Generalized Linear Models
Posted: Updated:
The concept of Generalized Linear Models (‘GLMs’) didn’t click until I saw it from a lucky angle. When I first encountered it, I was confused by its arbitrarylooking design. The fortunate perspective came with understanding its motivation: to generalize linear regression. Someone broke linear regression into pieces, generalized each of them and put them back together, creating a machine with more nobs.
What are Generalized Linear Models?
We’ll start in the friendly world of linear regression. We assume a scalar \(y\) is normally distributed with a mean given as a linear combination of a vector \(\mathbf{x}\). This may be written:
\[\begin{align} p(y) & = \mathcal{N}(y \vert \mu,\sigma^2) \tag{1}\\ \mu & = \mathbf{x}^{\top}\boldsymbol{\beta} \tag{2} \end{align}\]We’ll call (1) the ‘distribution assumption’ and (2) the ‘expectedy and \(\mathbf{x}\) relation assumption’. The latter has this name because \(\mu\) is the expectation of \(y\) given \(\mathbf{x}\). That is, \(\mu=\mathbb{E}[y\vert\mathbf{x}]\).
A GLM generalizes both (1) and (2).
Generalizing the Distribution Assumption
We generalize from:
\(y\) is normally distributed given \(\mathbf{x}\).
to
\(\mathbf{y}\) has a distribution within the exponential family given \(\mathbf{x}\).
This is a clever move considering the exponential family is an elegant, expressive and useful object. To refresh, the vector \(\mathbf{y}\) has a distribution from the exponential family if its probability mass or density is:
\[\begin{align} p(\mathbf{y})=&\frac{1}{Z(\boldsymbol{\theta})}h(\mathbf{y})\exp\Big\{T(\mathbf{y})\cdot\boldsymbol{\theta}\Big\}\\ \end{align}\]according to some parameter vector \(\boldsymbol{\theta}\). The choice of the functions \(h(\cdot)\) and \(T(\cdot)\) will determine which distribution within the exponential family we’re using. So we could choose them such we get the Normal distribution, but a different choice might give us the Binomial or Poisson or something else. Though we can’t choose \(Z(\boldsymbol{\theta})\); it’s there to ensures \(p(\mathbf{y})\) integrates to one. So a choice of \(h(\cdot)\) and \(T(\cdot)\) fixes \(Z(\cdot)\).
One thing to call out is that the dependent variable just went from necessarily a scalar (\(y\)) to a vector (\(\mathbf{y}\)); we’ve subtly generalized to multioutput predictions.
Before the Next Generalization
Suppose we’ve chosen \(h(\cdot)\) and \(T(\cdot)\) to recreate linear regression. We have:
\[\begin{align} p(y) & = \frac{1}{Z(\theta)}h(y)\exp\Big\{T(y)\cdot\theta\Big\} \tag{1}\\ \mu & = \mathbf{x}^{\top}\boldsymbol{\beta} \tag{2}\\ \end{align}\]Notice an issue? How does \(\mu\) connect to \(\theta\)? We have to build that bridge. We can get that done with a function \(\Psi(\cdot)\), which is defined to solve the problem:
\[\theta = \Psi(\mu)\]This means if we select \(h(\cdot)\) and \(T(\cdot)\) to recreate linear regression, \(\Psi(\cdot)\) will fall out as required. It is fully determined by the distribution we name and that distribution’s connection between \(\theta\) and the resulting expectation of \(y\)^{1}.
Now unsuppose so we may get back to generalizing.
Generalizing the Expected\(y\) and \(\mathbf{x}\) Relation Assumption
We generalize from:
The scalar expectation of \(y\) is a linear combination of \(\mathbf{x}\).
to
Some simple function of the vector expectation of \(\mathbf{y}\) is a linear combination of \(\mathbf{x}\).
That simple function is known as a link function, labeled \(g(\cdot)\):
\[g(\boldsymbol{\mu})=\mathbf{x}^{\top}\boldsymbol{\beta}\]It’s a generalization because \(g(\cdot)\) could be the identity function and the length of \(\boldsymbol{\mu}\) could be one, landing us back in the world of linear regression. But it could not and we could land elsewhere. Note that if \(\boldsymbol{\mu}\) is a vector with length greater than one, \(\boldsymbol{\beta}\) is a matrix.
Also, the ‘simple’ requirement on \(g(\cdot)\) means it’s invertible, so we may write:
\[\boldsymbol{\mu}=g^{1}(\mathbf{x}^{\top}\boldsymbol{\beta})\]Putting It Together
Now that we have both generalizations, we put it all together:
\[\begin{align} p(\mathbf{y}) & = \frac{1}{Z(\boldsymbol{\theta})}h(\mathbf{y})\exp\Big\{T(\mathbf{y})\cdot\boldsymbol{\theta}\Big\} \tag{1}\\ \boldsymbol{\theta} & = \Psi(\boldsymbol{\mu}) \tag{bridge}\\ \boldsymbol{\mu} & = g^{1}(\mathbf{x}^{\top}\boldsymbol{\beta}) \tag{2}\\ \end{align}\]And that’s it.
When applying a GLM, there are two choices to make:
 Given \(\mathbf{x}\), what distribution should \(\mathbf{y}\) have: the Normal, the Poisson, the Binomial or something else? This choice determines \(h(\cdot)\) and \(T(\cdot)\), which forces \(\Psi(\cdot)\) and \(Z(\cdot)\).
 What should the link function \(g(\cdot)\) be? There are a few choices. Typically the answer to the previous question informs this one. Not all link functions can be combined with all distributions.
Answering these gives a full model specification. With it, we can learn \(\boldsymbol{\beta}\) from data.
An Example
Suppose we have a classification problem where \(y\) is either 1 or 0. We answer the questions:
 Given an \(\mathbf{x}\), what should the distribution of \(y\) be? Since \(y\) is either 1 or 0, a natural choice is the Bernoulli distribution. If we ask the internet, we discover this implies \(h(y)=1\) and \(T(y)=y\). These then tell us \(\Psi(\mu)=\log(\frac{\mu}{1\mu})=\theta\) and \(Z(\theta) = \exp(\theta)+1\).
 The expectation of \(y\) needs to be between 0 and 1 and \(\mathbf{x}^{\top}\boldsymbol{\beta}\) can vary over the real line, so we should pick a function that maps from (0, 1) to the real line. The logit function (which is coincidentally \(\Psi(\cdot)\) as well) is a reasonable choice. That is, \(g(\mu)=\log(\frac{\mu}{1\mu})=\mathbf{x}^{\top}\boldsymbol{\beta}\).
If we substitute these settings and simplify, the model specification becomes:
\[\begin{align} p(y) & = \frac{1}{\exp(\mathbf{x}^{\top}\boldsymbol{\beta})+1}\exp\Big\{y\cdot\mathbf{x}^{\top}\boldsymbol{\beta}\Big\}\\ \end{align}\]And look at that–it’s logistic regression!
Final Comments
To survive the wilderness of GLMs, a few things should be noted:
 I’ve presented the general form of GLMs, but it’s easy to make choices of \(h(\cdot)\), \(T(\cdot)\) and \(g(\cdot)\) such that the optimization to determine \(\boldsymbol{\beta}\) is nearly impossible. Because of this, most GLM software will restrict the choices, such that no matter what is selected, it’ll be able to optimize it.
 At the same time, software will offer other generalizations I haven’t mentioned. They may offer a parameter that allows one to smoothly vary \(h(\cdot)\) or \(T(\cdot)\) or \(\Psi(\cdot)\) or provide a means to weigh observations.
 \(\Psi(\cdot)\) can be unusual. Sometimes it maps from a single input to multiple outputs. Also, it often holds the exogenously determined parameters, parameters that impact the distribution of \(\mathbf{y}\) but aren’t learned from data.
These points can make the form offered look a bit different from how I’ve presented GLMs, but knowing them should make resolving differences easy.
References
I first formed this intuition when reading chapter 9 of Murphy (2012). Chapter 16 of Gelman et al. (2014) was a useful second perspective. The statsmodels documentation was useful for understanding how GLMs are expressed and handled in software.

K. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press. 2012.

A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari and D. B. Rubin. Bayesian Data Analysis Third Edition. Chapman and Hall/CRC. 2014.

Statsmodel contributors. Generalized Linear Models. Statsmodels 0.14.0
Something to Add?
If you see an error, egregious omission, something confusing or something worth adding, please email dj@truetheta.io with your suggestion. If it’s substantive, you’ll be credited. Thank you in advance!
Footnotes

It’s not necessary for this, but the following is good to know in general. Since picking a distribution within the exponential family fixes \(\Psi(\cdot)\) and this function relates \(\boldsymbol{\mu}\) (the expected\(\mathbf{y}\)) and the parameters \(\boldsymbol{\theta}\), then if you know \(\boldsymbol{\mu}\), then you know the parameters as well. Because of this, \(\boldsymbol{\mu}\) is called the ‘mean parameters’. Along the same thread, \(\boldsymbol{\theta}\) is called the ‘canonical parameters’. ↩