When does the Delta Method approximation work?
Posted: Updated:
Approximations are useful, squirrelly things. Useful because they offer shortcuts, freeing up compute for other purposes. Squirrelly because the approximation error can vary from case to case, and that brings an annoying extra question of whether the approximation is good enough for the case considered. This isn’t always an issue; some approximations (e.g. Stirling’s approximation) are virtually always good enough. But for the approximation I’ll discuss here, the annoying extra question needs answering.
In fact, my main motivation for writing this was to develop an intuition for when this approximation works and fails. And that means having a mechanical visual, something I’m comfortable (or naive enough) believing generalizes to the cases I’ll encounter. This post is to share this perspective.
The Delta Method Approximation
The approximation concerns two ubiquitous mathematical objects: functions and distributions. It’s an approximation of the distribution produced by passing a given distribution through a function. Specifically, let \(X\) be normally distributed with mean \(\mu\) and variance \(\sigma^2\) and let \(f(x)\) be a function differentiable at \(\mu\), then the following holds approximately:
\[f(X) \sim \mathcal{N}(f(\mu), f'(\mu)^2\sigma^2)\]where \(f(X)\) is the random variable produced by passing \(X\) into \(f(x)\) and granted \(f'(\mu)\neq 0\). This follows from a twoterm Taylor series expansion on \(f(x)\) at \(\mu\)^{1}.
To be unambiguous, I’ll refer to this as the Delta Method approximation, because as J. Hoef (2012) points out, the ‘Delta Method’ can refer to multiple techniques.
The Approximation Visualized
To me, the approximation’s statement isn’t obviously true, but should be. To that end, I offer the following visual, also demonstrating when this approximation is expected to work or fail.
\(\quad\) \(\quad\)
From this, it seems the approximation only works when \(f(x)\) is approximately linear in the region over which \(X\) is distributed^{2}. Now it feels obviously true.
Other Comments
I’ll make a few other comments:
 The approximation to \(f(X)\) only references \(f(x)\) at \(x=\mu\). This is a major weakness. For those who might be familiar, it’s the same weakness found in the Extended Kalman Filter and improved upon with the Unscented Kalman Filter. See section 18.5 of Murphy (2012) for more on this.
 It makes the connection of slopes and variances clear. When passing \(X\) through \(f(x)\), the slope is a multiplier on \(X\)’s standard deviation.
 If we hope for a normal approximation to \(f(X)\) to work, we’ll need \(X\) to be normally distributed^{3}.
 This approximation generalizes to higher dimensions, where we deal in gradients and covariance matrices. In this case, I expect all the above issues to be even more precarious, where it only takes one nonlinear dimension to spoil the approach.
In summary, I expect the occassion for this approximation is when we’re certain \(f(x)\) is smooth, we can differentiate it and the variance of \(X\) is very small.
References

J. M. Ver Hoef. Who invented the delta method?. The American Statistician. 2012.

K. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press. 2012.
Something to Add?
If you see an error, egregious omission, something confusing or something worth adding, please email dj@truetheta.io with your suggestion. If it’s substantive, you’ll be credited. Thank you in advance!
Footnotes

Different approximations can be produced by considering more terms in the Taylor series expansion. ↩

More precisely, it’s approximately linear in the sense that it’s well approximated by the linear Taylor series approximation. ↩

This excludes contrived circumstances where \(f(\cdot)\) is designed to make \(f(X)\) normally distributed. ↩