Phorgy Phynance

Archive for the ‘Maximum Likelihood Estimation’ Category

Weighted Likelihood for Time-Varying Gaussian Parameter Estimation

leave a comment »

In a previous article, we presented a weighted likelihood technique for estimating parameters \theta of a probability density function \rho(x|\theta). The motivation being that for time series, we may wish to weigh more recent data more heavily. In this article, we will apply the technique to a simple Gaussian density

\rho(x|\mu,\nu) = \frac{1}{\sqrt{\pi\nu}} \exp\left[-\frac{(x-\mu)^2}{\nu}\right].

In this case, the log likelihood is given by

\begin{aligned} \log\mathcal{L}(\mu,\nu) &= \sum_{i=1}^N w_i \log\rho(x_i|\mu,\nu) \\ &= -\frac{1}{2} \log\pi\nu - \frac{1}{\nu} \sum_{i=1}^N w_i \left(x_i - \mu\right)^2 \end{aligned}.

Recall that the maximum likelihood occurs when

\begin{aligned} \frac{\partial}{\partial\mu} \log\mathcal{L}(\mu,\nu) = \frac{\partial}{\partial\nu} \log\mathcal{L}(\mu,\nu) = 0. \end{aligned}

A simple calculation demonstrates that this occurs when

\begin{aligned} \mu = \sum_{i=1}^N w_i x_i \end{aligned}

and

\begin{aligned} \sigma^2 = \sum_{i=1}^N w_i \left(x_i - \mu\right)^2, \end{aligned}

where \sigma^2 = \nu/2.

Introducing a weighted expectation operator for a random variable X with samples x_i given by

\begin{aligned} E_w(X) = \sum_{i=1}^N w_i x_i, \end{aligned}

the Gaussian parameters may be expressed in a familiar form via

\mu = E_w(X)

and

\sigma^2 = E_w(X^2) - \left[E_w(X)\right]^2.

This simple result justifies the use of weighted expectations for time varying Gaussian parameter estimation. As we will see, this is also useful for coding financial time series analysis.

 

Advertisements

Written by Eric

February 3, 2013 at 4:33 pm

More fun with maximum likelihood estimation

with one comment

A while ago, I wrote a post

Fun with maximum likelihood estimation

where I jotted down some notes. I ended the post with the following:

Note: The first time I worked through this exercise, I thought it was cute, but I would never compute \mu and \sigma^2 as above so the maximum likelihood estimation, as presented, is not meaningful to me. Hence, this is just a warm up for what comes next. Stay tuned…

Well, it has been over a year and I’m trying to get a friend interested in MLE for a side project we might work on together, so thought it would be good to revisit it now.

To briefly review, the probability of observing N independent samples X\in\mathbb{R}^N may be approximated by

\begin{aligned} P(X|\theta) = \prod_{i = 1}^N P(x_i|\theta) = \left(\Delta x\right)^N \prod_{i=1}^N \rho(x_i|\theta),\end{aligned}

where \rho(x|\theta) is a probability density and \theta represents the parameters we are trying to estimate. The key observation becomes clear after a slight change in perspective.

If we take the Nth root of the above probability (and divide by \Delta x), we obtain the geometric mean of the individual densities, i.e.

\begin{aligned} \langle \rho(X|\theta)\rangle_{\text{geom}} = \prod_{i=1}^N \left[\rho(x_i|\theta)\right]^{1/N}.\end{aligned}

In computing the geometric mean above, each sample is given the same weighting, i.e. 1/N. However, we may have reason to want to weigh some samples heavier than others, e.g. if we are studying samples from a time series, we may want to weigh the more recent data heavier. This inspired me to replace 1/N with an arbitrary weight w_i satisfying

\begin{aligned} w_i\ge 0,\quad\text{and}\quad \sum_{i=1}^N w_i = 1.\end{aligned}

With no apologies for abusing terminology, I’ll refer to this as the likelihood function

\begin{aligned} \mathcal{L}(\theta) = \prod_{i=1}^N \rho(x_i|\theta)^{w_i}.\end{aligned}

Replacing w_i with 1/N would result in the same parameter estimation as the traditional maximum likelihood method.

It is often more convenient to work with log likelihoods, which has an even more intuitive expression

\begin{aligned}\log\mathcal{L}(\theta) = \sum_{i=1}^N w_i \log \rho(x_i|\theta),\end{aligned}

i.e. the log likelihood is simply the weighted (arithmetic) average of the log densities.

I use this approach to estimate stable density parameters for time series analysis that is more suitable for capturing risk in the tails. For instance, I used this technique when generating the charts in a post from back in 2009:

80 Years of Daily S&P 500 Value-at-Risk Estimates

which was subsequently picked up by Felix Salmon of Reuters in

How has VaR changed over time?

and Tracy Alloway of Financial Times in

On baseline VaR

If I find a spare moment, which is rare these days, I’d like to update that analysis and expand it to other markets. A lot has happened since August 2009. Other markets I’d like to look at would include other equity markets as well as fixed income. Due to the ability to cleanly model skew, stable distributions are particularly useful for analyzing fixed income returns.

Fun with maximum likelihood estimation

with 2 comments

The following is a fun little exercise that most statistics students have probably worked out as a homework assignment at some point, but since I have found myself rederiving it a few times over the years, I decided to write this post for the record to save me some time the next time this comes up.

Given a probability density \rho(x), we can approximate the probability of a sample falling within a region \Delta x around the value x_i\in\mathbb{R} by

P(x_i) = \rho(x_i)\Delta x.

Similarly, the probability of observing N independent samples X\in\mathbb{R}^N is approximated by

P(X) = \prod_{i = 1}^N P(x_i) = \left(\Delta x\right)^N \prod_{i=1}^N \rho(x_i).

In the case of a normal distribution, the density is parameterized by two parameters \mu and \nu and we have

\rho(x|\mu,\nu) = \frac{1}{\sqrt{\pi\nu}} \exp\left[-\frac{(x-\mu)^2}{\nu}\right].

The probability of observing a given sample is then approximated by

P(X|\mu,\nu) =   \left(\frac{\Delta x}{\sqrt{\pi}}\right)^N \nu^{-N/2} \exp\left[-\frac{1}{\nu} \sum_{i=1}^N (x_i - \mu)^2\right].

The idea behind maximum likelihood estimation is that the parameters should be chosen such that the probability of observing the given samples is maximized. This occurs when the differential vanishes, i.e.

dP(X|\mu,\nu) = \frac{\partial P(X|\mu,\nu)}{\partial \mu} d\mu + \frac{\partial P(X|\mu,\nu)}{\partial \nu} d\nu = 0.

This, in turn, vanishes only when both components vanish, i.e.

\frac{\partial P(X|\mu,\nu)}{\partial \mu} = \frac{\partial P(X|\mu,\nu)}{\partial \nu} = 0.

The first component is given by

\frac{\partial P(X|\mu,\nu)}{\partial \mu} = P(X|\mu,\nu) \left[-\frac{2}{\nu} \sum_{i=1}^N (x_i - \mu)\right]

and vanishes when

\mu = \frac{1}{N} \sum_{i = 1}^N x_i.

The second component is given by

\frac{\partial P(X|\mu,\nu)}{\partial \nu} = P(X|\mu,\nu) \left[-\frac{N}{2\nu} + \frac{1}{\nu^2} \sum_{i=1}^N (x_i - \mu)^2\right]

and vanishes when

\sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i-\mu)^2,

where \nu = 2\sigma^2.

Note: The first time I worked through this exercise, I thought it was cute, but I would never compute \mu and \sigma^2 as above so the maximum likelihood estimation, as presented, is not meaningful to me. Hence, this is just a warm up for what comes next. Stay tuned…

Written by Eric

January 2, 2011 at 10:56 pm