Wednesday, May 12, 2010

Max likelihood estimation with income classes

Typically income data is presented in income classes:

Income Class Number of Observations
0-$20000 1036
$20000-$40000 495
$40000-$60000 201
$60000-$80000 102
$80000+ 38

In order to find the mean income based on such data we can fit e.g. a Pareto Distribution. One way of doing this is with a max likelihood estimation procedure. From http://www.math.uconn.edu/~valdez/math3632s10/M3632Week10-S2010.pdf we have:

image (EQ.1)

This is easy to code in GAMS, using a Pareto distribution:


set j 'bins' /b1*b5/;

table bin(j,*)
       
lo     up   observations
b1       0  20000      1036
b2   20000  40000       495
b3   40000  60000       201
b4   60000  80000       102
b5   80000    INF        38
;

parameters n(j),lo(j),up(j);
lo(j) = bin(j,
'lo');
up(j) = bin(j,
'up');
n(j) = bin(j,
'observations');

* Pareto distribution
* See: Evans, Hastings, Peacock, "Statistical Distributions", 3rd ed, Wiley 2000
* find mean income using MLE

variables a,c,L;
equation loglik;

a.lo=1;
c.lo=1;

a.l = 100;
c.l = 1;

loglik.. L =e=
sum(j, n(j) * log[ (1-(a/up(j))**c)$(ord(j)<card(j)) + 1$(ord(j)=card(j))
                                - (1-(a/lo(j))**c)$(
ord(j)>1) - 0$(ord(j)=1)] );


option optcr=0;
model mle /all/;

parameter mean;
solve mle maximizing L using nlp;
mean = c.l*a.l/(c.l-1);
display mean;

We use here that F(0) = 0 and F(INF) = 1 for the CDF of the Pareto distribution.

Note this approach can be also used to estimate the number of millionaires (ie income > 1e6) even though this number is hidden in the last income class.

Looking at http://www.jstor.org/stable/1914015 I was a little bit confused by

image (EQ.2)

however I think we can arrive at the earlier likelihood function. The factor n! can be dropped as it is constant. Taking logs we see:

image (EQ.3)

The last sum can be dropped as it is also constant. (Of course this can also be deducted directly from the product in EQ2).