Yet Another Math Programming Consultant: Max likelihood estimation with income classes

Wednesday, May 12, 2010

Max likelihood estimation with income classes

Typically income data is presented in income classes:

Income Class	Number of Observations
0-$20000	1036
$20000-$40000	495
$40000-$60000	201
$60000-$80000	102
$80000+	38

In order to find the mean income based on such data we can fit e.g. a Pareto Distribution. One way of doing this is with a max likelihood estimation procedure. From http://www.math.uconn.edu/~valdez/math3632s10/M3632Week10-S2010.pdf we have:

(EQ.1)

This is easy to code in GAMS, using a Pareto distribution:

set j 'bins' /b1*b5/;

table bin(j,*)
        lo     up   observations
b1       0 20000      1036
b2   20000 40000       495
b3   40000 60000       201
b4   60000 80000       102
b5   80000    INF        38
;

parameters n(j),lo(j),up(j);
lo(j) = bin(j,'lo');
up(j) = bin(j,'up');
n(j) = bin(j,'observations');

* Pareto distribution
* See: Evans, Hastings, Peacock, "Statistical Distributions", 3rd ed, Wiley 2000
* find mean income using MLE

variables a,c,L;
equation loglik;

a.lo=1;
c.lo=1;

a.l = 100;
c.l = 1;

loglik.. L =e= sum(j, n(j) * log[ (1-(a/up(j))**c)$(ord(j)<card(j)) + 1$(ord(j)=card(j))
                                - (1-(a/lo(j))**c)$(ord(j)>1) - 0$(ord(j)=1)] );

option optcr=0;
model mle /all/;

parameter mean;
solve mle maximizing L using nlp;
mean = c.l*a.l/(c.l-1);
display mean;

We use here that F(0) = 0 and F(INF) = 1 for the CDF of the Pareto distribution.

Note this approach can be also used to estimate the number of millionaires (ie income > 1e6) even though this number is hidden in the last income class.

Looking at http://www.jstor.org/stable/1914015 I was a little bit confused by

(EQ.2)