Tuesday, January 20, 2009

LAD regression

> I'm doing median regression with lp_solve. My LP-model is
> min sum(i, s1(i)+s2(i))
> s.t.
> sum(j, a(j)*x(i,j)) + s1(i) - s2(i) = b(i)
> s1(i),s2(i) >= 0
> where
> i=1..n observations,
> j=1..m describing dimensions,
> a(j) parameters to be identified,
> x(i,j) describing values for observation i,
> b(i) observed values,
> s1(i), s2(i) deviations of the described value from the observed value.
> For realistics instances I have > 100.000 observations. Many instances
> have rows i with equal x(i,j) but different b(i) (describing variables are
> equal but observations different). I wonder if there is a modeling trick
> to aggregate these rows into 1 or 2 constraints.

Don't know. But I have some suggestions to try out:

If j is large you may want to solve a larger but sparser problem where you prevent repeating the same sum(j, a(j)*x(i,j)) but using extra variables

   y(k) = sum(j, a(j)*x(k,j))

and then

   y(k) + s1(i) - s2(i) = b(i)

for suitable combinations (k,i). 

You could reduce the number of variables again by a different trick:

   -s(i) <= y(k)-b(i) <= s(i)
   min sum(i, s(i))

Also you may want to try different LP solvers.