Saturday, March 25, 2023

Simultaneous equation models and data errors

My experience is that using (optimization) models is a great way to provide quality assurance on the data. Data collection can be very difficult and expensive. It can also be quite easy to have errors cropping up somewhere along the way. When using an optimization model, we will get hammered by such errors.

The models I work on are largely based on simultaneous equations. In my world, an optimization problem is a system of equations plus an objective. My argument here is that such models are very sensitive to data errors. Much more so than, say, statistical analyses or machine learning models. That may sound really bad. But we also can turn this around and say that simultaneous equation models can be fantastic tools to stress test data sets.

Here I try to demonstrate the issue with a small model. The idea is to introduce a small error in a somewhat large data matrix (i.e., difficult to detect)  and see how this affects the solution of a linear model \(\color{darkblue}A\color{darkred}x=\color{darkblue}b\).

Experiment

  1. Generate random data \(\color{darkblue}A\) and \(\color{darkblue}b\).
  2. Solve \(\color{darkblue}A\color{darkred}x=\color{darkblue}b\).
  3. Introduce an error in \(\color{darkblue}A\).
  4. Resolve and observe the effects on the solution.
  5. Tremble in fear.

Here I just change one number in \(\color{darkblue}A\): I shift the decimal point by one in just one element. For a \(50 \times 50\) matrix this gives the following changes in \(\color{darkred}x\):


----     61 PARAMETER result  collected results

       original   perturbed        diff       %diff   different    signflip

i1       -1.049      -1.234       0.185      17.624   different
i2       -0.413      -0.294       0.119      28.851   different
i3        0.034      -0.163       0.197     572.422   different    signflip
i4       -0.761      -0.653       0.108      14.165   different
i5       -0.569      -0.546       0.023       4.002
i6       -0.333      -0.283       0.050      15.110   different
i7        0.330       0.371       0.041      12.302   different
i8       -0.128      -0.081       0.047      36.661   different
i9        0.358       0.259       0.099      27.651   different
i10      -0.279      -0.413       0.134      48.130   different
i11      -0.305      -0.418       0.113      37.000   different
i12       0.665       0.568       0.097      14.627   different
i13      -0.413      -0.549       0.136      32.948   different
i14      -0.885      -0.989       0.105      11.833   different
i15      -0.552      -0.542       0.010       1.793
i16      -0.652      -0.691       0.039       5.982
i17      -0.244      -0.162       0.082      33.752   different
i18       0.596       0.452       0.144      24.186   different
i19      -0.143      -0.018       0.124      87.086   different
i20      -0.465      -0.466       0.002       0.378
i21      -0.080       0.185       0.265     331.702   different    signflip
i22       0.767       0.789       0.022       2.890
i23       0.337       0.287       0.049      14.672   different
i24      -0.199      -0.120       0.079      39.891   different
i25      -0.800      -0.810       0.010       1.225
i26      -0.567      -0.557       0.009       1.672
i27      -0.341      -0.367       0.026       7.583
i28      -0.780      -0.921       0.141      18.025   different
i29       0.259       0.299       0.040      15.386   different
i30       0.219       0.077       0.143      65.061   different
i31      -0.570      -0.847       0.276      48.410   different
i32       0.235       0.244       0.009       3.896
i33       0.081      -0.042       0.123     151.192   different    signflip
i34       0.099       0.039       0.060      60.201   different
i35      -0.106      -0.052       0.053      50.595   different
i36       0.213       0.246       0.032      15.176   different
i37       0.310       0.304       0.005       1.771
i38      -0.012       0.225       0.238    1927.458   different    signflip
i39      -0.752      -0.677       0.075       9.989
i40       0.135       0.019       0.117      86.159   different
i41       0.270       0.179       0.090      33.449   different
i42       0.308       0.491       0.183      59.361   different
i43      -0.416      -0.428       0.012       2.953
i44       0.093       0.419       0.326     349.219   different
i45      -0.033      -0.163       0.130     399.702   different
i46       0.324       0.151       0.173      53.330   different
i47      -0.041      -0.134       0.094     229.921   different
i48       0.140       0.190       0.050      35.987   different
i49       0.851       1.024       0.173      20.391   different
i50      -0.210      -0.109       0.101      47.879   different

The values \(\color{darkred}x_i\) that change by more than 10% are marked as "different". Notice that quite a few solution values show a change in sign. 

We somehow have the intuition that a few isolated errors have limited effect. This is often not at all the case, as is demonstrated here. A database with, say, a 0.01% error rate sounds good. But it isn't. 

The problem sketched here is made much worse if data comes from different sources (which is usually the case). Many economists involved in modeling spend considerable time creating consistent and calibrated data sets. This is why. Another big concern is using very large data sets: the probability of encountering multiple issues quickly converges to 1, and the effort needed to detect and fix things increases substantially.  

Here I considered a numerical error. In practice, we also see many other errors. Missing data, suppressed data, duplicates, logical errors, data extraction errors, inconsistent use of units, undocumented exceptions, classification inconsistencies, truncation errors, software limits, differences in locale settings, and definitional problems are just a start. Just something silly like NULL v.s. 0 v.s. NA can be a major headache.

I am also very afraid to use live databases. To make runs reproducible, we need to work with fixed data, not something that can change without us even knowing. This can easily lead to losing your sanity and subsequent involuntary commitment. My approach: always take a snapshot and work with that. If technically possible, make the data read-only. In other words, stale data is underappreciated.

The same story can be told for production planning models. Even if they tell me, "our data is very good", it is almost guaranteed that I will find data errors during the development of the model. 

Conclusion


The goal of this little experiment is to scare the *** out of you. Just one small error in the data can lead to total disaster. I think everyone knows this in the back of one's mind. However, to be confronted with this phenomenon this way is certainly revealing.

There are many textbooks on optimization. Admittedly, I have read only a tiny fraction of them. This is a subject that is hardly mentioned, if at all. This inattention to this subject is not warranted: these data problems are a major issue when doing real optimization. A possible reason is that authors have more of a theoretical background in optimization than actually spent time in the trenches.

Appendix: GAMS model


$onText

 

  Experiment:

  show the difference in results for Ax=b when we change one element in

  A a little bit.

 

$offText

 

*--------------------------------------------------------

data

*--------------------------------------------------------

 

set i /i1*i50/;

alias(i,j);

 

parameter A(i,j),b(i);

A(i,j) = uniform(-10,10);

b(i) = uniform(-10,10);

 

 

*--------------------------------------------------------

solve Ax=b

*--------------------------------------------------------

 

variable x(i);

 

Equation

   lineq(i'linear equations'

;

 

lineq(i)..  sum(j, A(i,j)*x(j)) =e= b(i);

 

model m /all/;

solve m using cns

display x.l;

 

parameter result(i,*'collected results';

result(j,'original') = x.l(j);

 

*--------------------------------------------------------

introduce an error in A

*--------------------------------------------------------

 

shift decimal point by one in one single number

a('i3','i5') = a('i3','i5')/10;

 

solve m using cns;

display x.l;

result(j,'perturbed') = x.l(j);

display result;

 

*--------------------------------------------------------

report differences in solution x

*--------------------------------------------------------

 

acronym different,signflip;

result(j,'diff') = abs(result(j,'original')-result(j,'perturbed'));

result(j,'%diff') = 100*result(j,'diff')/max(0.0001,abs(result(j,'original')));

result(j,'different')$(result(j,'%diff')>10) = different;

result(j,'signflip')$(result(j,'original')*result(j,'perturbed') < 0) = signflip;

display result;

 

 

2 comments:

  1. It would be interesting to compare the condition number of original and modified matrix A, even the condition number relates errors in b to solution x.
    In commercial LP systems for crude oil refining there is always a place for simultaneous equations, for example with blends produced to specification limit. Common approach is to use SLP solvers hinged on "best guess" and evaluation of tolerances applied to recursive properties.

    ReplyDelete
  2. This is an interesting topic. A long time ago I purchased this book to look into it more https://shop.elsevier.com/books/data-processing-and-reconciliation-for-chemical-process-operations/romagnoli/978-0-12-594460-1. Sadly, I didn't get much headway through it, but this post reminded me of it :)

    ReplyDelete