Monday, May 16, 2022

Ranking using numpy.argsort

I needed to find a ranking of a large data set. Using Python, it makes sense to look at the numpy library for this. 

Numpy has the function argsort, which returns index positions [1]. One would think these are exactly the ranks we are after. Unfortunately, this is not the case.

>>> import numpy as np
>>> a = [3.0, 1.0, 5.0, 2.0]
>>> indx = np.argsort(a)
>>> indx
array([1, 3, 0, 2], dtype=int64)

I would expect:
     indx = [2, 0, 3, 1]

Actually, the reported indices are for a reversed mapping: from the (unknown) sorted vector to the original unsorted vector.

On the left is what I was after, and on the right is what argsort returns.

It is not very difficult to get the inverse mapping. Here are a few ways to do this:
  1. Using loops. Use the simple fact that for an index \(p\) and its inverse mapping \(p'\), we have: \[p_i=j \iff p'_j=i\]
    >>> rank = np.empty_like(indx)
    >>> for i,j in enumerate(indx):
    ...     rank[j]=i
    >>> rank
    array([2, 0, 3, 1], dtype=int64)
    Looping is in general quite slow in Python. So we may want to look into some alternatives.
  2. Fancy indexing[2]. Here we use the array indx to permute the values [0,1,2,3].
    >>> rank = np.empty_like(indx)
    >>> rank[indx] = np.arange(len(indx))
    >>> rank
    array([2, 0, 3, 1], dtype=int64)
  3. Applying argsort twice[2]. This is a bit of a surprise, but it does exactly what we want.
    >>> rank=np.argsort(indx)
    >>> rank
    array([2, 0, 3, 1], dtype=int64)
    This one is the most intriguing of course: argsort(argsort(a)) gives the ranking.

An alternative is to use scipy.stats.rankdata. The above ranking can be replicated with:

>>> import scipy.stats as stats
>>> rank = stats.rankdata(a,method='ordinal')-1
>>> rank
array([2, 0, 3, 1], dtype=int64)


Wednesday, May 4, 2022

Evil Sudoku

On[1] there is now an "evil" difficulty level. I found these very difficult to solve by hand (I usually give up).


Of course, a good MIP presolver/preprocessor devours these in no time. It should solve models like this in less than 0.1 seconds using zero Simplex iterations and zero B&B nodes.

Version identifier: | 2021-04-07 | 3a818710c
CPXPARAM_Advance                                 0
CPXPARAM_Threads                                 1
CPXPARAM_MIP_Display                             4
CPXPARAM_MIP_Pool_Capacity                       0
CPXPARAM_MIP_Tolerances_AbsMIPGap                0
Generic callback                                 0x50
Tried aggregator 2 times.
MIP Presolve eliminated 168 rows and 580 columns.
Aggregator did 28 substitutions.
Reduced MIP has 128 rows, 122 columns, and 501 nonzeros.
Reduced MIP has 122 binaries, 0 generals, 0 SOSs, and 0 indicators.
Presolve time = 0.06 sec. (0.93 ticks)
Found incumbent of value 0.000000 after 0.06 sec. (1.66 ticks)

Root node processing (before b&c):
  Real time             =    0.06 sec. (1.67 ticks)
Sequential b&c:
  Real time             =    0.00 sec. (0.00 ticks)
Total (root+branch&cut) =    0.06 sec. (1.67 ticks)

--- MIP status (101): integer optimal solution.
--- Cplex Time: 0.06sec (det. 1.67 ticks)

Proven optimal solution
MIP Solution:            0.000000    (0 iterations, 0 nodes)
Final Solve:             0.000000    (0 iterations)

Best possible:           0.000000
Absolute gap:            0.000000
Relative gap:            0.000000

CBC is a bit terser (not always the case):

COIN-OR CBC      38.2.1 96226ea8 Feb 19, 2022          WEI x86 64bit/MS Window

COIN-OR Branch and Cut (CBC Library 2.10)
written by J. Forrest

Calling CBC main solution routine...
No integer variables - nothing to do

Solved to optimality (within gap tolerances optca and optcr).
MIP solution:    0.000000e+00   (0 nodes, 0.019 seconds)

Best possible:   0.000000e+00
Absolute gap:    0.000000e+00   (absolute tolerance optca: 0)
Relative gap:    0.000000e+00   (relative tolerance optcr: 0.0001)



Appendix: GAMS model


Evil Sudoku from



'all areas (rows,columns,squares)' /r1*r9,c1*c9,s1*s9/
'rows' /r1*r9/
'columns' /c1*c9/
'squares' /s1*s9/
'areas with unique values a=area name, (i,j) are the cells'
'values' /1*9/;

* populate u(a,i,j)
u(i,i,j) =
u(j,i,j) =
ord(s)=3*ceil(ord(i)/3)+ceil(ord(j)/3)-3) = yes;
display u;

* given values
table v0(i,j)
c1 c2 c3 c4 c5 c6 c7 c8 c9
r1                 9     5
r2  7                    1
r3        8  2        7     9
r4  6              4  9     8
r5     4     1
r6                          5
r7        2     3
r8  8              1  6     4
r9                       7

* MIP model
binary variable x(i,j,k);
x.fx(i,j,k)$(v0(i,j)=k.val) = 1;
variable z 'dummy objective';

'one value per cell'

dummy.. z =e= 0;
sum(u(a,i,j), x(i,j,k)) =e= 1;
sum(k, x(i,j,k)) =e= 1;

model sudoku /all/;

* solve
solve sudoku minimizing z using mip;

* display solution
parameter v(i,j) 'solution';
v(i,j) =
sum(k, k.val*x.l(i,j,k));
option v:0;
display v;

Most models have separate constraints for rows, columns, and squares. In this formulation, I generalize this to "areas". The multidimensional set \(u(a,i,j)\) contains for each area \(a\) the corresponding cells \((i,j)\).  You can display this set to view it. Once we have this set \(u\) properly populated, the uniqueness constraints become exceedingly simple. The idea is that constraints are more difficult to debug than sets/parameters. So I am inclined to spend more effort in creating somewhat complex sets, expecting some pay-off in constraints becoming so simple that few things can go wrong.

u(a,i,j) for different values of a

As usual, we have \[\color{darkred}x_{i,j,k} = \begin{cases} 1 &  \text{if cell $(i,j)$ has value $k$} \\ 0 & \text{otherwise}\end{cases}\] To recover the Sudoku solution, we form \[ \color{darkblue}v_{i,j} = \sum_k \color{darkblue}k\cdot \color{darkred}x^*_{i,j,k}\] 

The final solution is:

----     73 PARAMETER v  solution

            c1          c2          c3          c4          c5          c6          c7          c8          c9

r1           3           2           4           7           1           9           8           5           6
r2           7           5           9           8           6           3           4           1           2
r3           1           6           8           2           4           5           7           3           9
r4           6           7           1           3           5           4           9           2           8
r5           9           4           5           1           8           2           3           6           7
r6           2           8           3           9           7           6           1           4           5
r7           4           9           2           6           3           7           5           8           1
r8           8           3           7           5           2           1           6           9           4
r9           5           1           6           4           9           8           2           7           3

Sunday, May 1, 2022

Getting rid of non-linearities

In [1], an attempt was made to implement the following non-linear constraint: \[\frac{\displaystyle\sum_t \max\left\{\sum_i CF_{i,t}\cdot q_i -L_t, 0 \right\}}{\displaystyle\sum_t L_t} \le 0.015 \] Obviously code like:

cfm = quicksum( max(quicksum(cf[i][t] * q[i] - L[t] for i in range(I)), 0) for t in range(T) /  quicksum(L[t] for t in range(T)) <= 0.015

is really not what we want to see. It is best, not to directly start up your editor and write Python code like this. Rather, get a piece of paper, and focus on the math.

There are two non-linearities: a division and a \(\max()\) function. Note that I am not sure which symbols are variables or parameters.  I'll try to explain some possible cases below.


Exogenous divisions of the form \[\frac{\color{darkred}x}{\color{darkblue}a}\] where \(\color{darkblue}a\) is a constant, are not an issue. This is just really a linear coefficient \(1/\color{darkblue}a\). If you have \[\frac{\color{darkred}x}{\sum_i \color{darkblue}a_i}\] in the constraints, I would advice to calculate \[\color{darkblue}\alpha :=\sum_i \color{darkblue}a_i \] in advance, so we can simplify things to \[\frac{\color{darkred}x}{\color{darkblue}\alpha}\] It is a good habit to keep constraints as simple as possible, as they are more difficult to debug than parameter assignments.

If we have something like \[\frac{\color{darkred}x}{\color{darkred}y}\] (endogenous division) we need to be careful and worry about division by zero (or, more generally, that the expression can assume very large values if \(\color{darkred}y \approx 0\)). If \(\color{darkred}y\) is a positive variable, we can protect things a bit by using a lower bound, say \(\color{darkred}y\ge 0.0001\). This is difficult to do if \(\color{darkred}y\) is a free variable (we want a hole in the middle). If we have  \[\frac{\color{darkred}x}{\sum_i\color{darkred}y_i}\] it may be better to write this as: \[\begin{align}&\frac{\color{darkred}x}{\color{darkred}z}\\ & \color{darkred}z= \sum_i \color{darkred}y_i  \\ &\color{darkred}z\ge 0.0001  \end{align}\] at the expense of an extra variable and constraint. It has the additional advantage of making the model less non-linear (fewer non-linear variables, smaller non-linear part of the Jacobian). Note that we assumed here that \(\sum_i\color{darkred}y_i \ge 0\). If this expression can assume both negative and positive values, this will not not work.

If the division is not embedded in other non-linear expressions, we often can multiply things out. E.g. if we have the constraint:  \[\frac{\color{darkred}x}{\sum_i\color{darkred}y_i} \le 0.015\] then we can write: \[\color{darkred}x \le  0.015 \cdot \sum_i\color{darkred}y_i\] This is now linear. This is what can apply to our original constraint:\[\sum_t \max\left\{\sum_i CF_{i,t}\cdot q_i -L_t, 0 \right\} \le 0.015\cdot \sum_t L_t \]

Max function

We can feed \(\max\{\color{darkred}x,0\}\) to a general-purpose NLP solver. However, this is dangerous. Although this function is continuous, it is non-differentiable at \(\color{darkred}=0\).

In the special case \[\max\{\color{darkred}x,0\} \le \color{darkred}y\] we can write: \[\begin{align}  \color{darkred}x &\le \color{darkred}y \\ 0 & \le \color{darkred}y\end{align} \] If \(\color{darkred}y\) represents an expression, you may want to use an extra variable to prevent duplication.  

If we have \[\sum_i \max\{\color{darkred}x_i,0\} \le \color{darkred}y\] we can do: \[\begin{align} & \color{darkred}z_i \ge 0 \\ & \color{darkred}z_i \ge \color{darkred}x_i \\ &\sum_i \color{darkred}z_i \le \color{darkred}y \end{align}\] Note that the variables \(\color{darkred}z_i\) are a bit difficult to interpret. They can be larger than \(\max\{\color{darkred}x_i,0\}\) if the inequality is not tight. The reason I use \(\color{darkred}z_i \ge \max\{\color{darkred}x_i,0\}\) is that it is sufficient for our purpose and that \(\color{darkred}z_i = \max\{\color{darkred}x_i,0\}\) is just way more difficult to represent.

This is the reformulation we can apply to our original problem: \[\begin{align}&z_t \ge \sum_i CF_{i,t}\cdot q_i -L_t \\ & z_t \ge 0 \\ & \sum_t z_t \le 0.015\cdot \sum_t L_t  \end{align}\] This is now completely linear: we have removed the division and the \(\max(.,0)\) function.