Yet Another Math Programming Consultant: 2023

Sunday, November 19, 2023

Grouping items: a difficult combinatorial problem

In [1], a simple problem is described:

We have $n$ items (or orders) with a certain width.
We need to combine these items in groups (called patterns) with rather tight limits on the total width. The total length of a pattern (the sum of the lengths of the items assigned to this pattern) must be between 335 and 340.
As a result, we may not be able to assign all items. The remaining items cannot be formed into valid patterns.
The objective is to try to place as many items as possible into patterns.
An indication of the size of the problem: $n \approx 500$.

Data

Instead of immediately working on a full-known $n=500$ problem, I generated a random data set with a very manageable $n=50$ items. The widths were drawn from a discrete uniform distribution between 30 and 300. The data looks like:

----     15 PARAMETER w  item widths

order1   76.000,    order2  258.000,    order3  179.000,    order4  111.000,    order5  109.000,    order6   90.000
order7  124.000,    order8  262.000,    order9   48.000,    order10 165.000,    order11 300.000,    order12 186.000
order13 298.000,    order14 236.000,    order15  65.000,    order16 203.000,    order17  73.000,    order18  97.000
order19 211.000,    order20 147.000,    order21 127.000,    order22 125.000,    order23  65.000,    order24  70.000
order25 189.000,    order26 255.000,    order27  92.000,    order28 210.000,    order29 240.000,    order30 112.000
order31  59.000,    order32 166.000,    order33  73.000,    order34 266.000,    order35 101.000,    order36 107.000
order37 190.000,    order38 225.000,    order39 200.000,    order40 155.000,    order41 142.000,    order42  61.000
order43 115.000,    order44  42.000,    order45 121.000,    order46  79.000,    order47 204.000,    order48 181.000
order49 238.000,    order50 110.000

I stick to the pattern limits $\color{darkblue}L=335$ and $\color{darkblue}U=340$.

We need some estimate of the number of patterns to use. We could just guess. But a better approach is the following. An upper bound for the number patterns can be established quite easily: \[{\mathit{maxj}} = \left\lfloor \frac{\sum_i \color{darkblue}w_i}{\color{darkblue}L}\right\rfloor\] For our data set this number is:

----     29 PARAMETER maxj                 =       22.000  max number of patterns we can fill

This means we can safely use this number as the number of bins (patterns).

Scheduling Team Meetings

This is a simple problem from [1]:

I'm rusty on constraint optimization and am looking for help in this particular
use case. There are individuals who are each member to several teams. This is a
fixed many-to-many relationship and is determined a-priori. There are 3 time
slots where the teams can be scheduled to conduct a business meeting, but if an
individual is a member of more than one team which are both meeting at a given
time slot, they'll only be able to attend one. The objective is to schedule the
teams into the time slots, minimizing the number of overlaps of individuals.

For beginners, it is often a good idea to split the task in two:

Formulate a mathematical model (on a piece of paper)
Implement the model in code

GAMS: SMAX and sparsity

This is a discussion about the SMAX function in GAMS and how it behaves for sparse data.

The data structure we were facing was something like:

set
i 'cases' /case1*case100000/
j 'attribute' /j1*j25/
k 'attribute' /k1*k25/
t 'type' /typ1*typ2/
;

parameter p(i,j,k,t) 'positive numbers';
* note: for each i we have only one (j,k)

Plotting NUTS-2 maps from GAMS

NUTS-2 regions are statistical subnational regions (often provinces), mainly for the EU and UK [1].

NUTS hierarchy (from [1])

In [2] we can find mapping information in the form of Shapefiles[3] and related formats. I used the GeoJSON[4] format, and created a Python notebook script to extract a GAMS set from that file. The file is reproduced in the appendix below. The NUTS-2 codes form the set elements, and the name is stored as explanatory text. There is an option to generate Latin names instead of using the native alphabet. The Latin names are also inside the geojson file. E.g. we have:

EL65 'Πελοπόννησος'

which is in the Greek alphabet. Using the Latin name, this would look like:

EL65 'Peloponnisos'

Linear Programming Nonsense?

1. Inventory balance constraints

I came accross this text [1]:

Inventory Balance Constraint

Math and ChatGPT

Performing symbolic math steps is often related to pattern recognition. In theory, ChatGPT could be doing a good job here. I wanted to find the inverse of \[f(x) = {\mathrm{sign}}(x) \log(1+|x|)\] This function is a form of a signed logarithmic scaling. So, let's see what ChatGPT is telling us:

Julia vs Python

I gave a talk to economists (i.e., not professional programmers) about a Julia project we were working on. Julia is famous for its speed. It uses LLVM [1] as back-end for its JIT (Just In Time) compilation. As seeing is believing, here is an example algorithm which is used to compare performance between different languages and tools. This example was chosen as it is small, easy to explain, and easy to program while still showing meaningful time differences.

We have a square $[-1,+1]\times[-1,+1]$ and an inscribing circle with radius $1$. See the dynamic figure below. Their areas are $4$ and $\pi$ respectively. The idea is to draw $n$ points \[\begin{align}& x_i \sim U(-1,+1) \\ & y_i \sim U(-1,+1)\end{align}\]Let $m$ be the number of points inside the circle, i.e. with \[x_i^2+y_i^2\lt 1\] Obviously, from the ratio of the areas, we have \[\frac{m}{n} \approx \frac{\pi}{4}\] It follows that an estimate of $\pi$ is \[\hat{\pi}=4\frac{m}{n}\]

Simulation with n=1000

Critiquing a GAMS Model

It is always interesting to read GAMS models written by someone else. There are probably three things one can observe:

A nice formulation or concept that is useful to learn about.
A bad implementation: something that really should not be done that way.
A piece of code that is correct and defensible, but I would write it differently. This includes things like style, layout, formatting, etc.

My way of reading GAMS code is often to start editing and make it "my code". It is a bit slower process, but that comes with its advantages: better understanding of what is going on, and often cleaner code.

Here, I am looking at the model sambal.gms in the GAMS model library [1]. It is a very small model, but I have many thoughts about it. The complete model is reproduced in Appendix 1. Let's walk through it.

The matrix balancing problem is to find a nearby matrix such that row- and column sums are obeyed. A relative quadratic objective is used to minimize the sum of the squared deviations between the original data (the priors) and the final matrix. Zeros in the matrix need to be maintained: they can't become nonzero. This is sometimes called sparsity preservation. Often, sign-preservation is another condition. That is not part of this model. Note that, in this model, not only the matrix is updated but also the row and column totals.

Three-level Matrix Balancing

Matrix balancing: introduction

Matrix Balancing Models are often used in economic modeling exercises: they create consistent data sets from data originating from different, conflicting data sources. A standard example is updating a matrix subject to given row and column sums. An example can look like:

Update orange cells subject to row/column sums

The empty cells are zero, and they should remain zero. In other words, we need to preserve sparsity. Often, we have non-negativity restrictions on the values. The mathematical model can look like this:

Some TSP MTZ experiments

In [1], a question was posed about a TSP model using the MTZ (Miller-Tucker-Zemlin) subtour elimination constraints. The results with Julia/glpk were disappointing. With $n=58$ cities, things were taken so long that the solver seemed to hang. Here I want to see how a precise formulation with a good MIP solver can do better. As seeing is believing, let's do some experiments.

The standard MTZ formulation[1] can be derived easily. We use the binary variables \[\color{darkred}x_{i,j}=\begin{cases}1 & \text{if city $j$ is visited directly after going to city $i$}\\ 0 & \text{otherwise}\end{cases}\]

Coloring edges of a grid

This is a little coloring problem from [1]:

Given an m * n grid, each edge must be colored. However, there are 2 constraints:
Entire grid must use only 3 unique colors
Each square in the grid must be colored using exactly 2 colors (2 edges per color)

Let's try to build a GAMS model for this. The question itself is not so interesting, but the modeling for setting up these grids is.

A Julia thingy

In Julia, we can write 2x instead of 2*x. Not the most earth-shattering. But a bit special nonetheless.

The expression in the last cell is interpreted as a function call.

Sunday, July 2, 2023

Some confusion here

Sometimes you end up visiting strange sites. This is a question & answer site. The question is:

Obviously, this is not a good question. It is something like "what is the difference between a ham sandwich and butter?".

The answers are not so good either.

CVXPY DCP errors

CVXPY is a popular tool to model and solve convex optimization problems. Sometimes, it throws a "DCP" error, even for convex problems. DCP stands for Disciplined Convex Programming, the underlying framework for working with guaranteed convex models. The error says: I can not verify this is convex.

Here are some small examples of convex objectives (under minimization) one would expect to work.

Objective	CVXPY code	Result	Notes
\[\color{darkred}x^T \color{darkred}x\]	x.T@x	DCP error	print shows minimize x@x, i.e. transpose is dropped
\[\color{darkred}x^T \color{darkred}x\]	x@x	DCP error
\[\color{darkred}x^T \color{darkred}x\]	cp.sum_squares(x)		transformed into quad_over_lin(x, 1.0)
\[\color{darkred}x^T \color{darkblue}Q \color{darkred}x\]	x.T@Q@x		transformed into QuadForm(x,Q)
\[\color{darkred}y:=\color{darkred}x-\color{darkblue}p\]\[\color{darkred}x^T \color{darkblue}Q \color{darkred}y\]	y=x-p x.T@Q@y	DCP error
\[\color{darkred}x^T \color{darkblue}Q \color{darkred}x - \color{darkred}x^T \color{darkblue}Q \color{darkblue}p\]	x.T@Q@x - x.T@Q@p		first term transformed into QuadForm(x,Q)

Not everything makes sense to me. I am not sure why x.T@x is not properly recognized, but x.T@Q@x is.

Generate all solutions that sum up to one

In a post, the following question was posed:

We can select unique values $\displaystyle\frac{1}{i}$ for $i=1,\dots,n$. Find all combinations that add up to 1.

A complete enumeration scheme was slow even for $n=10$. Can we use a MIP model for this or something related?

A single solution is easily found using the model:

Mathematical Model
\[ \begin{align} & \sum_{i=1}^n \frac{1}{i} \cdot \color{darkred}x_i = 1 \\ & \color{darkred}x_i \in \{0,1\} \end{align}\]

Finding common patterns

In [1], the following problem is stated:

Given a boolean matrix, with $m$ rows and $n$ columns, find the largest pattern of ones that is found in at least $\color{darkblue}K$ rows. We can ignore cells where the pattern has a zero value: they don't count.

A small example [1] is given:

row 1:[0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1]
row 2:[0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1]
row 3:[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1]
row 4:[1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1]
row 5:[1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1]
row 6:[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1]

With $K=3$, we can form a pattern with 10 nonzero elements:

row 1:  [0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1]
row 2:  [0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1]
row 3:  [0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1]
row 4:  [1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1]
row 5:  [1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1]
row 6:  [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1] 
pattern:[1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1]

The pattern is shared by rows 4, 5, and 6.

Can we formulate a MIP model for this? My first attempt is as follows.

Solving as network with lowerbounds

In [1], we looked at the following problem:

Mathematical Model
\[ \begin{align} \min& \sum_{i,j} \color{darkblue}a_{i,j} \cdot \color{darkred}x_{i,j} \\ & \sum_j \color{darkblue}a_{i,j}\cdot \color{darkred}x_{i,j} \ge \color{darkblue}r_i && \forall i \\ & \sum_i \color{darkblue}a_{i,j}\cdot \color{darkred}x_{i,j} \ge \color{darkblue}c_j && \forall j \\ & \color{darkred}x_{i,j} \in \{0,1\} \end{align}\]

A large MIP model that should be solved as LP: the root node

In this post, I want to discuss an observation about the root node when solving a MIP model.

Problem Description

The model I want to tackle here is from [1]:

We have a large matrix $\color{darkblue}A=\color{darkblue}a_{i,j}$ with values 0 and 1. In addition, we have a minimum on the row and column totals. These are called $\color{darkblue}r_i$ and $\color{darkblue}c_j$. The goal is to remove as many 1's in the matrix $\color{darkblue}A$ subject to these minimum row and column totals.

Some GAMS embedded Python notes

Here are two issues you may want to be aware of. The discussion below is relevant for Windows and not so much for Unix variants.

Using Python Raw strings for directories

Here we use some Python inside an otherwise empty GAMS model:

$onEmbeddedCode Python:

dir = r"%system.fp%"
print(dir)

$offEmbeddedCode

In-process, in-memory databases

There are a few database systems that are a bit different. They are libraries that can be linked directly to your application. Linking can be done statically (during the compilation/linking step) or dynamically (using a shared library or DLL). Here I want to show two cases:

SQLite [1] used from R on data frames
DuckDB [2] used from Python, again on data frames

So these databases don't only run inside R or Python but also can operate directly on data frames.

Compression and large tables (Excel and CSV)

Here is an interesting little experiment: load a large CSV file into Excel. My original Powerpoint slide from a recent presentation was not as complete as it should be. Here is more size info:

Simultaneous equation models and data errors

My experience is that using (optimization) models is a great way to provide quality assurance on the data. Data collection can be very difficult and expensive. It can also be quite easy to have errors cropping up somewhere along the way. When using an optimization model, we will get hammered by such errors.

Algorithm vs. model

From [1]:

We are given a plane defined by Ax+By+Cz-D = 0 where D is significantly larger than A,B,C and GCD(A,B,C) = 1. How would I find all points (x, y, z), where x,y,z are integers and >= 0, that lie on the plane in an efficient manner?

So the poster asks for an algorithm to find $\color{darkred}x,\color{darkred}y,\color{darkred}z \in \{0,1,2,\dots\}$ such that \[\color{darkblue}A \cdot \color{darkred}x + \color{darkblue}B \cdot \color{darkred}y + \color{darkblue}C \cdot \color{darkred}z = \color{darkblue}D\] Besides the assumptions stated in the question, I'll further assume $\color{darkblue}A,\color{darkblue}B,\color{darkblue}C,\color{darkblue}D\gt 0$.

Choosing between NLP solvers: interior point or active set.

One way to categorize (local) nonlinear programming (NLP) solvers is active set methods and interior point solvers. Some representative large-scale sparse solvers are:

Active set: CONOPT, SNOPT. These are using SQP algorithms.
Interior point: IPOPT, Knitro. Note: Knitro also contains an active set algorithm.

These are two very different types of algorithms. They work very well on different types of problems. If you do a lot of nonlinear optimization, it actually makes sense to have both types of solvers available. With modeling tools like AMPL and GAMS, it is very easy to switch between solvers. Testing performance between two solvers is then also a snap.

Here I want to discuss some factors that can help in deciding whether to use an active set solver or an interior point solver.

Some approaches for moving data between MS Access and GAMS

Moving data between different environments is always more difficult than we hope. Here I list some approaches and actually try them out on a small dataset. We hit some bugs along the way and also a few conceptual stumbling blocks (even for this stylized example). We had some issues with Access as well as GAMS Connect.

This question came up in an actual project. My idea was: "Let me show you how this can be done". I am afraid, I got carried away a bit. But it demonstrates that we should not underestimate these, at first sight, menial tasks. When the data set becomes larger, the problems compound. We can't eyeball the data, and statistically, it is more likely we encounter some problems.

Another fast MIP model: covering

In [1], the following problem is stated:

There is a collection of $n=1,000$ test questions.
Each question covers a number of skills.
Given is a requirement for a number of questions for each required skill (e.g., 4 questions about skill 1, 3 questions about skill 2, etc.).
Create a test with the minimum number of questions that fulfills the requirements.

Of course, we see some remarks about NP-Hardness. Complexity theory is about bounds, about worst-case and asymptotic behavior. Even if we can't prove that a particular problem with a particular dataset given to a particular solver must be fast, it does not mean the opposite: that it must be slow. It is not at all unusual that we can solve things very quickly, even though the complexity bounds are terrible. Of course, this depends on the model, on the data set, and on the solver. Here, we see an open-source MIP solver can solve this particular problem in less than 0.01 seconds. Many well-educated people misunderstand complexity theory – to their detriment. They miss out on great ways to solve actual problems! Obviously, this is a problem with how complexity theory is taught.

Assigning jobs to machines without overlap

Here we consider the following problem from [1]:

We have jobs with a given start time and completion time
Jobs can be repeated on given days (e.g. job 1 needs to run on Monday, Wednesday, and Friday)
We want to assign jobs to machines in such a way that there is no overlap
The objective is to minimize the number of machines needed to execute all jobs

The planning horizon is a week, and the problems has 50 jobs.

Supplier selection: an easy MIP

This is a fairly standard supplier selection model. It was posted in [1]. It is interesting as it is a good fit for a MIP model: it is simple to model, it solves fast, and it is not obvious how to solve without an optimization model.

The problem is:

We want to order items in different quantities from suppliers.
Suppliers have an available inventory for these items. This can be zero.
We can split the ordering over different suppliers.
The cost structure is as follows:

Shipping cost is a fixed cost per supplier.
Item cost is a variable per-unit cost.

Populating SQLite databases

GAMS has three easy ways to populate a SQLite database:

Using the tool gdx2sqlite. This tool populates a SQLite database with data from a GDX file. This means we first have to export GAMS data to a GDX file. As there is quite some file I/O going on here (writing GDX file, reading GDX file, writing database), I would expect this to be slower than the next method.
The new GAMS-connect facility. This does not use intermediate files, and directly copies records from in-memory data. This should be the fastest.
Old fashioned CSV files. We first export data as a GDX file, and then use gdxdump to convert the data to a CSV file. Then sqlite can import the CSV file, and populate the database. There is much file I/O here, so this should be slow.

Tiny non-convex quadratic model brings solvers to their knees

Here is a very small geometric problem:

Given $n$ points in 2d space, find the smallest triangle that contains all these points.

Find the smallest triangle containing all points.

This looks like a simple problem. Somewhat to my surprise, my attempt here does not bear that out.

Sunday, November 19, 2023

Data

Saturday, October 21, 2023

Friday, October 20, 2023

Thursday, October 19, 2023

Thursday, October 5, 2023

1. Inventory balance constraints

Wednesday, September 27, 2023

Wednesday, September 20, 2023

Monday, September 4, 2023

Tuesday, August 29, 2023

Matrix balancing: introduction

Friday, August 4, 2023

Wednesday, July 12, 2023

Thursday, July 6, 2023

Sunday, July 2, 2023

Tuesday, June 27, 2023

Wednesday, May 10, 2023

Sunday, May 7, 2023

Tuesday, May 2, 2023

Wednesday, April 26, 2023

Problem Description

Tuesday, April 18, 2023

Using Python Raw strings for directories

Wednesday, April 5, 2023

Tuesday, April 4, 2023

Saturday, March 25, 2023

Thursday, March 16, 2023

Tuesday, March 14, 2023

Monday, March 6, 2023

Friday, February 24, 2023

Thursday, February 16, 2023

Wednesday, February 15, 2023

Monday, February 13, 2023

Saturday, January 28, 2023