Wednesday, August 21, 2019

R/Python + C++

In some recent projects, I was working on using algorithms implemented in C++ from R and Python. Basically the idea is: Python and R are great languages for scripting but they are slow as molasses. So, it may make sense to develop the time consuming algorithms in C++ while driving the algorithm from R or Python.

R and C++: Rcpp.

The standard way to build interfaces between R and C++ code is to use Rcpp [1,2].

It is possible to directly interface R with low-level C code, This will require a lot of code and knowledge of R internals. Rcpp automates a lot of this. E.g. Rcpp will take care of translating an R vector into a C++ vector.

Rcpp supports both small fragments of C++ code passed on as an R string to more coarse-grained file based approach [3]. For Windows, you need to download the GNU compilers [4].

If you are new to both  Rcpp and to building your own R packages [5] things may be a bit overwhelming.

Rstudio can help a lot. It supports a lot of very useful tasks:

• Syntax coloring for C++ code.
• Building projects.
• Git version control.
• Documentation tools (rmarkdown and bookdown). My documentation is also a good test: it executes almost all of the code when building the document.

 Editing C++ code in RStudio

Basically I never have to leave RStudio.

I have added an alternative driver file for my C++ code so I can debug it in Visual Studio. I used it only a few times: most of the time I just used RStudio.

Python and C++: pybind11

pybind11 [6] is in many respects similar to Rcpp, although it requires a little bit more programming to bridge the gap between Python and C++.

In the beginning of the above Youtube video [7], the presenter compares pybind11 with some of the alternatives:

• SWIG: author of SWIG says: don't use it
• ctypes: call c functions but not c++
• CFFI: call c functions
• Boost.Python: support for older C++ standards, but not much maintained
• pybind11: modern

As with Rcpp, calling the compiler is done through running a build or setup script. For Rcpp I used the GNU compilers, while pybind11/pip install supports the Visual Studio C++ compiler. This also means that if you have little experience with both pybind11 and creating packages, the learning curve may be steep.

References

1. http://www.rcpp.org
2. Dirk Eddelbuettel, Seamless R and C++ Integration with Rcpp, Springer, 2013
4. Hadley Wickham, R packages, O'Reilly, 2015
5. https://cran.r-project.org/bin/windows/Rtools/
7. Robert Smallshire, Integrate Python and C++ with pybind11, https://www.youtube.com/watch?v=YReJ3pSnNDo

Thursday, August 15, 2019

Scheduling teachers

From [1]:

I am trying to find a solution to the following problem:
• There are 6 teachers
• Each teacher can only work for 8 hours a day
• Each teacher must have a 30 minute break ideally near the middle of his shift
• The following table shows the number of teacher needed in the room at certain times:
• 7am - 8am 1 teacher
• 8am - 9am 2 teacher
• 10am - 11am 5 teacher
• 11am - 5pm 6 teacher
• 5pm - 6pm 2 teacher
What is a good way to solve this (ideally in python and Google OR-Tools) ?
Thank you

Initial analysis

From the demand data we see that we need all 6 teachers from 11am-5pm working without lunch break. This is 6 hours. So there is no way to allow a lunch break near the middle of the shift.

We have more problems. The picture of the demand data indicates we have no demand between 9am and 10am. That does not look right. Bad data is fact of life. Optimization models are especially sensitive to bad data.

I believe that looking critically at your data is essential for successful optimization applications. You can learn a lot from just a bit of staring.

Alternative problem 1

We can ask a few different questions. If we only allow shifts of the form: 4 hours work, 0.5 hour lunch, 4 hours work (i.e. lunch perfectly in the middle of the shift), how many teachers do we need to meet the demand. To model this, we can assume time periods of half an hour. The number of possible shifts is small:

With an enumeration of the shifts, we can model this as a covering problem:

Covering Model
\begin{align} \min\> &\color{darkblue}z= \sum_s \color{darkred} x_s \\ & \sum_{s|\color{darkblue}{\mathit cover}(s,t)} \color{darkred} x_s \ge \color{darkblue}{\mathit demand}_{t} && \forall t \\ &\color{darkred} x_i \in \{0,1,2,\dots\}\end{align}

Here $$\mathit{cover}(s,t)=\text{True}$$ if shift $$s$$ covers time period $$t$$. If we would interpret $$\mathit cover(s,t)$$ as a binary (data) matrix $\mathit cover_{s,t} = \begin{cases} 1 & \text{if shift s covers time period t}\\ 0 & \text{otherwise}\end{cases}$ we can also write:

Covering Model (alternative interpretation)
\begin{align} \min\> &\color{darkblue}z= \sum_s \color{darkred} x_s \\ & \sum_s \color{darkblue}{\mathit cover}_{s,t} \cdot \color{darkred} x_s \ge \color{darkblue}{\mathit demand}_{t} && \forall t \\ &\color{darkred} x_i \in \{0,1,2,\dots\}\end{align}

Our new assumptions are:

• we only allow shifts with a lunch break in the middle
• we add to our demand data: 4 teachers needed between 9am and 10am
After solving this easy MIP model we see:

----     55 VARIABLE x.L  number of shifts needed

shift1 2.000,    shift4 2.000,    shift5 2.000,    shift6 2.000

----     55 VARIABLE z.L                   =        8.000  total number of shifts


I.e. we need 8 teachers to handle this workload.

The picture shows we are overshooting demand in quite a few periods. Note: this picture illustrates the left-hand side (orange bars) and right-hand side (blue line) of the demand equation in the Covering Model.

Alternative problem 2

Lets make things a bit more complicated. We allow now the following rules for a single shift:

• The work period before the lunch bread is between 3 and 5 hours (or between 6 and 10 time periods)
• There is a lunch break of 0.5 or 1 hour.
• After lunch there is another work period of between 3 and 5 hours. The total number of working hours is 8.
When we enumerate the shifts, we see:

We now have 55 different shifts to consider.

The results look like:

----     72 VARIABLE x.L  number of shifts needed

shift1  1.000,    shift22 1.000,    shift26 1.000,    shift39 1.000,    shift49 1.000,    shift52 1.000
shift55 1.000

----     72 VARIABLE z.L                   =        7.000  total number of shifts


We see the number of teachers needed is 7. We are closer to the demand curve:

We see that if we add more flexibility we can do a bit better. Achieving 6 teachers is almost impossible. We would need to introduce shifts like: work 2 hours, lunch, work 6 hours. The teachers union would object.

It is noted there are quite a few other methods to solve models like this where we don't need to enumerate all possible shifts [2,3]. For larger problems it may not be feasible to employ the shift enumeration scheme we used here.

Wednesday, August 7, 2019

Finding the central point in a point cloud

Problem

The problem is easy to describe: given the coordinates of points $$p_i$$, find the central point $$x$$ that minimizes the sum of the distances from $$x$$ to $$p_i$$ [1]. This point is sometimes called the geometric median [2].

Data

We just use some random data on the unit square $$[0,1]\times[0,1]$$:

----     13 PARAMETER p  points (randomly generated)

x           y

point1   0.17174713  0.84326671
point2   0.55037536  0.30113790
point3   0.29221212  0.22405287
point4   0.34983050  0.85627035
point5   0.06711372  0.50021067
point6   0.99811763  0.57873338
point7   0.99113304  0.76225047
point8   0.13069248  0.63971876
point9   0.15951786  0.25008053
point10  0.66892861  0.43535638
point11  0.35970027  0.35144137
point12  0.13149159  0.15010179
point13  0.58911365  0.83089281
point14  0.23081574  0.66573446
point15  0.77585761  0.30365848
point16  0.11049229  0.50238487
point17  0.16017276  0.87246231
point18  0.26511455  0.28581432
point19  0.59395592  0.72271907
point20  0.62824868  0.46379787
point21  0.41330699  0.11769536
point22  0.31421227  0.04655151
point23  0.33855027  0.18209959
point24  0.64572713  0.56074555
point25  0.76996172  0.29780586
point26  0.66110626  0.75582167
point27  0.62744750  0.28386420
point28  0.08642462  0.10251467
point29  0.64125115  0.54530950
point30  0.03152485  0.79236064
point31  0.07276700  0.17566105
point32  0.52563261  0.75020767
point33  0.17812371  0.03414099
point34  0.58513117  0.62122998
point35  0.38936190  0.35871415
point36  0.24303462  0.24642154
point37  0.13050280  0.93344972
point38  0.37993791  0.78340046
point39  0.30003426  0.12548322
point40  0.74887411  0.06923246
point41  0.20201556  0.00506586
point42  0.26961305  0.49985148
point43  0.15128587  0.17416945
point44  0.33063773  0.31690605
point45  0.32208696  0.96397664
point46  0.99360221  0.36990306
point47  0.37288857  0.77197833
point48  0.39668414  0.91309632
point49  0.11957773  0.73547889
point50  0.05541847  0.57629980
mean     0.39042709  0.47299042


The last entry labeled mean is just the average of the $$x$$- and $$y$$-coordinates.

NLP Model

A straightforward non-linear programming model can look like:

Unconstrained NLP Model
$\min \sum_i \sqrt{\sum_c (\color{darkred}x_c-\color{darkblue}p_{i,c})^2 }$

We use $$c = \{x,y\}$$, i.e. we have $$x$$ and $$y$$-coordinates. Note that we use $$x$$ in two different contexts: element of set $$c$$, being the $$x$$-coordinate, and the decision variable $$x_c$$.

We can use the mean as a very good starting point to help the NLP solver. I.e. $x_c := \frac{\displaystyle\sum_{i=1}^n p_{i,c}}{n}$

The picture below shows why the mean is such a good starting point:

 Optimal center point is close to mean point

The numeric values are here:

----     45 PARAMETER results  x(center) vs mean

x           y     sumdist

mean  0.39042709  0.47299042 18.02414861
x     0.37048298  0.43857572 17.96891985


The sumdist column shows the objective values for these two points.

This is an easy NLP problem. Most NLP solvers just need a few iterations. With a system like GAMS or AMPL we get exact gradients automatically. That is much preferable to finite differences which seems the prevalent method people use in an R or Python environment.

Cone programming I

The above problem can also be written as a cone programming problem. This will allow us to use a different class of solvers to work on this problem. Here we use CVXPY [3] to express the model. The Python code can look like:

import cvxpy as cp

x = cp.Variable(2)   # center point

obj = cp.Minimize( cp.sum( [ cp.norm(x-p[i,:]) for i in range(N) ] ) )
prob = cp.Problem(obj)

objval = prob.solve(solver=cp.SCS, verbose=True)


This is very much a straight translation of our unconstrained NLP model. Although we only declared two $$x$$ variables, behind the scenes the model is blown up to a rather large one. We can see from the log:

----------------------------------------------------------------------------
SCS v2.1.1 - Splitting Conic Solver
(c) Brendan O'Donoghue, Stanford University, 2012
----------------------------------------------------------------------------
Lin-sys: sparse-direct, nnz in A = 150
eps = 1.00e-04, alpha = 1.50, max_iters = 5000, normalize = 1, scale = 1.00
acceleration_lookback = 0, rho_x = 1.00e-03
Variables n = 52, constraints m = 150
Cones:  soc vars: 150, soc blks: 50


The generated SOCP (second order cone programming) model is larger, but also very sparse. The solver has no problem solving it very quickly. With SOCP solvers we usually don't worry about an initial point like we used in the NLP model.

Cone programming II

If we don't use a modeling tool that can do these transformations automatically, we can use a DIY approach. Second-order cone constraints can be stated as: $||A_i x+b_i||_2 \le c_i^Tx + d_i \>\>\forall i$ This would imply that we can write our model as:

SOCP Model Attempt
\begin{align} \min & \sum_i \color{darkred} d_i \\ & \color{darkred} d_i^2 \ge \sum_c (\color{darkred}x_c-\color{darkblue}p_{i,c})^2 && \forall i \\ &\color{darkred} d_i \ge 0, \color{darkred} x_c \text{ free}\end{align}

Unfortunately this will yield a message like: CPLEX Error  5002: 'e(point1)' is not convex or Constraint 'e(point1)'(0) is not convex. Q should be positive semidefinite in a constraint with finite upper bound.  We can repair this as follows:

Repaired SOCP Model
\begin{align} \min & \sum_i \color{darkred} d_i \\ & \color{darkred} d_i^2 \ge \sum_c \color{darkred}y_{i,c}^2 && \forall i \\ & \color{darkred} y_{i,c} = \color{darkred} x_c -\color{darkblue}p_{i,c} && \forall i,c \\ &\color{darkred} d_i \ge 0, \color{darkred} x_c \text{ free}, \color{darkred} y_{i,c} \text{ free}\end{align}

This now solves quickly. We can now understand that CVXPY did quite a few steps before passing the model to the solver. As argued in [4], it is much better if the modeling system takes care of these reformulations. Some of them are not immediate obvious, and hand-crafted reformulations can be error-prone.

The NLP model has just 2 variables $$x_c$$. This SOCP model has 152 variables. Using $$n=50$$ data points, we added $$3n$$ variables. When looking at the number of constraints we see a similar thing. The NLP model has no constraints, but this SOCP model has $$3n=150$$ constraints ($$n$$ of them cones).

Conclusion

The min sum distance problem has a simple NLP formulation which can be improved by using a good initial point. It can also be formulated as a SOCP problem. Using high-level modeling tools this is not difficult. Without automatic reformulations things become a bit less obvious.

References

1. The point that minimizes the sum of euclidean distances to a set of n points, https://stackoverflow.com/questions/57277247/the-point-that-minimizes-the-sum-of-euclidean-distances-to-a-set-of-n-points
2. Geometric median, https://en.wikipedia.org/wiki/Geometric_median
3. https://www.cvxpy.org/
4. Victor Zverovich, Robert Fourer, Automatic Reformulation of Second-Order Cone Programming Problems, https://ampl.com/MEETINGS/TALKS/2015_01_Richmond_2E.2.pdf