Sunday, January 23, 2022

Coloring the US county map

A famous result in graph theory is that we can color a planar graph with just four colors [1]. The MIP model for coloring a graph is really simple. Let's try this with a relatively large data set: the map of US counties. We'll see that there are some hurdles on the way.


Building block: MIP model for coloring a graph


The problem is as follows: we want to assign a color to each node such that nodes connected by an arc don't have the same color. The objective is to minimize the number of different colors. For the MIP model we use an assignment variable:  \[\color{darkred}x_{n,c} = \begin{cases}1 & \text{if node $n$ gets color $c$} \\  0 & \text{otherwise}\end{cases}\] To keep track of the colors we have: \[\color{darkred}u_c = \begin{cases}1 & \text{color $c$ is used for some nodes} \\ 0 & \text{otherwise}\end{cases}\]


The MIP model can look like:

MIP Model
\[\begin{align} \min & \sum_c \color{darkred}u_c \\ & \sum_c \color{darkred}x_{n,c}=1 &&\forall n \\ & \color{darkred}x_{i,c}+\color{darkred}x_{j,c} \le \color{darkred}u_c && \forall \color{darkblue}{\mathit{arc}}_{i,j}, c \\ & \color{darkred}u_c \le \color{darkred}u_{c-1} && \forall c\gt 1 \\ & \color{darkred}x_{n,c} \in \{0,1\} \\ & \color{darkred}u_c \in \{0,1\} \end{align} \]



The last constraint is optional. It is used to get nicer-looking solutions and to reduce symmetry.

With a random graph with 20 nodes and 57 arcs, our minimum number of colors is 4. Of course, other graphs may need more or fewer colors.


This was a very small data set and the MIP model solved very fast (it used 0 nodes to prove optimality; although when solved in the root node, I guess more honest accounting would say this needed 1 node). Next, up to something bigger.

Building block: Map of US counties


To produce a map for all US counties, I used the GeoJSON file from [2]. After extracting the id fields from the JSON file we have a collection of counties encoded by their FIPS code [3]. These 5-digit codes have two parts: a 2-digit state id and a 3-digit county id. We have 3,221 counties in this file. This will become our set of nodes.

The data for the adjacency matrix comes from [4]. I had some issues with this file:

  • This file had more counties than I have in my map data. All counties not in the map file are ignored. These were all related to faraway territories: Virgin Islands, American Samoa, Guam, Northern Mariana Islands. 
  • The matrix in [4] is larger than strictly needed. It says all counties are adjacent to itself. Also, I made the matrix strictly upper triangular. I.e. two counties \(i\) and \(j\) can only be adjacent if \(i\lt j\). This makes the data smaller without losing any information.
  • There is a problem with county 27111. It has different names in the file.


     
  • County 27165 has an empty name in some of the records.
  • I suspect the fips codes are correct but the names are not. Because of the inconsistencies, I have the feeling that some of this data was entered manually instead of using automation. Copy-paste should be avoided like the plague!
  • Data problems are fact of life. Even for data from reputable sources. Always check incoming data for errors, and add data checks to the model.
After fixing the problems with the adjacency file, we can try to color the map. 

Debugging


The underlying graph has 3,221 nodes and 9,478 undirected arcs. The generated model is not small: 

MODEL STATISTICS

BLOCKS OF EQUATIONS           3     SINGLE EQUATIONS       60,102
BLOCKS OF VARIABLES           3     SINGLE VARIABLES       19,333
NON ZERO ELEMENTS       189,973     DISCRETE VARIABLES     19,332

The size depends quite a bit on the number of colors in set \(c\). The solution is however not what we would expect:

**** SOLVER STATUS     1 Normal Completion
**** MODEL STATUS      1 Optimal
**** OBJECTIVE VALUE                5.0000

Why don't we get an objective value of 4?  We need to do some debugging. Let's see if we can identify an offending node. I.e., when we remove that node (county), the remaining map can be properly colored with 4 colors. Our debugging tool becomes the following model:


MIP Debugging Model
\[\begin{align} \min\> & \color{darkred}z=\sum_c \color{darkred}u_c \\ &\color{darkred}z=4 \\ & \sum_c \color{darkred}x_{n,c}=\color{darkred}v_n &&\forall n \\ & \color{darkred}x_{i,c}+\color{darkred}x_{j,c} \le \color{darkred}u_c && \forall \color{darkblue}{\mathit{arc}}_{i,j}, c \\ & \sum_n \color{darkred}v_n = {\bf{card}}(n)-1 \\ & \color{darkred}x_{n,c} \in \{0,1\} \\ & \color{darkred}u_c \in \{0,1\} \\ & \color{darkred}v_n \in \{0,1\} \end{align} \]


The new variable \(\color{darkred}v_n\) indicates if node \(n\) is used. The node with \(\color{darkred}v_n=0\) in the solution is suspect. When we run this, Glades County in Florida is selected to be ignored. When we fix Glades county to be included, neighboring Okeechobee County is removed. So let's have a look at this part of Florida:




We see that Glades, Okeechobee, Martin, Palm Beach, and Hendry form a bit of a pie chart with one single point where they meet. Such a pie chart with a single meeting point can create problems:



There is no way we can color this with 4 colors. If we look at a description of the 4-color theorem, we actually see such a single point is removed from consideration:

The four-color theorem states that any map in a plane can be colored using four-colors in such a way that regions sharing a common boundary (other than a single point) do not share the same color. [5]

After removing arcs Glades-Martin and Glades-Palm Beach, our model finds a 4-coloring scheme for the US county map. The problem took about 3.5 minutes on my machine to solve to optimality. The problem is large but the underlying graph is very sparse.


The map


After fixing our data problems, and solving the coloring model, it is not very difficult to produce a proper 4-color map:




Refinements


One issue that can occur is that some counties may not have any neighbors. In our data set, we have three isolated counties, all part of Hawaii. 


----   6243 SET isolated2  county without neighbors (incl. name)

15001.Hawaii County, HI  
15003.Honolulu County, HI
15007.Kauai County, HI   


These counties do not get a proper color assigned in the model. However, we can simply assign any color to these counties. When minimizing the number of colors, it may make sense to fix these counties to color 1.

The problem has still a large number of symmetries. One thing we can do is fix any county (from the contiguous part of the US) to color 1. Then we can fix a single neighbor. This simple trick can help a bit.

When minimizing the number of colors, we may end up with a solution that is not very balanced: some colors are more used than others. We can add an objective where we make the number of times each color is used as equal as possible.
 

Conclusions


Applying the 4-color theorem to an actual largish data set provides some good lessons:
  • Data can never be trusted, even if it comes from the Census bureau.
  • Models can help in debugging by changing the question. Here we asked: is there a single node (county) we can remove to make the map four-colorable? This model led us directly to the underlying problem.  
  • Touching a neighboring county by a single point does not count in the 4-color theorem. It looks like these cases can happen in the real world.


References



Appendix: GAMS Model


$ontext

 
Standard Graph Coloring Model

 
Color map with US counties

$offtext


*---------------------------------------------------------------
* data
*---------------------------------------------------------------

sets
   n     
'nodes: counties (fips code)'
   a(n,n)
'arcs: adjacency -- upper-triangular part only'
   c     
'colors' /color1*color5/

   county
'county names'
   map(n,county<)
;

$include data.inc

* fix data wrt Glades County, FL
a(
'12043','12085') = no;
*a('12043','12099') = no;


scalar numCounties 'number of counties in map';
numCounties =
card(n);
display numCounties;

alias(n,i,j);

*---------------------------------------------------------------
* data checks
*---------------------------------------------------------------

abort$(card(map)<>numCounties) "check map";
abort$sum(a(i,j)$(ord(i)>=ord(j)),1) "a(i,j) is not strictly upper-triangular";

*---------------------------------------------------------------
* isolated counties (no neighboring counties), all in Hawaii
* we fix them outside the model
*---------------------------------------------------------------

set isolated(n) 'county without neighbors';
isolated(n) =
sum(j$(a(n,j) or a(j,n)),1)=0;
display isolated;


set isolated2(n,county) 'county without neighbors (incl. name)';
isolated2(isolated,county) = map(isolated,county);
option isolated2:0:0:1; display isolated2;

*---------------------------------------------------------------
* model
*---------------------------------------------------------------

binary variables
   x(n,c)
'assign color to node'
   u(c)  
'color is used'
;

variable
   numColors
'number of colors needed'
;

equations
   objective           
'minimize number of colors used'
   assign(n)           
'every node should have exactly one color'
   notSameColor(i,j,c) 
'adjacent vertices cannot have the same color'
   order(c)             
'ordering constraint (optional)'
;

objective..  numColors =e=
sum(c, u(c));

assign(n).. 
sum(c, x(n,c)) =e= 1;

notSameColor(a(i,j),c).. x(i,c)+x(j,c) =l= u(c);

order(c-1).. u(c) =l= u(c-1);


* fix isolated counties
x.fx(isolated,
'color1') = 1;

model color /all/;
option optcr=0, threads=16;
solve color minimizing numColors using mip;


display x.l,u.l,numColors.l;

*---------------------------------------------------------------
* export results
*---------------------------------------------------------------

parameter mapcolors(n);
mapcolors(n) =
sum(c, ord(c)*round(x.l(n,c)));

file f /mapcolors.csv/;
* csv format
f.pc=5;
put f,"id","color","county"/;
loop(n,
  
put n.tl:0,mapcolors(n):0:0;
  
loop(map(n,county),
      
put county.tl:0:0/;
   );
);
putclose;

*---------------------------------------------------------------
* create map
*---------------------------------------------------------------

$ontext

 
Notes: py is part of python 3.10
        
Make sure that the following packages are installed:
           
pandas, plotly

$offtext

file log /''/;
putclose log, "Creating map. This takes a minute..."/;
execute "py plotter.py";



The data file can be had from [6].


Appendix: Visualization



This is a visualization of the results using d3.

1 comment: