Pages

Tuesday, 26 November 2013

A graph labelling problem

I've been doing some work using the shapefiles for the current Canadian federal electoral districts.  Each shape in the shapefile corresponds to a riding (or, in a few cases, a part of a riding).  One thing I have noticed while working with them is that there is no apparent structure to the order in which the shapes appear.

For example, Shape 1 is the boundary of the Vegreville--Wainwright riding in Alberta, while Shape 2 is Bruce--Grey--Owen Sound in Ontario.  Since Alberta and Ontario do not even share a border, these two districts are far apart from each other.  Meanwhile, the ridings adjacent to Vegreville--Wainwright are numbered 5, 21, 144, 183 195, 217, and 240.  Considering there are only 312 shapes [1] in total, 5 and 21 are the only numbers remotely close to 1.  If there is any reasoning behind the numbering scheme, it's not easy to tell what it is by looking at the numbers alone.  One assumes it's arbitrary.

I use the numbers frequently, so it would be helpful if I could tell a little bit more about how close two ridings are based on their numbers.

The best numbering system would have consecutive numbers mapped to adjacent ridings, so that one riding was adjacent to another if and only if their labels differed by exactly 1.  This impossible to do for every riding, however, since it would require each riding to be adjacent to only two other ridings, but most are adjacent to three or more others.

Instead, we could number the ridings so that the numbers of  ``close'' ridings are also close.

There are a couple ways we could do this.  First, let's put the problem in terms of graph theory.  Let $G$ be the graph whose vertices are the electoral districts.  Define two vertices to be adjacent if their corresponding districts share a border (excluding the case where two districts share a point).  Suppose we label each vertex $v$ with a number $n(v)$ between 1 and the number of vertices, using each number for exactly one vertex.
  1. Ensure adjacent vertices have close numbers.  Thus, for each pair of adjacent vertices, compute the difference between their labels, then take the sum of these numbers over all adjacent pairs.  That is, compute
    $$\sum_{uv\in E}|n(u)-n(v)|$$
    Define the best labellings to be those that minimize minimize this sum.
  2. Approximate the the distances between vertices with the difference between their respective labels.  To this end, let $d(u,v)$ be the distance between two vertices in the graph and let $D(u,v)=|n(u)-n(v)|$.  The difference between the labels approximates the distance between the vertices if $|d(u,v)-D(u,v)|$ is small.  Thus, define the best labellings to be those that minimizes the sum
    $$\sum_{uv}|d(u,v)-D(u,v)|$$
The first approach bears some similarity to the graph bandwidth problem.   In the bandwidth problem, the goal is to minimize the maximum value of $|n(u)-n(v)|$.  If this value is $m$, then $m|E|$ is an upper bound on the sum that appears in the first approach.  It's possible, however, that the sum and the maximum are minimized by different labellings. 

The fact that the graph bandwidth problem is NP-hard suggests that it will also be difficult to minimize the sum.  I also suspect that the sum in the second approach will be of comparable difficulty.

Edit: One way of interpreting the graph bandwidth is not obvious from the name (the name is derived from matrices associated with the graph).  If we think of the labels $n(v)$ as locations of vertices along the line from 1 to $|V|$, then $|n(u)-n(v)|$ is the length of the edge.  So the graph bandwidth is the smallest possible value for the length of the longest edge.   We can say something similar about the first solution.  The goal in that case is to minimize the total length of the edges, or, equivalently, the average edge length.

[1] The astute reader will notice that there are 308 ridings in total for the last election, and there will be 338 in the next.  So where does this 312 come from?  Two ridings include islands that are too far from the mainland to be included in the same shape, so are included as separate shapes, and one riding includes 3 mainland components that are separated from each other by another riding, yielding 4 extra shapes.

No comments :

Post a Comment