Hexagonal coordinate representations in Hexeline

2017-07-21 in math hexeline micro-optimisation

One of the central design principles in my still extremely early space shooter Hexeline (read: has neither space nor shooting) is that spacecraft will be defined in terms of a hexagonal grid, rather than cells of varying shape, which simplifies things for both the software and the player while still allowing more interesting shapes than a square grid.

A useful property of regular grids is that they are directly addressable; i.e., you can put the elements of an array and immediately access an element just by knowing its coordinate. Ideally, it’s also efficient to go from continuous spatial coordinates to a grid coordinate. This is trivial in a square grid: depending on how you define your coordinates, you can usually just divide spatial coordinates by the cell size to get the coordinate of the cell a point falls within.

In effort to get a similar property for hexagonal grids, I’ve taken the unusual step of basing a Newtonian physics engine on non-cartesian coordinates. It turns out this is not as bad as it sounds, and quite a few useful properties fall out of it.

Note: This post is going to be comparatively math-heavy and assumes some familiarity with linear algebra.

Performance Concerns

Before going into doing crazy things in the name of performance, I think it’s necessary to go over why the performance limits are so tight.

First, some simple target numbers: A minimum requirements system should be able to handle a simulation of 10’000 objects with 10 ms resolution, i.e., 100 frames per second. This initially looks like we’re talking about doing one million object updates per second, or 1 μs each on average.

The reality is more delicate due to the plans for handling networked play. There’s two general approaches to networking: state replication and event replication. State replication is conceptually simpler, and simply involves each participant occasionally transmitting their current state. This is what my prior space shooter, Abendstern, did, as well as quite a few other games. The main downsides are high bandwidth use if the state is large (Abendstern had an exceedingly complicated protocol to alleviate this) and that it is hard or impossible to prevent cheating by simply modifying the client.

On the other hand, with event replication, only events from outside the game (i.e., player controls in our case) are transmitted, and all participants derive the state by running the simulation with those particular events. Cheating by modifying the client is impossible because other clients won’t have that modification; e.g., if you make a “speed hack”, you may well see yourself zooming around, but other players will see you slowly flying about like an idiot.

Supporting event replication generally requires building it in from the ground up. Calls to things like the C standard rand() are off-limits because different clients would derive different states. If a participant is not fast enough to run the simulation at the required speed, they cannot participate because there is no way to reduce the simulation granularity like most games do as a by-product of reduced frame rate.

Real-time games using event replication also have the awkward problem that they can learn about an event after that event’s time has passed. This necessitates being able to “rewind” time to apply the event, then “fast forward” back to the current time. With a real network in play, events will always be received after their nominal time. If we receive, say, 50 event batches per second, we need to rewind/fast-forward time 50 times per second. The amount of time is equal to the network latency; assuming that averages 200ms, we end up playing 200ms of time forward 50 times per second, or 10 seconds of simulation time per second of wall time.

In other words, the game needs to be able to run at ten times its natural speed, so rather than one microsecond average per object update, we have one hundred nanoseconds. On a 2 GHz processor, that’s two hundred clock cycles, or around four hundred instructions if the code manages to saturate a two-unit superscalar processor. That’s about the same amount of time that a single integer division takes.

Also, keep in mind that updating an object is not just a question of updating position/velocity/acceleration; the engine needs to also do collision detection at the same time. Most objects need to be significantly faster than the 100 ns limit to make room for the occasional update that takes longer.

Having established “counting clock cycles” territory, let’s move on to coordinate systems.

Hexagonal Coordinates

The original idea here came from Hexagon grids: coordinate systems and distance calculations. I highly recommend reading that article, but a short summary of the relevant bits follows anyway.

There are a few ways to assign axes to a hexagonal grid. To avoid confusion with normal cartesian coordinates, we’ll be using A, B, … to reference these axes instead of the usual X, Y, … convention.

A common method is to “zig-zag” one of the axes, as shown below. This results in a grid that can be thought of as a square grid with every odd column offset by half a square, since the zig-zagged axis is still perpendicular to the other axis on average.

The downsides of having a zig-zag axis is that the resulting space is non-linear. Any code which considers the area around such a hexagonal coordinate needs separate cases for even/odd columns. As described in the article, distance calculations are pretty tricky. There is also no obvious interpretation to continuous coordinates in this space — where would (0.5,0.5) fall on the above grid?

In order to make the space linear, we need to make the axes non-perpendicular. As described in the article, we can make distances work out by adding a third axis with the constraint that the sum of the coordinates is zero (i.e., $A+B+C=0$). This results in a grid that looks like this:

This coordinate system is a bit counter-intuitive at first, especially given the way all the coordinates change as one moves horizontally.

There are two equivalent ways to interpret $(a,b,c)$ coordinates:

They represent $(x’,y’,z’)$ cartesian coordinates on the 3-dimensional plane $x’+y’+z’=0$. A point on this plane corresponds to the 2-dimensional point $(x,y)$ where $(x,y)$ is the orthographic projection of $(x’,y’,z’)$ such that the projected axes are at 120° angles to each other.
$(a,b,c)$ are coordinates in a vector space that transforms to $(x,y,z)$ coordinates with $z=0$.

Given these interpretations, a great deal of extremely useful properties reveal themselves:

There is a sensible interpretation of continuous coordinates.
Given the correct scaling factor, the L2 length of vector $(a,b,c)$ can be made to equal the L2 length of the corresponding $(x,y)$ vector.
An affine transform can be used to convert between cartesian and hexagonal coordinates. In other words, translating between the two is simply a single matrix-vector multiplication. Additionally, since each axis constrained entirely by the other two, the $(a,b)$ space by itself is also an affine transform of $(x,y)$. $$\table [\table a;b;c] = [\table √2/√3, 0, 0; -1/{√6}, 1/√2, 0; -1/{√6}, -1/√2, 0] [\table x;y;0]; [\table x;y] = [\table √3/√2, 0; 1/√2, √2] [\table a; b]$$
Any affine transform against cartesian space can itself be transformed to have the same effect against hexagonal coordinates.
Because length is always preserved and this is an affine transform from cartesian space, angles in hexagonal space measure the same as they do in cartesian space.

Since the $c$ coordinate can always be calculated as $-a-b$, hexagonal coordinates can generally be represented simply as $(a,b)$, an oblique two-dimensional space. This is Hexeline’s primary coordinate system. This has some interesting fallout, such as turning the common “bounding box” into a bounding rhombus.

Of course, the decision to use hexagonal coordinates pervasively was made for reasons beyond them being similar enough to cartesian coordinates to be workable; the properties above simply make them minimally practical. What makes them desirable are the operations that can be done efficiently on hexagonal coordinates but not on cartesian coordinates.

Translating to cell coordinates

As previously mentioned, one of the main motivations for using a hexagonal grid was to make cells directly addressible. For this to be useful, we need a way to efficiently determine a cell coordinate from an arbitrary point in space. This turns out to be quite efficient in hexagonal space.

The first step is to approximate the cell coordinate by simply rounding the $(a,b)$ coordinates towards negative infinity. For example, in the graph to the right, all points in the shaded area are rounded to the top left corner of that area.

This might look like a strange approximation at first: only a minority of the rhombus is placed into the correct cell, and only a minority of that cell is covered by the rhombus. However, it’s actually ideal for the next step, as every rhombus covers exactly four cells in exactly the same way, and every cell is covered by four rhombi in exactly the same way. As we’ll shortly see, four happens to be a very convenient number of candidates to have.

Once we’ve obtained the approximation, we know the point falls within one of four cells, offset from the approximation by $(0,0)$, $(0,1)$, $(1,0)$, and $(1,1)$, respectively. To determine the final answer, we simply determine which of those candidates actually contains the point in question. To do this, we find the displacement $(Δa,Δb,Δc)$ from the point to the centre of a hexagon. The point is within the hexagon if $max (Δa,Δb,Δc)- min (Δa,Δb,Δc)≤1$.

There isn’t an obvious way to illustrate the geometry behind what makes this condition work. To make it marginally more intuitive, note that there are exactly 6 combinations of displacements that this formula can output — one for each edge of the hexagon. The correctness can be verified by plotting coordinates along the edges of the unit hexagon and observing that it holds.

Along the edges of the hexagons, this test finds that a point is in two hexagons; similarly, at the vertices, it finds that the point is in all three neighbours. In Hexeline, handling these edge cases precisely doesn’t matter, so one hit is chosen arbitrarily.

That sounds like a lot to do, but with a few tricks we can reduce it down to fairly few operations.

In Hexeline, coordinates are represented as signed 32-bit integers, or i32 in Rust parlance. The cell size (i.e., “1.0” in the abstract math above) is selected to be a power of two in order for the modulus and division operations to be reducible to bit masks and shifts. Finally, an i32x4 is a vector of 4 i32s, essentially an SSE XMM register. With that said, we can start translating all this to real code.

We’ll start with an $(a,b)$ coordinate in one register; i.e., $a$ is in lane 0 and $b$ in lane 1.

    let coord: i32x4 = /* some point */;

Now, recall how we have exactly four candidate hexagons? We also have exactly four SIMD lanes, so we can test all four in parallel. To prepare, we set up two SSE registers containing just the $a$ coordinate repeated four times, and just the $b$ coordinate repeated four times.

    let base_a = i32x4::splat(coord.extract(0));
    let base_b = i32x4::splat(coord.extract(1));

From these, we can calculate the coordinates of the four candidate hexagons as well as their $Δa$ and $Δb$.

    // Offset both coordinates by 0 or 1 cells.
    let off_a = base_a + i32x4::new(0,         0, CELL_SIZE, CELL_SIZE);
    let off_b = base_b + i32x4::new(0, CELL_SIZE,         0, CELL_SIZE);
    // Round down to the "whole integer" coordinates. We can do this by just
    // zeroing out the lower few bits of each lane.
    let candidate_a = base_a & i32x4::splat(MASK);
    let candidate_b = base_b & i32x4::splat(MASK);

    let disp_a = candidate_a - base_a;
    let disp_b = candidate_b - base_b;

What about $Δc$? Well, since we know that $a + b + c = 0$, i.e. $c = -a-b$, we know that $Δc = -Δa-Δb$, so we can simply calculate it.

    let disp_c = - disp_a - disp_b;

Calculating the distance function is pretty straight-forward.

    // .max() and .min() expand to the `pmaxsd` and `pminsd` SSE4.1
    // instructions when available and are emulated otherwise.
    let max_disp = disp_a.max(disp_b).max(disp_c);
    let min_disp = disp_a.min(disp_b).min(disp_c);
    let distance = max_disp - min_disp;

Now we know the distance from all four candidates, we simply must select a candidate whose lane in distance is less than or equal to CELL_SIZE. First, we can subtract CELL_SIZE+1 from every lane to cause the sign bit to be set on all lanes where the membership function is satisfied (assuming no overflow). Then, we can use the movmskps (“move mask”) SSE instruction to find out which lanes have the sign bit set.

    let valid_lanes = (distance - i32x4::splat(CELL_SIZE + 1)).movemask();
    // valid_lanes is a 4-bit value containing a 1 bit for a candidate
    // lane which is within `CELL_SIZE` distance or 0 otherwise.

An obvious way to go from here would be to select a lane, then extract that lane from candidate_a and candidate_b. This works but is very slow, since AMD64 does not have any way to access a runtime-determined lane from an SSE register. If you do this, the compiler implements it by writing the SSE register to memory and then treating it like an array and reading the desired element back.

Instead, we should compute the top-left cell coordinate and then derive the $(a,b)$ offset of the candidate we want. The offset for each axis is either 0 or 1, so we can simply use an integer as a bitset and select a value out of it. I’m going to gloss over the exact contents of these lookup tables for this post since their derivation is uninteresting.

    // Divide the input coordinate by `CELL_SIZE`
    let base_cell = coord >> CELL_SHIFT;
    // Compute A and B offsets directly from the movemask output.
    // The binary constants control how we select from among the candidates.
    let a_off = (0b11111100 >> (valid_lanes >> 1)) & 1;
    let b_off = (0b11110010 >> (valid_lanes >> 1)) & 1;
    // Finally, we can return the cell coordinate
    (base_cell.extract(0) + a_off, base_cell.extract(1) + b_off)

This in fact compiles to extremely efficient code:

    ; Put the A and B coordinates into their own registers
    pshufd xmm1,xmm0,0x0
    pshufd xmm2,xmm0,0x55
    ; Calculate candidate coordinates
    movdqa xmm3,XMMWORD PTR [rip+0x9fbee]
    paddd  xmm3,xmm1
    movdqa xmm4,XMMWORD PTR [rip+0x9fbf2]
    pand   xmm3,xmm4
    movdqa xmm5,XMMWORD PTR [rip+0x9fbf6]
    paddd  xmm5,xmm2
    pand   xmm5,xmm4
    ; Calculate A, B, C displacements
    psubd  xmm3,xmm1
    psubd  xmm5,xmm2
    movdqa xmm1,xmm3
    paddd  xmm1,xmm5
    pxor   xmm2,xmm2
    psubd  xmm2,xmm1
    ; Calculate overall distance
    movdqa xmm1,xmm3
    pmaxsd xmm1,xmm5
    pmaxsd xmm1,xmm2
    pminsd xmm5,xmm3
    pminsd xmm5,xmm2
    ; Subtract CELL_SIZE-1.
    movdqa xmm2,XMMWORD PTR [rip+0x9fcc6]
    psubd  xmm2,xmm5
    ; Finish up distance calculation
    paddd  xmm2,xmm1
    ; The move mask instruction to find out which candidate(s)
    ; are within the right distance
    movmskps ecx,xmm2
    ; Select bits from the bitsets for A and B offset
    shr    ecx,1
    mov    edx,0xfc
    shr    edx,cl
    and    edx,0x1
    mov    esi,0xf2
    shr    esi,cl
    and    esi,0x1
    ; Divide the input by CELL_SIZE
    psrad  xmm0,0x9
    ; Extract the A and B coordinates and apply their offsets
    movd   eax,xmm0
    add    eax,edx
    pextrd ecx,xmm0,0x1
    add    ecx,esi
    ; Combine the result tuple and return
    shl    rcx,0x20
    or     rax,rcx
    ret

That’s 36 instructions (excluding the return sequence), which my K10 processor can chew through in around 3 to 4 nanoseconds (12 to 16 clock cycles).

What’s this useful for? Collision detection, specifically testing whether a point particle collides with, e.g., a ship, after the course bounding-rhombus collision detection finds a candidate pair. With this capability, this test is simply a matter of finding the cell coordinate, then checking whether the ship has a something there.

Collision detection between ships uses a very similar algorithm to find pairs of overlapping hexagons.

Conclusion

That about wraps it up for today, though there almost certainly will be more posts on this topic in the future.

If you want to read through more details, the source code is available online.

An Instance Short of a Singleton

Performance Concerns

Hexagonal Coordinates

Translating to cell coordinates

Conclusion