Efficient Bounding Rhombi

2017-08-06 in hexeline micro-optimisation

A bounding box generally refers to an axis-aligned rectangular region of space used as a first, coarse step of collision detection. Every object is given a bounding box which covers all space the object could possibly occupy. If the bounding boxes of two objects overlap, the simulation needs to do a more precise but expensive collision check; but if they do not overlap, they certainly do not collide and the pair can be skipped.

Since Hexeline uses an oblique coordinate system, the same approach results in bounding rhombi instead of boxes, though that distinction is not particularly interesting here. What is interesting is how bounding boxes/rhombi can be handled extremely efficiently with SIMD. I fully expect that this technique has been discovered by someone else previously, but it’s still worth discussing.

The rest of this post will just use “bounding box” and normal (X,Y) coordinates so familiarity with “hexagonal coordinates” is not necessary.

There are a couple ways a bounding box can be represented. The one we’re interested in here is the 2D interval representation, wherein we have two points $(x_0,y_0)$ and $(x_1,y_1)$ such that $x_0 ≤ x_1$ and $y_0 ≤ y_1$. We then say that any point $(x,y)$ is within the box if $x_0≤x≤x_1$ and $y_0≤y≤y_1$.

Notice there are exactly four coordinate values. Four is a magic number here: it means we can make full use of a SIMD word! In other words, we can pack $(x_0,y_0,x_1,y_1)$ in a single i32x4 (i.e., a SIMD vector of 4 32-bit signed integers):

// A "newtype" which gets us type-safety from accidentally mixing with other
// uses of `i32x4`.
#[derive(Clone, Copy, Debug)]
struct BoundingBox(i32x4);
impl BoundingBox {
    fn new(x0: i32, y0: i32, x1: i32, y1: i32) -> Self {
        BoundingBox(i32x4::new(x0, y0, x1, y1))
    }
}

Before getting into how we check these for overlap, let’s think about another operation: union. In order to efficiently find candidate objects for collision checks, a common approach is to build some form of tree of bounding boxes. This requires each node to have a bounding box that encompasses at least all space of its children, i.e., to be a union of its children.

We can certainly implement union with the current representation.

impl BoundingBox {
    fn union(self, other: Self) -> Self {
        // x0 and y0 are chosen from the minima of the inputs
        let min = self.0.min(other.0);
        // x1 and y1 are chosen from the maxima of the inputs
        let max = self.0.max(other.0);
        // Take x0 and y0 from `min`, x1 and y1 from `max`
        BoundingBox(min.blend(max, 00, 01, 12, 13))
    }
}

Unfortunately, this isn’t great. With SSE4.1, it’s just a pminsd, pmaxsd, pblendw sequence. But for systems without min and max instructions, these need to be emulated, and the fact that there are two of them causes a large portion of the register file to be used just for this operation. It could be improved by inlining the min and max emulation to remove a lot of the redundant work, but we can still make something better for both cases.

The only reason we needed the extra instructions here is that the condition for choosing the lower-bound coordinate is different from that of choosing the upper-bound coordinate. With a simple tweak to the representation, we can make that condition the same. We simply negate $x_0$ and $y_0$, which allows finding the minimum $x_0$ and $y_0$ to be done by finding the maximum of their negated values. Thus, all lanes are selected by the same criterion.

impl BoundingBox {
    fn new(x0: i32, y0: i32, x1: i32, y1: i32) -> Self {
        BoundingBox(i32x4::new(-x0, -y0, x1, y1)) // Note negative signs
    }

    fn union(self, other: Self) -> Self {
        BoundingBox(self.0.max(other.0))
    }
}

With SSE4.1, this produces a single pmaxsd instruction. ARM with Neon is also a single instruction. max still needs to be emulated elsewhere, but there’s now a lot less to do.

Checking whether two interval-based bounding boxes overlap is simply a matter of checking whether both one-dimensional intervals they comprise overlap.

Suppose we have a pair of one-dimensional intervals, $(l_0,l_1)$ and $(r_0,r_1)$. It may be non-obvious at first, but they overlap if and only if $l_0 ≤ r_1$ and $r_0 ≤ l_1$. However, in our bounding box representation, we don’t have the lower bounds directly; we have the negated lower bounds, so we would need to check $–(–l_0) ≤ r_1$ and $–(–r_0) ≤ l_1$.

This looks pretty inconvenient from a SIMD perspective: we’re doing a different operation on each lane. But it can actually be simplified quite a bit. We can start by putting all the $l$ terms on the left-hand side.

$–(–l_0) ≤ r_1$ and $l_1 ≥ –(–r_0)$

One of the $r$ terms is negated here; the other is not, so multiply both sides of that equation by $–1$ (which also changes ≤ to ≥).

$(–l_0) ≥ –r_1$ and $l_1 ≥ –(–r_0)$

We now have something SIMD-friendly. Every lane of the right-hand side is negated, and we then perform the same comparison on every lane after a shuffle. We can view this as a two-step process: “invert” the right-hand side by negating all the terms and putting the upper bounds in the lower bounds’ lanes and vice-versa, then perform the comparison.

The first is fairly easy to implement:

// Another "newtype" around i32x4 so that we can't accidentally forget the
// invert step
#[derive(Clone, Copy, Debug)]
struct InvertedBoundingBox(i32x4);
impl BoundingBox {
    fn invert(self) -> InvertedBoundingBox {
        // Lanes:     0    1   2   3
        // Input:  (-x0, -y0, x1, y1)
        // Output: (-x1, -y1, x0, y0)
        InvertedBoundingBox((-self.0).shuf(2, 3, 0, 1))
    }
}

For the comparison step, we want to check that every lane of the left-hand side is greater than or equal to the corresponding lane of the right-hand side. Assuming space is not so large that overflow is a concern, we know that if $a≥b$, the result of $a-b$ will never have its sign bit set. The “move mask” primitive (movmskps in SSE speak) will give us the sign bit of all lanes in a single integer. Putting these together, we can simply subtract the two vectors and test that all the sign bits are zero.

impl BoundingBox {
    fn overlaps(self, other: InvertedBoundingBox) -> bool {
        let diff = self.0 - other.0;
        0 == diff.movemask()
    }
}

That’s about it for bounding boxes. It may seem like all this just makes already fast operations even faster, but both unions and overlap tests must be performed multiple times per update for every object and so count against Hexeline’s 100 nanosecond budget, so making them as fast as possible is extremely important.

An Instance Short of a Singleton