The bad news is that their build system is extremely hand-rolled, and so if it works for you, count yourself lucky, because when it doesn't work you're in for 4 hours of python hell
Interesting stuff. Not sure if I read this right that it‘s 16 und 32 bit values of integers that get sorted. If yes, I‘d love to see if the GPU implementation can beat a competitive Radix sort implementation on a CPU.
It's 32 32-bit values which get sorted. I don't think a GPU sort would beat a CPU sort at this scale, even if you don't take kernel launch time into account. CPUs are simply too fast for (super-)small data, especially with AVX-512.
But if we're talking about a larger amount of data, that would be a different story, i.e. as part of a normal gpu mergesort.
It is also useful if your data already lives on the GPU memory. For example, when you need to z-sort a bunch of particles in a 3d renderer particle system.
Without knowing what they actually use, I feel comfortable to state that the industry has moved on from Contraction Hierarchies to somewhat more flexible techniques. These allow you to take traffic information and road closures, and user preferences, and whatnot into account without requiring a full re-processing of the input data with each traffic update. The state of the art is a two-step preprocessing that first decomposes the road network into cells, and then processes these cells independently. Sometimes it goes by the name of customisable route planning, sometimes it is called multi-level Dijkstra.