more _hrfd's comments

_hrfd · on June 29, 2017

To summarize:

If comparison is cheap (e.g. when sorting integers), pdqsort wins because it copies less data around and the instructions are less data-dependency-heavy.

If comparison is expensive (e.g. when sorting strings), timsort is usually a tiny bit faster (around 5% or less) because it performs a slightly smaller total number of comparisons.

_hrfd · on June 29, 2017

I think it's fair to say that pdqsort (pattern-defeating quicksort) is overall the best unstable sort and timsort is overall the best stable sort in 2017, at least if you're implementing one for a standard library.

The standard sort algorithm in Rust is timsort[1] (slice::sort), but soon we'll have pdqsort as well[2] (slice::sort_unstable), which shows great benchmark numbers.[3] Actually, I should mention that both implementations are not 100% equivalent to what is typically considered as timsort and pdqsort, but they're pretty close.

It is notable that Rust is the first programming language to adopt pdqsort, and I believe its adoption will only grow in the future.

Here's a fun fact: Typical quicksorts (and introsorts) in standard libraries spend most of the time doing literally nothing - just waiting for the next instruction because of failed branch prediction! If you manage to eliminate branch misprediction, you can easily make sorting twice as fast! At least that is the case if you're sorting items by an integer key, or a tuple of integers, or something primitive like that (i.e. when comparison is rather cheap).

Pdqsort efficiently eliminates branch mispredictions and brings some other improvements over introsort as well - for example, the complexity becomes O(nk) if the input array is of length n and consists of only k different values. Of course, worst-case complexity is always O(n log n).

Finally, last week I implemented parallel sorts for Rayon (Rust's data parallelism library) based on timsort and pdqsort[4].

Check out the links for more information and benchmarks. And before you start criticizing the benchmarks, please keep in mind that they're rather simplistic, so please take them with a grain of salt.

I'd be happy to elaborate further and answer any questions. :)

[1] https://github.com/rust-lang/rust/pull/38192

[2] https://github.com/rust-lang/rust/issues/40585

[3] https://github.com/rust-lang/rust/pull/40601

[4] https://github.com/nikomatsakis/rayon/pull/379

lorenzhs · on June 30, 2017

To illustrate the point about branch misprediction, I implemented quicksort with intentionally skewed pivot selection. You can see the results here: https://github.com/lorenzhs/quicksort-pivot-imbalance

For this demo I'm sorting a permutation of the range 1 to n, so obtaining perfect or arbitrarily skewed pivots is free, which isn't realistic, but that's not the point. My experiments show that quicksort is fastest on my machine when 15% of the elements are on one side and 85% on the other. This is a tradeoff between branch prediction (if it always guesses "right side", it's right 85% of the time) and recursion depth (which grows as the pivot becomes more and more imbalanced).

Of course, skewing the pivot is not what you want to do. You should rather look at sorting algorithms which don't suffer from branch mispredictions (as much), such as pdqsort, super-scalar sample sort [1,2] or block quicksort [3]

[1] excellent paper on a parallel in-place adaptation by some of my colleagues: https://arxiv.org/abs/1705.02257 - code will become available soon

[2] my less sophisticated implementation: https://github.com/lorenzhs/ssssort/, I also have a version using pdqsort as base case sorter, which is faster: https://github.com/lorenzhs/ssssort/blob/pdq/speed.pdf (in that plot, ssssort = super scalar sample sort with pdqsort as base case)

[3] https://arxiv.org/abs/1604.06697, implementation at https://github.com/weissan/BlockQuicksort - pdqsort uses the same technique

sgift · on June 30, 2017

Thanks for making it clear if a sort is stable or unstable in the function name. One of my longer debug sessions was tracing a porting error from Java to C# where the default sort functions are stable vs unstable.

CJefferson · on June 30, 2017

Super picky reply, the gap programming language (www.gap-system.org) used pdqsort before Rust :) (I know, as I implemented it).

Retric · on June 29, 2017

Comparing sorting algo's often says more about your benchmark than the algo's themselves. Random and pathological are obvious, but often your dealing with something in between. Radix vs n log n is another issue.

So, what where your benchmarks like?

_hrfd · on June 29, 2017

That is true - the benchmarks mostly focus on random cases, although there are a few benchmarks with "mostly sorted" arrays (sorted arrays with sqrt(n) random swaps).

If the input array consists of several concatenated ascending or descending sequences, then timsort is the best. After all, timsort was specifically designed to take advantage of that particular case. Pdqsort performs respectably, too, and if you have more than a dozen of these sequences or if the sequences are interspersed, then it starts winning over timsort.

Anyways, both pdqsort and timsort perform well when the input is not quite random. In particular, pdqsort blows introsort (e.g. typical C++ std::sort implementations) out of the water when the input is not random[1]. It's pretty much a strict improvement over introsort. Likewise, timsort (at least the variant implemented in Rust's standard library) is pretty much a strict improvement over merge sort (e.g. typical C++ std::stable_sort implementations).

Regarding radix sort, pdqsort can't quite match its performance (it's O(n log n) after all), but can perform fairly respectably. E.g. ska_sort[2] (a famous radix sort implementation) and Rust's pdqsort perform equally well on my machine when sorting 10 million random 64-bit integers. However, on larger arrays radix sort starts winning easily, which shouldn't be surprising.

I'm aware that benchmarks are tricky to get right, can be biased, and are always controversial. If you have any further questions, feel free to ask.

[1]: https://github.com/orlp/pdqsort

[2]: https://github.com/skarupke/ska_sort

prestonbriggs · on June 30, 2017

There's lots of pathologies to watch out for... Imagine sorting 10 million random ints, where 1% of them are random 64-bit values and 99% are in the range [0..9]. Might be extra fun in parallel.

etep · on June 29, 2017

Hi stjepang,

If the following criteria is met, then perhaps the branch mis-predict penalty is less of a problem: 1. you are sorting a large amount of data, much bigger than the CPU LLC 2. you can effectively utilize all cores, i.e. your sort algorithm can parallelize Perhaps in this case you are memory bandwidth limited. If so, you are probably spending more time waiting on data than waiting on pipe flushes (i.e. consequence of mis-predicts).

_hrfd · on June 29, 2017

Absolutely - there are cases when branch misprediction is not the bottleneck. It depends on a lot of factors.

Another such case is when sorting strings because every comparison causes a potential cache miss and introduces even more branching, and all that would dwarf that one misprediction.

VHRanger · on June 30, 2017

I have no doubt that PDQSort is faster than the other sorting algorithms, but before I swap std::sort for it in serious commercial apps I'd like some proofs and battle testing.

That said, if it really is an in place swap for std::sort I'll try it out ASAP

_hrfd · on Dec 25, 2016

There is a chapter in Rustonomicon on subtyping and variance: https://doc.rust-lang.org/nomicon/subtyping.html

_hrfd · on Dec 14, 2016

Author here. Yes, this is an interesting problem TimSort used to have. A counterexample to incorrect TimSort is sequence "120, 80, 25, 20, 30". This simple sequence could have been easily discovered by generating sequences randomly and verifying that the TimSort invariants hold on it. The fact that it took 13 years to discover the bug (using formal verification!) is just silly :)

Fortunately, the problem is easy to fix and there's a brief mention of it in the comments: https://is.gd/txKd4I

_hrfd · on Dec 4, 2016

To transfer ownership, you would wrap the object in Mutex<T>, send it via Sender<T>, use some other synchronization primitive, or simply pass it over at the moment the thread is spawned.

It's up to the synchronization primitives to ensure memory barriers are executed, not the language. So e.g. a Mutex<T> would execute a memory barrier when locking and unlocking. Rust then allows you to pass ownership in a trusted manner only, i.e. through those primitives.

Of course, you can also create an unsafe block and say: "Rust, trust me, I know what I'm doing, don't bother me, and these are the invariants I want you to enforce in safe code..." Mutexes are implemented this way.

_hrfd · on Nov 11, 2016

Yes, there is a simpler approach.

If your function returns Result<T, Box<Error>>, then you can use try! or ? operator to return any kind of error. It will be automatically boxed and converted into the generic Box<Error>.

This works for two reasons:

1) try! and ? don't simply return Err(err) on error case as one might think. They actually return Err(From::from(err)).

2) The standard library has: impl<'a, E: Error + 'a> From<E> for Box<Error + 'a> { ... }

Check out this article, starting from "The From trait": http://blog.burntsushi.net/rust-error-handling/

wwalexander · on Nov 13, 2016

Another solution I hadn't considered is to simply impl std::convert::From<SomeError> for your custom error type for each SomeError that might be encountered.

wwalexander · on Nov 11, 2016

Awesome, thanks. Does returning boxed errors have a performance impact?

sklopi · on Nov 11, 2016

It should only have an impact when errors actually occur, i think.

_hrfd · on Sept 27, 2016

Chris Lattner on the advantages of reference counting over tracing garbage collection: https://lists.swift.org/pipermail/swift-evolution/Week-of-Mo...

HN For You