More

sarah-ek · on April 25, 2024

i have, yes. i can't speak for openblas or mkl, but im familiar with eigen and nalgebra's implementations to some extent

nalgebra doesn't use blocking, so decompositions are handled one column (or row) at a time. this is great for small matrices, but scales poorly for larger ones

eigen uses blocking for most decompositions, other than the eigendecomposition, but they don't have a proper threading framework. the only operation that is properly multithreaded is matrix multiplication using openmp (and the unstable tensor module using a custom thread pool)

sarah-ek · on April 24, 2024

because we can do better

1980phipsi · on April 25, 2024

What features of Rust let you write faster code than C/C++/Fortran?

adgjlsfhk1 · on April 25, 2024

most of it's just general programming niceness. If you have to spend a few hours to wrestle with make/bazel/etc every time you need to reach for a dependency, you don't depend on things and have to rewrite them yourself. If your programming language doesn't have good ways of writing generic code, you either have to write the code once per precision (and do all your bugfixes/perf improvements in triplicate) or do very hacky metaprogramming where you use Python to generate your low level code (yes the Fortran people actually do this https://github.com/aradi/fypp), or use the C preprocesser which is even worse.

sarah-ek · on April 24, 2024

as far as i know, very little is shared. ndarray-linalg is mostly a lapack wrapper. nalgebra and faer both implement the algorithms from scratch, with nalgebra focusing more on smaller matrices

sarah-ek · on April 24, 2024

one issue that may be affecting the result is that openmp's threadpool doesn't play well with rayon's. i've seen some perf degradation in the past (on both sides) when both are used in the same program

i plan to address that after refactoring the benches by executing each library individually

jimbobraggins · on April 24, 2024

Very cool project! I'd suggest before running the new benchmarks to reach out to to the developers of the packages you are testing against to see if they think the benchmark you wrote is doing efficient calling conventions. I work on a large open source software project and we've had people claim they are 10x faster than us while they were really using our code in some very malformed ways.

Also stops them from grumbling after you post good results!

sarah-ek · on April 24, 2024

fair enough. i try to stay in touch with the eigen and nalgebra developers so i have a good idea on how to write code efficiently with them. for openblas and mkl i've been trying recently to call into the lapack api (benches doing that are still unpublished at the moment), that way im using a universal interface for that kinda stuff.

and of course i do check the cpu utilization to make sure that all threads are spinning for multithreaded benchmarks, and occasionally check the assembly of the hot loops to make sure that the libraries were built properly and are dispatching to the right code. (avx2, avx512, etc) so overall i try to take it seriously and i'll give credit where credit is due when it turns out another library is faster

sarah-ek · on April 24, 2024

lapack does expose a full pivoting lu as far as i can tell? https://netlib.org/lapack/explore-html/d8/d4d/group__getc2_g...

bee_rider · on April 24, 2024

If you are going to include vs MKL benchmarks in your repo, full pivoting LU might be one to consider. I think most people are happy with partial pivoting, so I sorta suspect Intel hasn’t heavily tuned their implementation, might be room to beat up on the king of the hill, haha.

sarah-ek · on April 24, 2024

funny you mention that, the full pivoting. it's one of the few benchmarks where faer wins by a huge margin

n faer mkl openblas

1024 27.06 ms 186.33 ms 793.26 ms

1536 73.57 ms 605.71 ms 2.65 s

2048 280.74 ms 1.53 s 8.99 s

2560 867.15 ms 3.31 s 17.06 s

3072 1.87 s 6.13 s 55.21 s

3584 3.42 s 10.18 s 71.56 s

4096 6.11 s 15.70 s 168.88 s

adgjlsfhk1 · on April 24, 2024

it might be interesting to add butterfly lu https://arxiv.org/abs/1901.11371. it's a way of doing a numerically stable lu like factorization without any pivoting, which allows it to parallelize better.

bee_rider · on April 24, 2024

It looks like they are describing a preconditioner there.

adgjlsfhk1 · on April 24, 2024

the key point is that the preconditioner allows you to skip pivoting which is really nice because the pivoting introduces a lot of data dependence.

sarah-ek · on April 24, 2024

looks interesting! thanks for sharing

heinrichhartman · on April 24, 2024

Thanks for pointing this out. Looks like only the python bindings are not included in nunpy.

sarah-ek · on April 24, 2024

the long compile times are mostly because im instantiating every dense decomposition in the library in one translation unit, for several data types (f32, f64, f128, c32, c64, c128)

sarah-ek · on April 24, 2024

not yet. a tensor api is on the long term todo list, but it's a big undertaking and i'd like to focus on matrix operations for the time being

sarah-ek · on April 24, 2024

i've contributed to eigen in the past and know enough about the internals of the codebase to know my way around safe `auto` usage

dannyz · on April 24, 2024

I wasn't worried about safe usage, more that some of the initialization may be moved inside the benchmarking function instead of outside of it like intended. I'm sure you know more about it than me though.

sarah-ek · on April 24, 2024

im in the process of refactoring the benchmark code at the moment, and plan to include mkl in the benches soon.

overall, the results show that faer is usually faster, or even with openblas, and slower than mkl on my desktop

mjhay · on April 24, 2024

Wow, that's impressive! I wouldn't expect anything to be able to beat MKL, given the optimizations made based on proprietary information.

sarah-ek · on April 24, 2024

author here, eigen is compiled with -fopenmp, which enables parallelism by default

bayindirh · on April 24, 2024

Hi! Thanks for chiming in!

Did you check with resource utilization? If you don't provide "OMP_NUM_THREADS=n", Eigen doesn't auto-parallelize by default.

sarah-ek · on April 24, 2024

i did check, yes

HN For You