For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | faltet's commentsregister

This is what happens when two nice libraries like HDF5 and Blosc2 cooperate in making a third one (PyTables) way more efficient.


Here you have a benchmark based on the MovieLens database:

http://nbviewer.ipython.org/github/Blosc/movielens-bench/blo...

The results are explained here:

http://www.blosc.org/docs/bcolz-EuroPython-2014.pdf

and, more in-depth here:

https://python.g-node.org/wiki/starving_cpu


Blaze is intended to be a much more general solution than a pure key-value store (although it will be able to tackle this use case too). But yes, the idea under BLZ is close to your projects, namely, leveraging the available resources in your computer in the most useful way.

But sorry, I strongly disagree in that everything is a variation on things from 70's: during these good old years, the memory hierarchy was far simpler than nowadays, and having to deal with that introduces a great amount of complexity in libraries that are designed to get most out of modern computers.


The memory hierarchy can be pretty complicated, but the general design, the part that's making the rest possible, seems to me that it's stacking ISAM. It might not be a simple stacking, you need to tune the parameters to deal with various cache levels. But it's the same principle, isn't it?

I agree the hardware is quite different, vastly more complex.


That's right. But even in this case, Blosc, the internal compressor used in Blaze, can detect whether the data is compressible or not pretty early in the compression pipeline, and decide to stop compressing and start just copying (how early that decision is taked depends on the compression level).

The good news is that Blosc can still use threads in parallel for doing the copy, and this normally gives significantly better speed than a non-threaded memcpy() (the default for many modern systems).


> significantly better speed than a non-threaded memcpy()

Really? I always thought that a single core could always saturate available memory bandwidth (unless you have some weird architecture like NUMA). If you're seeing a multithread memcpy that has better performance speed, maybe you're just stealing memory bandwidth from other processes (since AFAIK memory bandwidth is probably done on a per-thread basis), or maybe you're getting more CPU cache allocated to you because you're running on multiple cores?

This would be interesting to investigate.


I'm not sure why, but yes, I'm seeing these speedups. You can see them too in: http://blosc.pytables.org/trac/wiki/SyntheticBenchmarks and paying attention to compression ratios of 1 (compression disabled). If you find any good reason on why this is happening, I'm all ears.


No doubt that much longer :) But the important points to take away are:

1) You can store more compressed data by using the same storage capacities. 2) If data can be compressed, the I/O effort will be less 3) If the compressor is fast enough, you may end saving I/O time

As for 2), Blosc can be pretty fast as can be seen in: http://blosc.pytables.org/trac/wiki/SyntheticBenchmarks so, in general, will speed up I/O for compressible data (even if that data is in-memory).


HDF5 is a very nice format indeed, and in fact, BLZ is borrowing a lot of good ideas from it. However, HDF5 has its own drawbacks, like not being able to compress variable length datasets, the lack of a query/computational kernel or its flaky resiliency during updates. Also, its approach for distributing data among nodes diverges from our goals.

Finally, you are right, speed is pretty important for us, and we think that our approach can make a better use of current computer architectures.


I use HDF5s for storage/analytics of tick data; my experience has been that the performance for storing large sparse matrices is both expensive in storage space and speed.

I know that a lot of my peers in HFT have an illicit love for column stores, but for a lot of work, there's the need for converting to 'wide' format, which quickly can take a 10 million row matrix with a few columns to one now with a few thousand columns. (And thus stuff like KX, FastBit, etc becomes sort of suboptimal)

The need for massive 'last-of' information for time series leads to basically abandoning python/pandas/numpy and using C primitives and doing a lot more than you'd typically like 'online' but really a lot of this could happen behind the scenes with intelligent out of memory ops.

So...I'm pretty excited for innovation in data stores -- I look forward to seeing more!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You