More

fangjin · on Nov 22, 2021

Druid founder @zedruid writes a (satirical) benchmarking blog post about Druid vs Clickhouse and Rockset. The results are technically valid, but should you ever trust a benchmark?

zX41ZdbW · on Nov 22, 2021

They did not actually perform benchmark for ClickHouse, just copy-pasted the results from some old article: https://altinity.com/blog/clickhouse-nails-cost-efficiency-c...

fangjin · on Oct 11, 2016

While Druid was initially built for batch loads, the architecture has evolved substantially as the project has matured. Today, Druid supports exactly-once streaming ingestion from Kafka, and large production deployments routinely stream millions of events per second into Druid.

dataloopio · on Oct 11, 2016

Can you point me to a source for 'routinely stream millions of events per second into druid'?

While it is true that druid is great at querying billions of rows per second it's not very good at ingress. Here is a mailing list discussion for some background.

https://groups.google.com/forum/#!searchin/druid-user/benchm...

cheddar · on Oct 11, 2016

What kind of ingestion numbers are you working with? The thread you link to shows that Druid can ingest ~27.5k events/sec per node, which is roughly 2.376bn events a day per node.

While you can claim bias here too, we have multiple clusters ingesting in the high hundreds of thousands of events/second and our largest cluster does close to 2m/s. That's definitely scaled horizontally across multiple nodes.

If you are suggesting there is a system out there that can ingest millions of messages a second on a single node, I'd love to hear about it :).

edit: Ah, I see from the spreadsheet that you linked that there are systems out there that claim 2.5-3.5m writes per second per node. That's really quite amazing, would be awesome if you could provide the methodology used to collect those numbers. For example, if you are sending in 500 byte events (a rather common size for what we do), if my calculations are correct, you are now sustaining 14 Gbps, which means those benchmarks were done on some beefy hardware. Can you link to a blog post that details the methodology?

dataloopio · on Oct 11, 2016

Most benchmarks are given a colour for reliability and link to repeatability.

cheddar · on Oct 11, 2016

Ah, cool, I chased down what you are doing and figured out that you are doing an apples to oranges comparison.

As described in your benchmark description:

https://gist.github.com/sacreman/b77eb561270e19ca973dd505527...

You are running 200 agents emitting 6000 metrics a piece using Haggar to generate load, which is at

https://github.com/dalmatinerdb/haggar

The specific thing of interest is how you are generating your data, which looks like you have a single set of dimensions and 6000 metrics dangling off of it. The loop that populates all of the "metrics" are:

https://github.com/dalmatinerdb/haggar/blob/master/main.go#L...

And the thing that actually populates the bytes are at:

https://github.com/dalmatinerdb/haggar/blob/master/util.go#L...

So, if we take this to an apples-to-apples comparison, you have 200 agents sending a single event every second with 6000 metrics in it. That means that you are successfully ingesting 200 events per second in the way that we would measure event ingestion for Druid.

Note, also, that the thread you link to is ingesting 17 independent dimensions with each and every event that flows in. From the Daltaminer docs, it looks like you put all dimension data into postgres and you don't expect any large-scale deployment to ever need more than a single postgres node:

https://gist.github.com/sacreman/9015bf466b4fa2a654486cd79b7...

Look under "Setup Postgres".

We routinely have billions of unique combinations of dimension values per day flowing into our system. Delegating the finding of the right keys to a relational database for such operations is going to be very cost-prohibitive, not to mention, you are going to have to materialize hundreds of millions of keys in order to do a simple aggregate over the day.

So, I guess this is just another case where you should never trust benchmarks that you didn't do yourself or that don't follow a standard pattern like TPC-H. It's too easy for the same words to be used with different meanings.

dataloopio · on Oct 11, 2016

DalmatinerDB, InfluxDB, Prometheus and Graphite each claim their numbers based on similar benchmark methods. The results range from 500k/sec to a couple of million metrics /sec. Druid comparatively, for the same benchmark would be closer to 30k/sec. If that's factually wrong please post some details and we can update the spreadsheet.

Expanding the benchmark to cover cardinality and other aspects would indeed be comparing apples to oranges.

In terms of benchmarking DalmatinerDB with billions of unique combinations indexed in Postgres.. I think we know what will happen there :) That's what it's designed for. We can also shard in the query engine, or use any of the multi master Postgres options, but I doubt that would even be necessary.

fangjin · on Oct 11, 2016

The databases listed above, to the best of my knowledge, are commonly used for dev ops metrics data and share similar terminology. Druid on the other hand, draws much of its terminology from the OLAP world. As cheddar clarified in his post above, the benchmarks for Druid are misleading as it is not an apples to apples comparison (I suspect the benchmarks for ES also suffer from this problem). A single Druid event may consist of thousands of metrics.

gillh · on Oct 12, 2016

Agree with your analysis. At Netsil, one of the big factors that we considered was average query latency and fast-aggregations over high cardinality, multi-dimensional data. Few of our early customers told us that when they deployed solutions from other vendors (with storage engines such as Cassandra) at scale (800+ monitored instances), they would have to wait for several minutes to let the data render on their dashboards for a 1-day aggregation query. So, it was not just the scalable ingestion that was paramount, fast ad-hoc analytics functionality was equally important to us.

kylequest · on Oct 11, 2016

and that happens to be the system they built :-)

fangjin · on Oct 11, 2016

Here is a source from 2015: http://www.marketwired.com/press-release/metamarkets-clients...

You can also find additional information that folks have been willing to publicly share on scale and use cases here: http://druid.io/druid-powered.html

dataloopio · on Oct 11, 2016

Those sources contain literally no technical detail. At 1.1 million metrics per second is that a 40 node druid cluster?

fangjin · on Oct 11, 2016

I think we're using very different terminology here. An event in our world may contain thousands of metrics as part of the same event.

packetslave · on Oct 11, 2016

well, he's is a druid committer and CEO of a company built on top of it, so...

dataloopio · on Oct 11, 2016

No bias there then

user5994461 · on Oct 11, 2016

> millions of events per second

Source? That statement as-is is clearly overselling.

fangjin · on April 1, 2016

FWIW, Imply repackaged Druid in such a way that should make it much much easier to set up and evaluate. We've been porting our docs over to Druid for 0.9.0: http://imply.io/docs/latest/quickstart

There's also a production-ready docker distribution: https://hub.docker.com/r/imply/imply/

fangjin · on March 31, 2016

Druid has a SparkSQL connector: https://github.com/SparklineData/spark-druid-olap

joefkelley · on April 1, 2016

Cool! Yeah I poked around the druid site and didn't find anything originally, but this looks pretty promising. It's hard to tell how full-featured it is without getting hands-on, but I see something like this making Druid much more usable in a lot of analytics environments.

hbutani · on April 2, 2016

We have couple of companies running Tableau on top of this. The deployment is Tableau - Spark ThriftServer(with our extension) - Druid. We push down Slice and Dice and Star Join Queries as Druid Queries; all of Spark SQL is supported with some portions of a Query Plan being executed in Spark. We are working on supporting more Spark UDFs being pushed to Druid, performance improvements, and more coverage for Tableau. Further down we will support Star Schemas where some or all dimensions are not indexed. Happy to discuss specific SQL support or deployment questions. Please reach out to us.

- Harish.

fangjin · on March 31, 2016

Druid's main value add to the data infrastructure space is around power user-facing data applications at scale. The queries it is best at are OLAP/business intelligence style queries. It isn't really designed to be a general processing tool such as Hadoop or Spark. The open source data space is very complex, and there are many different solutions targeted towards many different use cases. Druid is better than other solutions at some of these use cases, and worse than other solutions at others.

I wrote my interpretation of the current open source data landscape here for anyone interested: http://imply.io/post/2015/11/04/big-data-zoo.html

deepanchor · on April 1, 2016

As someone who hasn't yet had the opportunity to use many of these systems, this was a great high-level overview of how the various systems fit together. Thanks for writing it!

scott_s · on April 1, 2016

Your Wikipedia link to "stream processors" is for the wrong kind of stream processors. For a decades, the digital signal processing and graphics worlds have used the streaming abstraction to design programs and hardware. Their applications are typically expressed in pipelines, and do have to continually process data.

"Big data" stream processing is obviously related as its a dataflow programming model, but it's still very different in practice. The streaming abstraction is generally more free-form, and not realized directly in hardware. I contrast the two kinds of streaming in Section 2 of a paper from a few years ago: http://www.scott-a-s.com/files/pact2012.pdf

fangjin · on March 31, 2016

Druid committer here. We spent a lot of time in the early days on making sure the system worked (at scale) and now we're spending more time to make it much easier to set up and manage.

pookeh · on March 31, 2016

It might be helpful perhaps to provide some kubernetes configuration set up (or like a presetup kubernetes running in vagrant) that has all the nodes correctly configured out of the box to easily get started with development and prototyping.

fangjin · on March 11, 2016

Druid (http://druid.io/druid-powered.html) is another option for similar workloads. Druid is a community-led open source data store used by many technology companies at very large scale. Comes with multiple visualization/open source applications, SQL interfaces, Grafana extensions, and a community to help with issues.

fangjin · on Nov 23, 2015

Hi Evan, the user groups (https://groups.google.com/forum/#!forum/druid-user) are very active and if you post your issue there, I"m sure someone could help.

fangjin · on Nov 23, 2015

The logo draws its origins from mathematics :) http://mathworld.wolfram.com/Implies.html

fangjin · on Aug 25, 2015

Druid does pre-aggregation (roll-up) of data at ingestion time and is also used at scale (30+ trillion events, ingesting over 1M+ events/s) by numerous large technology companies: http://druid.io/druid-powered.html

dsp1234 · on Aug 25, 2015

Note that the commenter mentioned mid-2014. That page first appeared on (or about) July 29th, 2014[1], and at that time only contained 4 names:

Metamarkets

Netflix

LiquidM

N3twork

So while today Druid may be in use by "numerous large technology companies", at the time the commenter was researching it wasn't showcasing as many large companies.

[1] https://web.archive.org/web/20140729014707/http://druid.io/d...

luckydata · on Aug 25, 2015

Hey Fan, I know you feel very strongly about Druid but at the time it wasn't the way to go, I can see how they might have opted to steer clear.

HN For You