For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | more atwong's commentsregister

Scale. DuckDB chokes at a certain point (just like sqlite isn't the same as mysql or postgresql in terms of scalability). That's why they're building a better/bigger version.



MongoDB did something similar. It's open source for you to extend and host yourself but you can't build a cloud service for it.


Lots of other competitors like Apache Pinot, Apache Druid and StarRocks are fighting in that default analytics space.


StarRocks has compute/storage separation in open-source as example.


When datasets are small and can easily fit into a single node [a few terabytes], this isn't as much of an issue. Yet when datasets grow far larger, or when compute/QPS needs grow while the dataset grows slower — when either side of the equation does not scale in balanced proportion with each other — that's when this separation of compute & storage becomes vital. [Either that, or you need to find hardware servers or cloud instance types that also support this imbalance of compute & storage, which is sometimes harder to do; it also locks you into a hardware configuration that cannot dynamically scale as needs and workloads change.]

Apache Pinot also offers the same 2-tier compute/storage separation. And it also has nodes for minion [administrative] tasks. Again, these are more issues for larger scale analytical use cases.


> Apache Pinot also offers the same 2-tier compute/storage separation.

Based on looking at the docs, I don't think so. Maybe only with HDFS. Feel free to link to a page that says otherwise.


The Brokers are the compute layer. The Servers are the storage later.

Note that this is separate from the fact that the Servers can also be in a tiered storage configuration.

https://docs.pinot.apache.org/basics/architecture


Not only. Transactions, UPDATES, CBO, Better join optimizations.

It seems that someone is stuck in 2016, when there is no good alternatives for ClickHouse exist in open source.


All the main players in Clickhouse's space like Apache Pinot, Apache Druid, StarRocks, PrestoDB all have mindshare and unicorns using their products. It sounds like you haven't seen whats happening in this space.


Trino, not Presto.

Presto, created by FB, was required to let any FB engineer merge without OWNERS (because Facebook doesn't have OWNERS files unless it would create a SEV1).

Subsequently, original creators of Presto forked it to PrestoSQL.

So Facebook trademarked the name Presto.

So creators renamed it Trino.

https://trino.io/blog/2020/12/27/announcing-trino.html


Here's a list of other open source OLAP systems out there. Clickhouse is on the list along with others like StarRocks. https://atwong.medium.com/top-open-source-alternatives-to-ol...


There are other databases today that do real time analytics (ClickHouse, Apache Druid, StarRocks along with Apache Pinot). I'd look at the ClickHouse Benchmark to see who are the competitors in that space and their relative performance.


Yeah ClickHouse is definitely the way to go here. Its ability to serve queries with low latency and high concurrency is in an entirely different league from Snowflake, Redshift, BigQuery, etc.


StarRocks handles latency and concurrency as well as Clickhouse but also does joins. Less denormalization, and you can use the same platform for traditional BI/ad-hoc queries.


I wasn't familiar with StarRocks so thanks for calling attention to it.

It appears to make very different tradeoffs in a number of areas so that makes it a potential useful alternative. In particular transactional DML will make it much more convenient for workloads involving mutation. Plus as you suggested, having a proper Cost-Based Optimizer should make joins more efficient (I find ClickHouse joins to be fine for typical OLAP patterns but they do break down in creative queries...)

It's a bummer though that the deployment model is so complicated. One thing I truly like about ClickHouse is its ability to wring every drop of performance out of a single machine, with a super simple operational model. Being able to scale to a cluster is great but having to start there is Not Great.


Clickhouse also does joins.

Somehow StarRocks dudes appear in every relevant post with this false claim.


There's a difference between "supports the syntax for joins" and "does joins efficiently enough that they are useful."

My experience with Clickhouse is that its joins are not performant enough to be useful. So the best practice in most cases is to denormalize. I should have been more specific in my earlier comment.


ack that anonymous user in internet said he couldn't make CLickhouse joins perform well in his case which he didn't describe


Not "that anonymous user." In my experience, avoiding Join statements is a common best practice for Clickhouse users seeking performant queries on large datasets. A couple examples... https://medium.com/datadenys/optimizing-star-schema-queries-... https://posthog.com/blog/secrets-of-posthog-query-performanc...


> avoiding Join statements is a common best practice for Clickhouse users seeking performant queries on large datasets

its common best practice on any database, because if both joined tables don't fit memory, then merge join is O(nlogn) operation which indeed many times slower than querying denormalized schema, which will have linear execution time.


For real-time and large historical data, open source there's tdengine/questdb, commercial DolphinDB and kdb+. If you only need fast recent data and not large historical embedding is a good solution which means h2/duckdb/sqlite if open source, extremedb if commercial. I've benchmarked and ran applications on most these databases including running real-time analytics.


Open-source ClickHouse also allows both real-time and large historical data.


QuestDB, kdb+ and others mentioned are more geared toward time-series workloads, while Clickhouse is more toward OLAP. There are also exciting solutions on the streaming side of things with RisingWave etc.


As a database engineer, people ask me all the time, “What is the hottest area of database design” right now. It’s using SIMD which stands for Single instruction, multiple data, to process a lot of data very, very fast.


Looking at https://landscape.cncf.io/card-mode?category=database, here’s a list of the online analytical processing (OLAP) that are CNCF members


CH timeouts using joins on TPCH test data. https://celerdata.com/blog/clickhouse-vs.-starrocks-a-detail...


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You