One company looking to capitalize on this is Onehouse, a three-year-old Californian startup founded by Vinoth Chandar, who created the open source Apache Hudi project while serving as a data architect at Uber. Hudi brings the benefits of data warehouses to data lakes, creating what has become known as a “data lakehouse,” enabling support for actions like indexing and performing real-time queries on large datasets, be that structured, unstructured, or semi-structured data.
Remote: Preferred, but Hybrid fine within 20 miles of Irvine, CA
Willing to Relocate: No
Technologies: Strongest in Java, Kubernetes, MongoDB, SQL, Kafka, Data Lakehouse (Wrote Java apps as a developer lead at IBM, was on the 0.1 kubenetes team at Red Hat, 3+ years at MongoDB leading the technical direction at their #1 ARR account WW, Head of Community and DevRel at a 65 million VC-funded startup that was building an open source version of Snowflake)
My game is changing the world at warp speed. I've honed my skills across the board - sales engineer, developer advocate, lead developer, even infrastructure and operations maestro. Think of me as a multi-class strategist with 10 patents under my belt and 20+ certifications in my arsenal – Java, .net, you name it. I can build it, deploy it, talk the talk, and walk the walk, from bare metal to cloud-native code. So, what's the next challenge we'll tackle together?
I am looking for my next job in sales engineering or developer relations.
Data-warehousing becomes ubiquitous and open-source with Iceberg. Powerful open-source RTA (real-time-analytics) engines like ClickHouse are designed for crunching numbers as fast as possible. How can you use Iceberg and ClickHouse together to burn the rubber over glacially slow storage?
intake of 450,000 entries per second, and a daily data volume of 12TB
* Query performance improved 4x.
* Query P90 time improved from 500ms to 150ms.
* The cluster size reduced from 60+ nodes to less than 10 nodes.
* Data storage volume reduced by about 40%.
* Overall cost reduced by more than 80%.
It is my belief that data lakehouse analytics will replace data warehouse analytics in the future. Data lakehouses offer a number of advantages over traditional data warehouses
Before we start, what is TPC-H and TPC-DS and why are they important?
TPC-H and TPC-DS are important because they are industry standard benchmarks for measuring the performance of data warehouse and big data systems. They are widely used by vendors and customers to evaluate the performance of different systems and to compare the performance of the same system over time.
> The program is installed with clickhouse-client, has no dependencies, and works on almost any flavor of Linux. You can apply it to any database dump, not just ClickHouse. For instance, you can generate test data from MySQL or PostgreSQL databases or create development databases that are similar to your production databases.