More

MrPowers · on Jan 19, 2024

I work at Databricks, but am pretty much just an OSS nerd, mainly focusing on Delta Rust recently: https://github.com/delta-io/delta-rs

I did some keyword research and wrote this post cause lots of folks are doing searches for Delta Lake vs Parquet. I'm just trying to share a fair summary of the tradeoffs with folks who are doing this search. It's a popular post and that's why I figured I would share it here.

MrPowers · on Jan 19, 2024

Yea, it is fair feedback.

I respect the Iceberg team & their work.

I've been shying away from that post cause I don't wanna start a flamewar, but I will reflect on this and reconsider. Thank you.

gregw2 · on Jan 19, 2024

You are right there will be a flamewar, and others will discount some of what you say because of your bias, you will get criticism and personal remarks (mostly off base) and you will suffer tremendous heat for it. I have been there in a past life re: unix wars.

But, particularly if you acknowledge opposing views in your content and don't hide counterarguments via cherry picking, you will really add value to the data community in exposing the truth, and educating people both on your team and the other team which ultimately spurs improvements where both sides have gaps and performs a greater benefit for the broader community.

It takes courage and care to put a controversial rigorous viewpoint out there; you do risk your "reputation". But, particularly if you make corrections where appropriate, people will recognize you as genuine.

It is not bad to have a point of view. What is bad is to hide your bias or counterarguments to deceive people.

Be part of the thesis + antithesis-> synthesis Hegelian dialog that brings progress. Ultimately as you advocate for your customers (developers/data users), not "your team", you will perform a true service to the community, even if only you and a few others recognize it.

MrPowers · on Jan 19, 2024

Lots of organizations have Parquet data lakes and are considering switching to Delta Lake.

Converting a Parquet table to a Delta table is an in-place, cheap computation. You can just add the Delta Lake metadata to an existing Parquet table and then take advantage of transactions and other features. I don't think it's a meaningless comparison.

Iceberg is cool too.

BadHumans · on Jan 19, 2024

There is no Parquet table. Parquet is a compressed file format like a zip. Parquet can be read into table formats like Hive, Delta, etc. That is why this comparison makes no sense.

MrPowers · on Jan 19, 2024

Lots of Parquet files in the same directory are typically referred to as a "Parquet table".

Yes, Parquet can be compressed with zip, but snappy is much more common because it's splittable.

Parquet tables can be registered in a Hive metastore. Delta metadata can be added to a Parquet table to make it a Delta table.

BadHumans · on Jan 19, 2024

> Lots of Parquet files in the same directory are typically referred to as a "Parquet table".

This is my point though? This is an apples to oranges comparison. A directory of Parquet files is not a table format. Comparing Delta to Hive or Iceberg is a more apt comparison. I have worked with all types of companies and I have yet to work with one that is just using a directory of Parquet files and calling it a day without using something like Hive with it.

MrPowers · on Jan 19, 2024

Yea, comparing Delta Lake to Iceberg is more apt, but I've been shying away from that content cause I don't wanna flamewar. Another poster is asking for this post tho, so maybe I should write it.

I don't really see how Delta vs Hive comparison makes sense. A Delta table can be registered in the Hive metastore or can be unregistered. If you persist a Spark DataFrame in Delta with save it's not registered in the Hive metastore. If you persist it with saveAsTable it is registered. I've been meaning to write a blog post on this, so you're motivating me again.

I've seen a bunch of enterprises that are still working with Parquet tables that aren't registered in Hive. I worked at an org like this for many years and didn't even know Hive was a thing, haha.

BadHumans · on Jan 19, 2024

> I don't really see how Delta vs Hive comparison makes sense. A Delta table can be registered in the Hive metastore or can be unregistered.

You are right about Delta tables in the Hive metastore but if you are writing from the perspective of "there are companies that don't know what Hive is" then I feel the next step up is "there are companies that just stuff files in S3 and query them with Athena(which handles all the Hive stuff for you when you make tables). Explaining what Delta gives them over that I feel is something worth explaining.

chimerasaurus · on Jan 19, 2024

I agree with the points you make above.

MrPowers · on Jan 19, 2024

Yea, Spark works best with "right-sized" files.

Let's suppose you have a data lake with 40,000 Parquet files. You need to list the files before you can read the data. This can take a few minutes. I've worked on data lakes that require file listing operations that run for hours. Key/value stores aren't good at listing files like Unix filesystems.

When Spark reads the 40,000 Parquet files it needs to figure out the schema. By default, it'll just grab the schema from one of the files and just assume that all the others have the same schema. This could be wrong.

You can set an option telling Spark to read the schemas of all 40,000 Parquet files and make sure they all have the same schema. That's expensive.

Or you can manually specify the schema, but that can be really tedious. What if the table has 200 columns.

The schema in the Parquet footer is perfect for a single file. I think storing the schema in the metadata is much better when data is spread across many Parquet files.

adolph · on Jan 19, 2024

> a data lake with 40,000 Parquet files. You need to list the files before you can read the data. This can take a few minutes.

Sounds like this data lake could use a Parquet file listing the Parquet files.

Butter

MrPowers · on Jan 19, 2024

Yea, that's exactly what Delta Lake does. All the table metadata is stored in a Parquet file (it's initially stored in JSON files, but eventually compacted into Parquet files). These tables are sometimes so huge that the table metadata is big data also.

MrPowers · on Jan 19, 2024

Looking at this now.

* Delta Lake supports merge-on-read via deletion vectors: https://delta.io/blog/2023-07-05-deletion-vectors/

* Why doesn't Delta Lake have efficient bulk load? Lots of the biggest datasets in the world are in Delta tables.

* Delta Lake definitely supports compaction: https://delta.io/blog/2023-01-25-delta-lake-small-file-compa...

* What does CLI support mean in the context of a Lakehouse storage system? You can open up a Spark shell or Python shell to interface with your Delta table. That's like saying "CSV doesn't have a CLI". I don't get it.

I didn't do a detailed review of the post.

MrPowers · on Jan 19, 2024

Yep, it is re-inventing database systems and you raise a great question.

At first glance, it seems like Delta Lake is inferior to a database. Most databases support multi-table transactions and Delta Lake only support transactions for single table. ACID transaction support is nothing new for a database.

Delta Lake is useful for large datasets and to keep costs low.

There are organizations that are ingesting hundreds of terabytes and petabytes of data into a Delta table every day. They're able to ingest data, perform upserts, and build realtime pipelines with this architecture.

Delta Lake is also free, so you only have to pay for storing the files in the cloud. This is a lot cheaper than a database usually.

Data warehouses are often packaged with a certain amount of shared RAM/storage. This can be a problem for a team with large workflows from many users. It's annoying to share compute with someone that's running a large experiment.

These are the main reasons enterprises shited to data lakes and now Lakehouse storage systems. See this paper to learn more: https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

MrPowers · on Jan 19, 2024

There are a few features missing from the FOSS Scala/Spark implementation of Delta Lake, but I wouldn't say a lot. The FOSS version supports all the table features in the Delta Lake protocol.

The Delta Rust implementation is missing more table features, but we're closing the gap fast. We just added support for constraints to Delta Rust and are working on change data feed right now.

orthoxerox · on Jan 19, 2024

Delta Live Tables and automatic vacuuming are the two big features I'm missing.

MrPowers · on Jan 19, 2024

Delta Live Tables are a Databricks feature and aren't related to Delta Lake.

Can't you just setup a cron job to vacuum periodically?

switchbak · on Jan 19, 2024

DLT is a big feature of a full platform, you can’t really say that’s a missing feature of a delta library.

MrPowers · on Jan 19, 2024

Yea, there is a Rust implementation of the Delta Lake protocol that lets you do upserts without Spark too. This allows pandas, Polars, DataFusion, and PyArrow users to easily do upserts as well.

MrPowers · on Jan 19, 2024

Data Lakes (i.e. Parquet files in storage without a metadata layer) don't support transactions, require expensive file listing operations, and don't support basic DML operations like deleting rows.

Delta Lake stores data in Parquet files and adds a metadata layer to provide support for ACID transactions, schema enforcement, versioned data, and full DML support. Delta Lake also offers concurrency protection.

This post explains all the features offered by Delta Lake in comparison to a plain vanilla Parquet data lake.

alexmolas · on Jan 19, 2024

Please, stop using LLM to provide post summaries. This comment is not adding value to the conversation.

MrPowers · on Jan 19, 2024

I actually wrote this. I thought it was going to be part of the post description and didn't realize it was going to be a comment.

alexmolas · on Jan 19, 2024

Sorry if my comment sound too harsh. I've noticed a lot of people commenting autogenerated summaries of the posts, trying to farm karma I guess.

MrPowers · on Jan 17, 2024

Delta Lake solves a lot of the Parquet limitations mentioned in this post. Disclosure: I work on the Delta Lake project.

Parquet files store metadata about row groups in the file footer. Delta Lake adds file-level metadata in the transaction log. So Delta Lake can perform file-level skipping before even opening any of the Parquet files to get the row-group metadata.

Delta Lake allows you to rearrange your data to improve file-skipping. You can Z Order by timestamp for time-series analyses.

Delta Lake also allows for schema evolution, so you can evolve the schema of your table over time.

This company may have a cool file format, but is it closed source? It seems like enterprises don't want to be locked into closed formats anymore.

Malcolmlisk · on Jan 17, 2024

Wow ! I've been reading for a while from delta lake and Im interested in the company. Is there a chance to drop a CV for remote work (i am from spain).

The schema evolution is something that popped out in a water cooler conversation the other day in my team.

adammarples · on Jan 17, 2024

Can you z order in delta lake? I thought that was one of the features databricks had kept to themselves

HN For You