More

dmpetrov · on Oct 20, 2024

Right, DVC caches data for consistency and reproducibility.

If caching is not needed and streaming required, we've created a sister tool DataChain. It's even supports WebDataset and can stream from tar archives and filter images by metadata.

WebDataset example: https://github.com/iterative/datachain/blob/main/examples/mu...

notrealyme123 · on Oct 21, 2024

Thank you! Thats news to me. I will absolutely give it a try

dmpetrov · on Oct 19, 2024

Yes. And if you track transformations of the binaries or ml training

dmpetrov · on Oct 19, 2024

hi there! Maintainer and author here. Excited to see DVC on the front page!

Happy to answer any questions about DVC and our sister project DataChain https://github.com/iterative/datachain that does data versioning with a bit different assumptions: no file copy and built-in data transformations.

ajoseps · on Oct 19, 2024

if the data files are all just text files, what are the differences between DVC and using plain git?

miki123211 · on Oct 19, 2024

DVC does a lot more than git.

It essentially makes sure that your results can reproducibly be generated from your original data. If any script or data file is changed, the parts of your pipeline that depend on it, possibly recursively, get re-run and the relevant results get updated automatically.

There's no chance of e.g. changing the structure of your original dataset slightly, forgetting to regenerate one of the intermediate models by accident, not noticing that the script to regenerate it doesn't work any more due to the new dataset structure, and then getting reminded a year later when moving to a new computer and trying to regen everything from scratch.

It's a lot like Unix make, but with the ability to keep track of different git branches and the data / intermediates they need, which saves you from needing to regen everything every time you make a new checkout, lets you easily exchange large datasets with teammates etc.

In theory, you could store everything in git, but then every time you made a small change to your scripts that e.g. changed the way some model works and slightly adjusted a score for each of ten million rows, your diff would be 10m LOC, and all versions of that dataset would be stored in your repo, forever, making it unbelievably large.

amelius · on Oct 20, 2024

Sounds like it is more a framework than a tool.

Not everybody wants a framework.

JadeNB · on Oct 20, 2024

> Sounds like it is more a framework than a tool.

> Not everybody wants a framework.

The second part of this comment seems strange to me. Surely nothing on Hacker News is shared with the expectation that it will be interesting, or useful, to everyone. Equally, surely there are some people on HN who will be interested in a framework, even if it might be too heavy for other people.

amelius · on Oct 20, 2024

Just saying that what makes Git so appealing is that it does one thing well, and from this view DVC seems to be in an entirely different category.

stochastastic · on Oct 20, 2024

It doesn’t force you to use any of the extra functionality. My team has been using it just for the version control part for a couple years and it has worked great.

bach4ants · on Oct 26, 2024

Yep. I personally like DVC's pipeline implementation because it's lightweight and language-agnostic, but haven't gotten into using their experiment tracking features.

woodglyst · on Oct 20, 2024

This sounds a lot like the experimental project Jacquard [0] from Ink & Switch.

[0] https://www.inkandswitch.com/jacquard/notebook/

azinman2 · on Oct 19, 2024

So where do the adjusted 10M rows live instead? S3?

thangngoc89 · on Oct 20, 2024

DVC support multiple remotes. S3 is one of them, there are also WebDAV, local FS, Google Drive, and a bunch of others. You could see the full list here [0]. Disclaimer: not affiliated with DVC in anyway, just a user.

[0] https://dvc.org/doc/user-guide/data-management/remote-storag...

dmpetrov · on Oct 19, 2024

In this cases, you need DVC if:

1. File are too large for Git and Git LFS.

2. You prefer using S3/GCS/Azure as a storage.

3. You need to track transformations/piplines on the file - clean up text file, train mode, etc.

Otherwise, vanilla Git may be sufficient.

agile-gift0262 · on Oct 20, 2024

It's not just to manage file versioning. Yo can define a pipeline with different stages, the dependencies and outputs of each stage and DVC will figure out which stages need running depending on what dependencies have changed. Stages can also output metrics and plots, and DVC has utilities to expose, explore and compare those.

johanneskanybal · on Oct 20, 2024

Mostly consult as a data engineer not ML ops but I’m interested in some aspects of this. We have 10 years of parquet files from 300+ different kafka topic and we’re currently migrating to apache iceberg. We’ll back fill on a need only basis and it would be nice to track that with git. Would this be a good fit for that?

Another potential aspect would be tracking schema evolution in a nicer way than we currently do.

thx in advance, huge fan of anything-as-code and think it’s a great fit for data (20+ years in this area).

stochastastic · on Oct 20, 2024

Thanks for making and sharing DVC! It’s been a big help.

Is there any support that would be helpful? I’ll look at the project page too.

dmpetrov · on Oct 20, 2024

Thank you!

Just shoot an email to support and mention HN. I’ll read and reply.

dmpetrov · on Aug 17, 2024

Hey! I'm one of the creators of DataChain.

DataChain works on your local machine and manages files in storage (like images and PDFs in S3 or GCP). Users can slice and dice their files using metadata. Example:

- Download only files labeled "Cats" instead of the whole dataset. Use json/parque to get labels.

- Use LLMs to generate metadata. E.g., "Are there more than 3 people in the image?".

- Add custom metadata to create a rich "DataFrame" of your files

The API of the data-frame is based on Python (Pydentic) but queries to Pythion objects are transpiled to database (SQLite). Or you can just convert all metadata into Pandas if you prefer.

WDYT? I’d love to hear your thoughts!

dmpetrov · on Aug 9, 2024

Bridging the gap between AI and data warehouses is crucial, but I’m not sure SQL is the best fit for AI engineers who mainly work with Python and AI APIs.

At DataChain, we are solving this by creating a Python API that translates to SQL under the hood, which is pretty easy now with Pydantic. https://github.com/iterative/datachain

WDYT?

richardmeng · on Aug 9, 2024

Right, our product is designed for data practitioners who want snappy data analytics on unstructured data.

Thanks for sharing your project, super cool idea! What does it take if we want to integrate our SQL engine with datachain?

dmpetrov · on Aug 9, 2024

It uses SQLite in open-source. In SaaS - proprietary data warehouses where your engine can be integrated.

dmpetrov · on June 5, 2022

> I do not treat Kubernetes as Cloud-Native... for end-users to achieve cloud native goals.

Do you mean that Cloud-Native is about applications that are built on top of K8S, not K8S itself?

What are the cloud native goals? Abstracting out end-users from any cloud?

allencloud · on June 5, 2022

> Do you mean that Cloud-Native is about applications that are built on top of K8S, not K8S itself?

Exactly, from the end-user's perspective, cloud native is the API to access cloud, even the distributed cloud in the future. Kubernetes is just one popular way for end-user to enjoy cloud api. But AWS has its own way as well.

> What are the cloud native goals? Abstracting out end-users from any cloud?

We cannot reach a consensus unless we clarify clearly we throw opinions on which kind of position. For AWS, the leading cloud provider all around the world, He would say AWS service API is the best practice of cloud native. For 2nd, 3rd chaser for AWS, they would say AWS's way is no cloud native at all, since AWS just wish to lock users always in AWS. For end-users, a way to encapsulate all kinds of cloud provider's functionally into a single layer of API, it is the cloud native. Then end user could use any cloud with less effort and less cost.

dmpetrov · on June 5, 2022

I'd like to see a research of WFH impact to people's spendings. Many households have to pay extra for more space to have a home office while companies pay less :)

ghaff · on June 5, 2022

As with many things, averages are pretty worthless.

You mostly have two primary costs.

- Possible upgrades to living space so that you can "lend" your employer an office.

- Possible significant time/cost savings by not having to commute.

Mostly these aren't borne by the same people.

spangry · on June 5, 2022

It's just anecdotal, but I found myself unintentionally saving money like crazy when I was WFH during lockdown due to not selling out every day for fuel, parking and buying lunch. But then again I'm someone who always pays the 'disorganisation tax'. I'm sure people with more optimised spending (e.g. catching the bus, bringing their lunch in from home etc.) didn't see as much of an effect.

dmpetrov · on June 5, 2022

Well... I need to rent a property with +1 bedroom to WFH. It is a substantial family spending.

dmpetrov · on June 1, 2022

It would be helpful to see the difference with MLFlow. The deployment part as well as model registry part.

aguschin · on June 1, 2022

Hi! Thanks for the question! There are a few important differences:

- MLEM automatically extracts the metadata from the model for you. With MLflow, you need to specify ML framework and environment.

- For the Model Registry that you can build in Git with MLEM, you don't need a separate service and Database up, except for GitHub or GitLab.

dmpetrov · on April 27, 2022

Yes, it' from DVC/CML team! We started TPI as a "computational backend" for CML project (CI/CD for ML). But then we realized that it can be useful as an independent tool.

dmpetrov · on April 27, 2022

Hey all, we are launching Terraform Provider Iterative (TPI).

It was designed for machine learning (ML/AI) teams and optimizes CPU/GPU expenses:

1. Spot instances auto-recovery (if an instance was evicted/terminated) with data and checkpoint synchronization

2. Auto-terminate instances when ML training is finished - you won't forget to terminate your expensive GPU instance for a week :)

3. Familiar Terraform commands and config (HCL)

The secret sauce is auto-recovery logic that is based on cloud auto-scaling groups and does not require any monitoring service to run (another cost-saving!). Cloud providers recover it for you. TPI just unifies auto-scaling groups for all the major cloud providers: AWS, Azure, GCP and Kubernetes. Yeah, it was tricky to unify all clouds :)

We'd love to hear your feedback!

HN For You