Right, DVC caches data for consistency and reproducibility.
If caching is not needed and streaming required, we've created a sister tool DataChain. It's even supports WebDataset and can stream from tar archives and filter images by metadata.
hi there! Maintainer and author here. Excited to see DVC on the front page!
Happy to answer any questions about DVC and our sister project DataChain https://github.com/iterative/datachain that does data versioning with a bit different assumptions: no file copy and built-in data transformations.
It essentially makes sure that your results can reproducibly be generated from your original data. If any script or data file is changed, the parts of your pipeline that depend on it, possibly recursively, get re-run and the relevant results get updated automatically.
There's no chance of e.g. changing the structure of your original dataset slightly, forgetting to regenerate one of the intermediate models by accident, not noticing that the script to regenerate it doesn't work any more due to the new dataset structure, and then getting reminded a year later when moving to a new computer and trying to regen everything from scratch.
It's a lot like Unix make, but with the ability to keep track of different git branches and the data / intermediates they need, which saves you from needing to regen everything every time you make a new checkout, lets you easily exchange large datasets with teammates etc.
In theory, you could store everything in git, but then every time you made a small change to your scripts that e.g. changed the way some model works and slightly adjusted a score for each of ten million rows, your diff would be 10m LOC, and all versions of that dataset would be stored in your repo, forever, making it unbelievably large.
The second part of this comment seems strange to me. Surely nothing on Hacker News is shared with the expectation that it will be interesting, or useful, to everyone. Equally, surely there are some people on HN who will be interested in a framework, even if it might be too heavy for other people.
It doesn’t force you to use any of the extra functionality. My team has been using it just for the version control part for a couple years and it has worked great.
Yep. I personally like DVC's pipeline implementation because it's lightweight and language-agnostic, but haven't gotten into using their experiment tracking features.
DVC support multiple remotes. S3 is one of them, there are also WebDAV, local FS, Google Drive, and a bunch of others. You could see the full list here [0]. Disclaimer: not affiliated with DVC in anyway, just a user.
It's not just to manage file versioning. Yo can define a pipeline with different stages, the dependencies and outputs of each stage and DVC will figure out which stages need running depending on what dependencies have changed. Stages can also output metrics and plots, and DVC has utilities to expose, explore and compare those.
Mostly consult as a data engineer not ML ops but I’m interested in some aspects of this. We have 10 years of parquet files from 300+ different kafka topic and we’re currently migrating to apache iceberg. We’ll back fill on a need only basis and it would be nice to track that with git. Would this be a good fit for that?
Another potential aspect would be tracking schema evolution in a nicer way than we currently do.
thx in advance, huge fan of anything-as-code and think it’s a great fit for data (20+ years in this area).
DataChain works on your local machine and manages files in storage (like images and PDFs in S3 or GCP). Users can slice and dice their files using metadata. Example:
- Download only files labeled "Cats" instead of the whole dataset. Use json/parque to get labels.
- Use LLMs to generate metadata. E.g., "Are there more than 3 people in the image?".
- Add custom metadata to create a rich "DataFrame" of your files
The API of the data-frame is based on Python (Pydentic) but queries to Pythion objects are transpiled to database (SQLite). Or you can just convert all metadata into Pandas if you prefer.
Bridging the gap between AI and data warehouses is crucial, but I’m not sure SQL is the best fit for AI engineers who mainly work with Python and AI APIs.
At DataChain, we are solving this by creating a Python API that translates to SQL under the hood, which is pretty easy now with Pydantic. https://github.com/iterative/datachain
> Do you mean that Cloud-Native is about applications that are built on top of K8S, not K8S itself?
Exactly, from the end-user's perspective, cloud native is the API to access cloud, even the distributed cloud in the future. Kubernetes is just one popular way for end-user to enjoy cloud api. But AWS has its own way as well.
> What are the cloud native goals? Abstracting out end-users from any cloud?
We cannot reach a consensus unless we clarify clearly we throw opinions on which kind of position. For AWS, the leading cloud provider all around the world, He would say AWS service API is the best practice of cloud native. For 2nd, 3rd chaser for AWS, they would say AWS's way is no cloud native at all, since AWS just wish to lock users always in AWS. For end-users, a way to encapsulate all kinds of cloud provider's functionally into a single layer of API, it is the cloud native. Then end user could use any cloud with less effort and less cost.
I'd like to see a research of WFH impact to people's spendings. Many households have to pay extra for more space to have a home office while companies pay less :)
It's just anecdotal, but I found myself unintentionally saving money like crazy when I was WFH during lockdown due to not selling out every day for fuel, parking and buying lunch. But then again I'm someone who always pays the 'disorganisation tax'. I'm sure people with more optimised spending (e.g. catching the bus, bringing their lunch in from home etc.) didn't see as much of an effect.
Yes, it' from DVC/CML team! We started TPI as a "computational backend" for CML project (CI/CD for ML). But then we realized that it can be useful as an independent tool.
Hey all, we are launching Terraform Provider Iterative (TPI).
It was designed for machine learning (ML/AI) teams and optimizes CPU/GPU expenses:
1. Spot instances auto-recovery (if an instance was evicted/terminated) with data and checkpoint synchronization
2. Auto-terminate instances when ML training is finished - you won't forget to terminate your expensive GPU instance for a week :)
3. Familiar Terraform commands and config (HCL)
The secret sauce is auto-recovery logic that is based on cloud auto-scaling groups and does not require any monitoring service to run (another cost-saving!). Cloud providers recover it for you. TPI just unifies auto-scaling groups for all the major cloud providers: AWS, Azure, GCP and Kubernetes. Yeah, it was tricky to unify all clouds :)
If caching is not needed and streaming required, we've created a sister tool DataChain. It's even supports WebDataset and can stream from tar archives and filter images by metadata.
WebDataset example: https://github.com/iterative/datachain/blob/main/examples/mu...