m0sth8's comments

m0sth8 · on Nov 5, 2024

How about daft https://github.com/Eventual-Inc/Daft - also looks like a new multimodal dataframe framework

jaychia · on Nov 5, 2024

One of the maintainers of Daft here.

Just dug through the datachain codebase to understand a little more. I think while both projects have a Dataframe interface, they're very different projects!

Datachain seems to operate more on the orchestration layer, running Python libraries such as PIL and requests (for making API calls) and relying on an external database engine (SQLite or BigQuery/Clickhouse) for the actual compute.

Daft is an actual data engine. Essentially, it's "multimodal BigQuery/Clickhouse". We've built out a lot of our own data system functionality such as custom Rust-defined multimodal data structures, kernels to work on multimodal types, a query optimizer, distributed joins etc.

In non-technical terms, I think this means that Datachain really is more of a "DBT" which orchestrates compute over an existing engine, whereas Daft is the actual compute/data engine that runs the workload. A project such as Datachain could actually run on top of Daft, which can handle the compute and I/O operations necessary to execute the requested workload.

dmpetrov · on Nov 5, 2024

Good question! I’m not so familiar with it.

It looks like Daft is closer to Lance with it’s own data format and engine. But I’d appreciate more insights from users or the creators.

m0sth8 · on Nov 19, 2020

Congratulations with the launch.

We've used https://github.com/iterative/dvc for a long time and quite happy. What's the main difference between replicate.ai and dvc?

bfirsh · on Nov 19, 2020

Thanks!

DVC is closely tied to Git. We've heard people find that quite heavyweight when you're running experiments.

We think we can build a much better experience if we detach ourselves from Git. With Replicate, you just run your training script as usual, and it automatically tracks everything from within Python. You don't have to run any additional commands to track things.

DVC is really good for storing data sets though, and we see potential for integration there: https://github.com/replicate/replicate/issues/359

gidim · on Nov 19, 2020

Hey! I'm one of the founders at Comet.ml. We believe that Git should continue to be the approach for managing code (similar to dvc) but we adapted it to the ML workflow. Our approach is to compute a git patch on every run so later you can 'git apply' if you'd like (https://www.comet.ml/docs/user-interface/#the-reproduce-butt...).

shcheklein · on Nov 19, 2020

Hey, one of the DVC maintainers here!

TL;DR: I think it should be compared with the upcoming DVC feature - https://github.com/iterative/dvc/wiki/Experiments . Stay tuned - it'll be released very soon but you can try it now in beta.

First of all, congrats on the launch! I do really like the aesthetics of the website, and the overall approach. It resonates with our vision and philosophy!

Good feedback on experiments feeling heavyweight! We've been focused on doing great foundation to manage data and pipelines in the previous DVC versions and were aware about this problem (https://github.com/iterative/dvc/issues/2799). As I mentioned - Experiments feature is already there in beta testing. It means that users don't have to do commits anymore until they are ready, still can share experiments (it's a long topic and we'll write a blog post at some point since I really excited about the way it'll be implemented using custom Git refs), support for DL workflow (auto-checkpoints), and more. Would love to discuss and share any details, it would be great to compare the approaches.

bfirsh · on Nov 19, 2020

Would love to chat -- I'll shoot you an email. :)

mwnivek · on Nov 19, 2020

I'd be curious about comparison with https://github.com/mlflow/mlflow

bfirsh · on Nov 19, 2020

We talked to a bunch of MLflow users, and the general impression we got is that it is heavyweight and hard to set up. MLflow is an all-encompassing "ML platform". Which is fine if you need that, but we're trying to just do one thing well. (Imagine if Git called itself a "software platform".)

In terms of features, Replicate points directly at an S3 bucket (so you don't have to run a server and Postgres DB), it saves your training code (for reproducibility and to commit to Git after the fact), and it has a nice API for reading and analyzing your experiments in a notebook.

Jugurtha · on Nov 19, 2020

Congrats on the launch!

>MLflow is an all-encompassing "ML platform"

Not really. We're trying to use MLflow with our "ML platform"[0]. Namely, it can save a model that expects high dimensional inputs, which is most models I've seen that aren't trivial, and can "deploy" the model but with an expectation of two dimensional DataFrame inputs. Apparently, they're working on that.

There are also many ambiguities concerning Keras and Tensorflow stemming from "What is a Keras model? Is it a Tensorflow model now they're integrated? Why are Keras models logged with the tensorflow model logger when you use the autolog functionality?". These are shared ambiguities, as there are several ways to save and load models with Tensorflow, and we're looking into the Keras/Tensorflow integration closely. MLflow uses `cloudpickle` and unpickling expects not only the same 'protocol', but the same Python version. Had to dig deeper than necessary.

One other problem is when a model relies on ancillary functions, which you must be able to ship somehow. You end up tinkering with its guts, too.

Could you shed some light on how do you deal with these matters. Namely, high dimensional inputs for models, pre-processing/post-processing functions, serialization brittleness, and Keras/Tensorflow "duality".

We have to inherit that complexity to spare our users from having to mentally think of saving their experiments (we do that automatically to save models, metrics, params). The workflow is data --> collaborative notebooks with scheduling features and job --> (generate appbooks) --> automatically tracked models/params/metrics --> one click deployment --> 'REST' API or form to invoke model.

Aaaaaand again, congrats on the launch!

- [0]: https://iko.ai

edolev · on Nov 19, 2020

Congrats on the launch! This looks exciting. My company has been using Comet.ml and they cover a few use cases that are missing here. Specifically things like real time visualizations and sharing experiments which is key when working in a team. Are you planning on adding those?

fagerhult · on Nov 19, 2020

Thank you! We have an issue on the roadmap for adding a web GUI: https://github.com/replicate/replicate/issues/295

We haven't thought about it in great detail yet, so I'd be curious to hear your thoughts and ideas if you'd like to add a comment to that issue!

m0sth8 · on July 7, 2020

CML looks really awesome. I've been on one of your online meetups. Are you planning to host more in the future? It would be great to learn real production use cases!

m0sth8 · on July 7, 2020

Btw. Here's a recording of the first DVC meetup https://www.youtube.com/watch?v=19GMtrFykSU I like it a lot. Highly recommended!

rhythmvertigo · on July 7, 2020

Yes, we're aiming to have another at the end of July! If you have a particular use case you're interested in, we'd love to know. We might be able to develop some materials around it.

m0sth8 · on June 16, 2020

I'm co-founder of Tulu.la. We're building a community first solution. A mix of twitch and discord but for professional events. Would be happy to share more details.

m0sth8 · on June 8, 2020

Hi Matthew. Is there a way to connect with the team behind cloudflare.tv? We're building something similar (netflix original for tech conferences and meetups) and would be happy to talk with like-minded people.

eastdakota · on June 8, 2020

Sure. Email: daneatcloudflaredotcom.

m0sth8 · on Feb 19, 2020

Thank you!

m0sth8 · on Feb 19, 2020

Thank you! We aggregate data from different sources and enrich it with information from twitter/website/opengraph. We've tried to automate as much as possible but still some work has been done by people. Stack is go + postgres + elastic on backend and react + graphql + material ui on frontend.

Thank you for the suggestion! We'll add this to our roadmap

HN For You