Just dug through the datachain codebase to understand a little more. I think while both projects have a Dataframe interface, they're very different projects!
Datachain seems to operate more on the orchestration layer, running Python libraries such as PIL and requests (for making API calls) and relying on an external database engine (SQLite or BigQuery/Clickhouse) for the actual compute.
Daft is an actual data engine. Essentially, it's "multimodal BigQuery/Clickhouse". We've built out a lot of our own data system functionality such as custom Rust-defined multimodal data structures, kernels to work on multimodal types, a query optimizer, distributed joins etc.
In non-technical terms, I think this means that Datachain really is more of a "DBT" which orchestrates compute over an existing engine, whereas Daft is the actual compute/data engine that runs the workload. A project such as Datachain could actually run on top of Daft, which can handle the compute and I/O operations necessary to execute the requested workload.
DVC is closely tied to Git. We've heard people find that quite heavyweight when you're running experiments.
We think we can build a much better experience if we detach ourselves from Git. With Replicate, you just run your training script as usual, and it automatically tracks everything from within Python. You don't have to run any additional commands to track things.
Hey! I'm one of the founders at Comet.ml. We believe that Git should continue to be the approach for managing code (similar to dvc) but we adapted it to the ML workflow. Our approach is to compute a git patch on every run so later you can 'git apply' if you'd like (https://www.comet.ml/docs/user-interface/#the-reproduce-butt...).
First of all, congrats on the launch! I do really like the aesthetics of the website, and the overall approach. It resonates with our vision and philosophy!
Good feedback on experiments feeling heavyweight! We've been focused on doing great foundation to manage data and pipelines in the previous DVC versions and were aware about this problem (https://github.com/iterative/dvc/issues/2799). As I mentioned - Experiments feature is already there in beta testing. It means that users don't have to do commits anymore until they are ready, still can share experiments (it's a long topic and we'll write a blog post at some point since I really excited about the way it'll be implemented using custom Git refs), support for DL workflow (auto-checkpoints), and more. Would love to discuss and share any details, it would be great to compare the approaches.
We talked to a bunch of MLflow users, and the general impression we got is that it is heavyweight and hard to set up. MLflow is an all-encompassing "ML platform". Which is fine if you need that, but we're trying to just do one thing well. (Imagine if Git called itself a "software platform".)
In terms of features, Replicate points directly at an S3 bucket (so you don't have to run a server and Postgres DB), it saves your training code (for reproducibility and to commit to Git after the fact), and it has a nice API for reading and analyzing your experiments in a notebook.
Not really. We're trying to use MLflow with our "ML platform"[0]. Namely, it can save a model that expects high dimensional inputs, which is most models I've seen that aren't trivial, and can "deploy" the model but with an expectation of two dimensional DataFrame inputs. Apparently, they're working on that.
There are also many ambiguities concerning Keras and Tensorflow stemming from "What is a Keras model? Is it a Tensorflow model now they're integrated? Why are Keras models logged with the tensorflow model logger when you use the autolog functionality?". These are shared ambiguities, as there are several ways to save and load models with Tensorflow, and we're looking into the Keras/Tensorflow integration closely. MLflow uses `cloudpickle` and unpickling expects not only the same 'protocol', but the same Python version. Had to dig deeper than necessary.
One other problem is when a model relies on ancillary functions, which you must be able to ship somehow. You end up tinkering with its guts, too.
Could you shed some light on how do you deal with these matters. Namely, high dimensional inputs for models, pre-processing/post-processing functions, serialization brittleness, and Keras/Tensorflow "duality".
We have to inherit that complexity to spare our users from having to mentally think of saving their experiments (we do that automatically to save models, metrics, params). The workflow is data --> collaborative notebooks with scheduling features and job --> (generate appbooks) --> automatically tracked models/params/metrics --> one click deployment --> 'REST' API or form to invoke model.
Congrats on the launch! This looks exciting. My company has been using Comet.ml and they cover a few use cases that are missing here. Specifically things like real time visualizations and sharing experiments which is key when working in a team. Are you planning on adding those?
CML looks really awesome. I've been on one of your online meetups. Are you planning to host more in the future? It would be great to learn real production use cases!
Yes, we're aiming to have another at the end of July! If you have a particular use case you're interested in, we'd love to know. We might be able to develop some materials around it.
I'm co-founder of Tulu.la. We're building a community first solution. A mix of twitch and discord but for professional events. Would be happy to share more details.
Hi Matthew. Is there a way to connect with the team behind cloudflare.tv? We're building something similar (netflix original for tech conferences and meetups) and would be happy to talk with like-minded people.
Thank you! We aggregate data from different sources and enrich it with information from twitter/website/opengraph. We've tried to automate as much as possible but still some work has been done by people.
Stack is go + postgres + elastic on backend and react + graphql + material ui on frontend.
Thank you for the suggestion! We'll add this to our roadmap