More

davidbuniat · 2026-02-02T16:05:07 1770048307

We let our Software Factory run for 15 hours autonomously. Output was 83 lines of highly optimized C++ code. 714 lines of tests. 8:1 test to code ratio. It fixed the bottleneck in a large codebase. Improved the TPC-H benchmark 2x. Verified memory leak using ASAN. Spent $160 of LLM calls.

davidbuniat · on April 26, 2023

totally, @MuffinFlavored, super-exciting.

davidbuniat · on April 18, 2023

This is awesome, curious what is your DSL look like and what is the accuracy of LLM translating into your DSL?

hartem_ · on April 18, 2023

It's similar to javascript, but not exactly identical. Here is an example that sends a summary of calendar event two minutes before it starts (uses gmail, openai and slack):

``` function (recipient) {\n when: __0 = GoogleCalendar.when_next_event_is_in(time: B.Duration(120000));\n __1 = GoogleCalendar.get_next_event();\n __2 = OpenAI.get_summary_of(text: __1.description);\n __3 = BardeenCommons.get_string_concatenating_strings(strings: ["Your next event is:", __1.summary, "at", __1.startTime, "Here is a summary of the event:", __2]);\n Slack.send_message(message: __3, recipient: recipient);\n}' ```

Initial accuracy was about 10%, which was pretty meh TBH. With a lot of tweaking and tuning we were able to get it to 70%. This means that it takes about 2-3 attempts to get it to generate what's expected.

The great thing is that we only use AI to generate the DSL description of the automation and let the user tweak and tune it. Once it's there we just execute it with our engine.

davidbuniat · on April 18, 2023

nice, pretty impressive to go from 10% to 70%. Do you also do autogpt style looping with reasking and verification until proper DSL is creating?

hartem_ · on April 18, 2023

Not yet because honestly we were able to get to where we needed in terms of accuracy without it. Creating a feedback loop and turning into an agent-style interaction can be helpful for more complex automations, but honestly, you can get a lot of mileage out of what we have. Having said that, agents look really fascinating and some demos I saw are mind blowing, so we will definitely look into it very closely.

davidbuniat · on Nov 15, 2022

thank you so much! We've recently announced a broad strategic collaboration with Intel Corporation to advance the field of AI data infrastructure. More specifically, this means initiatives in making sure Deep Lake datasets are trained super-smoothly on 3rd Gen Intel Xeon Scalable processors with built-in AI accelerators, as well as a range of technological improvements to Deep Lake given Intel's know-how in the field. Together, we will hopefully abstract away the need to build complex data infrastructure in-house :)

davidbuniat · on Nov 15, 2022

There's indeed plenty of tools out there, so the next few years are very interesting in terms of seeing which approach gets adopted by the wider audience. Feedback from the community (inc. feedback from you!) is instrumental in designing a tool that works well and addresses all the needs, so thanks a lot for your input re: CLI and branding. :)

Good observation, we've made sure that datasets can evolve better as you go. Specifically for your use case, you can query subsets of data and materialize it on the fly to be streamed, and then go back to a specific dataset "view" (i.e. saved query), as needed.

See how this works at around 5:50 here - https://youtu.be/SxsofpSIw3k

As for your last question, when the jpeg is appended using its file path, the compressed bytes get stored in the dataset without decompression/recompression. When the data is accessed as a numpy array, then the jpeg bytes are decompressed.

For researchers in Academia, our Growth plan is free. Since you work at a startup, the trial for Growth plan is for two weeks. If you want access, hit us up in the Community slack (slack.activeloop.ai - or you can just test the querying on public activeloop datasets!)

davidbuniat · on Nov 15, 2022

One more thing regarding the compression -> you can both deeplake.link() to your raw data lake without touching it, or use deeplake.read() which preservers the compression as long as it matches with tensor default compression.

https://docs.deeplake.ai/en/latest/deeplake.html?highlight=l... https://docs.deeplake.ai/en/latest/deeplake.html#deeplake.re...

dimatura · on Nov 15, 2022

thanks for the pointer, I'll definitely give deeplake a closer look

davidbuniat · on Nov 15, 2022

my pleasure, let us know if you have any feedback, would love to see you succeed wtih Deep Lake. :)

davidbuniat · on Nov 15, 2022

As a matter of fact, I do agree - the CV space is less crowded (until now). That's hopefully where we come in!

thanks a lot for the input, and thanks for trying us out. You know, it's always a work-in-progress, but we've actually done a major overhaul - this is our biggest release yet and I'd love it if you gave it a try, especially the querying feature we're super-proud of. :) https://docs.activeloop.ai/tutorials/querying-datasets

here's a couple of playbooks on how the new features (visualization + querying + version control) play together to solve complex workflows.

- https://docs.activeloop.ai/playbooks/training-with-lineage - https://docs.activeloop.ai/playbooks/evaluating-model-perfor... - https://docs.activeloop.ai/playbooks/training-reproducibilit...

davidbuniat · on Nov 15, 2022

thanks for the context and links! I replied to your comment slightly above in one comment. :)

davidbuniat · on Nov 15, 2022

Re: HF - we know them and admire their work (primarily, until very recently, focused on NLP, while we focus mostly on CV). As mentioned in the post, a large part of Deep Lake, including the Python-based dataloader and dataset format, is open source as well - https://github.com/activeloopai/deeplake.

Likewise, we curate a list of large open source datasets here -> https://datasets.activeloop.ai/docs/ml/, but our main thing isn't aggregating datasets (focus for HF datasets), but rather providing people with a way to manage their data efficiently. That being said, all of the 125+ public datasets we have are available in seconds with one line of code. :)

We haven't benchmarked against HF datasets in a while, but Deep Lake's dataloader is much, much faster in third-party benchmarks (see this https://arxiv.org/pdf/2209.13705 and here for an older version, that was much slower than what we have now, see this: https://pasteboard.co/la3DmCUR2iFb.png). HF under the hood uses Git-LFS (to the best of my knowledge) and is not opinionated on formats, so LAION just dumps Parquet files on their storage.

While your setup would work for a few TBs, scaling to PB would be tricky including maintaining your own infrastructure. And yep, as you said NAS/NFS would neither be able to handle the scale (especially writes with 1k workers). I am also slightly curious about your use of mmap files with image/video compressed data (as zero-copy won’t happen) unless you decompress inside the GPU ;), but would love to learn more from you! Re: pricing thanks for the feedback, storage is one component and customly priced for PB-scale workloads.

fxtentacle · on Nov 15, 2022

I was referring specifically to this page:

https://www.activeloop.ai/pricing/

which says

"Deep Lake Enterprise" and then "10TB of managed data (total)".

To me, that read as if the Enterprise plan is limited to a maximum of 10TB.

davidbuniat · on Nov 15, 2022

Ah, thank you so much for noticing this! this is a very important piece of feedback (we will fix it shortly). What we meant was the 10TB is the first tier of the enterprise plans, and the rest are more custom-billed (typically because those also require other custom integrations, etc).

If you find any other points of confusion, please send them our way, and we will fix it, the community has been instrumental over the years in iterating on the product! :)

davidbuniat · on Nov 15, 2022

You can store your data either remotely or locally (see here on how https://docs.activeloop.ai/getting-started/creating-datasets...).

You can then visualize your datasets if their stored on our cloud, in AWS/GCP, or you can drag and drop your local dataset in Deep Lake format into our UI (https://docs.activeloop.ai/dataset-visualization)

We do, with version control, Python based dataloader and dataset format being open source! Please check out https://github.com/activeloopai/deeplake.

davidbuniat · on Nov 15, 2022

a very fair point, we do usually say that our main (possible) competition are the "traditional" data lakes! We've spent 4+ years designing this system to be specifically resolving the issue for unstructured AI data like videos, images, audio (and multimodal datasets, like the ones used to train models like Stable Diffusion, for instance).

Our main competitive advantage against the players you've mentioned is just that - our bet is that Deep Learning will overtake traditional BI workflows (especially with >90% of data generated today being unstructured), and we've been preparing for it. Traditional "BI datalakes" are pretty inefficient when it comes to storing the data specifically for deep learning workflows. They currently also lack an entire suite of key features (visualization for those data types, query engine based on tensors, etc.) to be able to successfully convince the potential users.

As a matter of fact, we're seeing not only adoption from AI-first companies/startups who are building their infrastructure from the ground up, but mature companies who are hitting the limits of the traditional setups.

Keeping that in mind, we're working on making the onboarding for such companies much easier, so their cost of switching to a more efficient/performant setup is much lower.

As for Databricks specifically, we see them more as a complement, rather than a competitor.

clusterhacks · on Nov 15, 2022

"... our bet is that Deep Learning will overtake traditional BI workflows ..."

This is an interesting perspective. I have spent years in the traditional BI space and my gut feeling there is that analytics there are very much not fancy. Simple stuff seems to be where the real ROI is at.

Are you saying that data storage, data model, etc that Activeloop puts in place to better support deep learning workflows will replace the data storage, data model, etc as the store of information but visualization and querying will still be like BI work? Or alternatively, are you saying that deep learning is on a roaring path to replace traditional BI analytics?

davidbuniat · on Nov 15, 2022

Thanks - it's very insightful to also hear your perspective as someone coming in from the BI space (if you have any more insights, please post them here, too). We have this internal joke where when one says their analytics is based on regressions, it's really an excel sheet, and if they say it's AI, it's a simple Ordinary Least Squares regression, and only a handful do actual AI/ML.

From what we are seeing in the market, both domains grow, but with an overlap, and it's expanding, too. I think while BI/Analytics would still be a major space, we would see more DL-based novel applications generating increasingly more business value (i.e. self-driving cars, robotics, agritech). After all, even in VERY traditional workflows/companies like economic growth estimation, we're seeing DL being applied (e.g. they look at nightlight satellite imagery to estimate economic growth/urbanization).

So to answer your question, for some parts, I think it would be the former (complement), and other applications it would call for replacement (particularly in the cases where companies use multi-modal data).

streetcat1 · on Nov 15, 2022

So 90% of the data is unstructured, but 99% of the ML use cases are tabular (structured), where tree based approach can win against DL.

Also, 90% of the structured data is unlabeled. Hence, your calculation should be for "labeled" unstructured data, is this 90%?. I would argue that outside big tech, this is 0.1%.

You competition is not Databricks. Databricks main use case is tabular data (both in the delta lake and in ML). I.e. Databricks compete with snowflake. It tries to be a database. I.e. it tries to get out of the data lake.

I think that your competition is with S3 and R2 from the storage side, and with transformer based models (Hugging face). Correct me if I am wrong, but the whole idea with transformers models is that the training was already done , and you can use small amount of domain specific data? I.e. you do not need a lot of storage?

davidbuniat · on Nov 15, 2022

"...I would argue that outside big tech, this is 0.1%."

Fair point regarding the unlabeled/unstructured data. One could also argue that labeled data isn't going to be a prerequisite forever (see https://ai.facebook.com/blog/the-first-high-performance-self...). We see a very sharp rise in unstructured data use for ML (especially a large spike caused by large language models like Dall-E 2 and Stable Diffusion). In my opinion, the majority of the novel use cases are outside of big tech, and we also see a trend in "legacy" companies like media, manufacturing, etc. start building dedicated ML teams. The industry is still nascent, but it is growing fast. Frankly, we see the pain points we're solving resonate with so many more companies than just a year ago.

Agree re Snowflake/Databricks, they are partners rather than competitors. We sit on top of S3/GCS or other blob storages and currently are competing with various in-house solutions that ML scientists built themselves. I do see your point regarding large foundational models that would be only fine-tuned on the tail end for various use cases. I believe there still would be still companies building foundational models from scratch (currently at 5 billion images) so they can serve more application-specific products and unstructured data generators that partner with those companies creating a good enough market for the tool.

HN For You