Cloudflare AI and Replicate are great for running off-the-shelf models, but anything custom is going to incur a 10+ minute cold start.
For running custom fine-tuned models on serverless, you could look into https://beam.cloud which is optimized for serving custom models with extremely fast cold start (I'm a little biased since I work there, but the numbers don't lie)
Serverless only works if the cold boot is fast. For context, my company runs a serverless cloud GPU product called https://beam.cloud, which we've optimized for fast cold start. We see Whisper in production cold start in under 10s (across model sizes). A lot of our users are running semi-real time STT, and this seems to be working well for them.
Is this because the users are streaming audio in a more conversational style?
For example, when you give siri a command, it is stated, and then you stop speaking.
For most of ChatGPT‘s life, in openAI’s iOS app, if you wanted to speak to input text, you would tap the record button, and then tap it off, either using the app’s own Speech to text capability or siri’s input field speech to text.
Conversational speech to text is more ongoing, though, which would make a 10 second cold start OK, because you don’t sense as much lag because you’re continuing to speak.
Or perhaps people in general record input longer than 10 seconds, And you are sending the first chunk as soon as possible to get whisper going.
Then follow up chunks are handled as warm boots? Then the text is reassembled? Is that roughly correct?
Anything you can provide on sort of the request and data flow that works with a longer cold boot time in the context of single recording versus streaming, and how audio is broken up would be helpful.
What are you using for K8s autoscaling? We initially tried a few standard K8s scaling mechanisms and found that they didn't work well for GPU workloads. For example, if we were serving a low-RAM Huggingface model on GPU, it wouldn't trigger autoscaling. But since the GPU can only process one request at a time, the system would get bottlenecked while it waited to process requests one-by-one.
Sharing GPUs only really makes sense for GPUs that are large enough to share. MIGs can work for 80Gi A100s but won't work with smaller cards like T4s. It also adds latency to the GPU operations. Unfortunately there's not yet a silver bullet for this stuff.
That’s why I was curious about utilization since you mentioned low memory usage. I believe time slicing can work on those smaller cards these days. Did you explore any other optimizations like batching or concurrency for same model?
Model heterogeneity seems like a real challenge there — you could optimize usage if you know all the sizes ahead of times and actually have gpu capacity to do efficient allocations, but it’s way harder than just doling 1 gpu per pod.
e: also, latency because of reduced resources? Or what do you mean?
I think the Beam website should be a lot clearer about how things work[0], but I think Beam is offering to bill you for your actual usage, in a serverless fashion. So, unless you're continuously running computations for the entire month, it won't cost $1200/mo.
If it works the way I think it does, it sounds appealing, but the GPUs also feel a bit small. The A10G only has 24GB of VRAM. They say they're planning to add an A100 option, but... only the 40GB model? Nvidia has offered an 80GB A100 for several years now, which seems like it would be far more useful for pushing the limits of today's 70B+ parameter models. Quantization can get a 70B parameter model running on less VRAM, but it's definitely a trade-off, and I'm not sure how the training side of things works with regards to quantized models.
Beam's focus on Python apps makes a lot of sense, but what if I want to run `llama.cpp`?
Anyways, Beam is obviously a very small team, so they can't solve every problem for every person.
[0]: what is the "time to idle" for serverless functions? is it instant? "Pay for what you use, down to the second" sounds good in theory, but AWS also uses per-second billing on tons of stuff... EC2 instances don't just stop billing you when they go idle, though, you have to manually shut them down and start them up. So, making the lifecycle clearer would be great. Even a quick example of how you would be billed might be helpful.
Slai is a tool to quickly build machine learning-powered applications. Our browser-based sandbox is the easiest way to build, deploy, and share machine learning models with zero setup [1]. We’re currently a team of four, and we’re looking to hire someone to help us with SRE / DevOps work.
You should have experience setting up and maintaining infrastructure at scale - ideally, you’ll have fluency with Docker/Kubernetes, EKS, KNative, Terraform, Terragrunt, Gunicorn, and Python. You should be able to communicate clearly in English, and you can work from anywhere (although if you prefer to work in-person, we have an office in Cambridge, MA)
We have a hackathon-inspired culture – the team works in one week sprints and has very few meetings besides daily standup and a Friday afternoon all-hands. If this interests you, please send a brief email with your resume to eli at slai dot io.
On paper, Sagemaker does everything, but they don’t do many of those things well. I think Sagemaker is a great product for enterprises who want to maximize the products procured from a single vendor — it’s easy to buy Sagemaker when all of your infra is already on AWS.
It’s fairly painful to productionize a model on Sagemaker — they make you think about a lot of things and fit into AWS primitives. Besides the code for the model, we don’t force users to think about anything. Our focus is helping engineers get models into production, not reading documentation.
Using our tool, you can fork a model and deploy it to production right away — there’s no time spent battling AWS primitives. We’re focused on developer experience above everything else - which means we enforce sandboxes on our platform to be consistent and reproducible.
Yep! You can upload pre-trained models — just upload a pickled binary of your model into the “data” section of the sandbox and then load and return the object in the train function.
We chose to do it this way to ensure that the binary you upload is properly tracked in our versioning system, and that it can be integrated into your handler.
Right now it's pretty hard to get GPU quota on AWS/GCP, so hopefully this is useful for you.