This is really cool to see, reminds me of the early days of CodeSandbox. Though this API looks _fantastic_. I love that you do VM configuration using `with`.
I tried to do something similar well over a decade ago during an internal hackathon (the motivation back then being speeding up destructive integration tests). My idea was to have the memory be a file on tmpfs, and simply `cp --reflink` to get a copy-on-write clone. Then you wouldn't need to bother with userfaultfd or slow storage as the kernel would just magically do the right thing.
Unfortunately, the Linux kernel didn't support reflink on tmpfs (and still doesn't), and I'm not genius enough to have been able to implement that within 24 hours. :-)
I still believe it'd be nice to implement reflink for tmpfs, though. It's the perfect interface for copy-on-write forking of VM memory.
Glad to see the approach validated at scale! I hadn't seen your blog posts until they were linked here, going to dig into the userfaultfd path. Would love to chat if you're open to it.
It's important to refresh entropy immediately after clone. Still, there can be code that didn't assume it could be cloned (even though there's always been `fork`, of course). Because of this, we don't live clone across workspaces for unlisted/private sandboxes and limit the use case to dev envs where no secrets are stored.
Oh wow! Unexpected and cool to see this post on Hacker News! Since then we have evolved our VM infra a bit, and I've written two more posts about this.
First, we started cloning VMs using userfaultfd, which allows us to bypass the disk and let children read memory directly from parent VMs [1].
And we also moved to saving memory snapshots compressed. To keep VM boots fast, we need to decompress on the fly as VMs read from the snapshot, so we chunk up snapshots in 4kb-8kb pieces that are zstd compressed [2].
Exactly, the result would've been different if the author would not have disabled caching.
In this case it's because the iframes are loaded/unloaded multiple times, but we also spawn web workers where the same worker is spawned multiple times (for transpiling code in multiple threads, for example). In all those cases we rely on caching so we don't have to download the same worker code more than once.
If you want to be efficient in Amsterdam, you take the bike or public transport. That has been faster than cars even before this change, and now more so.
reply