This lets AI work on cloned production sandboxes vs running on production instances. Yes you can sandbox Claude Code on a production box, but it cannot test changes like it would for production-breaking changes. Sandboxes give AI this flexibility allowing it to safely test changes and reproduce things via IaC like Ansible playbooks.
Blog author here. Actually, no. The model can be streamed into the DGX Spark, so we can run prefill of models much larger than 128GB (e.g. DeepSeek R1) on the DGX Spark. This feature is coming to EXO 1.0 which will be open-sourced soonTM.
This looks like potentially some promising research that I'm looking into reproducing now. We want to lower the barrier to running large models as much as possible so if this works, it would be a potential addition to the exo offering.
Yeah combining these two would make a lot of sense, there is a big appetite to run larger models - even slower - on clustered hardware. This way you can add compute to speed up the token pace vs adding it just to run the model at all.
It is also possible some of these optimizations could help optimize distribution based on latency and bandwidth between nodes.
Does this work with Siri? I'm not running the beta so am not familiar with the features and limitations, but I thought that it was either answering based on on-device inference (using a closed model) or Apple's cloud (using a model you can't choose). My understanding is that you can ask OpenAI via an integration they've built, and that in the future you may be able to reach out to other hosted models. But I didn't see anything about being able to seamlessly reach out to your own locally-hosted models, either for Siri backup or anything else. But like I said, I'm not running the beta!