We use a pretty standard tech stack of PyTorch + NCCL + MPI. We've used both OpenMPI and MPICH to varying degrees.
Kubeflow is interesting, but it solves a slightly different problem of scheduling/coordinating ML workflows on top of Kube. It doesn't get involved with how an ML job communicates within itself cross-node.
Probably OP was referring to the MPIOperator, TFOperator, PytorchOperator, ... they are under the Kuberflow org, but can be deployed independently of Kubeflow itself. Several other projects are using those operators to provide similar abstractions you mentioned in your blog post, e.g. Gang scheduling, cross-nodes communication, ...
One difference is that these operators use the Kubernetes service interface for communication, generally exposing a headless service for each replica.
Hi! Co-author here. We do keep the nodes running 24/7, so Kubernetes still provides the scheduling to decide which nodes are free or not at any given time. Generally starting a container on a pre-warmed node is still much much faster than booting a VM. Also, some of our servers are bare-metal.
EDIT: Also don't discount the rest of the Kubernetes ecosystem. It's more than just a scheduler. It provides configuration, secrets management, healthchecks, self-healing, service discovery, ACLs... there are absolutely other ways to solve each of these things. But when starting from scratch there's a wide field of additional questions to answer, problems to solve.
Isn't Kubernetes a pretty lousy scheduler when it doesn't take this into consideration? There are a number of schedulers used in high performance computing that should be able to do a better job.
Summary: scale has at least 2 different meanings. Scaling in resources doesn't really mean you need Kubernetes. Scaling in terms of workload diversity is a better use case for it.
Kubernetes is basically a knockoff of Borg, but Borg is designed (or evolved) to run diverse services (search, maps, gmail, etc.; batch and low latency). Ironically most people who run their own Kube clusters don't seem to have much workload diversity.
On the other hand, HPC is usually about scaling in terms of resources: running a few huge jobs on many nodes. A single job will occupy an entire node (and thousands of nodes), which is what's happening here.
I've never used these HPC systems but it looks like they are starting to run on the cloud. Kubernetes may still have been a defensible choice for other reasons, but as someone who used Borg for a long time, it's weird what it's turned into. Sort of like protobufs now have a weird "reflection service". Huh?
Exactly, we migrated to k8s not because we needed better scaling (ec2 auto scaling groups were working reasonably well for us) but because we kept inventing our own way to do rolling deploys or run scheduled jobs, and had a variety of ways to store secrets. On top of that developers were increasingly running their own containers with docker-compose to test services talking to each to each other.
We migrated to k8s to A) have a way to standardize how to run containerized builds and get the benefits for "it works on my laptop" matching how it works in production (at least functionally) and B) a common set of patterns for managing deployed software.
Resource scheduling only became of interest after we migrated when we realized the aggregation of our payloads allowed us to use things like spot instances without jeopardizing availability.
Condor and the like are for independent jobs "throughput computing" but the authors here are using MPI for tightly-coupled jobs. SLURM and Flux are actively-developed schedulers for these kind of jobs.
SLURM hits a nice sweet spot when you have a very traditional cluster: very homogeneous nodes (both hardware and software), standard logins (eg some kind of LDAP/AD), shared NFS files, trusted code. It's an absolute pain when:
- Lots of different kinds of nodes
- anything more complex dependency wise than a handful of shared Conda envs
- anything involving docker
- anything vaguely untrusted
- any kind of partitioning worse than 3 nines e.g. connectivity or uptime instability
- anything more complex than 3-5 priority levels of scheduling
It's great if you hit that niche but it frankly struggles with the complexities of even moderately heterogeneous work loads.
It's also just a bit dated feeling. Even though kube is complex, it's a joy to work with compared to SLURM. Hashicorp is even better imho.
well, that's not a problem of slurm (which will happily start your process on all nodes), but of typical MPI programming. And once you are running something computationally intensive over multiple nodes today, you are still using MPI.
>- anything more complex dependency wise than a handful of shared Conda envs
you can put whatever dependencies you want on your NFS (or copy them to your node). If you're running on a single node it behaves 100% like running with a special login shell on os XYZ, so I don't know what problems happen with dependencies. The main problem would be that it doesn't include any "service discovery" beyond OpenMPI.
>- anything involving docker
have not used it, but there's enroot/singularity. The first of which is apparently dogfooded at Nvidia. Probably might need some adjustements for bases images (because MPI)... As I don't know about the policy within these 5k+ cloud companies: can employees just execute any random image from dockerhub there? This seems a little dangerous...
> anything vaguely untrusted
linked to the docker case? Does kubernetes reboot nodes then? Slurm can do this. And while classical Slurm use cases definitely require a shared account (because of the shared fs), slurm should afaik merrily execute your programs even without any shared account than slurm. You can attack this obviously, but so you can attack kubernetes and while it gets more scrutiny it's also a byzantine collection of FANG-style requirements.
EDIT: What you can't work around is Slurm needing a comms-channel back to the controller, which you though could just firewall off (jobs don't use Slurm to communicate...). As each job can execute a Prolog-script, you can even only selectively allow traffic to flow between allocated nodes quite simply.
>- any kind of partitioning worse than 3 nines e.g. connectivity or uptime instability
that's indeed the case
>- anything more complex than 3-5 priority levels of scheduling
what kind of scheduling does kubernetes implement? I guess you could write a plugin for slurm doing that
> It's great if you hit that niche but it frankly struggles with the complexities of even moderately heterogeneous work loads.
except that your points didn't pertain to this (except maybe for the dependencies, if you think about actual service-dependencies), I fully agree
> you can put whatever dependencies you want on your NFS (or copy them to your node).
This is exactly what we do currently. For non controlled data, this works. However this gets really thorny when you involve CUI (confidential unclassified information), precisely because of mentioned shared fs.
Both SLURM and Kube let you write schedulers but just getting SLURM to talk to the DB was a tough affair, some very poorly documented bugs were at play.
I haven't been on this project in a bit so I don't recall the exact details. And maybe it's lack of familiar with SLURM. But I definitely felt hobbled by it. We are probably going to something based off of Hashicorp stuff.
yes, I guess you are still using NFSv3? We (really tiny vs. everyone else here) settled on that as well, because it requires less integration overall. Though if you're going the all-AD-route, there's the auks-plugin for running with NFSv4 (not sure, how long ticket renewal works though). And you can always just sbcast a zip of your tree and completely forego the NFS (if you store your data somewhere else. Normally you should also be able to write GRES-plugins to "share" this ressources.
The problem with slurm is how it's typically used: ssh into a shared login node with a shared file system, authorization is tightly coupled to linux users on that node, submit jobs with sbatch. Kubernetes deployment feels much more modern and safe.
I have worked with containers + slurm, where the vendor libmpi is injected in the container runtime [1] by a hook, which gives you close to bare metal performance with some container goodness in terms of isolation and deployment.
Slurm should be the answer but it isn't. In our ML environment, it required ML researchers to understand what is going on (more systems knowledge) and no one liked it. The situation devolved to sshing into machines and running jobs. You are right that slurm is a good fit for HPC ... I just don't think DL workloads are exactly that.
One FAANGUAMLetc engineer told me they SSH, Slurm, and track experiments by telling their manager which parameters were best the day before. This was very strange given that this company has a machine learning platform, so either this engineer did not use it, or they did not use it that much.
We were talking about our machine learning platform and taking it for a spin. We do have long-running notebook scheduling[0] but we wanted to be able to watch the notebook's output from multiple devices as it was running, and for it to survive closed tabs or network disconnections, not just get the results once it's done. We also wanted to be able to do that right from the notebook's interface, instead of SSH'ing and all that, as this was tedious and some of our users aren't that comfortable doing that.
Condor is clunky, but still in use in high energy physics, for example (LHC CMS detector data processing).
For greenfield deployments, I would recommend Hashicorp's Nomad before Kubernetes or Condor if your per server container intent is ~1 (bare metal with a light hypervisor for orchestration), but still steer you to Kubernetes for microservices and web-based cookie cutter apps (I know many finance shops using Nomad, but Cloudflare uses it with Consul, so no hard and fast rules).
Disclosure: Worked in HPC space managing a cluster for high energy physics. I also use (free version) Nomad for personal cluster workload scheduling.
I admit that Nomad is a fair middle ground due to its clean DSL and also because of the homogeneity of their workloads.
The team at OpenAI used the k8s api to make extensions around multi-tenancy (across teams) to saturate available allocations, task specific scheduling modifications which were not supported by the k8s scheduler.
I don't know if Nomad has this extensibility. Their plugins were around device plugins and tasks when I last looked at it.
Another pro for Kubernetes is that it has a lot of inertia at the moment with a large contributing community and a large pool of engineers with experience in using it. It's a guess, but would assume the talent pool for hpc stuff isn't as big.
And yea, I like the ability to easily support a diverse set of workloads on the same cluster. It's a simple and easier to understand architecture compared to my previous experience with hadoop.
Not sure that's a pro if your use case is just a platform for long running compute intensive jobs. The platform's goals may diverge even more from yours in the future, if a cloud provider's use case is the cause for a big rewrite for example.
A small part of said inertia is perhaps the CADT model of software development in action up close, where functionality can be redeveloped multiple times because someone is not satisfied with the outcome.
> Ironically most people who run their own Kube clusters don't seem to have much workload diversity.
This has not been my experience at all, but most of my clients are big corporations/enterprises. It's not uncommon to have a cluster with hundreds or thousands of different services running, from front-end static file servers to CRUD apps to machine learning. Even the startups I've worked with had at least a handful of different services they ran on K8s.
A better term to search for is "gRPC reflection service". I can't find the link, but I thought I saw people saying that this was idiomatic to use in many cases for Google Cloud, rather than compiling the schemas into the binary.
That feels weird to me because compilation was always the traditional / intended use of protobufs, whereas dynamic reflection was for a few non-critical debugging tools. I guess my point is that Google has a very specific computing environment and set of problems, and when they get exported to the outside world, they start to get warped because of the different set of problems. That feels like what happened with Borg/Kubernetes as well. I seem to see a lot of regrets about Kubernetes lately, from primary authors and operators.
Not to me mention it’s a well known skillset that can more easily be hired for, as opposed to “come work on our crazy sauce job scheduler, you’ll love it!”
Well, considering how looking at kubernetes config makes me question the choices I have made in my life that led me into this moment, "self-discovery" is not too far off, I think.
Utilize SNI and serve up a fake cert when someone scans you without a matching hostname. Censys is scraping you by IP, so it'll just see the fake cert.
You can do this in nginx by making the fake cert the first server block.
Not all bandwidth is created equal. Different providers can have widely different levels of performance. Higher-priced bandwidth _can_ also be better bandwidth.
Cloud providers haven't done a great job of explaining this so far, but Google Cloud's Network Service Tiers product is a good attempt at why they charge more for bandwidth, and how you can get it for cheaper: https://cloud.google.com/network-tiers/
That said, cheap (and presumably slower) bandwidth may be the right target for B2, given that it's customers are probably using it for backups and private storage rather than, say, serving a website serving photos to millions of visitors from around the globe.
So comparing AWS or Google transfer costs directly to B2 isn't necessarily a fair comparison, depending on your needs.
>So when you hit the limit you want AWS to stop your spending... how?
How? Very abruptly. I can stop the actual spend my side (credit card side) but I need Amazon to implement a legal mechanism that stops me from being legally liable for a (potentially) unlimited amount. Without that its too risky as a personal project...I'll rather do something safer like BASE jumping.
>You're spending $/hr on compute, so terminate your instances.
If everything goes to plan. hn is full of horror stories about people waking up to bills of people hacking their AWS & mining bitcoins. This would be a toy/hobby/project for me though & I'm painfully aware of my ignorance in this regard. Chances are I will screw up & have a bitcoin mining hacker on my account...hence me needing a hard cap.
Yep, doing budgeting at the end of the month and I found a $20 Amazon charge. Turns out it was from a little bit of data I left on their DB service (which took _a lot_ of clicking to find the one I was actually using) after my free year had expired. Certainly not thousands of dollars, but for just experimenting with the system months ago, it was a rude wakeup call. Ideally I would have been able to say "Never spend more than $0.00, kill any services you have to, I'm _only_ using this for evaluation."
For development/personal stuff, yes, that is exactly what I want. I want to be able to put a hard cap on my downside.
Maybe they can make it a separate account type, maybe the cap can be distinguished by resource type, but for mucking around/side projects/etc it's unsettling to use a resource whose cost is potentially arbitrarily high if something goes wrong.
Yes, certainly, it's always possible that hardware will fail with no notice, and any architecture needs to account for that (this is equally true if you're running your own infrastructure, of course). For some applications downtime in these situations may be the right value tradeoff, for others they'll need to architect things assuming always-on redundancy. In terms of data durability, though, local SSD gets you a performance win by being local. If there's a catastrophic failure of that machine, data is gone.
What live migration buys you, the prospective cloud customer, is transparent avoidance of (a) Google's planned hardware maintenance windows (for example, network or power infrastructure maintenance) and (b) outages due to hardware failures which can be detected and migrated away from before they cause data loss.
The second category includes any number of situations which, if you can't migrate your workload, require taking a machine out of service, replacing some gear, bringing it back up, and hoping the replacement fixes the issue[0]. If we can instead migrate your VM away from the issue (say, for example, it's a bad root hard disk -- we need to replace the drive as lots of things in the VM hosting environment depend on it, but your workload is otherwise completely unaffected), we are able to service the affected machine with zero downtime for your service.
[0]: Think about the number of things that could manifest as a flaky network connection. Could be any (or all) of: bad cable, bad NIC, bad motherboard, bad CPU, or even bad RAM (since NICs use bus-mastering DMA from host memory in most cases). After migrating the workload away, we can take as long as necessary to reliably diagnose and fix the machine. Huge win for us and for our customers.