More

benchess · on Jan 25, 2021

Hi, co-author here. Yes we are excited about the potential of this!

asey · on Jan 26, 2021

Hi Ben, could I ask the size of your team managing these clusters and development efforts?

benchess · on Jan 25, 2021

Hi, co-author here!

We use a pretty standard tech stack of PyTorch + NCCL + MPI. We've used both OpenMPI and MPICH to varying degrees.

Kubeflow is interesting, but it solves a slightly different problem of scheduling/coordinating ML workflows on top of Kube. It doesn't get involved with how an ML job communicates within itself cross-node.

mmq · on Jan 25, 2021

Probably OP was referring to the MPIOperator, TFOperator, PytorchOperator, ... they are under the Kuberflow org, but can be deployed independently of Kubeflow itself. Several other projects are using those operators to provide similar abstractions you mentioned in your blog post, e.g. Gang scheduling, cross-nodes communication, ...

One difference is that these operators use the Kubernetes service interface for communication, generally exposing a headless service for each replica.

sandGorgon · on Jan 26, 2021

@benchess - yes this is what i meant. Using the operator framework.

But more generally, MPI over ssh on a large kubernetes deployment is not a very common pattern. Any reason you chose that ?

Have you looked at Ray or Torch-Elastic (which seems to be officially supported by AWS, etc as well) https://github.com/pytorch/elastic ?

alculquicondor · on Jan 26, 2021

Hi, kube-scheduler maintainer here, currently looking into enabling MPI use cases in k8s.

We started a discussion in https://github.com/kubeflow/mpi-operator/issues/315

benchess · on Jan 25, 2021

Hi! Co-author here. We do keep the nodes running 24/7, so Kubernetes still provides the scheduling to decide which nodes are free or not at any given time. Generally starting a container on a pre-warmed node is still much much faster than booting a VM. Also, some of our servers are bare-metal.

EDIT: Also don't discount the rest of the Kubernetes ecosystem. It's more than just a scheduler. It provides configuration, secrets management, healthchecks, self-healing, service discovery, ACLs... there are absolutely other ways to solve each of these things. But when starting from scratch there's a wide field of additional questions to answer, problems to solve.

xorcist · on Jan 25, 2021

Isn't Kubernetes a pretty lousy scheduler when it doesn't take this into consideration? There are a number of schedulers used in high performance computing that should be able to do a better job.

chubot · on Jan 25, 2021

Yeah exactly... This seems closer to an HPC problem, not a "cloud" problem.

Related comment from 6 months ago about Kubernetes use cases: https://lobste.rs/s/kx1jj4/what_has_your_experience_with_kub...

Summary: scale has at least 2 different meanings. Scaling in resources doesn't really mean you need Kubernetes. Scaling in terms of workload diversity is a better use case for it.

Kubernetes is basically a knockoff of Borg, but Borg is designed (or evolved) to run diverse services (search, maps, gmail, etc.; batch and low latency). Ironically most people who run their own Kube clusters don't seem to have much workload diversity.

On the other hand, HPC is usually about scaling in terms of resources: running a few huge jobs on many nodes. A single job will occupy an entire node (and thousands of nodes), which is what's happening here.

I've never used these HPC systems but it looks like they are starting to run on the cloud. Kubernetes may still have been a defensible choice for other reasons, but as someone who used Borg for a long time, it's weird what it's turned into. Sort of like protobufs now have a weird "reflection service". Huh?

https://aws.amazon.com/blogs/publicsector/tag/htcondor/

https://aws.amazon.com/marketplace/pp/Center-for-High-Throug...

jacobr1 · on Jan 25, 2021

Exactly, we migrated to k8s not because we needed better scaling (ec2 auto scaling groups were working reasonably well for us) but because we kept inventing our own way to do rolling deploys or run scheduled jobs, and had a variety of ways to store secrets. On top of that developers were increasingly running their own containers with docker-compose to test services talking to each to each other.

We migrated to k8s to A) have a way to standardize how to run containerized builds and get the benefits for "it works on my laptop" matching how it works in production (at least functionally) and B) a common set of patterns for managing deployed software.

Resource scheduling only became of interest after we migrated when we realized the aggregation of our payloads allowed us to use things like spot instances without jeopardizing availability.

jedbrown · on Jan 25, 2021

Condor and the like are for independent jobs "throughput computing" but the authors here are using MPI for tightly-coupled jobs. SLURM and Flux are actively-developed schedulers for these kind of jobs.

https://slurm.schedmd.com/

https://flux-framework.readthedocs.io/en/latest/

kortex · on Jan 26, 2021

SLURM hits a nice sweet spot when you have a very traditional cluster: very homogeneous nodes (both hardware and software), standard logins (eg some kind of LDAP/AD), shared NFS files, trusted code. It's an absolute pain when:

- Lots of different kinds of nodes

- anything more complex dependency wise than a handful of shared Conda envs

- anything involving docker

- anything vaguely untrusted

- any kind of partitioning worse than 3 nines e.g. connectivity or uptime instability

- anything more complex than 3-5 priority levels of scheduling

It's great if you hit that niche but it frankly struggles with the complexities of even moderately heterogeneous work loads.

It's also just a bit dated feeling. Even though kube is complex, it's a joy to work with compared to SLURM. Hashicorp is even better imho.

fock · on Jan 26, 2021

hmm, I'd like to digress

>- Lots of different kinds of nodes

well, that's not a problem of slurm (which will happily start your process on all nodes), but of typical MPI programming. And once you are running something computationally intensive over multiple nodes today, you are still using MPI.

>- anything more complex dependency wise than a handful of shared Conda envs

you can put whatever dependencies you want on your NFS (or copy them to your node). If you're running on a single node it behaves 100% like running with a special login shell on os XYZ, so I don't know what problems happen with dependencies. The main problem would be that it doesn't include any "service discovery" beyond OpenMPI.

>- anything involving docker

have not used it, but there's enroot/singularity. The first of which is apparently dogfooded at Nvidia. Probably might need some adjustements for bases images (because MPI)... As I don't know about the policy within these 5k+ cloud companies: can employees just execute any random image from dockerhub there? This seems a little dangerous...

> anything vaguely untrusted

linked to the docker case? Does kubernetes reboot nodes then? Slurm can do this. And while classical Slurm use cases definitely require a shared account (because of the shared fs), slurm should afaik merrily execute your programs even without any shared account than slurm. You can attack this obviously, but so you can attack kubernetes and while it gets more scrutiny it's also a byzantine collection of FANG-style requirements.

EDIT: What you can't work around is Slurm needing a comms-channel back to the controller, which you though could just firewall off (jobs don't use Slurm to communicate...). As each job can execute a Prolog-script, you can even only selectively allow traffic to flow between allocated nodes quite simply.

>- any kind of partitioning worse than 3 nines e.g. connectivity or uptime instability

that's indeed the case

>- anything more complex than 3-5 priority levels of scheduling

what kind of scheduling does kubernetes implement? I guess you could write a plugin for slurm doing that

> It's great if you hit that niche but it frankly struggles with the complexities of even moderately heterogeneous work loads.

except that your points didn't pertain to this (except maybe for the dependencies, if you think about actual service-dependencies), I fully agree

kortex · on Jan 27, 2021

All very good points!

> you can put whatever dependencies you want on your NFS (or copy them to your node).

This is exactly what we do currently. For non controlled data, this works. However this gets really thorny when you involve CUI (confidential unclassified information), precisely because of mentioned shared fs.

Both SLURM and Kube let you write schedulers but just getting SLURM to talk to the DB was a tough affair, some very poorly documented bugs were at play.

I haven't been on this project in a bit so I don't recall the exact details. And maybe it's lack of familiar with SLURM. But I definitely felt hobbled by it. We are probably going to something based off of Hashicorp stuff.

fock · on Jan 27, 2021

yes, I guess you are still using NFSv3? We (really tiny vs. everyone else here) settled on that as well, because it requires less integration overall. Though if you're going the all-AD-route, there's the auks-plugin for running with NFSv4 (not sure, how long ticket renewal works though). And you can always just sbcast a zip of your tree and completely forego the NFS (if you store your data somewhere else. Normally you should also be able to write GRES-plugins to "share" this ressources.

stabbles · on Jan 25, 2021

The problem with slurm is how it's typically used: ssh into a shared login node with a shared file system, authorization is tightly coupled to linux users on that node, submit jobs with sbatch. Kubernetes deployment feels much more modern and safe.

I have worked with containers + slurm, where the vendor libmpi is injected in the container runtime [1] by a hook, which gives you close to bare metal performance with some container goodness in terms of isolation and deployment.

[1] https://github.com/eth-cscs/sarus

brutus1213 · on Jan 26, 2021

Slurm should be the answer but it isn't. In our ML environment, it required ML researchers to understand what is going on (more systems knowledge) and no one liked it. The situation devolved to sshing into machines and running jobs. You are right that slurm is a good fit for HPC ... I just don't think DL workloads are exactly that.

P.S. I also think the K8s scheduler isn't great.

Jugurtha · on Jan 25, 2021

One FAANGUAMLetc engineer told me they SSH, Slurm, and track experiments by telling their manager which parameters were best the day before. This was very strange given that this company has a machine learning platform, so either this engineer did not use it, or they did not use it that much.

We were talking about our machine learning platform and taking it for a spin. We do have long-running notebook scheduling[0] but we wanted to be able to watch the notebook's output from multiple devices as it was running, and for it to survive closed tabs or network disconnections, not just get the results once it's done. We also wanted to be able to do that right from the notebook's interface, instead of SSH'ing and all that, as this was tedious and some of our users aren't that comfortable doing that.

- [0]: https://iko.ai/docs/notebook/#long-running-notebooks

vergessenmir · on Jan 25, 2021

It maybe an HPC problem but I'm not sure the available solutions come close to k8s in terms of functionality and I'm not talking about scheduling.

I used to work in HPC/Grid but it's been a while but I do remember Condor being clunky even though it had its uses.

And the commercial grid offerings couldn't scale to almost 10k nodes back then (am not sure about now, or if they even exist anymore)

toomuchtodo · on Jan 25, 2021

Condor is clunky, but still in use in high energy physics, for example (LHC CMS detector data processing).

For greenfield deployments, I would recommend Hashicorp's Nomad before Kubernetes or Condor if your per server container intent is ~1 (bare metal with a light hypervisor for orchestration), but still steer you to Kubernetes for microservices and web-based cookie cutter apps (I know many finance shops using Nomad, but Cloudflare uses it with Consul, so no hard and fast rules).

Disclosure: Worked in HPC space managing a cluster for high energy physics. I also use (free version) Nomad for personal cluster workload scheduling.

vergessenmir · on Jan 27, 2021

I admit that Nomad is a fair middle ground due to its clean DSL and also because of the homogeneity of their workloads.

The team at OpenAI used the k8s api to make extensions around multi-tenancy (across teams) to saturate available allocations, task specific scheduling modifications which were not supported by the k8s scheduler.

I don't know if Nomad has this extensibility. Their plugins were around device plugins and tasks when I last looked at it.

bostonsre · on Jan 26, 2021

Another pro for Kubernetes is that it has a lot of inertia at the moment with a large contributing community and a large pool of engineers with experience in using it. It's a guess, but would assume the talent pool for hpc stuff isn't as big.

And yea, I like the ability to easily support a diverse set of workloads on the same cluster. It's a simple and easier to understand architecture compared to my previous experience with hadoop.

xorcist · on Jan 26, 2021

Not sure that's a pro if your use case is just a platform for long running compute intensive jobs. The platform's goals may diverge even more from yours in the future, if a cloud provider's use case is the cause for a big rewrite for example.

A small part of said inertia is perhaps the CADT model of software development in action up close, where functionality can be redeveloped multiple times because someone is not satisfied with the outcome.

freedomben · on Jan 25, 2021

> Ironically most people who run their own Kube clusters don't seem to have much workload diversity.

This has not been my experience at all, but most of my clients are big corporations/enterprises. It's not uncommon to have a cluster with hundreds or thousands of different services running, from front-end static file servers to CRUD apps to machine learning. Even the startups I've worked with had at least a handful of different services they ran on K8s.

Stammon · on Jan 26, 2021

Can you go a bit more into detail what you mean by protobuf "reflection service"

chubot · on Jan 26, 2021

A better term to search for is "gRPC reflection service". I can't find the link, but I thought I saw people saying that this was idiomatic to use in many cases for Google Cloud, rather than compiling the schemas into the binary.

That feels weird to me because compilation was always the traditional / intended use of protobufs, whereas dynamic reflection was for a few non-critical debugging tools. I guess my point is that Google has a very specific computing environment and set of problems, and when they get exported to the outside world, they start to get warped because of the different set of problems. That feels like what happened with Borg/Kubernetes as well. I seem to see a lot of regrets about Kubernetes lately, from primary authors and operators.

AlphaSite · on Jan 25, 2021

If all you care about is node in use or not in use I think it’s fine. You don’t need anything complex from the scheduler.

hamandcheese · on Jan 25, 2021

Not to me mention it’s a well known skillset that can more easily be hired for, as opposed to “come work on our crazy sauce job scheduler, you’ll love it!”

stonogo · on Jan 25, 2021

Are you starting from scratch? This architecture seems like a pretty standard HPC deployment with unnecessary containerization involved.

dijit · on Jan 25, 2021

I feel like we solved this problem over a decade ago (if you’re keeping machines warm anyway) with job brokers. Am I somehow mistaken?

torbital · on Jan 26, 2021

> self-healing, service discovery

For a second I read that as self-discovery

Damn kubernetes is some good shit

yongjik · on Jan 26, 2021

Well, considering how looking at kubernetes config makes me question the choices I have made in my life that led me into this moment, "self-discovery" is not too far off, I think.

benchess · on Sept 4, 2020

It's all pretty well covered in the Methodology section of the about page: https://takecongress.org/about.html

edbaskerville · on Sept 5, 2020

I went ahead and broke that out into a separate page:

https://takecongress.org/methodology.html

benchess · on Sept 14, 2019

Utilize SNI and serve up a fake cert when someone scans you without a matching hostname. Censys is scraping you by IP, so it'll just see the fake cert.

You can do this in nginx by making the fake cert the first server block.

benchess · on March 9, 2018

Not all bandwidth is created equal. Different providers can have widely different levels of performance. Higher-priced bandwidth _can_ also be better bandwidth.

Cloud providers haven't done a great job of explaining this so far, but Google Cloud's Network Service Tiers product is a good attempt at why they charge more for bandwidth, and how you can get it for cheaper: https://cloud.google.com/network-tiers/

That said, cheap (and presumably slower) bandwidth may be the right target for B2, given that it's customers are probably using it for backups and private storage rather than, say, serving a website serving photos to millions of visitors from around the globe.

So comparing AWS or Google transfer costs directly to B2 isn't necessarily a fair comparison, depending on your needs.

benchess · on Nov 19, 2015

What an amazing story. Great to hear of people like Katherine!

benchess · on Aug 8, 2015

So when you hit the limit you want AWS to stop your spending... how?

You're spending $/hr on compute, so terminate your instances. Plus your EBS volumes, so delete your drives. Plus... S3? So delete all your data?

Havoc · on Aug 8, 2015

>So when you hit the limit you want AWS to stop your spending... how?

How? Very abruptly. I can stop the actual spend my side (credit card side) but I need Amazon to implement a legal mechanism that stops me from being legally liable for a (potentially) unlimited amount. Without that its too risky as a personal project...I'll rather do something safer like BASE jumping.

>You're spending $/hr on compute, so terminate your instances.

If everything goes to plan. hn is full of horror stories about people waking up to bills of people hacking their AWS & mining bitcoins. This would be a toy/hobby/project for me though & I'm painfully aware of my ignorance in this regard. Chances are I will screw up & have a bitcoin mining hacker on my account...hence me needing a hard cap.

lfowles · on Aug 8, 2015

Yep, doing budgeting at the end of the month and I found a $20 Amazon charge. Turns out it was from a little bit of data I left on their DB service (which took _a lot_ of clicking to find the one I was actually using) after my free year had expired. Certainly not thousands of dollars, but for just experimenting with the system months ago, it was a rude wakeup call. Ideally I would have been able to say "Never spend more than $0.00, kill any services you have to, I'm _only_ using this for evaluation."

dogecoinbase · on Aug 8, 2015

For development/personal stuff, yes, that is exactly what I want. I want to be able to put a hard cap on my downside.

Maybe they can make it a separate account type, maybe the cap can be distinguished by resource type, but for mucking around/side projects/etc it's unsettling to use a resource whose cost is potentially arbitrarily high if something goes wrong.

benchess · on Dec 22, 2014

Also Stratos https://stratoscard.com/

benchess · on Oct 29, 2014

"No planned downtime: Local SSD data will not be lost when Google does datacenter maintenance."

Only unplanned downtime then :)

Live migration maybe adds another '9' to reliability, but customers still need to plan for failure.

jsolson · on Oct 29, 2014

(note: I work on GCE networking)

Yes, certainly, it's always possible that hardware will fail with no notice, and any architecture needs to account for that (this is equally true if you're running your own infrastructure, of course). For some applications downtime in these situations may be the right value tradeoff, for others they'll need to architect things assuming always-on redundancy. In terms of data durability, though, local SSD gets you a performance win by being local. If there's a catastrophic failure of that machine, data is gone.

What live migration buys you, the prospective cloud customer, is transparent avoidance of (a) Google's planned hardware maintenance windows (for example, network or power infrastructure maintenance) and (b) outages due to hardware failures which can be detected and migrated away from before they cause data loss.

The second category includes any number of situations which, if you can't migrate your workload, require taking a machine out of service, replacing some gear, bringing it back up, and hoping the replacement fixes the issue[0]. If we can instead migrate your VM away from the issue (say, for example, it's a bad root hard disk -- we need to replace the drive as lots of things in the VM hosting environment depend on it, but your workload is otherwise completely unaffected), we are able to service the affected machine with zero downtime for your service.

[0]: Think about the number of things that could manifest as a flaky network connection. Could be any (or all) of: bad cable, bad NIC, bad motherboard, bad CPU, or even bad RAM (since NICs use bus-mastering DMA from host memory in most cases). After migrating the workload away, we can take as long as necessary to reliably diagnose and fix the machine. Huge win for us and for our customers.

HN For You