For the best experience on desktop, install the Chrome extension to track your reading on news.ycombinator.com
Hacker Newsnew | past | comments | ask | show | jobs | submit | history | antongribok's commentsregister

As someone who's in charge of close to an exabyte on Ceph, I couldn't disagree with you more.

Done correctly, Ceph is extremely reliable, resilient, and fast. Once you get over the initial learning curve, dare I say, even a joy to work with.


For parallel read/write access across many thousands of large-ish files (ie multiples of the minimum chunk size) I'm sure it does grand.

But for metadata heavy operations, ie git, its not the FS I would choose. like lustre it can be fast, if your workload aligns with it's tradeoffs. but high metadata loads are not ceph-fs's strong point, (or many other distributed filesystems either)


I concur, even though I have only used it as a hobbyist.


I think more people should know about the existence of ZRAM on modern Linux distributions. It's really changed the way I look at swap configs.

ZRAM is a compressed block device that is stored in RAM. It's great!

Previously, if I ever had high memory pressure situations, I really dreaded the slowdowns. Now, with swap sitting on top of /dev/zram0 it's a completely different experience.

I have ZRAM enabled on all of my personal machines, both laptops with limited memory, and desktops with 64 or 128GB of RAM. It's rarely used, but it is nice to have that extra room sometimes.

The performance of a zram device is so much faster than even the latest NVMe drives.


One interesting thing with zram (which OS X also does by default) is that certain memory leaks... effectively don't. I have a little raspberry pi where I have zram enabled. If I make a string in python and keep appending 'a's to it, eventually zram just soaks it up:

  $ zramctl
  NAME       ALGORITHM DISKSIZE  DATA COMPR TOTAL STREAMS MOUNTPOINT
  /dev/zram0 zstd          3.8G  2.3G 13.2M 17.2M       4 [SWAP]
2.3GB of 'a's that gets compressed down to 20MB.


macOS also uses compression in the virtual memory layer.

(It's fun to note that I try to type out "virtual memory" in this thread, because I don't want people to think I talk about virtual machines.)


Make sure you have solid Linux system monitoring in general. About 50% of running Ceph successfully at scale is just basic, solid system monitoring and alerting.


This line of advice basically comes down to: have a competent infrastructure team. Sometimes you gotta move fast, but this is where having someone on infrastructure that knows what they are doing comes in and pays dividends. No competent infra guy is going to NOT set up linux monitoring. But you see some companies hit 100 people and get revenue then this type of thing blows up in their face.


Not sure what you mean about Ceph wanting to be in a single rack.

I run Ceph at work. We have some clusters spanning 20 racks in a network fabric that has over 100 racks.

In a typical Leaf-Spine network architecture, you can easily have sub 100 microsecond network latency which would translate to sub millisecond Ceph latencies.

We have one site that is Leaf-Spine-SuperSpine, and the difference in network latency is barely measurable between machines in the same network pod and between different network pods.


I meant that you can't span single cluster over high latency link, unlike for example garagefs


Ceph has synchronous replication, writes have to be acked by all replicas before the client gets an ack. Fundamentally, the latency of ceph is at least the latency between the OSDs. This is a tradeoff ceph makes for strong consistency.


I know. We run it for near a decade now. I mentioned it because a lot of uses for minio are pretty small.

I had 2 servers at home running their builtin site replication and it was super easy setup that would take far more of both hardware and work to replicate to ceph so while ceph might be theoretically fitting feature list, realisticially it isn't an option.


The Ceph way of doing asynchronous replication would be to run separate clusters and ship incremental snapshots between them. I don't know if anyone's programmed the automation for that, but it's definitely doable. For S3 only, radosgw has it's own async replication thing.

https://ceph.io/en/news/blog/2025/stretch-cluuuuuuuuusters-p...

Disclaimer: ex-Ceph-developer.


Monash University is also a Ceph Foundation member.

They've been active in the Ceph community for a long time.

I don't know any specifics, but I'm pretty sure their Ceph installation is pretty big and used to support critical data.


What did the provider do? Did they put your IMEI onto some list of other customers that complained, where all of you get better network prioritization?

I'm genuinely curious.


I've lived most of my adult life in houses with forced air furnaces (albeit powered via natural gas, not propane), and what you are saying is inaccurate regarding indoor air pollution unless your furnace is in need of immediate replacement.

A modern furnace works via a heat exchanger, where the combustion produced pollutants never mix with the indoor air being pushed through. All pollutants are expelled outside via a property functioning chimney. This is one reason why you should have the furnace (and chimney function) inspected annually. Aging heat exchangers will show hotspots before there is a possibility of air being mixed, giving plenty of time to plan for a replacement. Of course there is a possibility of failure, which is why you should have a carbon monoxide detector.


I know I'm going to sound crazy here, but there is one more alternative. How about: Reduce, Reuse, Repair, Recycle?

I recently got a sewing machine for an unrelated project and around the same time I ordered it I had one of these cloth reusable bags rip, because I put too many heavy things in it. When I got the sewing machine, for practice I decided to see if I could fix the bag. It turned out to be surprisingly quick and easy. I didn't use any extra material besides the thread, and I believe the bag is much stronger now.


It's all about convenience, and the fact that we're trained from birth into being good little obedient consumers. Talk to your grandparents, back then they all had sewing machines, fixed their clothes, their shoes, things were expensive and cherished, now it's all cheap junk you have to consume as fast as possible before getting your next hit from amazon. Now that everything is cheap and abundant why would people bother ?


Whenever the solution involves needing other people to act together at an expense (time in this case), you run into problems. People(many) care up until it's not only words anymore.


One of the essential items to have in your house is a sewing / repair kit, for things like bags you don't even need a sewing machine, you can fix it by hand. Don't even need to know how to sow, just stick the needle / wire through a couple times until it's fixed.


Side note: people (men, mostly, still) who don't know how to use sewing machines are missing out on perhaps the most transformative, clever, empowering machine ever made. You could teach an entire curriculum just on the history, design, manufacturing and use of the sewing machine, and barely scratch the surface.

They are quite simply marvels. (Great Veritasium video about them too)


While most of what you speak of re Ceph is correct, I want to strongly disagree with your view of not filling up Ceph above 66%. It really depends on implementation details. If you have 10 nodes, yeah then maybe that's a good rule of thumb. But if you're running 100 or 1000 nodes, there's no reason to waste so much raw capacity.

With upmap and balancer it is very easy to run a Ceph cluster where every single node/disk is within 1-1.5% of the average raw utilization of the cluster. Yes, you need room for failures, but on a large cluster it doesn't require much.

80% is definitely achievable, 85% should be as well on larger clusters.

Also re scale, depending on how small we're talking of course, but I'd rather have a small Ceph cluster with 5-10 tiny nodes than a single Linux server with LVM if I care about uptime. It makes scheduled maintenances much easier, also a disk failure on a regular server means RAID group (or ZFS/btrfs?) rebuild. With Ceph, even at fairly modest scale you can have very fast recovery times.

Source, I've been running production workloads on Ceph at fortune-50 companies for more than a decade, and yes I'm biased towards Ceph.


I defer to your experience and agree that it really depends on implementation details (and design). I've only worked on a couple of Ceph clusters built by someone else who left, around 1-2PB, 100-150 OSDs, <25 hosts, and not all the same disks in them. They started falling over because some OSDs filled up, and I had to quickly learn about upmap and rebalancing. I don't remember how full they were, but numbers around 75-85% were involved so I'm getting nervous around 75% from my experiences. We suddenly commit 20TB of backup data and that's a 2% swing. It was a regular pain in the neck, stress point, and creaking, amateurishly managed, under-invested Ceph cluster problems caused several outages and some data corruption. Just having some more free space slack in it would have spared us.[1]

That whole situation is probably easier the bigger the cluster gets; any system with three "units" that has to tolerate one failing can only have 66% usable. With a hundred "units" then 99% are usable. Too much free space is only wasting money, too full is a service down disaster, for that reason I would prefer to err towards the side of too much free rather than too little.

Other than Ceph I've only worked on systems where one disk failure needs one hotspare disk to rebuild, anything else is handled by a separate backup and DR plan. With Ceph, depending on the design it might need free space to handle a host or rack failure, and that's pretty new to me and also leads me to prefer more free space rather than less. With a hundred "units" of storage grouped into 5 failure domains then only 80% is usable, again probably better with scale and experienced design.

If I had 10,000 nodes I'd rather 10,100 nodes and better sleep than playing "how close to full can I get this thing" and constantly on edge waiting for a problem which takes down a 10,000 node cluster and all the things that needed such a big cluster. I'm probably taking some advice from Reddit threads talking about 3-node Ceph/Proxmox setups which say 66% and YouTube videos talking about Ceph at CERN - in those I think their use case is a bursty massive dump of particle accelerator data to ingest, followed by a quieter period of read-heavy analysis and reporting, so they need to keep enough free space for large swings. My company's use case was more backup data churn, lower peaks, less tidal, quite predictable, and we did run much fuller than 66%. We're now down below 50% used as we migrate away, and they're much more stable.

[1] it didn't help that we had nobody familiar with Ceph once the builder had left, and these had been running a long time and partially upgraded through different versions, and had one-of-everything; some S3 storage, some CephFS, some RBDs with XFS to use block cloning, some N+1 pools, some Erasure Coding pools, some physical hardware and some virtual machines, some Docker containerised services but not all, multiple frontends hooked together by password based SSH, and no management will to invest or pay for support/consultants, some parts running over IPv6 and some over IPv4, none with DNS names, some front-ends with redundant multiple back end links, others with only one. A well-designed, well-planned, management-supported cluster with skilled admins can likely run with finer tolerances.


Thank you for this detailed reply!

I only want to add a small suggestion. I get that large distributed production systems will occasionally go down, but it would be great if you could look into reducing the latency of your status page.

By my count there was at least a 35 minute delay between when things broke and before the status page (https://fastmailstatus.com) was updated.

Also, I think it would have been nice to have a bit more explanation on this event than simply "database issues" [1]. Being able to know that this was related to an upgrade would have made me feel a bit better during the time the status page was updated and until the issue was resolved.

Thank you for your hard work and an excellent email service!

-A long time customer.

[1]: https://fastmailstatus.com/cme1fq7ej002dh0iu6z8pey4f


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

HN For You