More

dakami · on July 16, 2017

Funny you should ask. I just looked into this:

https://twitter.com/dakami/status/884715382061252608

Yeah. Ethereum blew up and all that compute is going towards GPU Mining.

dakami · on Oct 16, 2016

My understanding (and please correct me if I'm wrong!) is that it's pretty easy to burn out FPGAs, electrically, in a way that it's not particularly trivial to do with other sorts of circuitry. So one downside of exposing your bitstreams is people are going to blow things up and demand new chips.

dakami · on July 15, 2016

Rust isn't a sandbox. The whole point of a sandbox is it survives incorrect software at runtime. Rust is compile time magic. Sure, cool, different thing.

_phaq · on July 15, 2016

He's not saying that Rust is a sandbox, but rather that Rust should be used to write a sandbox.

jstewartmobile · on July 15, 2016

He never said it was. That's why he's pushing nacl / qubes / xen.

dakami · on June 2, 2016

was that...like...a hundred megs for a web page?

Zikes · on June 2, 2016

177 modules for a Hello World page? Is this satire?

catpolice · on June 2, 2016

Most of the modules involved are development side resources that don't end up included in the build. Complaining about the number of modules is like installing visual studio, noticing that it takes several gigs on disk and being like "Gigabytes, for a hello world program?"

untog · on June 2, 2016

Sigh, that's an absurd way to measure. A "Hello world" is just that: a demo. No-one is:

1) actually making Hello, World pages that will go anywhere near a user

2) Using React in Hello, World pages (which they don't make)

3) Using Reactpack to build the React-based Hello World pages they aren't making

Zikes · on June 2, 2016

It's not absurd because:

1) Whatever you're making, you're starting out with 177 modules minimum if you use React + Reactpack.

2) See 1.

untog · on June 2, 2016

But if you're using React you are creating a not-insignificant webapp and the initial install time of 177 modules is totally irrelevant beyond that first install. It doesn't have any reflection on the size of the client JS file (webpack loaders, for example, only live on the dev machine).

Moreover, you can presumably globally install this, so it can cover every React-based project you need to build.

What is an acceptable number of modules?

rymate1234 · on June 2, 2016

Surely the important bit is the size of the code that's served to the end user, not the size of the development environment?

My Visual Studio install is easily at least 4GB, yet this doesn't matter to the end user as the packages produced are about 6MB total

tacone · on June 2, 2016

That's Node, baby. Deal with it.

ola · on June 2, 2016

It's just an example to show what type of code `reactpack` builds, ES6 and JSX with style loading in this case. If you want a more minimal React-like build you can switch out React with lightweight alternatives like Preact or React-lite.

tracker1 · on June 2, 2016

For reference, I build a standalone datepicker module with preact + redux at under 20kb (min + gz)... worked out really well. By comparison, went with regular react recently, and the payload quickly got over 200kb, though I needed a couple relatively heavy libraries for the app in question.

Another point worth mentioning is using react-icons individually (svg based), same for component libraries where you are only using a few components... tends to work out much lighter. Also, if you do use bootstrap as a base css library, work from the source instead of monkey patching the .css output.

dakami · on May 30, 2016

We don't build houses (he said from San Francisco)

dakami · on May 19, 2016

-D flag in OpenSSH.

dakami · on May 17, 2016

No apparent backend integration.

dakami · on May 17, 2016

Does Webstorm actually integrate with front end development? I'd like to whip together simple projects that aren't actually hideous, and literally every cloud IDE I'm finding just doesn't have a way to go from HTML5 code to WYSIWYG or even structured output...

dakami · on May 15, 2016

Are your Win10 devices domain joined? Sounds like malware.

dakami · on May 14, 2016

So I just started experimenting with ZFS, because it seemed required for container snapshots.

Then I found out it fragments badly, and nobody can figure out how to write a defragmenter. So, uh, keep the FS below 60-80% full apparently.

Yeah.

rsync · on May 14, 2016

"Then I found out it fragments badly, and nobody can figure out how to write a defragmenter. So, uh, keep the FS below 60-80% full apparently."

Confirmed. Not FUD.

Our experience[1] is that things go to hell around 90% and even if you bring it below 90% there is a permanent performance degradation to the pool. In order to be safe, we try to keep things below 80%, just to be safe. That's probably a bit conservative, though.

ZFS needs defrag. It is not reasonable to give up 3 drives worth of capacity for the parity (raidz3, for instance) and then on top of that set aside another 10-20% as the "angels share".

[1] rsync.net

Annatar · on May 15, 2016

ZFS has defragmentation built into the very design of it!

It doesn't fragment, it actually turns all random writes into sequential ones, provided there is enough space because ZFS uses copy-on-write atomic writes:

http://constantin.glez.de/blog/2010/04/ten-ways-easily-impro...

http://everycity.co.uk/alasdair/2010/07/zfs-runs-really-slow...

now, for those of us in the Solaris / illumos / SmartOS world, this is well known and well understood. We either keep 20% free in the pool, or we turn off the defrag search algorithm. But now with the Linux crowd missing out on 11 years of experience, I see there will be lots of misunderstanding of what is actually going on, and consequently, lots of misinformation, which is unfortunate.

gnufx · on May 15, 2016

Experienced* SunOS admins are aware of that and can still end up -- accidentally I think -- with ZFS filesystems with unacceptable performance in a state that Oracle apparently didn't understand. There was a ticket open for order months but I don't know whether it ever got resolved.

* I'm not sure how experienced, but they have Sun hardware running that's older than ZFS.

ryao · on May 15, 2016

The performance degradation is likely from full meta slabs and maybe from gang blocks, although ZFS does a fair job at preventing gang blocks by using best fit behavior to minimize the external fragmentation that necessitates them. The magic threshold for best fit behavior is 96% full at the meta slab level. This tends to be where slowdowns occur. On spinning disks, being near full also means that basically all of the outermost tracks have been used, so you are limited to the inner most tracks, which can halve bandwidth.

Anyway, it would be nice if you could provide actual numbers and meta slab statistics from zdb. The worst case fragmentation that has been reported and that I can confirm from data provided to me is a factor of 2 reduction in sequential read bandwidth on a pool consisting of spinning disk after it had reached ~90% capacity. All files on it had been created out of sequence by bit torrent.

A factor of 2 might be horrible to some people. I can certainly imagine a filesystem performing many times worse though. I would be interested to hear from someone who managed to do worse than that in a manner that cannot be prescribed to the best fit allocator protecting the pool from gang block formation.

craigyk · on May 14, 2016

I've seen this repeated a lot, but have not had quite the same experience with "permanent" performance degradation. Especially if I eventually expand the pool with another vdev. Not sure about ZFSonLinux, but:

1) Having a ZIL helps with this, and in general. 2) ZFS changes strategy depending on how full it is, it spends more time avoiding further fragmentation rather than grabbing the first empty slot. This hit would go away if you get the free space back up. 3) Finally, there is a way[1] to have ZFS keep all the info it needs in RAM to greatly alleviate the times when it starts hunting harder to prevent more fragmentation. It looks like the RAM requirements are 32GB/1PB... so not too bad IMO.

[1] https://blogs.oracle.com/bonwick/entry/space_maps

rsync · on May 15, 2016

"I've seen this repeated a lot, but have not had quite the same experience with "permanent" performance degradation. Especially if I eventually expand the pool with another vdev. Not sure about ZFSonLinux"

Look, I'll admit that we haven't done a lot of scientific comparisons between healthy pools and presumed-wrecked-but-back-below-80-percent pools ... but I know what I saw.

I think if you break the 90% barrier and either: a) get back below it quickly, or b) don't do much on the filesystem while it's above 90%, you'll probably be just fine once you get back below 90%. However, if you've got a busy, busy, churning filesystem, and you grow above 90% and you keep on churning it while above 90%, your performance problems will continue one you go back below, presuming the workload is constant.

Which makes sense ... and, anecdotally, is the same behavior we saw with UFS2 when we tune2fs'd minfree down to 0% and ran on that for a while ... freeing up space and setting minfree back to 5-6% didn't make things go back to normal ...

I am receptive to the idea that a ZIL solves this. I don't know if it does or not.

ryao · on May 15, 2016

The magic threshold is 96% per meta slab. LBA weighting (which can be disabled with a kernel module parameter or its equivalent on your platform) causes metaslabs toward the front of the disk to hit this earlier. LBA weighting is great for getting maximum bandwidth out of spinning disks. It is not so great once the pool is near full. I wrote a patch that is in ZoL that disables it on solid state disk based vdevs by default where it has no benefit.

That being said, since rsync.net makes heavy use of snapshots, the snapshots would naturally keep the allocations in metaslabs toward the front of the disks pinned. That would make it a pain to get the metaslabs back below the 96% threshold. If you are okay with diminished bandwidth when the pool is empty (assuming spinning disks are used), turn off LBA weighting and the problem should become more manageable.

That said, getting data on the metaslabs from `zdb -mmm tank` would be helpful in diagnosing this.

kev009 · on May 15, 2016

You really shouldn't run non-CoW file systems above 90%, to include UFS and ext

rsync · on May 15, 2016

Agreed. I don't think anyone is arguing that you shouldn't do it.

What I believe, and what I think others have also concluded, is that it shouldn't be fatal. That is, when the dust has settled and you trim down usage and have a decent maintenance outage, you should be able to defrag the filesystem and get back to normal.

That's not possible with ZFS because there is no defrag utility ... and I have had it explained to me in other HN threads (although not convincingly) that it might not be possible to build a proper defrag utility.

DanielDent · on May 15, 2016

My understanding is that the way to defrag ZFS is to do a send and receive. Combined with incremental snapshotting, this should actually be realistic with almost no downtime for most environments.

Doing so requires that you have enough zfs filesystems in your pool (or enough independent pools) that you have the free space to temporarily have two copies of the filesystem.

rsync · on May 15, 2016

"Doing so requires that you have enough zfs filesystems in your pool (or enough independent pools) that you have the free space to temporarily have two copies of the filesystem."

Yes, and that is why I did not mention recreating the pool as a solution. If your pool is big enough or expensive enough, that's still "fatal".

ryao · on May 15, 2016

You ought to define what is fatal here. The worst that I have seen reported at 90% full is a factor of 2 on sequential reads off mechanical disks, which is acceptable to most people. Around that point, sequential writes should also suffer similarly from writes going to the inner most tracks.

DanielDent · on May 15, 2016

(1) I'm not proposing recreating the pool - I'm proposing an approach to incrementally fixing the pool in an entirely online manner.

(2) If your pool is big enough/expensive enough, surely you've also budgeted for backups.

rsync · on May 15, 2016

(1) Regardless of what you call it, it means having enough zpool somewhere else to zfs send the entire (90% full) affected zpool off to ... that might be impossible or prohibitively expensive depending on the size of the zpool.

(2) This has nothing to do with backups or data security in any way - it's about data availability (given a specific performance requirement).

You're not going to restore your backups to an unusable pool - you're going to build or buy a new pool and that's not something people expect to have to do just because they hit 90% and churned on it for a while.

DanielDent · on May 16, 2016

You can send/receive to the same zpool and still defrag. With careful thought, this can be done incrementally and with very minimal availability implications.

I agree it's not ideal to have filesystems do this, but it also simplifies a lot of engineering. And I think direct user exposure to a filesystem with a POSIX-like interface is a paradigm mostly on the way out anyway, meaning it's increasingly feasible to design systems to not exceed a safe utilization threshold.

ryao · on May 15, 2016

This does work.

Annatar · on May 15, 2016

On UNIX, there are two defragmentation utilities:

`tar` and `zfs send | zfs recv`.

cmurf · on May 14, 2016

What I think would help any COW file system is delay snapshot (and clone) deletion longer, and delete in groups, which would result in larger contiguous regions being freed. When one container is deleted, a small amount of space is freed, and then may be used for writes, quickly filling up, and thus increasing localized fragmentation - or more problematically for spinning drives it's increasing seek times by causing recent writes to be scattered farther apart. To reduce read and write seeks, it's better to have larger free areas for COW to write to sequentially.

So it'd be nice if there were something like a "remove/hide" feature for containers, separate from delete or clean up, with a command that makes it easier to select many containers for deletion. At least on Btrfs this should be quite fast, and the background cleaner process that does the actual work of updating the ref count and freeing extents should have a priority such that it doesn't overly negatively impact other processes.

Some of this behavior may change on Btrfs as the free space tracking has been recently rewritten. Right now the default is the original space cache implementation, while the new free space b-tree implementation is a mount time option intended only for testing.

hhw · on May 14, 2016

Using dedicated ZIL significantly reduces fragmentation:

http://www.racktopsystems.com/dedicated-zfs-intent-log-aka-s...

Anyone using ZFS in a serious capacity would have both dedicated ARC and ZIL.

ryao · on May 15, 2016

ZIL is only used on synchronous IO. Moving it to a dedicated SLOG device would have no impact on non-synchronous IO. A SLOG device does help on synchronous IO though.

That said, all file systems degrade in performance as they fill. I do not think there is anything notable about how ZFS degrades. The most that I have heard happen is a factor of 2 sequential read performance decrease on a system where all files were written by bit torrent and the pool had reached 90% full. That used mechanical disks. A factor of 2 in a nightmare scenario is not that terrible.

raattgift · on May 15, 2016

A log vdev is a log vdev (or a SLOG). ZIL is a badly overloaded term.

Ignoring logbias=throughput, when you have a slog you save on writing intents for small synchronous writes into the ordinary vdevs in the pool. If you do a lot of little synchronous writes, you can save a lot of IOPS writing their intents to the log vdev instead of the other vdevs. Log vdevs are write-only except at import (and at the end phases of scrubs and exports).

Here's the killer thing on an IOPS-constrained pool not dominated by large numbers of small synchronous writes: the reads get in the way of writes. ZFS is so good at aggegating writes that unless you are doing lots of small synchronous random writes, they write IOPS tend to vanish.

Reads are dealt with very well as well, especially if they are either prefetchable or cacheable. Random small reads are what kill ZFS performance.

Unfortunately systems dominated by lots of rsync or git or other walks of filesystems tends to produce large numbers of essentially random small reads (in particular, for all the ZFS metadata at various layers, to reach the "metadata" one thinks of at the POSIX layer). This is readily seen with Brendan Gregg's various dtrace tools for zfs.

The answer is, firstly, an ARC that is allowed to grow large, and secondly high-IOPS cache vdevs (L2ARC). l2 hit rates tend to be low compared to ARC hits, but every l2 hit is approximately one less seek on the regular vdevs, and seeks are zfs's true performance killers.

Persistent L2ARC is amazing, but has been languishing at https://reviews.csiden.org/r/267/

It has several virtues that are quickly obvious in production. Firstly, you get bursts of l2arc hits near import time, and if you have frequently traversed zfs metadata (which is likely if you have containers of some sort running on the pool shortly after import) the performance improvement is obvious. Secondly, you get better data-safety; l2arc corruption, although rare in the real world, can really ruin your day, and the checksumming in persistent l2arc is much more sound. Thirdly, it can take a very long time for large l2arcs to become hot, which make system downtown (or pool import/export) more traumatic than with l2arc (rebuilds of full ~128GiB l2arc vdevs take a couple of seconds or so on all realistic devices; even USB3 thumb drives (e.g Patriot Supersonic or Hyper-X DataTravellers, both of which I've used on busy pools) are fast and give an IOPS uptick early on after a reboot or import, and of course you can have several of those on a pool. "Real" ssds give greater IOPS still. Fifthly, the persistent l2arc being available at import time means that early writes are not stuck waiting for zfs metadata to be read in from the ordinary vdevs; that data again is mostly randomly placed LBA-wise, and small, so there will be many seeks compared the amount of data needed. Persistent l2arc is a huge win here, especially if for some reason you insist on having datasets or zvols that require DDT lookups (small synchronous high-priority reads if not in ARC or L2ARC!) at write time.

Maybe you could consider integrating it into ZoL since you guys have been busy exploring new features lately.

Finally, if you are doing bittorrent or some other system which produces temp files that are scattered somewhat randomly, there are two things you can do which will help: firstly, recordsize=1M (really; it's great for reducing write IOPS and subsequent read IOPS, and reduces pressure on the metadata in ARC), and secondly, particularly if your receives take a long time (i.e., many txgs), tell your bittorrent client to move the file to a different dataset when the file has been fully received and checked -- that will almost certainly coalesce scattered records.

ryao · on May 15, 2016

The term ZIL is not overloaded. Unfortunately, users tend to misuse it because the ZIL's existence is hard to discover until it is moved into a SLOG device.

As for persistent L2ARC, it was developed for Illumos and will be ported after Illumos adopts a final version of it.

aidenn0 · on May 14, 2016

I'm using a somewhat older version of ZFS, but I tried having an SLOG (a dedicated ZIL disk) and it went essentially unused, so instead I moved the disk over to a second L2ARC, which helped a lot, as it doubled the throughput.

Further research showed that the ZIL is only needed for synchronous writes, which my workload didn't have any of.

nwmcsween · on May 14, 2016

ARC when I looked at ZOL is separate from the linux page cache and thus you get double buffering.

ryao · on May 15, 2016

Only with mmap'ed files.

nwmcsween · on May 15, 2016

Why can't ZoL just not cache into ARC when mmaping then?

ryao · on May 16, 2016

There is no reason why the driver cannot be patches to mmap into ARC. There are just many higher priority things to do at the moment. In terms of performance, the value of eliminating double caching of mmap'ed data is rather small compared to other things in development. Later this year, ZoL will replace kernel virtual memory backed SLAB buffers with lists of pages (the ABD patches). That will improve performance under memory pressure by making memory reclaim faster and more effective versus the current code that will ecessively evict due to SLAB fragmentation. It should also bypass the crippled kernel virtual memory allocator on 32-bit Linux that prevents ZoL from operating reliably there. Additionally, workloads that cause the kernel to frequently count all of the kernel virtual memory allocations would improve tremendously.

Mmap'ing into ARC would probably come after that as it would make mapping easier.

sneak · on May 14, 2016

ARC yes, ZIL no.

cthalupa · on May 14, 2016

In a thread that is about the perils of ZFS fragmentation, you are replying to a link saying that a ZIL seriously reduces the risk of fragmentation, and saying that someone worried about fragmentation does not need to use a ZIL.

Why? If there's a legitimate reason, please expand.

gnoway · on May 14, 2016

I think he meant that they might not have one.

It's been a while since I looked at using ZFS for anything meaningful, but at the time (~6 years ago), while losing L2ARC was no big deal, losing dedicated ZIL was catastrophic. I think that's still true today.

So you need at least two ZIL devices in a mirror. On top of that, you really need something faster and lower latency for your ZIL vs. the ARC or main pool; people were trying to use SSDs but most commonly-available drives at the time would either degrade or fail in a hurry under load. So the options were RAM-based, e.g. STEC ZeusRAM on the high end, or some sort of PCI-X/PCIe RAM device. The former was not easy or cheap to acquire for testing stuff, and the latter made failover configs impossible.

I think that ZIL is also not soaking up all writes, just most writes meeting a certain criteria. Some just stream through to the pool. So I was always thinking of it as a protection device that also converted random writes to sequential. Some people don't think they need that.

I remember the fragmentation issue being a problem at the time, but also thinking it was probably going to get solved soon because there was so much interest and a whole company behind it. Then Oracle happened. My guess is that if it were still Sun and all the key people were still there, this would be a solved problem right now. As it is, Oracle probably wants you to buy all the extra storage anyway, and would love to offer professional services to get you out of the fragmentation bind you're in.

leonroy · on May 14, 2016

A lot has changed. Well - one thing actually: you no longer lose your ZFS pool if your dedicated ZIL log (called a SLOG) dies.

Here is some info on ZIL vs SLOG: http://www.freenas.org/blog/zfs-zil-and-slog-demystified/

ryao · on May 15, 2016

Your information is out of date. Losing s SLOG device while the system is running is fine. As far as I know, it has always been fine (unless someone goofed on the initial implementation long before I became involved). All data in ZIL is kept in memory, regardless of whether it is written to the main pool or to a SLOG device. The data is written to the main pool in a permanent fashion with the transaction group commit. If a SLOG device dies, that write out still happens and the pool harmlessly stops using it. If the SLOG device dies on an exported pool, you need to set the zil_replay_disable kernel module parameter to allow the pool to be imported. The same might be true if you reboot (although I doubt it, but need to check).

You can test these things for yourself.

sneak · on May 16, 2016

> Anyone using ZFS in a serious capacity would have both dedicated ARC and ZIL.

I contend that most people using ZFS in a serious capacity do not have a dedicated ZIL.

ChuckMcM · on May 14, 2016

So to understand why this is you have to appreciate the goals behind write-anywhere-file-layout (aka WAFL) file systems. [1]

One of the goals of such systems is that copy of the file system on disk is always consistent, turn power off at any point and you can come right back up with a valid file system. This is accomplished by only writing to the 'free block list'. You construct updated inodes from the file change all the way up to the root inode out of new blocks and then to "step" forward you write a new root block. This is really neat and it means that when you've done that step, you still have the old inodes and datablocks around, they just aren't linked but you can link them to another "holder" inode attached to the name ".snapshot" and it will show you the file system just before the change. Write the old root block back into the real root block and "poof!" you have reverted the file system back to the previous snapshot.

Ok, so that is pretty sweet and really awesome in a lot of ways, but it has a couple of problems. The first, as noted, is that it pretty much guarantees fragmentation as its always reaching for free blocks and they can be anywhere. On NetApp boxes of old, that wasn't too much of a big deal because everything was done "per RAID stripe" so you were fragmented, but you were also reading/writing full stripes in RAID so you had the bandwidth you needed and fragmentation was absorbed by the efficiencies of full stripe reads/writes. But the second issue arises when you start getting close to full, managing the free block list gets harder and harder. You are constantly getting low block pressure, so you are constantly trying to reclaim old blocks (on unused snapshots, or expired ones) and that leads to a big drop in performance. The math is you can't change more of the data between snapshot steps than the amount of space you have free. That is why NetApp filers would get cranky using them in build environments where automated builds would delete scads of intermediate files, only to rebuild them and then relink them. Big percentage change in the overall storage.

On the positive side, storage is pretty darn cheap these days, so a swapping in 3TB drives instead of 2TB drives means you could use all the storage you "planned" to use and keep the drives at 66% occupancy. Hard on users though who will yell at you "It says it has 10TB of storage and is only using 6TB but you won't expand my quota?" At such times it would be useful for the tools to lie but that doesn't happen.

[1] Disclosure 1, I worked for 5 years at NetApp with systems that worked this way. Disclosure 2, an intern with NetApp (we'll call him Matt) was very impressed with this and went on to work at Sun for Jeff and similar solutions appeared in ZFS.

thrownaway2424 · on May 14, 2016

"One of the goals of such systems is that copy of the file system on disk is always consistent."

Goal yes, implementation no. WAFL does in fact have consistency problems and filers do ship with a consistency checker called "wack" which if you ever need this tool you'll probably have better luck throwing the filer in the trash and restoring from backups rather than waiting a month for it to complete.

nwmcsween · on May 14, 2016

Why not 'defrag' the free list during low io so this issue is somewhat mitigated?

rincebrain · on May 14, 2016

At least on ZFS, the whole reason "defrag" is impractical is that a bunch of places in the FS structure assume the logical address of a block is immutable for the lifetime of the block, which makes a number of properties really easy and inexpensive, but also means that your life is suffering if you want to try to modify that particular constraint.

If you'd like to see some information on a feature that's been added while working around that particular constraint (or, rather, mitigating the impact of it), check out [1].

[1] - http://open-zfs.org/w/images/b/b4/Device_Removal-Alex_Reece_...

ryao · on May 15, 2016

Defragmenting a merkle tree required BPR, which temporarily breaks the structure intended to keep data safe. The only code known to have achieved it performed poorly and is behind closed doors at Oracle.

The benefits in terms of defragmentation are also limited because ZFS does a fair job of resisting fragmentation related performance penalties. The most that I would expect to see on a pool where poor performance is not caused by the best fit allocator would be a factor of two on sequential reads.

raattgift · on May 15, 2016

As it says in that slide deck's first slide (after the title slide), second bullet, this particular device removal technique is to deal with an "oops" where one accidentally adds a storage vdev to an existing pool.

The zpool command line utility tries hard to help you not shoot yourself in the foot, but "zpool add -f pool diskname" sometimes happens when "zpool add -f pool cache diskname" was meant. Everyone's done it once. Thinks of a system melting down because the l2arc has died, and you're trying to replace it in a hurry, and you fat-finger the attempt to get rid of the "-n" and end up getting rid of "log" instead.

Without this device removal, that essentially dooms your pool -- there is no way to back out, and the best you can do is throw hardware at the pool (attach another device fast to mirror the single device vdev, then try to grow the vdev to something temporarily useful, where "temporarily" almost always means "as long as it takes to get everything properly backed up" with the goal being the destruction and re-creation of the pool (plus restoral from backups).

With this device removal, you do not have to destroy your pool; you have simply leaked a small amount of space (possibly permanently) and will carry a seek penalty on some blocks (possibly permanently, but that's rarer) that get written to that vdev before the replacement.

As noted further in the slide deck (and in Alex's blog entries), this only works for single device vdevs -- you cannot remove anything else, like a raidz vdev, and you have to detach devices from mirror vdevs before removal.

Also, note the overheads: although you can remove a single-device vdev with a large amount of data on it, doing so is a wrecking ball to resources, particularly memory. You won't want to do something like:

Before:

mirror-0 disk0 2tb-used 3tb-disk-size disk1 2tb-used 3tb-disk-size mirror-1 disk2 2tb-used 3tb-disk-size disk3 2tb-used 3tb-disk-size

do an expand dance, so you have

mirror-0 disk0 2tb-used 6tb-disk-size disk1 2tb-used 6tb-disk-size mirror-1 disk2 2tb-used 3tb-disk-size disk3 2tb-used 3tb-disk-size

then detach disk3, then device-removal remove disk2, except in extremely special circumstances, and where you are well aware of the time it will take, the danger to the unsafe data in the pool during the removal (i.e., everything in former mirror-1), that your pool will be trashed beyond hope in the presence of crashes or errors during the removal, and that you will have a permanent expensive overhead in the pool after the removal is done.

It would almost certainly be much faster and vastly safer to make a new pool with the 6tb disks and zfs send data from the old one to the new one.

rincebrain · on May 15, 2016

I think we're basically agreeing loudly over everything except the example being a demonstration of mitigating the impact of BPs being immutable while adding a feature that requires that statement be less than true - and I agree, the permanent overhead of a mini-DDT is a non-starter for anything other than the example case of "oops I added a device, time to evac it before $TONS_OF_DATA gets landed".

Certainly, it would be much less exciting to send|recv from poolA to poolB, and require no code changes and no GB per TB of data indirection overhead.

But this was intended as an example of how many caveats and problems are involved in even a "simple" feature involving shuffling data on-disk, and thus, why "defrag" is a horrendously hard problem in this environment.

ryao · on May 15, 2016

On my SSDs, I can go to 96% full without issue using ZoL. ZFSOnLinux is patched to disable ZFS' LBA weighting on solid state storage though. Non-solid state storage tends to reach 96% in metaslabs early due to LBA weighting, although I would be fine with filling a pool to 90% with the recent code. Going much higher than that is probably not a good idea though.

I have yet to see evidence that 60%-80% causes issues unless the system is so overloaded that having performance drop a small amount is noticeable. On spinning disks, such a thing is only natural because there is not much space left in the outer platters.

That said, older versions years ago would enter best fit behavior at 80%, which is where the 80% talk originated.

gcr · on May 14, 2016

Have you tried btrfs? It also has support for super-cheap copy-on-write operations which should make container images and the like a snap.

Not sure if your container tool supports btrfs snapshots of course, but it's conceptually simple, right?

cmurf · on May 14, 2016

I think the main issue here is Btrfs is still developing. Its kernel doc file still says it's for benchmarking and review. [1] CoreOS devs decided to switch from Btrfs to overlay(fs) about 17 months ago. That's a long, long time in Btrfs development "years" that's how much development happens on Btrfs. But I can't say if CoreOS would, had today's Btrfs been what they were using in 2014, would have changed their decision.

RH/Fedora are very dm/LVM thinp snapshots with XFS centric for backing their containers. I think what you're seeing is distros are doing something different with their container backing approach in order to differentiate from other distros. Maybe it's a stab in the dark or spaghetti on the wall approach but in the end all of these storage backends are going to mature a lot in the interim, so ultimately it'll be good for everyone.

[1] https://git.kernel.org/cgit/linux/kernel/git/stable/linux-st...

Annatar · on May 15, 2016

Btrfs has been in beta for how many years now?

ZFS is protecting data in enterprise production environments since 2006 (Solaris 10 update 2).

krondor · on May 15, 2016

Btrfs is not beta, see the discussion re: maturity below.

ZFS is an excellent filesystem, btrfs is an excellent filesystem. There is room for both excellent options for users.

aorth · on May 14, 2016

For development containers, systemd-nspawn has had support for btrfs snapshots since 2014 or 2015. Simple to get going if you're on a Linux box with systemd, no other daemons or tools required.

takeda · on May 14, 2016

Isn't this issue inherent to all COW systems (ZFS, WAFL, btrfs)?

KaiserPro · on May 14, 2016

have you actually benchmarked it?

anything with copy of write is going to fragment.

are you still running on spinning rust?

I've not seen any real performance hits until 90% full, but then any file system with large images suffers at that point.

davidgerard · on May 14, 2016

We got bitten by this back in the Solaris days in 2009, on a TV broadcasting production box with quite stringent uptime requirements: what happens is the defragmenter gets itself tied in knots and starts thrashing, and the symptom is 50% system CPU with no apparent cause. Got a Sun kernel engineer on call and all. SPOILER: it did in fact require a reboot to unfuck the system. Then we kept the disk in question being wasted.

ryao · on May 15, 2016

A reboot implies a bug had to be fixed. That can be assumed to be fixed everywhere by now.

davidgerard · on May 16, 2016

... no, it doesn't imply anything of the sort. As far as I know, ZFS still has this issue - it can get itself tied in knots, and only a reboot will stop this from happening.

pbarnes_1 · on May 17, 2016

We used ZFS in production for a year and it was the worst decision we'd ever made, precisely for this reason.

Also, removing files uses up an insane amount of CPU and can block all FS operations if you get into the 80%+ full situation.

I'll also note that the interplay between the ARC and Linux MM is... "interesting".

HN For You