8 Dezember 2023

I Have Strong Feelings Towards a File System

Disclaimer: I have no idea what I'm talking about when it comes to file systems. I know the basics, and I haven't paid much attention to them prior, so please assume all the details in this article are flawed. It's just a blog post of someone discovering an 18 year old file system fairly recently, and being excited about it.

I never thought I'd develop strong feelings towards a file system. It's such a boring piece of software after all. Don't get me wrong, it's super important, but what's there to be excited about? NTFS, FAT32, ext3, ext4... it's all kind of the same thing. Sure, there's incompatibilities, performance differences, size differences, file integrity differences, and so on. But it's all just kind of boring.

Hey, look, guys! My file system has journaling! It knows where to continue in case of power loss!

Cool.

But then there's ZFS. I love ZFS!

My Experience With File Systems Previously

Here's how I have dealt with file systems in the past: I have a disk, and I want to use it to store files on it. Usually I want to partition that disk slightly, and maybe I want to enroll it in some kind of RAID.

The process used to be this:

Use some sort of Disk Utility or paritioning tool like parted to write the partition table and define the layout. Try to figure out how to set up your RAID at this point.
If you're really fancy, activate LVM, which adds a bunch of overhead so that you can have a partition layout that's a bit more flexible.
And then all that flexibility gets thrown out the window anyways because then you mkfs.ext4 on that partition and now it's fixed.
Oh crap, I want to extend my partition. So I have to re-partition my drive while not losing any data. Let's hope I don't accidentally nuke everything! Every time I have to do this, I feel like a surgeon operating on an open heart.
Good news! Your partition has been extended and your data (probably) is still there! But your file system isn't extended so you still have the same amount of storage space available. So you have to resize2fs.
????
Profit?

Notice how this process is pretty much the same for every file system. The only real difference is whether you want your files to be readable by Windows, Linux, macOS or all of the above. And if you make the wrong choice, you'll get errors when copying files because the file is too big or the file name is too weird or long or too case (in)sensitive.

That's why I didn't really care. For Windows I'd pick NTFS, for Linux ext4, for macOS APFS and for USB sticks FAT32 or extFAT, and that's that.

Every now and then someone would say "hey, ZFS/BTRFS is pretty cool! It has snapshots." And every time I'd be like "ok, cool, I guess. Why do I need that? I have Time Machine. I'm not going to bother with a new file system. ext4 works for me."

And then I needed to update my NAS that's been running fine for the last 10 years...

It’s Not Just Snapshots

It's a new way of thinking about file systems. I imagine someone at Sun Microsystems was askig the question "What if file systems weren't shit?"

And born was ZFS. It's a completely new way of thinking about managing space on a disk. Conventional file systems and volume managers are approaching the problem of managing files in a quite different way than ZFS. With conventional systems it's split between managing the disks and managing the files, which of course makes sense when thinking about it technically. But it turns out, you lose a lot of flexibility and add a lot of complexity that way.

ZFS approaches it from a user perspective instead: I have a bunch of disks, and I have a bunch of files. I want to put x files onto y disks. Figure out the rest.

And boy, did they! With ZFS, you don't need to define partitions. You don't need to think about how much space you'll need 4 years from now. Everything is super flexible and super straight-forward at the same time. Here's the basic concept:

You have a pool that consists of one or multiple disks.
You have datasets that are part of that pool.

Configuring these two components is as easy as it can be. For the pools, you can use the zpool command. For example, to create a RAID1 (mirror) pool across sda and sdb, you do this:

zpool create my_pool mirror sda sdb

The commands are equally as simple if you want to create a non-RAID-pool, or a very complex pool with a striped mirror and a write cache and a spare disk. You can even use zpool to create a pool on just a partition of an existing disk, instead of using the whole disk.

Cool Pool, Now What?

So far you're probably not that impressed. Your partitioning GUI can do that too, after all. But the magic lies in the datasets.

A dataset holds your files. It's the actual file system. By the way, your newly created pool is already available to use and acts as a dataset. It's mounted at /my_pool by default and you can just store your files there and be done with it. But usually you'll want more control over where to put your data. Think of it like the classic partitions. In the context of a NAS, you may want to create a separate dataset for different users or different file types (for example, for your photos and documents). But really, it's up to you how to organize them. And if you're dissatisfied, just change them later. You don't have to think about where that "partition" lives, and you don't have to free up space later to create a new partition. It's like virtual, flexible partitions. A bit like folders with file system features.

Let's create a new dataset for photos:

zfs create my_pool/photos

Done. Your dataset has been mounted at /my_pool/photos and it's ready to store files. Of course, you can just define a different mount point if you so choose. No /etc/fstab tinkering required. And yes, you can even nest these datasets, if you want to.

Configuring Datasets

Note that without configuring anything besides the name of what you want to call your dataset (photos), you already have a working file system that you can use to store data. By default, you will have access to 100% of the available storage space of the entire pool. Now, if you create a new dataset, say, for your backups, this second dataset will share the storage space with your first dataset. You don't have to define fixed sizes, it's all dynamic. If you fill up your backups, the free space available to your photos will decrease, and vice versa.

Let's say you don't want that. Maybe you want to only allow your backups to take up at most 100 GB of space. How would you do that?

zfs set quota=100g my_pool/backups

And that's where the power of ZFS lies. It's all dynamic! And it has tonnes of features that are available for you to use, but you don't have to.

What About Those Snapshots?

I mentioned before that one key feature of ZFS is snapshots. But why should you care?

Well, you don't have to. If you don't want snapshots (right now), you don't have to worry about them. But if you do, you can enable them at any time. Snapshots allow you to freeze the file system at a certain point in time. It's instant and doesn't take up any space. It's just a pointer that says "here is where the snapshot should be."

Now, if you add or remove data, the snapshot size will increase, because all changes will be applied in a manner that they don't destroy the snapshot. If you delete a file, it will appear as removed just like with any other file system. But really, it will just mark the file as deleted. The snapshot will still point to that file, so that you can restore it at any time. Same goes with edits to files. New blocks will be written to a new part of the disk, leaving the old blocks available for the snapshot.

You can configure ZFS to automatically create snapshots every x hours, days, etc., or you can create manual snapshots (e.g. before updating your system). You can also configure it to delete old snapshots after a certain amount of time. Granted, managing automated tasks is a bit more complicated than I'd like it to be, but managing the actual snapshots is still quite simple.

Snapshots == Backups?

If you can use snapshots to roll back changes, like for example accidentally deleting a file, is it a backup?

Yes, and no. Yes, it's more of a backup than RAID, because you will be able to restore data. But it's still on the same device and drive. If you don't have RAID and the disk dies, your snapshots die, too. If you laptop gets stolen, your snapshots get stolen, too.

Wouldn't it be cool to have a mechanism that automatically copies these snapshot to a different system, so we can have real backups?

What if we just take the snapshots, and push them somewhere else?

Of course, ZFS can do that, too. Just set up a ZFS pool on an external machine, and push your snapshots via SSH to that machine. And since these snapshots are made on a block level and not on a file level, it's super quick. There's no need to check each file for differences. It just needs to know which snapshots are already present and it can push the rest. It doesn't matter whether you're backing up a snapshot of your entire device with millions of small files or just a snapshot of one large file, the transfer will still be (almost) equally qick. It's really amazing.

More Features and Data Integrity

There are a lot more cool features that I could be talking about, like deduplication, tiered storage handling, encryption, compression or the fact that you can use datasets as a virtual disk for your VMs (zvol). But one thing that I would like to mention is the fact that ZFS has excellent data integrity as well, which is fairly important for a file system.

While I won't pretend to understand half of how this works exactly, ZFS has a feature that allows the detection of data inconsistencies, and, more importantly the self-healing capabilities that come with it. While this may sound like something abstract that you probably don't have to worry about too much, it actually saved my data from being corrupted. I had a mirrored pool with a bad hardware controller that managed to corrupt data from both disks. Luckily, the corruption only occurred on one disk at a time, which gave ZFS the chance to detect this corruption, correct it, and notify me of it. Therefore, even though 100% of my disks were outputting (some) corrupted data, I did not lose any data in this case.

Granted, there was a bit of luck involved in the way the controller failed, but I'm fairly certain that this luck wouldn't have been sufficient with a more conventional RAID1 setup. The file system may not have detected this corruption at all, or it may have marked the first drive as failed, resulting in me replicating bad data from the second drive after replacing the "bad" drive, before realising that it was actually the controller that had failed.

TL;DR

ZFS is pretty cool and I should have looked at it years ago. So, if you haven't, check it out! There's no harm in trying (as long as you don't nuke your production data!), and I'm pretty sure you'll enjoy it, too. Plus, Linux support of ZFS is picking up rapidly and I'd say it's matured enough to trust it with your production data.

(Yes, I know of the recent bug, but if it has gone unnoticed for 10+ years, I'm not too worried about it. I'm not that lucky ;) )