While ZFS has a well-earned reputation for data integrity and reliability, ZFS native encryption has some incredibly sharp edges that will cut you if you don't know where to be careful. Unfortunately, I learned this the hard way, standing in a pool of my own blood and tears after thoroughly lacerating myself. I very nearly permanently lost 8.5 TiB of data after performing what should've been a series of simple, routine ZFS operations but resulted in an undecryptable dataset. Time has healed the wound enough that I am no longer filled with anguish just thinking about it, so I will now share my experience in the hope that you may learn from my mistakes. Together, we'll go over the unfortunate series of events that led to this happening and how it could've been avoided, learn how ZFS actually works under the hood, use our newfound knowledge to debug and reproduce the issue at hand, and finally compile a modified version of ZFS to repair the corrupted state and rescue our precious data. This is the postmortem of that terrible, horrible, no good, very bad week…
Part 1: An unfortunate series of events
The status quo
In the beginning, there were two ZFS pools: old and new (names changed for clarity). Each pool was hosted on an instance of TrueNAS CORE 13.0-U5.1 located at two different sites about an hour's drive apart with poor Internet connectivity between them. For this reason, a third pool sneakernet was periodically moved between the two sites and used to exchange snapshots of old and new datasets for backup purposes. ZFS dataset snapshots would be indirectly relayed from old to new (and vice versa) using sneakernet as an intermediate ZFS send/recv source/destination (e.g. old/foo@2023-06-01 -> sneakernet/old/foo@2023-06-01 -> new/old/foo@2023-06-01).
The new pool was natively encrypted from the very beginning. When ZFS snapshots were sent from new to sneakernet/new to old/new, they were sent raw, meaning that blocks were copied unmodified in their encrypted form. To decrypt and mount them on sneakernet or old, you would need to first load new's hex encryption key, which is stored in TrueNAS's SQLite database.
The old pool, on the other hand, was created before the advent of native encryption and was unencrypted for the first part of its life. Because it's desirable to encrypt data at rest, an encrypted dataset sneakernet/old was created for old using a passphrase encryption key when sneakernet was set up. Unencrypted snapshots were sent non-raw from old to sneakernet/old, where they were encrypted, and then sent raw from sneakernet/old to new/old. To decrypt and mount them on sneakernet or new, you would need to first load sneakernet's passphrase encryption key.
This was all tested thoroughly and snapshots were proven to be readable at each point on every pool.
Encrypting the old pool
Now that we had encrypted snapshots of old on sneakernet/old, we wanted to encrypt old itself. To do this, I simply took old offline during a maintenance window to prevent new writes, took snapshots of all datasets, sent them to sneakernet/old, and then sent the raw encrypted snapshots from sneakernet/old back to old/encrypted. Once I verified each dataset had been encrypted successfully, I destroyed the unencrypted dataset, updated the mount point of the encrypted dataset to that of the late unencrypted dataset, and then moved on to the next dataset. After all datasets were migrated, I used zfs change-key -i to make all child datasets inherit from the new old/encrypted encryption root, and then changed the key of the encryption root from a passphrase to a hex key, since TrueNAS only supported automatically unlocking datasets with hex encryption keys. Finally, I issued a zpool initialize to overwrite all the unencrypted blocks which were now in unallocated space.
Spoiler Alert: It may not be immediately obvious why, but changing the encryption key on old/encryption silently broke backups of old datasets. Snapshots would still send and recv successfully, but were no longer decryptable or mountable. Since the encryption key is not normally loaded, and we only load it when periodically testing the backups, we would not realize until it was too late.
Lesson: Test backups continuously so you get immediate feedback when they break.
Decommissioning the old pool
Later, the old pool was moved to the same site as the new pool, so we wanted to fully decommission old and migrate all its datasets to new. I began going about this in a similar way. I took old offline to prevent new writes, sent snapshots to sneakernet/old, and then to new/old. It was at this point that I made a very unfortunate mistake: I accidentally destroyed one dataset old/encrypted/foo before verifying the files were readable on new/old/foo, and I would soon realize that they were not.
Lesson: Wait to make all destructive changes together at the very end instead of interspersed where they could accidentally be performed in the wrong order.
The realization
What do you mean, permission denied? I am root!
Crap, I already destroyed old/encrypted/foo. This is not good, but I can still restore it from the remaining copy on sneakernet/old/foo.
Oh no, sneakernet/old is broken too. This is very not good!
In an act of desperation, I tried rebooting the machine, but it didn't change a thing.
It is at this point that I realized:
- Something has gone terribly wrong to prevent datasets on both sneakernet/old and new/old from mounting.
- Whatever it is, it's not likely going to be easy to diagnose or fix.
- There's a very real possibility the data might be gone forever.
I found myself in a hole and I wanted to stop digging. Fortunately, uptime was no longer critical for the old datasets after the relocation, so I could afford to step away from the keyboard, collect my thoughts, and avoid making the situation any worse that it already was.
Part 2: Debugging the issue
Once the worst of the overwhelming, visceral feelings that come with the realization that you may have just caused permanent data loss had subsided, I started to work the incident and try to figure out why the backups aren't mounting.
As a precaution, I first exported the old pool and took a forensic image of every disk in the pool. ZFS is a copy-on-write filesystem, so even though the dataset had been destroyed, most of the data was probably still on disk, just completely inaccessible with the normal ZFS tooling. In the worst case scenario, I may have had to try to forensically reconstruct the dataset from what was left on disk, and I didn't want to risk causing any more damage than I already had. Fortunately, I never had to use the disk images, but they still served as a valuable safety net while debugging and repairing.
Next, I realized that if we are to have any chance of debugging and fixing this issue, I need to learn how ZFS actually works.
Learning how ZFS actually works
I unfortunately did not keep track of every resource I consumed, but in addition to reading the source and docs, I found these talks by Jeff Bonwick, Bill Moore, and Matt Ahrens (the original creators of ZFS) to be particularly helpful in understanding the design and implementation of ZFS:
- ZFS: The Last Word in File Systems Part 1
- ZFS: The Last Word in File Systems Part 2
- ZFS: The Last Word in File Systems Part 3
- How ZFS Snapshots Really Work
I highly recommend watching them all despite their age and somewhat poor recording quality, but will summarize the relevant information for those who don't have 3 hours to spare.
ZFS is a copy-on-write filesystem, which means that it does not overwrite blocks in place when a write is requested. Instead, the updated contents are written to a newly allocated block, and the old block is freed, which keeps the filesystem consistent if a write is interrupted. All blocks of both data and metadata are arranged in a Merkle tree structure where each block pointer contains a checksum of the child block, which allows ZFS to detect both block corruption and misdirected/phantom reads/writes. This means that any write will cause the block's checksum to change, which will then cause the parent block's checksum to change (since the parent block includes the block pointer which includes checksum of the child block that changed), and so on, all the way up to the root of the tree which ZFS calls an uberblock.
Uberblocks are written atomically, and because of the Merkle tree structure, they always represent a consistent snapshot of the entire filesystem at a point in time. Writes are batched together into transaction groups identified by a monotonically increasing counter, and each transaction group when synced to disk produces a new uberblock and associated filesystem tree. Taking a snapshot is then as simple as saving an uberblock and not freeing any of the blocks it points to.
In addition to the checksum, each block pointer also contains the transaction group id in which the child block was written, which is called the block's birth time or creation time. ZFS uses birth times to determine which blocks have been written before or after a snapshot. Any blocks with a birth time less than or equal to the snapshot's birth time, must have been written before the snapshot was taken, and conversely, any blocks with a birth time greater than the snapshot's birth time must have been written after the snapshot was taken.
One application of birth times is to generate incremental send streams between two snapshots. ZFS walks the tree but only needs to include blocks where the birth time is both greater than the first snapshot and less than or equal to the second snapshot. In fact, you don't even need to keep the data of the first snapshot around—you can create a bookmark which saves the snapshot's transaction id (but none of the data blocks), delete the snapshot to free its data, and then use the bookmark as the source to generate the same incremental send stream.
Spoiler Alert: Chekhov's bookmark will become relevant later.
Learning how ZFS native encryption actually works
ZFS native encryption is a relatively new feature, which was first released in OpenZFS 0.8.0 (2019) and subsequently made it into FreeBSD 13.0 (2021) when OpenZFS was adopted.
In addition to the docs, I found this 2016 talk on ZFS Native Encryption by Tom Caputi (the original author of native encryption) to be helpful in understanding its design and implementation. Again, I will summarize the relevant information.
ZFS native encryption works by encrypting dataset blocks with an symmetric authenticated encryption cipher suite (AES-256-GCM by default). To use native encryption, you must create a new dataset with -o encryption=on which generates a unique master key for the dataset. The dataset's master key is then used to derive block data encryption keys with a salted HKDF.
The master key can't be changed, so it is encrypted with a wrapping key which can be changed. The wrapping key is provided by the user with zfs load-key and can be changed with zfs change-key which re-encrypts the same master key with a new wrapping key.
The encrypted master keys are stored in each dataset since each dataset has its own master key, but the wrapping key parameters are stored on what is called the encryption root dataset. The encryption root may be the same encrypted dataset, or it may be a parent of the encrypted dataset. When a child encrypted dataset inherits from a parent encryption root, the encryption root's wrapping key is used to decrypt the child dataset's master key. This is how one key can be used to unlock a parent encryption root dataset and all child encrypted datasets that inherit from it at the same time instead of having to load a key for every single encrypted dataset.
In our case, new, sneakernet/new, sneakernet/old, and old/encrypted are the encryption roots, and all child encrypted datasets inherit from them.
Forming a hypothesis
At this point, we now know enough to form a hypothesis as to what may have happened. Feel free to pause here and try to figure it out on your own.
Recall that sneakernet/old was created using a passphrase encryption key, and old/encrypted was created by raw sending sneakernet/old, so it initially used the same passphrase derived wrapping encryption key. When the old/encrypted encryption key was changed from a passphrase to a hex key, ZFS must have changed the wrapping key parameters on the old/encrypted encryption root and re-encrypted all child encrypted dataset master keys with the new hex wrapping key. Crucially, a new snapshot of old/encrypted was never taken and sent to sneakernet/old because it ostensibly didn't contain any data and was just a container for the child datasets.
Hypothesis: When subsequent snapshots were sent from old to sneakernet, the master keys of the child encrypted datasets were updated to be encrypted with the new hex wrapping key, but the sneakernet/old encryption root was never updated with the new hex wrapping key parameters because a new snapshot was never sent. Therefore, when we load the key for sneakernet/old, ZFS asks for the old passphrase, not a hex key, and when we try to mount sneakernet/old/foo, it tries and fails to decrypt its master key with the old passphrase wrapping key instead of the new hex wrapping key.
If correct, this would explain the behavior we're seeing. To test this hypothesis, let's try to reproduce the issue in a test environment.
Creating a test environment
TrueNAS CORE 13.0-U5.1 is based on FreeBSD 13.1, despite the different minor version numbers, so we'll create a FreeBSD 13.1 VM to test in. Make sure to include the system source tree and install on UFS so that we can build OpenZFS and reload the ZFS kernel module without rebooting.
TrueNAS CORE 13.0-U5.1 uses ZFS 2.1.11, so we'll want to build the same version from source for consistency. I started by reading the Building ZFS guide and following the steps documented there with some small modifications for FreeBSD since the page was clearly written with Linux in mind.
First, install the dependencies we'll need.
Then, clone ZFS and check out tag zfs-2.1.11.
Now, configure, build, and install ZFS.
Then, replace the FreeBSD's ZFS kernel module with the one we just built.
Finally, verify we're running version 2.1.11 as desired.
Reproducing the issue
Now we're ready to try reproducing the issue. This took some iteration to get right, so I wrote a bash script that starts from scratch on each invocation and then runs the commands needed to reproduce the corrupt state. After quite a bit of trial and error, I eventually produced a reproducer script which does the following:
- Create 2 pools: src and dst.
- Create src/encryptionroot using a passphrase encryption key.
- Create src/encryptionroot/child which inherits src/encryptionroot as its encryption root.
- Create files and take snapshots src/encryptionroot@111 and src/encryptionroot/child@111.
- Send raw snapshots src/encryptionroot@111 and src/encryptionroot/child@111 to dst/encryptionroot and dst/encryptionroot/child respectively.
- Load encryption key for dst/encryptionroot using passphrase and mount encrypted datasets dst/encryptionroot and dst/encryptionroot/child. At this point, src and dst pools are in sync.
- Change the src/encryptionroot encryption key from passphrase to hex.
- Update files and take snapshots src/encryptionroot@222 and src/encryptionroot/child@222.
- Send a raw incremental snapshot of src/encryptionroot/child@222 to dst/encryptionroot/child, but do not send src/encryptionroot@222 which contains the key change!
- Unmount dst/encryptionroot and dst/encryptionroot/child and unload the cached encryption key for dst/encryptionroot.
- Load the encryption key for dst/encryptionroot using the passphrase since we didn't send the updated encryption root after changing the key.
- Try to remount dst/encryptionroot and dst/encryptionroot/child.
When we run the reproducer, the root encrypted dataset dst/encryptionroot mounts successfully and we can read the old file from the first snapshot, but the child encrypted dataset dst/encryptionroot/child fails to mount with cannot mount 'dst/encryptionroot/child: Permission denied just as we expected.
Now that we understand and can reliably reproduce the issue, we're a big step closer to fixing it!
Part 3: Recovering our data
Theoretically easy to fix
We know now that a child encrypted dataset will become unmountable if the following conditions are met:
- The wrapping encryption key on the encryption root is changed.
- A snapshot of the child encrypted dataset that was taken after the key change is sent.
- A snapshot of the encryption root that was taken after the key change is not sent.
Lesson: Always send a snapshot of the encryption root after changing the encryption key.
In theory, all we should have to do to fix it is send the latest snapshot of the encryption root.
Not so easy in practice
Unfortunately, this isn't enough to fix new and sneakernet; there are no remaining snapshots or bookmarks left on the old encryption root from before the key change, and we can't generate an incremental send stream without one. Mapped to our reproduced example, this means that src/encryptionroot@111 does not exist.
You might think we could forcibly send the entire encryption root, but zfs recv will reject it no matter what you do.
Lesson: Create bookmarks before destroying snapshots.
We need to find a way to create an incremental send stream that contains the key change, but how?. We could try to manually craft a send stream containing the new key, but that sounds tricky. There's got to be a better way!
Idea for a hack
Recall that a snapshot is not the only valid source for generating an incremental send stream. What if we had a bookmark?
A bookmark works just as well as a snapshot for generating an incremental send stream, but we don't have a bookmark on old either. How is this any better?
Unlike a snapshot, which is effectively an entire dataset tree frozen in time (very complex), a bookmark is a very simple object on disk which consists of:
- The GUID of the snapshot.
- The transaction group the snapshot was created in.
- The Unix timestamp when the snapshot was created.
For example, this is what our bookmark looks like in zdb:
Note that zdb shows the GUID in hexadecimal versus zfs get guid which shows it in decimal, consistency be damned. The redaction_obj is optional and only used for redaction bookmarks, so we can ignore it.
A bookmark is simple enough that we could feasibly hack ZFS into manually writing one for us, provided that we can figure out the right values to use. The GUID and Unix timestamp don't really matter for generating an incremental send stream, so we could choose them arbitrarily if we had to, but the transaction group id really matters because that is what ZFS uses to determine which blocks to include.
But how can we figure out what transaction group the snapshot was created in if neither the snapshot nor a bookmark of the snapshot still exist? I initially considered walking the dataset trees on each pool, diffing them to find the newest block present on both datasets, and using its transaction group id, but I found a much easier way with one of ZFS's lesser known features.
A brief detour into pool histories
I didn't know about pool histories before embarking on this unplanned journey, but they are now yet another thing I love about ZFS. Every pool allocates 0.1% of its space (128 KiB minimum, 1 GiB maximum) to a ring buffer which is used to log every command that is executed on the pool. This can be used to forensically reconstruct the state of the pool over time.
ZFS also logs many internal operations in the pool history (search for spa_history_log in the source code) which can be viewed with the -i flag. For snapshots, this includes the transaction group (txg) id when the snapshot was created, which is exactly what we're looking for!
The GUID and creation timestamp we can easily get from the snapshot on dst.
Now that we know everything we need to create the bookmark, we just need to figure out a way to manually create a bookmark with arbitrary data.
Hacking ZFS to manually create a bookmark
To understand how ZFS creates a bookmark, we can trace the code path from zfs bookmark all the way down to dsl_bookmark_add which actually adds the bookmark node to the tree.
- command_table
- zfs_do_bookmark
- lzc_bookmark
- lzc_ioctl
- zfs_ioctl_fd
- zcmd_ioctl_compat
- zfs_ioctl_register bookmark
- zfs_ioc_bookmark
- dsl_bookmark_create
- dsl_bookmark_create_sync
- dsl_bookmark_create_sync_impl_book
- dsl_bookmark_node_add
This is the bookmark structure physically written to disk:
zfs/include/sys/dsl_bookmark.h
Only the first 3 fields are required for v1 bookmarks, while v2 bookmarks contain all 12 fields. dsl_bookmark_node_add only writes a v2 bookmark if one of the 9 v2 fields are non-zero, so we can leave them all zero to write a v1 bookmark.
After a few iterations, I had a patch which hijacks the normal zfs bookmark pool/dataset#src pool/dataset#dst code path to create a bookmark with arbitrary data when the source bookmark name is missing.
To test, we recompile ZFS, reload the kernel module, and reimport the pools.
Then, we create the bookmark ex nihilo using the magic bookmark name missing.
Success! We can now use the bookmark to generate an incremental send stream containing the new hex wrapping key parameters.
But we can't receive the send stream.
The final obstacle
ZFS refuses the stream because it is missing a source IV set GUID (see from_ivset_guid = 0x0 in the zstreamdump above). This is because we created a v1 bookmark which does not contain the IV set GUID like a v2 bookmark would.
Since we know that the send stream is created using the right snapshots, we can temporarily disable checking IV set GUIDs to allow the snapshot to be received as described in errata 4.
The moment of truth
And now for the moment of truth…
At this point, we can now reliably fix the issue in our test environment. All we need to do now is use our hacked ZFS build to create the bookmark on old, send an incremental snapshot of the encryption root with the new key to sneakernet, and then send that snapshot from sneakernet to new. I rebuilt ZFS again with the correct transaction group, GUID, and creation timestamp for old, repeated the same steps with the names changed, and thanks to our thorough testing, it worked on the first try!
Conclusion
After a week of intense research and debugging, I had rescued our data back from the brink and could again sleep soundly at night. While I appreciated the opportunity to learn more about ZFS, I can't help but think about how this entire incident could have been avoided at several key points which translate directly into lessons learned:
- Test backups continuously so you get immediate feedback when they break.
- Wait to make all destructive changes together at the very end instead of interspersed where they could accidentally be performed in the wrong order.
- Always send a snapshot of the encryption root after changing the encryption key.
- Create bookmarks before destroying snapshots.
I hope that you may learn from my mistakes and avoid a similar incident. If you do happen to find yourself in a similar predicament, I'd love to hear from you regardless of whether this postmortem was helpful or not. My contact details can be found here.
Knowing what I now know about ZFS native encryption, I find it difficult to recommend until the sharp edges have all been filed down. In most cases, I'd prefer to encrypt the entire pool at the block device level and encrypt send streams with age. But if you really do need the flexibility offered by native encryption, always remember to mind the encryptionroot!