Mind the encryptionroot: How to save your data when ZFS loses its mind

1 hour ago 2

While ZFS has a well-earned reputation for data integrity and reliability, ZFS native encryption has some incredibly sharp edges that will cut you if you don't know where to be careful. Unfortunately, I learned this the hard way, standing in a pool of my own blood and tears after thoroughly lacerating myself. I very nearly permanently lost 8.5 TiB of data after performing what should've been a series of simple, routine ZFS operations but resulted in an undecryptable dataset. Time has healed the wound enough that I am no longer filled with anguish just thinking about it, so I will now share my experience in the hope that you may learn from my mistakes. Together, we'll go over the unfortunate series of events that led to this happening and how it could've been avoided, learn how ZFS actually works under the hood, use our newfound knowledge to debug and reproduce the issue at hand, and finally compile a modified version of ZFS to repair the corrupted state and rescue our precious data. This is the postmortem of that terrible, horrible, no good, very bad week…

Part 1: An unfortunate series of events

The status quo

In the beginning, there were two ZFS pools: old and new (names changed for clarity). Each pool was hosted on an instance of TrueNAS CORE 13.0-U5.1 located at two different sites about an hour's drive apart with poor Internet connectivity between them. For this reason, a third pool sneakernet was periodically moved between the two sites and used to exchange snapshots of old and new datasets for backup purposes. ZFS dataset snapshots would be indirectly relayed from old to new (and vice versa) using sneakernet as an intermediate ZFS send/recv source/destination (e.g. old/foo@2023-06-01 -> sneakernet/old/foo@2023-06-01 -> new/old/foo@2023-06-01).

The new pool was natively encrypted from the very beginning. When ZFS snapshots were sent from new to sneakernet/new to old/new, they were sent raw, meaning that blocks were copied unmodified in their encrypted form. To decrypt and mount them on sneakernet or old, you would need to first load new's hex encryption key, which is stored in TrueNAS's SQLite database.

The old pool, on the other hand, was created before the advent of native encryption and was unencrypted for the first part of its life. Because it's desirable to encrypt data at rest, an encrypted dataset sneakernet/old was created for old using a passphrase encryption key when sneakernet was set up. Unencrypted snapshots were sent non-raw from old to sneakernet/old, where they were encrypted, and then sent raw from sneakernet/old to new/old. To decrypt and mount them on sneakernet or new, you would need to first load sneakernet's passphrase encryption key.

This was all tested thoroughly and snapshots were proven to be readable at each point on every pool.

Encrypting the old pool

Now that we had encrypted snapshots of old on sneakernet/old, we wanted to encrypt old itself. To do this, I simply took old offline during a maintenance window to prevent new writes, took snapshots of all datasets, sent them to sneakernet/old, and then sent the raw encrypted snapshots from sneakernet/old back to old/encrypted. Once I verified each dataset had been encrypted successfully, I destroyed the unencrypted dataset, updated the mount point of the encrypted dataset to that of the late unencrypted dataset, and then moved on to the next dataset. After all datasets were migrated, I used zfs change-key -i to make all child datasets inherit from the new old/encrypted encryption root, and then changed the key of the encryption root from a passphrase to a hex key, since TrueNAS only supported automatically unlocking datasets with hex encryption keys. Finally, I issued a zpool initialize to overwrite all the unencrypted blocks which were now in unallocated space.

Spoiler Alert: It may not be immediately obvious why, but changing the encryption key on old/encryption silently broke backups of old datasets. Snapshots would still send and recv successfully, but were no longer decryptable or mountable. Since the encryption key is not normally loaded, and we only load it when periodically testing the backups, we would not realize until it was too late.

Lesson: Test backups continuously so you get immediate feedback when they break.

Decommissioning the old pool

Later, the old pool was moved to the same site as the new pool, so we wanted to fully decommission old and migrate all its datasets to new. I began going about this in a similar way. I took old offline to prevent new writes, sent snapshots to sneakernet/old, and then to new/old. It was at this point that I made a very unfortunate mistake: I accidentally destroyed one dataset old/encrypted/foo before verifying the files were readable on new/old/foo, and I would soon realize that they were not.

Lesson: Wait to make all destructive changes together at the very end instead of interspersed where they could accidentally be performed in the wrong order.

The realization

[sam@newnas ~]$ DATASET=foo; [[ $(ssh sam@oldnas zfs list -H -o guid old/encrypted/${DATASET}@decomm) = $(zfs list -H -o guid sneakernet/old/${DATASET}@decomm) ]] && echo "GUIDs match" || echo "GUIDs DO NOT MATCH" GUIDs match [sam@newnas ~]$ DATASET=foo; [[ $(zfs list -H -o guid sneakernet/old/${DATASET}@decomm) = $(zfs list -H -o guid new/old/${DATASET}@decomm) ]] && echo "GUIDs match" || echo "GUIDs DO NOT MATCH" GUIDs match [sam@oldnas ~]$ sudo zfs destroy -r old/encrypted/foo [sam@newnas ~]$ ls /mnt/new/old/foo [sam@newnas ~]$ ls -a /mnt/new/old/foo . .. [sam@newnas ~]$ zfs list -o name,mounted new/old/foo NAME MOUNTED new/old/foo no [sam@newnas ~]$ sudo zfs mount new/old/foo cannot mount 'new/old/foo': Permission denied

What do you mean, permission denied? I am root!

Crap, I already destroyed old/encrypted/foo. This is not good, but I can still restore it from the remaining copy on sneakernet/old/foo.

[sam@newnas ~]$ sudo zfs load-key sneakernet/old Enter passphrase for 'sneakernet/old': [sam@newnas ~]$ sudo zfs mount sneakernet/old/foo cannot mount 'sneakernet/old/foo': Permission denied

Oh no, sneakernet/old is broken too. This is very not good!

In an act of desperation, I tried rebooting the machine, but it didn't change a thing.

It is at this point that I realized:

Something has gone terribly wrong to prevent datasets on both sneakernet/old and new/old from mounting.
Whatever it is, it's not likely going to be easy to diagnose or fix.
There's a very real possibility the data might be gone forever.

I found myself in a hole and I wanted to stop digging. Fortunately, uptime was no longer critical for the old datasets after the relocation, so I could afford to step away from the keyboard, collect my thoughts, and avoid making the situation any worse that it already was.

Part 2: Debugging the issue

Once the worst of the overwhelming, visceral feelings that come with the realization that you may have just caused permanent data loss had subsided, I started to work the incident and try to figure out why the backups aren't mounting.

As a precaution, I first exported the old pool and took a forensic image of every disk in the pool. ZFS is a copy-on-write filesystem, so even though the dataset had been destroyed, most of the data was probably still on disk, just completely inaccessible with the normal ZFS tooling. In the worst case scenario, I may have had to try to forensically reconstruct the dataset from what was left on disk, and I didn't want to risk causing any more damage than I already had. Fortunately, I never had to use the disk images, but they still served as a valuable safety net while debugging and repairing.

Next, I realized that if we are to have any chance of debugging and fixing this issue, I need to learn how ZFS actually works.

Learning how ZFS actually works

I unfortunately did not keep track of every resource I consumed, but in addition to reading the source and docs, I found these talks by Jeff Bonwick, Bill Moore, and Matt Ahrens (the original creators of ZFS) to be particularly helpful in understanding the design and implementation of ZFS:

I highly recommend watching them all despite their age and somewhat poor recording quality, but will summarize the relevant information for those who don't have 3 hours to spare.

ZFS is a copy-on-write filesystem, which means that it does not overwrite blocks in place when a write is requested. Instead, the updated contents are written to a newly allocated block, and the old block is freed, which keeps the filesystem consistent if a write is interrupted. All blocks of both data and metadata are arranged in a Merkle tree structure where each block pointer contains a checksum of the child block, which allows ZFS to detect both block corruption and misdirected/phantom reads/writes. This means that any write will cause the block's checksum to change, which will then cause the parent block's checksum to change (since the parent block includes the block pointer which includes checksum of the child block that changed), and so on, all the way up to the root of the tree which ZFS calls an uberblock.

Uberblocks are written atomically, and because of the Merkle tree structure, they always represent a consistent snapshot of the entire filesystem at a point in time. Writes are batched together into transaction groups identified by a monotonically increasing counter, and each transaction group when synced to disk produces a new uberblock and associated filesystem tree. Taking a snapshot is then as simple as saving an uberblock and not freeing any of the blocks it points to.

In addition to the checksum, each block pointer also contains the transaction group id in which the child block was written, which is called the block's birth time or creation time. ZFS uses birth times to determine which blocks have been written before or after a snapshot. Any blocks with a birth time less than or equal to the snapshot's birth time, must have been written before the snapshot was taken, and conversely, any blocks with a birth time greater than the snapshot's birth time must have been written after the snapshot was taken.

One application of birth times is to generate incremental send streams between two snapshots. ZFS walks the tree but only needs to include blocks where the birth time is both greater than the first snapshot and less than or equal to the second snapshot. In fact, you don't even need to keep the data of the first snapshot around—you can create a bookmark which saves the snapshot's transaction id (but none of the data blocks), delete the snapshot to free its data, and then use the bookmark as the source to generate the same incremental send stream.

Spoiler Alert: Chekhov's bookmark will become relevant later.

Learning how ZFS native encryption actually works

ZFS native encryption is a relatively new feature, which was first released in OpenZFS 0.8.0 (2019) and subsequently made it into FreeBSD 13.0 (2021) when OpenZFS was adopted.

In addition to the docs, I found this 2016 talk on ZFS Native Encryption by Tom Caputi (the original author of native encryption) to be helpful in understanding its design and implementation. Again, I will summarize the relevant information.

ZFS native encryption works by encrypting dataset blocks with an symmetric authenticated encryption cipher suite (AES-256-GCM by default). To use native encryption, you must create a new dataset with -o encryption=on which generates a unique master key for the dataset. The dataset's master key is then used to derive block data encryption keys with a salted HKDF.

The master key can't be changed, so it is encrypted with a wrapping key which can be changed. The wrapping key is provided by the user with zfs load-key and can be changed with zfs change-key which re-encrypts the same master key with a new wrapping key.

The encrypted master keys are stored in each dataset since each dataset has its own master key, but the wrapping key parameters are stored on what is called the encryption root dataset. The encryption root may be the same encrypted dataset, or it may be a parent of the encrypted dataset. When a child encrypted dataset inherits from a parent encryption root, the encryption root's wrapping key is used to decrypt the child dataset's master key. This is how one key can be used to unlock a parent encryption root dataset and all child encrypted datasets that inherit from it at the same time instead of having to load a key for every single encrypted dataset.

In our case, new, sneakernet/new, sneakernet/old, and old/encrypted are the encryption roots, and all child encrypted datasets inherit from them.

Forming a hypothesis

At this point, we now know enough to form a hypothesis as to what may have happened. Feel free to pause here and try to figure it out on your own.

Recall that sneakernet/old was created using a passphrase encryption key, and old/encrypted was created by raw sending sneakernet/old, so it initially used the same passphrase derived wrapping encryption key. When the old/encrypted encryption key was changed from a passphrase to a hex key, ZFS must have changed the wrapping key parameters on the old/encrypted encryption root and re-encrypted all child encrypted dataset master keys with the new hex wrapping key. Crucially, a new snapshot of old/encrypted was never taken and sent to sneakernet/old because it ostensibly didn't contain any data and was just a container for the child datasets.

Hypothesis: When subsequent snapshots were sent from old to sneakernet, the master keys of the child encrypted datasets were updated to be encrypted with the new hex wrapping key, but the sneakernet/old encryption root was never updated with the new hex wrapping key parameters because a new snapshot was never sent. Therefore, when we load the key for sneakernet/old, ZFS asks for the old passphrase, not a hex key, and when we try to mount sneakernet/old/foo, it tries and fails to decrypt its master key with the old passphrase wrapping key instead of the new hex wrapping key.

If correct, this would explain the behavior we're seeing. To test this hypothesis, let's try to reproduce the issue in a test environment.

Creating a test environment

TrueNAS CORE 13.0-U5.1 is based on FreeBSD 13.1, despite the different minor version numbers, so we'll create a FreeBSD 13.1 VM to test in. Make sure to include the system source tree and install on UFS so that we can build OpenZFS and reload the ZFS kernel module without rebooting.

TrueNAS CORE 13.0-U5.1 uses ZFS 2.1.11, so we'll want to build the same version from source for consistency. I started by reading the Building ZFS guide and following the steps documented there with some small modifications for FreeBSD since the page was clearly written with Linux in mind.

First, install the dependencies we'll need.

sam@zfshax:~ $ sudo pkg install autoconf automake autotools git gmake python devel/py-sysctl sudo

Then, clone ZFS and check out tag zfs-2.1.11.

sam@zfshax:~ $ git clone https://github.com/openzfs/zfs sam@zfshax:~ $ cd zfs sam@zfshax:~/zfs $ git checkout zfs-2.1.11 sam@zfshax:~/zfs $ git show --summary commit e25f9131d679692704c11dc0c1df6d4585b70c35 (HEAD, tag: zfs-2.1.11) Author: Tony Hutter <[email protected]> Date: Tue Apr 18 11:44:34 2023 -0700 Tag zfs-2.1.11 META file and changelog updated. Signed-off-by: Tony Hutter <[email protected]>

Now, configure, build, and install ZFS.

sam@zfshax:~/zfs $ sh autogen.sh sam@zfshax:~/zfs $ ./configure sam@zfshax:~/zfs $ gmake -s -j$(sysctl -n hw.ncpu) # <-- modified for FreeBSD sam@zfshax:~/zfs $ sudo gmake install; sudo ldconfig # <-- modified for FreeBSD

Then, replace the FreeBSD's ZFS kernel module with the one we just built.

sam@zfshax:~/zfs $ sudo kldunload zfs.ko # Needed because zfs.sh only unloads openzfs.ko sam@zfshax:~/zfs $ sudo ./scripts/zfs.sh

Finally, verify we're running version 2.1.11 as desired.

sam@zfshax:~/zfs $ sudo zfs version zfs-2.1.11-1 zfs-kmod-2.1.11-1

Reproducing the issue

Now we're ready to try reproducing the issue. This took some iteration to get right, so I wrote a bash script that starts from scratch on each invocation and then runs the commands needed to reproduce the corrupt state. After quite a bit of trial and error, I eventually produced a reproducer script which does the following:

Create 2 pools: src and dst.
Create src/encryptionroot using a passphrase encryption key.
Create src/encryptionroot/child which inherits src/encryptionroot as its encryption root.
Create files and take snapshots src/encryptionroot@111 and src/encryptionroot/child@111.
Send raw snapshots src/encryptionroot@111 and src/encryptionroot/child@111 to dst/encryptionroot and dst/encryptionroot/child respectively.
Load encryption key for dst/encryptionroot using passphrase and mount encrypted datasets dst/encryptionroot and dst/encryptionroot/child. At this point, src and dst pools are in sync.
Change the src/encryptionroot encryption key from passphrase to hex.
Update files and take snapshots src/encryptionroot@222 and src/encryptionroot/child@222.
Send a raw incremental snapshot of src/encryptionroot/child@222 to dst/encryptionroot/child, but do not send src/encryptionroot@222 which contains the key change!
Unmount dst/encryptionroot and dst/encryptionroot/child and unload the cached encryption key for dst/encryptionroot.
Load the encryption key for dst/encryptionroot using the passphrase since we didn't send the updated encryption root after changing the key.
Try to remount dst/encryptionroot and dst/encryptionroot/child.

When we run the reproducer, the root encrypted dataset dst/encryptionroot mounts successfully and we can read the old file from the first snapshot, but the child encrypted dataset dst/encryptionroot/child fails to mount with cannot mount 'dst/encryptionroot/child: Permission denied just as we expected.

sam@zfshax:~ $ sudo ./reproduce > /dev/null 2>&1 sam@zfshax:~ $ sudo zfs mount dst/encryptionroot/child cannot mount 'dst/encryptionroot/child': Permission denied

Full reproducer script output (long!)

sam@zfshax:~ $ sudo ./reproduce Destroy pools and backing files if they exist. + zpool destroy src + zpool destroy dst + rm -f /src.img + rm -f /dst.img Create pools using sparse files. + truncate -s 100M /src.img + truncate -s 100M /dst.img + zpool create -o ashift=12 -m /src src /src.img + zpool create -o ashift=12 -m /dst dst /dst.img Create root encrypted dataset using a passphrase encryption key. + echo 'hunter2!' + zfs create -o encryption=on -o keyformat=passphrase -o keylocation=prompt src/encryptionroot Create child encrypted dataset which inherits src/encryptionroot as its encryption root. + zfs create src/encryptionroot/child Create files in the root and child encrypted datasets and snapshot both. + touch /src/encryptionroot/111 + touch /src/encryptionroot/child/111 + zfs snapshot -r src/encryptionroot@111 [ Checkpoint 1 ] Files and snapshots are on the src pool but not the dst pool yet. NAME ENCROOT KEYFORMAT KEYSTATUS MOUNTED GUID src - none - yes 1354282934008960312 src/encryptionroot src/encryptionroot passphrase available yes 12828913232342655944 src/encryptionroot@111 src/encryptionroot - available - 14453618123048176778 src/encryptionroot/child src/encryptionroot passphrase available yes 10447093816713688124 src/encryptionroot/child@111 src/encryptionroot - available - 10173467213034806911 NAME ENCROOT KEYFORMAT KEYSTATUS MOUNTED GUID dst - none - yes 5247064584420489120 /src └── encryptionroot ├── 111 └── child └── 111 /dst Send a raw replication stream of the src snapshots to the dst pool. + zfs send --replicate --raw src/encryptionroot@111 + zfs recv dst/encryptionroot Load encryption key for the dst encryption root using passphrase and mount the encrypted datasets. + echo 'hunter2!' + zfs load-key dst/encryptionroot + zfs mount dst/encryptionroot + zfs mount dst/encryptionroot/child [ Checkpoint 2 ] Files and snapshots are on both pools and in sync. NAME ENCROOT KEYFORMAT KEYSTATUS MOUNTED GUID src - none - yes 1354282934008960312 src/encryptionroot src/encryptionroot passphrase available yes 12828913232342655944 src/encryptionroot@111 src/encryptionroot - available - 14453618123048176778 src/encryptionroot/child src/encryptionroot passphrase available yes 10447093816713688124 src/encryptionroot/child@111 src/encryptionroot - available - 10173467213034806911 NAME ENCROOT KEYFORMAT KEYSTATUS MOUNTED GUID dst - none - yes 5247064584420489120 dst/encryptionroot dst/encryptionroot passphrase available yes 3076413147413645477 dst/encryptionroot@111 dst/encryptionroot - available - 14453618123048176778 dst/encryptionroot/child dst/encryptionroot passphrase available yes 18246034838646533510 dst/encryptionroot/child@111 dst/encryptionroot - available - 10173467213034806911 /src └── encryptionroot ├── 111 └── child └── 111 /dst └── encryptionroot ├── 111 └── child └── 111 Change the src encryption root key from passphrase to hex. + echo 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef + zfs change-key -o keyformat=hex src/encryptionroot Update the files in the root and child encrypted datasets and snapshot both. + mv /src/encryptionroot/111 /src/encryptionroot/222 + mv /src/encryptionroot/child/111 /src/encryptionroot/child/222 + zfs snapshot -r src/encryptionroot@222 [ Checkpoint 3 ] Updated files and snapshots are on the src pool but not the dst pool yet. NAME ENCROOT KEYFORMAT KEYSTATUS MOUNTED GUID src - none - yes 1354282934008960312 src/encryptionroot src/encryptionroot hex available yes 12828913232342655944 src/encryptionroot@111 src/encryptionroot - available - 14453618123048176778 src/encryptionroot@222 src/encryptionroot - available - 929742392566496732 src/encryptionroot/child src/encryptionroot hex available yes 10447093816713688124 src/encryptionroot/child@111 src/encryptionroot - available - 10173467213034806911 src/encryptionroot/child@222 src/encryptionroot - available - 8161419639883744346 NAME ENCROOT KEYFORMAT KEYSTATUS MOUNTED GUID dst - none - yes 5247064584420489120 dst/encryptionroot dst/encryptionroot passphrase available yes 3076413147413645477 dst/encryptionroot@111 dst/encryptionroot - available - 14453618123048176778 dst/encryptionroot/child dst/encryptionroot passphrase available yes 18246034838646533510 dst/encryptionroot/child@111 dst/encryptionroot - available - 10173467213034806911 /src └── encryptionroot ├── 222 └── child └── 222 /dst └── encryptionroot ├── 111 └── child └── 111 Send a raw incremental snapshot of the child encrypted dataset to the dst pool. + zfs send --raw -i src/encryptionroot/child@111 src/encryptionroot/child@222 + zfs recv -F dst/encryptionroot/child NOTE: The encryption key change on the src encryption root has not been sent to dst! [ Checkpoint 4 ] File is updated in the dst child encrypted dataset but not the dst root encrypted dataset. NAME ENCROOT KEYFORMAT KEYSTATUS MOUNTED GUID src - none - yes 1354282934008960312 src/encryptionroot src/encryptionroot hex available yes 12828913232342655944 src/encryptionroot@111 src/encryptionroot - available - 14453618123048176778 src/encryptionroot@222 src/encryptionroot - available - 929742392566496732 src/encryptionroot/child src/encryptionroot hex available yes 10447093816713688124 src/encryptionroot/child@111 src/encryptionroot - available - 10173467213034806911 src/encryptionroot/child@222 src/encryptionroot - available - 8161419639883744346 NAME ENCROOT KEYFORMAT KEYSTATUS MOUNTED GUID dst - none - yes 5247064584420489120 dst/encryptionroot dst/encryptionroot passphrase available yes 3076413147413645477 dst/encryptionroot@111 dst/encryptionroot - available - 14453618123048176778 dst/encryptionroot/child dst/encryptionroot hex available yes 18246034838646533510 dst/encryptionroot/child@111 dst/encryptionroot - available - 10173467213034806911 dst/encryptionroot/child@222 dst/encryptionroot - available - 8161419639883744346 /src └── encryptionroot ├── 222 └── child └── 222 /dst └── encryptionroot ├── 111 └── child └── 222 NOTE: The updated file in the dst child encrypted dataset is only still readable because the encryption key is still loaded from before sending the snapshot taken after the key change. Unmount the dst encrypted datasets and and unload the cached encryption key. + zfs unmount dst/encryptionroot + zfs unload-key dst/encryptionroot Load the encryption key for the dst encryption root using the passphrase since we did not send the updated encryption root after changing the key. + echo 'hunter2!' + zfs load-key dst/encryptionroot Try to remount dst encrypted datasets. + zfs mount dst/encryptionroot + zfs mount dst/encryptionroot/child cannot mount 'dst/encryptionroot/child': Permission denied + true [ Checkpoint 5 ] Mounting dst child encrypted dataset failed even though encryption key is ostensibly available. Hypothesis confirmed! NAME ENCROOT KEYFORMAT KEYSTATUS MOUNTED GUID src - none - yes 1354282934008960312 src/encryptionroot src/encryptionroot hex available yes 12828913232342655944 src/encryptionroot@111 src/encryptionroot - available - 14453618123048176778 src/encryptionroot@222 src/encryptionroot - available - 929742392566496732 src/encryptionroot/child src/encryptionroot hex available yes 10447093816713688124 src/encryptionroot/child@111 src/encryptionroot - available - 10173467213034806911 src/encryptionroot/child@222 src/encryptionroot - available - 8161419639883744346 NAME ENCROOT KEYFORMAT KEYSTATUS MOUNTED GUID dst - none - yes 5247064584420489120 dst/encryptionroot dst/encryptionroot passphrase available yes 3076413147413645477 dst/encryptionroot@111 dst/encryptionroot - available - 14453618123048176778 dst/encryptionroot/child dst/encryptionroot hex available no 18246034838646533510 dst/encryptionroot/child@111 dst/encryptionroot - available - 10173467213034806911 dst/encryptionroot/child@222 dst/encryptionroot - available - 8161419639883744346 /src └── encryptionroot ├── 222 └── child └── 222 /dst └── encryptionroot ├── 111 └── child

Now that we understand and can reliably reproduce the issue, we're a big step closer to fixing it!

Part 3: Recovering our data

Theoretically easy to fix

We know now that a child encrypted dataset will become unmountable if the following conditions are met:

The wrapping encryption key on the encryption root is changed.
A snapshot of the child encrypted dataset that was taken after the key change is sent.
A snapshot of the encryption root that was taken after the key change is not sent.

Lesson: Always send a snapshot of the encryption root after changing the encryption key.

In theory, all we should have to do to fix it is send the latest snapshot of the encryption root.

sam@zfshax:~ $ sudo ./reproduce > /dev/null 2>&1 sam@zfshax:~ $ sudo ./repair_snapshot HYPOTHESIS: The child encrypted dataset should become decryptable again if a snapshot containing the key change on the root encrypted dataset is sent. Send a raw incremental snapshot of the root encrypted dataset to the dst pool. + zfs send --raw -i src/encryptionroot@111 src/encryptionroot@222 + zfs recv -F dst/encryptionroot Unmount the dst encrypted datasets and and unload the cached encryption key. + zfs unmount dst/encryptionroot + zfs unload-key dst/encryptionroot Load the encryption key for the dst encryption root using the hex key since we have now sent the updated encryption root after changing the key. + echo 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef + zfs load-key dst/encryptionroot Try to remount dst encrypted datasets. + zfs mount dst/encryptionroot + zfs mount dst/encryptionroot/child RESULT: Child encrypted dataset is decryptable again. Hypothesis confirmed! NAME ENCROOT KEYFORMAT KEYSTATUS MOUNTED GUID src - none - yes 3822096046979704342 src/encryptionroot src/encryptionroot hex available yes 10687499872806328230 src/encryptionroot@111 src/encryptionroot - available - 16650389156603898046 src/encryptionroot@222 src/encryptionroot - available - 157927145464667221 src/encryptionroot/child src/encryptionroot hex available yes 15788284772663365294 src/encryptionroot/child@111 src/encryptionroot - available - 8879828033920251704 src/encryptionroot/child@222 src/encryptionroot - available - 6286619359795670820 NAME ENCROOT KEYFORMAT KEYSTATUS MOUNTED GUID dst - none - yes 1835983340793043086 dst/encryptionroot dst/encryptionroot hex available yes 6911130245015256647 dst/encryptionroot@111 dst/encryptionroot - available - 16650389156603898046 dst/encryptionroot@222 dst/encryptionroot - available - 157927145464667221 dst/encryptionroot/child dst/encryptionroot hex available yes 15804809318195285947 dst/encryptionroot/child@111 dst/encryptionroot - available - 8879828033920251704 dst/encryptionroot/child@222 dst/encryptionroot - available - 6286619359795670820 /src └── encryptionroot ├── 222 └── child └── 222 /dst └── encryptionroot ├── 222 └── child └── 222

Not so easy in practice

Unfortunately, this isn't enough to fix new and sneakernet; there are no remaining snapshots or bookmarks left on the old encryption root from before the key change, and we can't generate an incremental send stream without one. Mapped to our reproduced example, this means that src/encryptionroot@111 does not exist.

You might think we could forcibly send the entire encryption root, but zfs recv will reject it no matter what you do.

sam@zfshax:~ $ sudo zfs send --raw src/encryptionroot@222 | sudo zfs recv dst/encryptionroot cannot receive new filesystem stream: destination 'dst/encryptionroot' exists must specify -F to overwrite it sam@zfshax:~ $ sudo zfs send --raw src/encryptionroot@222 | sudo zfs recv -F dst/encryptionroot cannot receive new filesystem stream: destination has snapshots (eg. dst/encryptionroot@111) must destroy them to overwrite it sam@zfshax:~ $ sudo zfs destroy dst/encryptionroot@111 sam@zfshax:~ $ sudo zfs send --raw src/encryptionroot@222 | sudo zfs recv -F dst/encryptionroot cannot receive new filesystem stream: zfs receive -F cannot be used to destroy an encrypted filesystem or overwrite an unencrypted one with an encrypted one

Lesson: Create bookmarks before destroying snapshots.

We need to find a way to create an incremental send stream that contains the key change, but how?. We could try to manually craft a send stream containing the new key, but that sounds tricky. There's got to be a better way!

Idea for a hack

Recall that a snapshot is not the only valid source for generating an incremental send stream. What if we had a bookmark?

sam@zfshax:~ $ sudo ./reproduce > /dev/null 2>&1 sam@zfshax:~ $ sudo ./repair_bookmark Replace the initial parent encrypted dataset snapshot with a bookmark. + zfs bookmark src/encryptionroot@111 src/encryptionroot#111 + zfs destroy src/encryptionroot@111 HYPOTHESIS: The child encrypted dataset should become decryptable again if a snapshot containing the key change on the root encrypted dataset is sent. Send a raw incremental snapshot of the root encrypted dataset to the dst pool using the bookmark. + zfs send --raw -i src/encryptionroot#111 src/encryptionroot@222 + zfs recv -F dst/encryptionroot Unmount the dst encrypted datasets and and unload the cached encryption key. + zfs unmount dst/encryptionroot + zfs unload-key dst/encryptionroot Load the encryption key for the dst encryption root using the hex key since we have now sent the updated encryption root after changing the key. + echo 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef + zfs load-key dst/encryptionroot Try to remount dst encrypted datasets. + zfs mount dst/encryptionroot + zfs mount dst/encryptionroot/child RESULT: Child encrypted dataset is decryptable again. Hypothesis confirmed! NAME ENCROOT KEYFORMAT KEYSTATUS MOUNTED GUID src - none - yes 1018261135296547862 src/encryptionroot src/encryptionroot hex available yes 1985286651877572312 src/encryptionroot@222 src/encryptionroot - available - 4582898506955533479 src/encryptionroot#111 - - - - 4964628655505655411 src/encryptionroot/child src/encryptionroot hex available yes 12927592016081051429 src/encryptionroot/child@111 src/encryptionroot - available - 15551239789901400488 src/encryptionroot/child@222 src/encryptionroot - available - 11729357375613972731 NAME ENCROOT KEYFORMAT KEYSTATUS MOUNTED GUID dst - none - yes 15258247229701443799 dst/encryptionroot dst/encryptionroot hex available yes 17755083343181277380 dst/encryptionroot@111 dst/encryptionroot - available - 4964628655505655411 dst/encryptionroot@222 dst/encryptionroot - available - 4582898506955533479 dst/encryptionroot/child dst/encryptionroot hex available yes 364333975888407846 dst/encryptionroot/child@111 dst/encryptionroot - available - 15551239789901400488 dst/encryptionroot/child@222 dst/encryptionroot - available - 11729357375613972731 /src └── encryptionroot ├── 222 └── child └── 222 /dst └── encryptionroot ├── 222 └── child └── 222

A bookmark works just as well as a snapshot for generating an incremental send stream, but we don't have a bookmark on old either. How is this any better?

Unlike a snapshot, which is effectively an entire dataset tree frozen in time (very complex), a bookmark is a very simple object on disk which consists of:

The GUID of the snapshot.
The transaction group the snapshot was created in.
The Unix timestamp when the snapshot was created.

For example, this is what our bookmark looks like in zdb:

sam@zfshax:~ $ sudo zdb src/encryptionroot#111 #111: {guid: 44e5e7755d23c673 creation_txg: 12 creation_time: 1756699200 redaction_obj: 0}

Note that zdb shows the GUID in hexadecimal versus zfs get guid which shows it in decimal, consistency be damned. The redaction_obj is optional and only used for redaction bookmarks, so we can ignore it.

A bookmark is simple enough that we could feasibly hack ZFS into manually writing one for us, provided that we can figure out the right values to use. The GUID and Unix timestamp don't really matter for generating an incremental send stream, so we could choose them arbitrarily if we had to, but the transaction group id really matters because that is what ZFS uses to determine which blocks to include.

But how can we figure out what transaction group the snapshot was created in if neither the snapshot nor a bookmark of the snapshot still exist? I initially considered walking the dataset trees on each pool, diffing them to find the newest block present on both datasets, and using its transaction group id, but I found a much easier way with one of ZFS's lesser known features.

A brief detour into pool histories

I didn't know about pool histories before embarking on this unplanned journey, but they are now yet another thing I love about ZFS. Every pool allocates 0.1% of its space (128 KiB minimum, 1 GiB maximum) to a ring buffer which is used to log every command that is executed on the pool. This can be used to forensically reconstruct the state of the pool over time.

sam@zfshax:~ $ sudo zpool history History for 'dst': 2025-09-01.00:00:00 zpool create -o ashift=12 -m /dst dst /dst.img 2025-09-01.00:00:00 zfs recv dst/encryptionroot 2025-09-01.00:00:00 zfs load-key dst/encryptionroot 2025-09-01.00:00:00 zfs recv -F dst/encryptionroot/child 2025-09-01.00:00:00 zfs unload-key dst/encryptionroot 2025-09-01.00:00:00 zfs load-key dst/encryptionroot History for 'src': 2025-09-01.00:00:00 zpool create -o ashift=12 -m /src src /src.img 2025-09-01.00:00:00 zfs create -o encryption=on -o keyformat=passphrase -o keylocation=prompt src/encryptionroot 2025-09-01.00:00:00 zfs create src/encryptionroot/child 2025-09-01.00:00:00 zfs snapshot -r src/encryptionroot@111 2025-09-01.00:00:00 zfs send --replicate --raw src/encryptionroot@111 2025-09-01.00:00:00 zfs change-key -o keyformat=hex src/encryptionroot 2025-09-01.00:00:00 zfs snapshot -r src/encryptionroot@222

ZFS also logs many internal operations in the pool history (search for spa_history_log in the source code) which can be viewed with the -i flag. For snapshots, this includes the transaction group (txg) id when the snapshot was created, which is exactly what we're looking for!

sam@zfshax:~ $ sudo zpool history -i src History for 'src': ... 2025-09-01.00:00:00 [txg:12] snapshot src/encryptionroot@111 (768) 2025-09-01.00:00:00 [txg:12] snapshot src/encryptionroot/child@111 (770) 2025-09-01.00:00:00 (3ms) ioctl snapshot input: snaps: src/encryptionroot@111 src/encryptionroot/child@111 props: 2025-09-01.00:00:00 zfs snapshot -r src/encryptionroot@111 ...

The GUID and creation timestamp we can easily get from the snapshot on dst.

sam@zfshax:~ $ sudo zfs list -o name,guid,creation -p dst/encryptionroot@111 NAME GUID CREATION dst/encryptionroot@111 4964628655505655411 1756699200

Now that we know everything we need to create the bookmark, we just need to figure out a way to manually create a bookmark with arbitrary data.

Hacking ZFS to manually create a bookmark

To understand how ZFS creates a bookmark, we can trace the code path from zfs bookmark all the way down to dsl_bookmark_add which actually adds the bookmark node to the tree.

This is the bookmark structure physically written to disk:

zfs/include/sys/dsl_bookmark.h

/* * On disk zap object. */ typedef struct zfs_bookmark_phys { uint64_t zbm_guid; /* guid of bookmarked dataset */ uint64_t zbm_creation_txg; /* birth transaction group */ uint64_t zbm_creation_time; /* bookmark creation time */ /* fields used for redacted send / recv */ uint64_t zbm_redaction_obj; /* redaction list object */ uint64_t zbm_flags; /* ZBM_FLAG_* */ /* fields used for bookmark written size */ uint64_t zbm_referenced_bytes_refd; uint64_t zbm_compressed_bytes_refd; uint64_t zbm_uncompressed_bytes_refd; uint64_t zbm_referenced_freed_before_next_snap; uint64_t zbm_compressed_freed_before_next_snap; uint64_t zbm_uncompressed_freed_before_next_snap; /* fields used for raw sends */ uint64_t zbm_ivset_guid; } zfs_bookmark_phys_t; #define BOOKMARK_PHYS_SIZE_V1 (3 * sizeof (uint64_t)) #define BOOKMARK_PHYS_SIZE_V2 (12 * sizeof (uint64_t))

Only the first 3 fields are required for v1 bookmarks, while v2 bookmarks contain all 12 fields. dsl_bookmark_node_add only writes a v2 bookmark if one of the 9 v2 fields are non-zero, so we can leave them all zero to write a v1 bookmark.

After a few iterations, I had a patch which hijacks the normal zfs bookmark pool/dataset#src pool/dataset#dst code path to create a bookmark with arbitrary data when the source bookmark name is missing.

sam@zfshax:~/zfs $ git --no-pager diff sam@zfshax:~/zfs $ git --no-pager diff diff --git a/cmd/zfs/zfs_main.c b/cmd/zfs/zfs_main.c index 2d81ef31c..73b5d7e70 100644 --- a/cmd/zfs/zfs_main.c +++ b/cmd/zfs/zfs_main.c @@ -7892,12 +7892,15 @@ zfs_do_bookmark(int argc, char **argv) default: abort(); } +// Skip testing for #missing because it does not exist. +if (strstr(source, "#missing") == NULL) { /* test the source exists */ zfs_handle_t *zhp; zhp = zfs_open(g_zfs, source, source_type); if (zhp == NULL) goto usage; zfs_close(zhp); +} nvl = fnvlist_alloc(); fnvlist_add_string(nvl, bookname, source); diff --git a/module/zfs/dsl_bookmark.c b/module/zfs/dsl_bookmark.c index 861dd9239..fae882f45 100644 --- a/module/zfs/dsl_bookmark.c +++ b/module/zfs/dsl_bookmark.c @@ -263,7 +263,12 @@ dsl_bookmark_create_check_impl(dsl_pool_t *dp, * Source must exists and be an earlier point in newbm_ds's * timeline (newbm_ds's origin may be a snap of source's ds) */ +// Skip looking up #missing because it does not exist. +if (strstr(source, "#missing") == NULL) { error = dsl_bookmark_lookup(dp, source, newbm_ds, &source_phys); +} else { + error = 0; +} switch (error) { case 0: break; /* happy path */ @@ -545,12 +550,34 @@ dsl_bookmark_create_sync_impl_book( * because the redaction object might be too large */ +// Skip looking up #missing because it does not exist. +if (strstr(source_name, "#missing") == NULL) { VERIFY0(dsl_bookmark_lookup_impl(bmark_fs_source, source_shortname, &source_phys)); +} dsl_bookmark_node_t *new_dbn = dsl_bookmark_node_alloc(new_shortname); +// Skip copying from #missing because it does not exist. +if (strstr(source_name, "#missing") == NULL) { memcpy(&new_dbn->dbn_phys, &source_phys, sizeof (source_phys)); new_dbn->dbn_phys.zbm_redaction_obj = 0; +} else { + // Manually set the bookmark parameters. + new_dbn->dbn_phys = (zfs_bookmark_phys_t){ + .zbm_guid = 4964628655505655411, + .zbm_creation_txg = 12, + .zbm_creation_time = 1756699200, + .zbm_redaction_obj = 0, + .zbm_flags = 0, + .zbm_referenced_bytes_refd = 0, + .zbm_compressed_bytes_refd = 0, + .zbm_uncompressed_bytes_refd = 0, + .zbm_referenced_freed_before_next_snap = 0, + .zbm_compressed_freed_before_next_snap = 0, + .zbm_uncompressed_freed_before_next_snap = 0, + .zbm_ivset_guid = 0, + }; +} /* update feature counters */ if (new_dbn->dbn_phys.zbm_flags & ZBM_FLAG_HAS_FBN) {

To test, we recompile ZFS, reload the kernel module, and reimport the pools.

sam@zfshax:~/zfs $ gmake -s -j$(sysctl -n hw.ncpu) sam@zfshax:~/zfs $ sudo gmake install && sudo ldconfig sam@zfshax:~/zfs $ sudo zpool export src && sudo zpool export dst sam@zfshax:~/zfs $ sudo ./scripts/zfs.sh -r sam@zfshax:~/zfs $ sudo zpool import src -d / && sudo zpool import dst -d /

Then, we create the bookmark ex nihilo using the magic bookmark name missing.

sam@zfshax:~/zfs $ sudo zfs bookmark src/encryptionroot#missing src/encryptionroot#111 sam@zfshax:~/zfs $ sudo zdb src/encryptionroot#111 #111: {guid: 44e5e7755d23c673 creation_txg: 12 creation_time: 1756699200 redaction_obj: 0}

Success! We can now use the bookmark to generate an incremental send stream containing the new hex wrapping key parameters.

sam@zfshax:~/zfs $ sudo zfs send --raw -i src/encryptionroot#111 src/encryptionroot@222 | zstreamdump BEGIN record hdrtype = 1 features = 1420004 magic = 2f5bacbac creation_time = 68d3d93f type = 2 flags = 0xc toguid = 3f99b9e92cc0aca7 fromguid = 44e5e7755d23c673 toname = src/encryptionroot@222 payloadlen = 1028 nvlist version: 0 crypt_keydata = (embedded nvlist) nvlist version: 0 DSL_CRYPTO_SUITE = 0x8 DSL_CRYPTO_GUID = 0x6196311f2622e30 DSL_CRYPTO_VERSION = 0x1 DSL_CRYPTO_MASTER_KEY_1 = 0x6c 0x55 0x13 0x78 0x8c 0x2d 0x42 0xb5 0x9e 0x33 0x2 0x7e 0x73 0x3a 0x46 0x20 0xd2 0xf7 0x23 0x7d 0x7c 0x5d 0x5f 0x76 0x63 0x90 0xd2 0x43 0x6a 0xdd 0x63 0x2b DSL_CRYPTO_HMAC_KEY_1 = 0x85 0xd1 0xf3 0xba 0xed 0xec 0x6 0x28 0x36 0xd6 0x60 0x28 0x8d 0x2f 0x6f 0x14 0xc9 0x2b 0x6f 0xf4 0x19 0x23 0x2d 0xf 0x3d 0xe 0xc4 0x88 0x4 0x6d 0xca 0xb5 0x2d 0x4d 0x8 0x75 0x17 0x1c 0xe3 0xe7 0xe6 0x23 0x7 0x53 0x94 0xba 0xc7 0x4b 0xf5 0xde 0x8c 0x29 0xa3 0x27 0xdf 0x82 0x64 0x9d 0x92 0xb4 0xc1 0x26 0x5b 0x32 DSL_CRYPTO_IV = 0xdf 0x52 0x77 0xe8 0xf 0xfd 0xc2 0x42 0x66 0x88 0xb9 0xf0 DSL_CRYPTO_MAC = 0x54 0x54 0x15 0xa4 0x21 0x55 0x6b 0x4e 0x99 0xe7 0xf 0xef 0x9f 0x90 0x42 0x54 portable_mac = 0x3a 0xd6 0x30 0xc4 0x6a 0x2d 0x60 0x24 0x95 0xfc 0x99 0xbb 0xfa 0x10 0xa0 0x6b 0xc6 0x1 0xdd 0x1d 0x9 0xcd 0xa8 0x19 0xdf 0x57 0xb9 0x90 0x4f 0x2e 0x33 0xc1 keyformat = 0x2 pbkdf2iters = 0x0 pbkdf2salt = 0x0 mdn_checksum = 0x0 mdn_compress = 0x0 mdn_nlevels = 0x6 mdn_blksz = 0x4000 mdn_indblkshift = 0x11 mdn_nblkptr = 0x3 mdn_maxblkid = 0x4 to_ivset_guid = 0x957edeaa7123a7 from_ivset_guid = 0x0 (end crypt_keydata) END checksum = 14046201258/62f53166ccc36/14023a70758c3195/1e906f4670783cd SUMMARY: Total DRR_BEGIN records = 1 (1028 bytes) Total DRR_END records = 1 (0 bytes) Total DRR_OBJECT records = 7 (960 bytes) Total DRR_FREEOBJECTS records = 2 (0 bytes) Total DRR_WRITE records = 1 (512 bytes) Total DRR_WRITE_BYREF records = 0 (0 bytes) Total DRR_WRITE_EMBEDDED records = 0 (0 bytes) Total DRR_FREE records = 12 (0 bytes) Total DRR_SPILL records = 0 (0 bytes) Total records = 26 Total payload size = 2500 (0x9c4) Total header overhead = 8112 (0x1fb0) Total stream length = 10612 (0x2974)

But we can't receive the send stream.

sam@zfshax:~ $ sudo zfs send --raw -i src/encryptionroot#111 src/encryptionroot@222 | sudo zfs recv -F dst/encryptionroot cannot receive incremental stream: IV set guid missing. See errata 4 at https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-ER.

The final obstacle

ZFS refuses the stream because it is missing a source IV set GUID (see from_ivset_guid = 0x0 in the zstreamdump above). This is because we created a v1 bookmark which does not contain the IV set GUID like a v2 bookmark would.

Since we know that the send stream is created using the right snapshots, we can temporarily disable checking IV set GUIDs to allow the snapshot to be received as described in errata 4.

sam@zfshax:~ $ sudo sysctl vfs.zfs.disable_ivset_guid_check=1 vfs.zfs.disable_ivset_guid_check: 0 -> 1 sam@zfshax:~ $ sudo zfs send --raw -i src/encryptionroot#111 src/encryptionroot@222 | sudo zfs recv -F dst/encryptionroot sam@zfshax:~ $ sudo sysctl vfs.zfs.disable_ivset_guid_check=0 vfs.zfs.disable_ivset_guid_check: 1 -> 0 sam@zfshax:~ $ sudo zpool export dst sam@zfshax:~ $ sudo zpool import dst -d / sam@zfshax:~ $ sudo zpool scrub dst sam@zfshax:~ $ sudo zpool status -x all pools are healthy

The moment of truth

And now for the moment of truth…

sam@zfshax:~ $ echo "0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef" | sudo zfs load-key dst/encryptionroot sam@zfshax:~ $ sudo zfs mount -a sam@zfshax:~ $ sudo zfs list -t all -o name,encryptionroot,keyformat,keystatus,mounted,guid -r dst NAME ENCROOT KEYFORMAT KEYSTATUS MOUNTED GUID dst - none - yes 15258247229701443799 dst/encryptionroot dst/encryptionroot hex available yes 17755083343181277380 dst/encryptionroot@111 dst/encryptionroot - available - 4964628655505655411 dst/encryptionroot@222 dst/encryptionroot - available - 4582898506955533479 dst/encryptionroot/child dst/encryptionroot hex available yes 364333975888407846 dst/encryptionroot/child@111 dst/encryptionroot - available - 15551239789901400488 dst/encryptionroot/child@222 dst/encryptionroot - available - 11729357375613972731 sam@zfshax:~ $ tree --noreport --noreport /dst /dst └── encryptionroot ├── 222 └── child └── 222

WE'RE GONNA LIVE!!!

At this point, we can now reliably fix the issue in our test environment. All we need to do now is use our hacked ZFS build to create the bookmark on old, send an incremental snapshot of the encryption root with the new key to sneakernet, and then send that snapshot from sneakernet to new. I rebuilt ZFS again with the correct transaction group, GUID, and creation timestamp for old, repeated the same steps with the names changed, and thanks to our thorough testing, it worked on the first try!

Conclusion

After a week of intense research and debugging, I had rescued our data back from the brink and could again sleep soundly at night. While I appreciated the opportunity to learn more about ZFS, I can't help but think about how this entire incident could have been avoided at several key points which translate directly into lessons learned:

Test backups continuously so you get immediate feedback when they break.
Wait to make all destructive changes together at the very end instead of interspersed where they could accidentally be performed in the wrong order.
Always send a snapshot of the encryption root after changing the encryption key.
Create bookmarks before destroying snapshots.

I hope that you may learn from my mistakes and avoid a similar incident. If you do happen to find yourself in a similar predicament, I'd love to hear from you regardless of whether this postmortem was helpful or not. My contact details can be found here.

Knowing what I now know about ZFS native encryption, I find it difficult to recommend until the sharp edges have all been filed down. In most cases, I'd prefer to encrypt the entire pool at the block device level and encrypt send streams with age. But if you really do need the flexibility offered by native encryption, always remember to mind the encryptionroot!

Read Entire Article