OpenZFS corruption: 4-year fix of non-raw snapshot send with encryption

8 hours ago 1

System information

Type Version/Name

Distribution Name	Debian
Distribution Version	Buster
Linux Kernel	5.10.0-0.bpo.5-amd64
Architecture	amd64
ZFS Version	2.0.3-1~bpo10+1
SPL Version	2.0.3-1~bpo10+1

Describe the problem you're observing

Since upgrading to 2.0.x and enabling crypto, every week or so, I start to have issues with my zfs send/receive-based backups. Upon investigating, I will see output like this:

zpool status -v pool: rpool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 0B in 00:03:37 with 0 errors on Mon May 3 16:58:33 2021 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 nvme0n1p7 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <0xeb51>:<0x0>

Of note, the <0xeb51> is sometimes a snapshot name; if I zfs destroy the snapshot, it is replaced by this tag.

Bug #11688 implies that zfs destroy on the snapshot and then a scrub will fix it. For me, it did not. If I run a scrub without rebooting after seeing this kind of zpool status output, I get the following in very short order, and the scrub (and eventually much of the system) hangs:

[393801.328126] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1) [393801.328129] PANIC at arc.c:3790:arc_buf_destroy() [393801.328130] Showing stack for process 363 [393801.328132] CPU: 2 PID: 363 Comm: z_rd_int Tainted: P U OE 5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1 [393801.328133] Hardware name: Dell Inc. XPS 15 7590/0VYV0G, BIOS 1.8.1 07/03/2020 [393801.328134] Call Trace: [393801.328140] dump_stack+0x6d/0x88 [393801.328149] spl_panic+0xd3/0xfb [spl] [393801.328153] ? __wake_up_common_lock+0x87/0xc0 [393801.328221] ? zei_add_range+0x130/0x130 [zfs] [393801.328225] ? __cv_broadcast+0x26/0x30 [spl] [393801.328275] ? zfs_zevent_post+0x238/0x2a0 [zfs] [393801.328302] arc_buf_destroy+0xf3/0x100 [zfs] [393801.328331] arc_read_done+0x24d/0x490 [zfs] [393801.328388] zio_done+0x43d/0x1020 [zfs] [393801.328445] ? zio_vdev_io_assess+0x4d/0x240 [zfs] [393801.328502] zio_execute+0x90/0xf0 [zfs] [393801.328508] taskq_thread+0x2e7/0x530 [spl] [393801.328512] ? wake_up_q+0xa0/0xa0 [393801.328569] ? zio_taskq_member.isra.11.constprop.17+0x60/0x60 [zfs] [393801.328574] ? taskq_thread_spawn+0x50/0x50 [spl] [393801.328576] kthread+0x116/0x130 [393801.328578] ? kthread_park+0x80/0x80 [393801.328581] ret_from_fork+0x22/0x30

However I want to stress that this backtrace is not the original cause of the problem, and it only appears if I do a scrub without first rebooting.

After that panic, the scrub stalled -- and a second error appeared:

zpool status -v pool: rpool state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub in progress since Sat May 8 08:11:07 2021 152G scanned at 132M/s, 1.63M issued at 1.41K/s, 172G total 0B repaired, 0.00% done, no estimated completion time config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 nvme0n1p7 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: <0xeb51>:<0x0> rpool/crypt/debian-1/home/jgoerzen/no-backup@[elided]-hourly-2021-05-07_02.17.01--2d:<0x0>

I have found the solution to this issue is to reboot into single-user mode and run a scrub. Sometimes it takes several scrubs, maybe even with some reboots in between, but eventually it will clear up the issue. If I reboot before scrubbing, I do not get the panic or the hung scrub.

I run this same version of ZoL on two other machines, one of which runs this same kernel version. What is unique about this machine?

It is a laptop
It uses ZFS crypto (the others use LUKS)

I made a significant effort to rule out hardware issues, including running several memory tests and the built-in Dell diagnostics. I believe I have rules that out.