System information
Distribution Name | Debian |
Distribution Version | Buster |
Linux Kernel | 5.10.0-0.bpo.5-amd64 |
Architecture | amd64 |
ZFS Version | 2.0.3-1~bpo10+1 |
SPL Version | 2.0.3-1~bpo10+1 |
Describe the problem you're observing
Since upgrading to 2.0.x and enabling crypto, every week or so, I start to have issues with my zfs send/receive-based backups. Upon investigating, I will see output like this:
Of note, the <0xeb51> is sometimes a snapshot name; if I zfs destroy the snapshot, it is replaced by this tag.
Bug #11688 implies that zfs destroy on the snapshot and then a scrub will fix it. For me, it did not. If I run a scrub without rebooting after seeing this kind of zpool status output, I get the following in very short order, and the scrub (and eventually much of the system) hangs:
However I want to stress that this backtrace is not the original cause of the problem, and it only appears if I do a scrub without first rebooting.
After that panic, the scrub stalled -- and a second error appeared:
I have found the solution to this issue is to reboot into single-user mode and run a scrub. Sometimes it takes several scrubs, maybe even with some reboots in between, but eventually it will clear up the issue. If I reboot before scrubbing, I do not get the panic or the hung scrub.
I run this same version of ZoL on two other machines, one of which runs this same kernel version. What is unique about this machine?
- It is a laptop
- It uses ZFS crypto (the others use LUKS)
I made a significant effort to rule out hardware issues, including running several memory tests and the built-in Dell diagnostics. I believe I have rules that out.
Describe how to reproduce the problem
I can't at will. I have to wait for a spell.
Include any warning/errors/backtraces from the system logs
See above
Potentially related bugs
- I already mentioned permanent errors (ereport.fs.zfs.authentication) reported after syncoid snapshot/send workload #11688 which seems similar, but a scrub doesn't immediately resolve the issue here
- A quite similar backtrace also involving arc_buf_destroy is in silent corruption for thousands files gives input/output error but cannot be detected with scrub - at least for openzfs 2.0.0 #11443. The behavior described there has some parallels to what I observe. I am uncertain from the discussion what that means for this.
- In silent corruption gives input/output error but cannot be detected with scrub, experienced on 0.7.5 and 0.8.3 versions #10697 there are some similar symptoms, but it looks like a different issue to me