Littlefs v3-alpha: Rbyds and B-trees and gcksums, oh my

3 hours ago 2

#### ######## ##### ##### ################# ######## #### #### ### ############################## #### #############-#### ##_######################### ########'#########_\'|######_#################### ##### ##########_.. \ .'#_._#### #/#_########## ################_ \ v .'_____#_.'.'_ ########### ################_ "'. | ."" _________ '"-########## ########"#####..--. / .-"'| | | '"-####'###### ############# \ / .---.|_|_ ########### ########### | / -| |- ######## ) | -|littlefs|- | | -| v3 |- | | '--------' | || ' ' ' -------------/-/---\\----------------------

Note: v3-alpha discussion (#1114)

Unfortunately GitHub made a complete mess of the PR discussion. To try to salvage things, please use #1114 for new comments. Feedback/criticism are welcome and immensely important at this stage.

Hello! ^

Hello everyone! As some of you may have already picked up on, there's been a large body of work fermenting in the background for the past couple of years. Originally started as an experiment to try to solve littlefs's $O(n^2)$ metadata compaction, this branch eventually snowballed into more-or-less a full rewrite of the filesystem from the ground up.

There's still several chunks of planned work left, but now that this branch has reached feature parity with v2, there's nothing really stopping it from being merged eventually.

So I figured it's a good time to start calling this v3, and put together a public roadmap.

NOTE: THIS WORK IS INCOMPLETE AND UNSTABLE

Here's a quick TODO list of planned work before stabilization. More details below:

Test framework rework (merged, mostly)
Rbyds
B-trees
M-tree
B-shrubs
Fruncate
Sync model rework
Stickynotes
Traversal API rework
Incremental GC
Global-checksums
Key-value APIs
On-disk block-map (in progress)
Metadata redundancy
Data redundancy
Document, document, document
Stretch goals

This work may continue to break the on-disk format.

That being said, I highly encourage others to experiment with v3 where possible. Feedback is welcome, and immensely important at this stage. Once it's stabilized, it's stabilized.

To help with this, the current branch uses a v0.0 as its on-disk version to indicate that it is experimental. When it is eventually released, v3 will reject this version and fail to mount.

Unfortunately, the API will be under heavy flux during this period.

A note on benchmarking: The on-disk block-map is key for scalable allocator performance, so benchmarks at this stage needs to be taken with a grain of salt when many blocks are involved. Please refer to this version as "v3 (no bmap)" or something similar in any published benchmarks until this work is completed.

Wait, a disk breaking change? ^

Yes. v3 breaks disk compatibility from v2.

I think this is a necessary evil. Attempting to maintain backwards compatibility has a heavy cost:

Development time - The littlefs team is ~1 guy, and v3 has already taken ~2.5 years. The extra work to make everything compatible would stretch this out much longer and likely be unsustainable.
Code cost - The goal of littlefs is to be, well, little. This is unfortunately in conflict with backwards compatibility.

Take the new B-tree data-structure, for example. It would be easy to support both B-tree and CTZ skip-list files, but now you need ~2x the code. This cost gets worse for the more enmeshed features, and potentially exceeds the cost of just including both v3 and v2 in the codebase.

So I think it's best for both littlefs as a project and long-term users to break things here.

Note v2 isn't going anywhere! I'm happy to continue maintaining the v2 branch, merge bug fixes when necessary, etc. But the economic reality is my focus will be shifting to v3.

What's new ^

Ok, with that out of the way, what does breaking everything actually get us?

Implemented: ^

Efficient metadata compaction: $O(n^2) \rightarrow O(n \log n)$ ^

v3 adopts a new metadata data-structure: Red-black-yellow Dhara trees (rbyds). Based on the data-structure invented by Daniel Beer for the Dhara FTL, rbyds extend log-encoded Dhara trees with self-balancing and self-counting (also called order-statistic) properties.

This speeds up most metadata operations, including metadata lookup ( $O(n) \rightarrow O(\log n)$ ), and, critically, metadata compaction ( $O(n^2) \rightarrow O(n \log n)$ ).

This improvement may sound minor on paper, but it's a difference measured in seconds, sometimes even minutes, on devices with extremely large blocks.
Efficient random writes: $O(n) \rightarrow O(\log_b^2 n)$ ^

A much requested feature, v3 adopts B-trees, replacing the CTZ skip-list that previously backed files.

This avoids needing to rewrite the entire file on random reads, bringing the performance back down into tractability.

For extra cool points, littlefs's B-trees use rbyds for the inner nodes, which makes CoW updates much cheaper than traditional array-packed B-tree nodes when large blocks are involved ( $O(n) \rightarrow O(\log n)$ ).
Better logging: No more sync-padding issues ^

v3's B-trees support inlining data directly in the B-tree nodes. This gives us a place to store data during sync, without needing to pad things for prog alignment.

In v2 this padding would force the rewriting of blocks after sync, which had a tendency to wreck logging performance.
Efficient inline files, no more RAM constraints: $O(n^2) \rightarrow O(n \log n)$ ^

In v3, B-trees can have their root inlined in the file's mdir, giving us what I've been calling a "B-shrub". This, combined with the above inlined leaves, gives us a much more efficient inlined file representation, with better code reuse to boot.

Oh, and B-shrubs also make small B-trees more efficient by avoiding the extra block needed for the root.
Independent file caches ^

littlefs's pcache, rcache, and file caches can be configured independently now. This should allow for better RAM utilization when tuning the filesystem.
Easier logging APIs: lfs3_file_fruncate ^

Thanks to the new self-counting/order-statistic properties, littlefs can now truncate from both the end and front of files via the new lfs3_file_fruncate API.

Before, the best option for logging was renaming log files when they filled up. Now, maintaining a log/FIFO is as easy as:

lfs3_file_write(&lfs, &file, entry, entry_size) => entry_size; lfs3_file_fruncate(&lfs, &file, log_size) => 0;
Sparse files ^

Another advantage of adopting B-trees, littlefs can now cheaply represent file holes, where contiguous runs of zeros can be implied without actually taking up any disk space.

Currently this is limited to a couple operations:
- lfs3_file_truncate
- lfs3_file_fruncate
- lfs3_file_seek + lfs3_file_write past the end of the file
But more advanced hole operations may be added in the future.
Efficient file name lookup: $O(n) \rightarrow O(\log_b n)$ ^

littlefs now uses a B-tree (yay code reuse) to organize files by file name. This allows for much faster file name lookup than the previous linked-list of metadata blocks.
A simpler/more robust metadata tree ^

As a part of adopting B-trees for metadata, the previous threaded file tree has been completely ripped out and replaced with one big metadata tree: the M-tree.

I'm not sure how much users are aware of it, but the previous threaded file tree was a real pain-in-the-ass with the amount of bugs it caused. Turns out having a fully-connected graph in a CoBW filesystem is a really bad idea.

In addition to removing an entire category of possible bugs, adopting the M-tree allows for multiple directories in a single metadata block, removing the 1-dir = 1-block minimum requirement.
A well-defined sync model ^

One interesting thing about littlefs, it doesn't have a strictly POSIX API. This puts us in a relatively unique position, where we can explore tweaks to the POSIX API that may make it easer to write powerloss-safe applications.

To leverage this (and because the previous sync model had some real problems), v3 includes a new, well-defined sync model.

I think this discussion captures most of the idea, but for a high-level overview:
1. Open file handles are strictly snapshots of the on-disk state. Writes to a file are copy-on-write (CoW), with no immediate affect to the on-disk state or any other file handles.
2. Syncing or closing an in-sync file atomically updates the on-disk state and any other in-sync file handles.
3. Files can be desynced, either explicitly via lfs3_file_desync, or because of an error. Desynced files do not recieve sync broadcasts, and closing a desynced file has no affect on the on-disk state.
4. Calling lfs3_file_sync on a desynced file will atomically update the on-disk state, any other in-sync file handles, and mark the file as in-sync again.
5. Calling lfs3_file_resync on a file will discard its current contents and mark the file as in-sync. This is equivalent to
  closing and reopening the file.
Stickynotes, no more 0-sized files ^

As an extension of the littlefs's new sync model, v3 introduces a new file type: LFS3_TYPE_STICKYNOTE.

A stickynote represents a file that's in the awkward state of having been created, but not yet synced. If you lose power, stickynotes are hidden from the user and automatically cleaned up on the next mount.

This avoids the 0-sized file issue, while still allowing most of the POSIX interactions users expect.
A new and improved compat flag system ^

v2.1 was a bit of a mess, but it was a learning experience. v3 still includes a global version field, but also includes a set of compat flags that allow non-linear addition/removal of future features.

These are probably familiar to users of Linux filesystems, though I've given them slightly different names:
- rcompat flags - Must understand to read the filesystem (incompat_flags)
- wcompat flags - Must understand to write to the filesystem (ro_compat_flags)
- ocompat flags - No understanding necessary (compat_flags)
This also provides an easy route for marking a filesystem as read-only, non-standard, etc, on-disk.
Error detection! - Global-checksums ^

v3 now supports filesystem-wide error-detection. This is actually quite tricky in a CoBW filesystem, and required the invention of global-checksums (gcksums) to prevent rollback issues caused by naive checksumming.

With gcksums, and a traditional Merkle-tree-esque B-tree construction, v3 now provides a filesystem-wide self-validating checksum via lfs3_fs_cksum. This checksum can be stored external to the filesystem to provide protection against last-commit rollback issues, metastability, or just for that extra peace of mind.

Funny thing about checksums. It's incredibly cheap to calculate checksums when writing, as we're already processing that data anyways. The hard part is, when do you check the checksums?

This is a problem that mostly ends up on the user, but to help, v3 adds a large number checksum checking APIs (probably too many if I'm honest):
- LFS3_M_CKMETA/CKDATA - Check checksums during mount
- LFS3_O_CKMETA/CKDATA - Check checksums during file open
- lfs3_fs_ckmeta/ckdata - Explicitly check all checksums in the filesystem
- lfs3_file_ckmeta/ckdata - Explicitly check a file's checksums
- LFS3_T_CKMETA/CKDATA - Check checksums incrementally during a traversal
- LFS3_GC_CKMETA/CKDATA - Check checksums during GC operations
- LFS3_M_CKPROGS - Closed checking of data during progs
- LFS3_M_CKFETCHES - Optimistic (not closed) checking of data during fetches
- LFS3_M_CKREADS (planned) - Closed checking of data during reads
Better traversal APIs ^

The traversal API has been completely reworked to be easier to use (both externally and internally).

No more callback needed, blocks can now be iterated over via the dir-like lfs3_traversal_read function.

Traversals can also perform janitorial work and check checksums now, based on the flags provided to lfs3_traversal_open.
Incremental GC ^

GC work can now be accomplished incrementally, instead of requiring one big go. This is managed by lfs3_fs_gc, cfg.gc_flags, and cfg.gc_steps.

Internally, this just shoves one of the new traversal objects into lfs3_t. It's equivalent to managing a traversal object yourself, but hopefully makes it easier to write library code.

However, this does add a significant chunk of RAM to lfs3_t, so GC is now an opt-in feature behind the LFS3_GC ifdef.
Better recovery from runtime errors ^

Since we're already doing a full rewrite, I figured let's actually take the time to make sure things don't break on exceptional errors.

Most in-RAM filesystem state should now revert to the last known-good state on error.

The one exception involves file data (not metadata!). Reverting file data correctly turned out to roughly double the cost of files. And now that you can manual revert with lfs3_file_resync, I figured this cost just isn't worth it. So file data remains undefined after an error.

In total, these changes add a significant amount of code and stack, but I'm of the opinion this is necessary for the maturing of littlefs as a filesystem.
Standard custom attributes ^

Breaking disk gives us a chance to reserve attributes 0x80-0xbf for future standard custom attributes:
- 0x00-0x7f - Free for user-attributes (uattr)
- 0x80-0xbf - Reserved for standard-attributes (sattr)
- 0xc0-0xff - Encouraged for system-attributes (yattr)
In theory, it was technically possible to reserve these attributes without a disk-breaking change, but it's much safer to do so while we're already breaking the disk.

v3 also includes the possibility of extending the custom attribute space from 8-bits to ~25-bits in the future, but I'd hesitate to to use this, as it risks a significant increase in stack usage.
More tests! ^

v3 comes with a couple more tests than v2 (+~6812.2%):

suites cases permutations ; pls runtime v2: 22 198 11641 ; 28741 54.16s v3: 23 784 804655 ; 2228513 1323.18s

You may or may not have seen the test framework rework that went curiously under-utilized. That was actually in preparation for the v3 work.

The goal is not 100% line/branch coverage, but just to have more confidence in littlefs's reliability.
Simple key-value APIs ^

v3 includes a couple easy-to-use key-value APIs:
- lfs3_get - Get the contents of a file
- lfs3_size - Get the size of a file
- lfs3_set - Set the contents of a file
- lfs3_remove - Remove a file (this one already exists)
This API is limited to files that fit in RAM, but if it fits your use case, you can disable the full file API with LFS3_KVONLY to save some code.

If your filesystem fits in only 2 blocks, you can also define LFS3_2BONLY to save more code.

These can be useful for creating small key-value stores on systems that already use littlefs for other storage.

Planned: ^

Efficient block allocation, via optional on-disk block-map (bmap) ^

The one remaining bottleneck in v3 is block allocation. This is a tricky problem for littlefs (and any CoW/CoBW filesystem), because we don't actually know when a block becomes free.

This is in-progress work, but the solution I'm currently looking involves 1. adding an optional on-disk block map (bmap) stored in gstate, and 2. updating it via tree diffing on sync. In theory this will drop huge file writes: $O(n^2 \log n) \rightarrow O(n \log_b^2 n)$

There is also the option of using the bmap as a simple cache, which doesn't avoid the filesystem-wide scan but at least eliminates the RAM constraint of the lookahead buffer.

As a plus, we should be able to leverage the self-counting property of B-trees to make the on-disk bmap compressible.
Bad block tracking ^

This is a much requested feature, and adding the optional on-disk bmap finally gives us a place to track bad blocks.
Pre-erased block tracking ^

Just like bad-blocks, the optional on-disk bmap gives us a place to track pre-erased blocks. Well, at least in theory.

In practice it's a bit more of a nightmare. To avoid multiple progs, we need to mark erased blocks as unerased before progging. This introduces an unbounded number of catch-22s when trying to update the bmap itself.

Fortunately, if instead we store a simple counter in the bmap's gstate, we can resolve things at the mrootanchor worst case.
Error correction! - Metadata redundancy ^

Note it's already possible to do error-correction at the block-device level outside of littlefs, see ramcrc32cbd and ramrsbd for examples. Because of this, integrating in-block error correction is low priority.

But I think there's potential for cross-block error-correction in addition to the in-block error-correction.

The plan for cross-block error-correction/block redundancy is a bit different for metadata vs data. In littlefs, all metadata is logs, which is a bit of a problem for parity schemes. I think the best we can do is store metadata redundancy as naive copies.

But we already need two blocks for every mdir, one usually just sits unused when not compacting. This, combined with metadata usually being much smaller than data, makes the naive scheme less costly than one might expect.
Error correction! - Data redundancy ^

For raw data blocks, we can be a bit more clever. If we add an optional dedup tree for block -> parity group mapping, and an optional parity tree for parity blocks, we can implement a RAID-esque parity scheme for up to 3 blocks of data redundancy relatively cheaply.
Transparent block deduplication ^

This one is a bit funny. Originally block deduplication was intentionally out-of-scope, but it turns out you need something that looks a lot like a dedup tree for error-correction to work in a system that allows multiple block references.

If we already need a virtual -> physical block mapping for error correction, why not make the key the block checksum and get block deduplication for free?

Though if this turns out to not be as free as I think it is, block deduplication will fall out-of-scope.

Stretch goals: ^

These may or may not be included in v3, depending on time and funding:

16-bit and 64-bit variants ^
Config API rework ^
Block device API rework ^
Custom attr API rework ^
Alternative (cheaper) write-strategies (write-once, global-aligned, eager-crystallization) ^
Advanced file tree operations (lfs3_file_punchhole, lfs3_file_insertrange, lfs3_file_collapserange, LFS3_SEEK_DATA, LFS3_SEEK_HOLE) ^
Advanced file copy-on-write operations (shallow lfs3_cowcopy + opportunistic lfs3_copy) ^
Reserved blocks to prevent CoW lockups ^
Metadata checks to prevent metadata lockups ^
Integrated block-level ECC (ramcrc32cbd, ramrsbd) ^
Disk-level RAID (this is just data redund + a disk aware block allocator) ^

Out-of-scope (for now): ^

If we don't stop somewhere, v3 will never be released. But these may be added in the future:

Alternative checksums (crc16, crc64, sha256, etc) ^
Feature-limited configurations for smaller code/stack sizes (LFS3_NO_DIRS, LFS3_KV, LFS3_2BLOCK, etc) ^
lfs3_file_openat for dir-relative APIs ^
lfs3_file_openn for non-null-terminated-string APIs ^
Transparent compression ^
Filesystem shrinking ^
High-level caches (block cache, mdir cache, btree leaf cache, etc) ^
Symbolic links ^
100% line/branch coverage ^

Code/stack size ^

littlefs v1, v2, and v3, 1 pixel ~= 1 byte of code, click for a larger interactive codemap (commit)

littlefs v2 and v3 rdonly, 1 pixel ~= 1 byte of code, click for a larger interactive codemap (commit)

Unfortunately, v3 is a little less little than v2:

code stack ctx v2: 17144 1440 580 v3: 37352 (+117.9%) 2280 (+58.3%) 636 (+9.7%)

code stack ctx v2-rdonly: 6270 448 580 v3-rdonly: 10616 (+69.3%) 808 (+80.4%) 508 (-12.4%)

On one hand, yes, more features generally means more code.

And it's true there's an opportunity here to carve out more feature-limited builds to save code/stack in the future.

But I think it's worth discussing some of the other reasons for the code/stack increase:

Runtime error recovery ^

Recovering from runtime errors isn't cheap. We need to track both the before and after state of things during fallible operations, and this adds both stack and code.

But I think this is necessary for the maturing of littlefs as a filesystem.

Maybe it will make sense to add a sort of LFS3_GLASS mode in the future, but this is out-of-scope for now.
B-tree flexibility ^

The bad news: The new B-tree files are extremely flexible. Unfortunately, this is a double-edged sword.

B-trees, on their own, don't add that much code. They are a relatively poetic data-structure. But deciding how to write to a B-tree, efficiently, with an unknown write pattern, is surprisingly tricky.

The current implementation, what I've taken to calling the "lazy-crystallization algorithm", leans on the more complicated side to see what is possible performance-wise.

The good news: The new B-tree files are extremely flexible.

There's no reason you need the full crystallization algorithm if you have a simple write pattern, or don't care as much about performance. This will either be a future or stretch goal, but it would be interesting to explore alternative write-strategies that could save code in these cases.
Traversal inversion ^

Inverting the traversal, i.e. moving from a callback to incremental state machine, adds both code and stack as 1. all of the previous on-stack state needs to be tracked explicitly, and 2. we now need to worry about what happens if the filesystem is modified mid-traversal.

In theory, this could be reverted if you don't need incremental traversals, but extricating incremental traversals from the current codebase would be an absolute nightmare, so this is out-of-scope for now.

Benchmarks ^

First off, I would highly encourage others to do their own benchmarking with v3/v2. Filesystem performance is tricky to measure because it depends heavily on your application's write pattern and hardware nuances. If you do, please share in this thread! Others may find the results useful, and now is the critical time for finding potential disk-related performance issues.

Simulated benchmarks ^

To test the math behind v3, I've put together some preliminary simulated benchmarks.

Note these are simulated and optimistic. They do not take caching or hardware buffers into account, which can have a big impact on performance. Still, I think they provide at least a good first impression of v3 vs v2.

To find an estimate of runtime, I first measured the amount of bytes read, progged, and erased, and then scaled based on values found in relevant datasheets. The options here were a bit limited, but WinBond fortunately provides runtime estimates in the datasheets on their website:

NOR flash - w25q64jv
NAND flash - w25n01gv
SD/eMMC - Also w25n01gv, assuming a perfect FTL

I said optimistic, didn't I? I could't find useful estimates for SD/eMMC, so I'm just assuming a perfect FTL here.

These also assume an optimal bus configuration, which, as any embedded engineer knows, is often not the case.

Full benchmarks here: https://benchmarks.littlefs.org (repo, commit)

And here are the ones I think are the most interesting:

Note that SD/eMMC is heavily penalized by the lack of on-disk block-map! SD/eMMC breaks flash down into many small blocks, which tends to make block allocator performance dominate.

Linear writes, where we write a 1 MiB file and don't call sync until closing the file. ^

This one is the most frustrating to compare against v2. CTZ skip-lists are really fast at appending! The problem is they are only fast at appending:

(commit, reads, progs, erases, sim, usage)
Random writes, note we start with a 1MiB file. ^

As expected, v2 is comically bad at random writes. v3 is indistinguishable from zero in the NOR case:

(commit, reads, progs, erases, sim, usage)
Logging, write 4 MiB, but limit the file to 1 MiB. ^

In v2 this is accomplished by renaming the file, in v3 we can leverage lfs3_file_fruncate.

v3 performs significantly better with large blocks thanks to avoiding the sync-padding problem:

(commit, reads, progs, erases, sim, usage)

Funding ^

If you think this work is worthwhile, consider sponsoring littlefs. Current benefits include:

Being able to complain about v3 not being released yet
Being able to complain about the disk breakage v2 -> v3

I joke, but I truly appreciate those who have contributed to littlefs so far. littlefs, in its current form, is a mostly self-funded project, so every little bit helps.

If you would like to contribute in a different way, or have other requests, feel free to reach me at geky at geky.net.

As stabilization gets closer, I will also be open to contract work to help port/integrate/adopt v3. If this is interesting to anyone, let me know.

Thank you @micropython, @fusedFET for sponsoring littlefs, and thank you @Eclo, @kmetabg, and @nedap for your past sponsorships!

Next steps ^

For me, I think it's time to finally put together a website/wiki/discussions/blog. I'm not sure on the frequency quite yet, but I plan to write/publish the new DESIGN.md in chapters in tandem with the remaining work.

EDIT: Pinned codemap/plot links to specific commits via benchmarks.littlefs.org/tree.html
EDIT: Updated with rdonly code/stack sizes
EDIT: Added link to #1114
EDIT: Implemented simple key-value APIs

Read Entire Article