Fast and cheap bulk storage: using LVM to cache HDDs on SSDs

6 hours ago 1

Since the inception of solid-state drives (SSDs), there has been a choice to make—either use SSDs for vastly superior speeds, especially with non-sequential read and writes (“random I/O”), or use legacy spinning rust hard disk drives (HDDs) for cheaper storage that’s a bit slow for sequential I/O¹ and painfully slow for random I/O.

The idea of caching frequently used data on SSDs and storing the rest on HDDs is nothing new—solid-state hybrid drives (SSHDs) embodied this idea in hardware form, while filesystems like ZFS support using SSDs as L2ARC. However, with the falling price of SSDs, this no longer makes sense outside of niche scenarios with very large amounts of storage. For example, I have not needed to use HDDs in my PC for many years at this point, since all my data easily fits on an SSD.

One of the scenarios in which this makes sense is for the mirrors I host at home. Oftentimes, a project will require hundreds of gigabytes of data to be mirrored just in case anyone needs it, but only a few files are frequently accessed and could be cached on SSDs for fast access². Similarly, I have many LLMs locally with Ollama, but there are only a few I use very frequently. The frequently used ones can be cached while the rest can be loaded slowly from HDD when needed.

While ZFS may seem like the obvious option here, due to Linux compatibility issues with ZFS mentioned previously, I decided to use Linux’s Logical Volume Manager (LVM) instead for this task to save myself some headache. To ensure reliable storage in the event of HDD failures, I am running the HDDs in RAID 1 with Linux’s mdadm software RAID.

This post documents how to build such a cached RAID array and explores some considerations when building reliable and fast storage.

Why use LVM cache?
A quick introduction to LVM
The hardware setup
Why use RAID 1 on HDDs?
Setting up RAID 1 with mdadm
Creating the SSD cache partition
Creating a new volume group
Creating the cached LV
Creating a filesystem
Mounting the new filesystem
Monitoring
Conclusion

Why use LVM cache?

There are several alternative block device caching solutions on Linux, such as:

bcache: a built-in Linux kernel module that does similar caching as LVM. I don’t like the way it’s set up by owning the entire block device and non-persistent sysfs configurations, compared to LVM remembering all the configuration options, nor do I enjoy hearing about all the reports of bcache corrupting data; and
EnhanceIO: an old kernel module that does something similar to bcache and LVM cache, but hasn’t been maintained for over a decade.

Since I am very familiar with LVM and have already used it for other reasons, I opted to use LVM for this exercise as well.

A quick introduction to LVM

If you aren’t familiar with LVM, we’ll need to first introduce some concepts, or none of the LVM portions of this post will make any sense.

First, we’ll need to introduce block devices, which are just devices with a fixed number of blocks that can be read at any offset. HDDs and SSDs show up as block devices, such /dev/sda. They can be partitioned into multiple pieces, showing up as smaller block devices such as /dev/sda1, the first partition on /dev/sda. Filesystems can be created directly on block devices, but these block devices can also be used with more advanced things like RAID and LVM.

LVM is a volume manager that allows you to create logical volumes that can be expanded much more easily than regular partitions. In LVM, there are three major entity types:

Physical volumes (PVs): block devices that are used as the underlying storage for LVM;
Logical volumes (LVs): block devices that are presented by LVM, stored on one or more PVs; and
Volume groups (VGs): a group of PVs on which LVs can be created.

LVs can be used just like partitions to store files, with the flexibility of being able to expand them at will while they are actively being accessed, without having to be contiguous like real partitions.

There are more advanced LV types, such as thin pools, which doesn’t allocate space for LVs until they are actually used to store data, and cached volumes, which this post is about.

The hardware

For the purposes of this post, we will assume that there are two SATA HDDs ( 4 TB each in my case), available as block devices /dev/sda and /dev/sdb:

$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 3.6T 0 disk sdb 8:16 0 3.6T 0 disk ...

Warning: before copying any commands, ensure that you are operating on the correct device. There is no undo button for most of the commands in this post, so be very careful lest you destroy your precious data! When in doubt, run lsblk to double check!

We’ll also assume that the SSD is /dev/nvme0n1³ (2 TB in my case), and we will allocate 100 GiB of it as the cache by creating a partition.

Effectively, the setup looks like this:

Diagram of the LVM cache setup

Why use RAID 1 on HDDs?

Mechanical HDDs, like everything mechanical, fail. It’s an inevitable fact of life. There are two choices here:

Treat your data as ephemeral and replace it when the drive fails, accepting the inevitable downtime this causes; or
Store your data in a redundant fashion (i.e. with RAID), so that it continues to be available despite drive failures⁴.

If your data is really that unimportant, I suppose you could store it on a single drive, or even use RAID 0 to stripe it across multiple drives such that it’s lost if any one drive fails, but benefit from being able to pool all the drives together.

However, as I learned the hard way, even easily replaceable data still requires effort to replace them. I once deployed this exact setup with RAID 0 and one of the constituent drives suffered a failure, causing a few files to become unreadable. While I could easily download them again, it created a lot of downtime due to having to destroy the entire array and start over after replacing the failed drive.

This may not matter for your use case, but I would rather that my mirror experience minimal downtime in the event of a drive failure. For this reason, I chose to run the drives together in RAID 1.

Setting up RAID 1 with mdadm

One thing worth noting before we start with setting up RAID is that all block devices (either whole drives or partitions) in a RAID must be identical in size⁵. This presents some interesting challenges, since a 4 TB HDD drive isn’t always the same size. Normally, for a drive to be sold as “ 4 TB,” it has to have at least 4 000 000 000 000 bytes (that’s 4 trillion bytes). This is around 3.638 TiB using power-of-two IEC units. Typically, they have slightly more, though this varies by manufacturer or even model.

This poses a problem when using non-identical drive models, which you are encouraged to do to avoid drives failing at the same time. Drives produced in the same batch subjected to the same operations tend to fail at similar times, so that’s a good precaution to take to avoid failures. A similar problem occurs when it comes to replacing the drives when they fail, especially if you can’t source an identical model.

To avoid this problem, we will partition the drive and cut the data partition off exactly at the 4 TB mark. This will ensure that any “4 TB” HDD could be similarly partitioned and used as a replacement. Another reason to partition is to avoid the drive being treated as uninitialized on operating systems that don’t understand Linux’s mdadm RAID, such as Windows.

Partitioning the drives

We’ll need to do some math to figure out which 512-byte logical sector to end the partition on. For a 4 TB drive, we want to end it at the exact 4 TB mark:

>>> 4e12/512 - 1 7812499999.0

Since partition tools typically ask for the offset of the last sector to be included in the partition, we’ll need to subtract 1.

To partition the drive, we first need to clean everything on it first:

$ sudo wipefs -a /dev/sda ... $ sudo wipefs -a /dev/sdb ...

(You can skip this if you are using a brand new drive.)

Then, create the partition with gdisk:

$ sudo gdisk /dev/sda GPT fdisk (gdisk) version 1.0.9 Partition table scan: MBR: not present BSD: not present APM: not present GPT: not present Creating new GPT entries in memory. Command (? for help): n Partition number (1-128, default 1): First sector (34-7814037134, default = 2048) or {+-}size{KMGTP}: Last sector (2048-7814037134, default = 7814035455) or {+-}size{KMGTP}: 7812499999 Current type is 8300 (Linux filesystem) Hex code or GUID (L to show codes, Enter = 8300): fd00 Changed type of partition to 'Linux RAID' Command (? for help): c Using 1 Enter name: cached_raid1_a Command (? for help): p Disk /dev/sda: 7814037168 sectors, 3.6 TiB Model: ST4000VN008-2DR1 Sector size (logical/physical): 512/4096 bytes Disk identifier (GUID): [redacted] Partition table holds up to 128 entries Main partition table begins at sector 2 and ends at sector 33 First usable sector is 34, last usable sector is 7814037134 Partitions will be aligned on 2048-sector boundaries Total free space is 1539149 sectors (751.5 MiB) Number Start (sector) End (sector) Size Code Name 1 2048 7812499999 3.6 TiB FD00 cached_raid1_a Command (? for help): w Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING PARTITIONS!! Do you want to proceed? (Y/N): y OK; writing new GUID partition table (GPT) to /dev/sda. The operation has completed successfully.

Now repeat this for /dev/sdb. Note that you don’t have to name the partitions with the c command, but it makes it easier to identify which partition is which if you have a lot of drives.

The partitions /dev/sda1 and /dev/sdb1 should now be available. If not, run partprobe to reload the partition table.

Creating the mdadm RAID array

Now we can create the array on /dev/md0 by running mdadm:

$ sudo mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1 mdadm: Note: this array has metadata at the start and may not be suitable as a boot device. If you plan to store '/boot' on this device please ensure that your boot-loader understands md/v1.x metadata, or use --metadata=0.90 mdadm: size set to 3906116864K mdadm: automatically enabling write-intent bitmap on large array Continue creating array? y mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started.

To avoid having to assemble this array on every boot, you should declare it in /etc/mdadm/mdadm.conf. To do this, first run a command to get the definition:

$ sudo mdadm --detail --scan ARRAY /dev/md0 metadata=1.2 name=example:0 UUID=6d539f5d:5b37:4bf0:b2d9:2af5efc99e6a

Now, append the output to /etc/mdadm/mdadm.conf.

Then, make sure that this configuration is updated in the initrd for all kernels:

$ sudo update-initramfs -u -k all update-initramfs: Generating /boot/initrd.img-6.1.0-34-amd64 update-initramfs: Generating /boot/initrd.img-6.1.0-33-amd64 ...

The RAID 1 array on /dev/md0 is now ready to be used as a PV containing the HDD storage.

Background operations

In the background, Linux’s MD RAID driver is working hard to synchronize the two drives so that they store identical data:

$ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sde1[1] sdd1[0] 3906116864 blocks super 1.2 [2/2] [UU] [=>...................] resync = 9.7% (379125696/3906116864) finish=402.8min speed=145930K/sec bitmap: 29/30 pages [116KB], 65536KB chunk unused devices: <none>

We can safely ignore this and continue. It will finish eventually.

Creating the SSD cache partition

You’ll need a partition on an SSD to serve as cache. This needs to be a real partition, not an LVM LV, as that would involve nested LVM. That never works reliably in my experience, and I’ve given up trying. This is especially nasty because I also use LVM to hold virtual machine disks, and if I just blanketly allow nested LVM, then the host machine can access all the LVM volumes inside all the VMs, which can cause data corruption.

If you don’t have unpartitioned space lying around, you’ll need to shrink a partition and reallocate its space as a separate partition.

Calculating the size

In my case, I had two partitions on my SSD, one EFI system partition (ESP) for the bootloader, and an LVM PV covering the rest of the disk. It looks something like this:

$ sudo gdisk -l /dev/nvme0n1 ... Number Start (sector) End (sector) Size Code Name 1 2048 206847 100.0 MiB EF00 EFI system partition 2 206848 3907029134 1.8 TiB 8E00 main_lvm_pv

For a 100 GiB cache, we’ll need to shrink the LVM PV by 100 GiB, and then edit the partition table. To avoid off-by-one errors, we’ll shrink the LV by 200 GiB or so, fix up the partition table, and then expand it afterwards.

Effectively, we want to end the LVM PV at sector 3697313934, which is exactly 100 GiB worth of 512-byte sectors before the current last sector:

>>> 3907029134 - 100*1024*1024*2 3697313934

Note that we multiply by 1024 once to convert from GiB to MiB, then a second time to convert from MiB to KiB, and there are two sectors per KiB.

Shrink existing partition data

First, shrinking the PV:

$ sudo pvresize --setphysicalvolumesize 1600G /dev/nvme0n1p2 /dev/nvme0n1p2: Requested size 1.56 TiB is less than real size <1.82 TiB. Proceed? [y/n]: y WARNING: /dev/nvme0n1p2: Pretending size is 3355443200 not 3906822287 sectors. Physical volume "/dev/nvme0n1p2" changed 1 physical volume(s) resized or updated / 0 physical volume(s) not resized

If you aren’t using LVM, but instead a regular ext4 filesystem, you can try using resize2fs, passing the size as the second positional argument. This would require you to unmount the partition first, since ext4 doesn’t have online shrinking, unlike LVM.

Editing the partition table

Then, we edit the partition table to shrink the partition for the PV and create a new one in the freed space:

$ sudo gdisk /dev/nvme0n1 ... Command (? for help): d Partition number (1-2): 2 Command (? for help): n Partition number (2-128, default 2): First sector (34-3907029134, default = 206848) or {+-}size{KMGTP}: Last sector (206848-3907029134, default = 3907028991) or {+-}size{KMGTP}: 3697313934 Current type is 8300 (Linux filesystem) Hex code or GUID (L to show codes, Enter = 8300): 8e00 Changed type of partition to 'Linux LVM' Command (? for help): n Partition number (3-128, default 3): First sector (34-3907029134, default = 3697315840) or {+-}size{KMGTP}: Last sector (3697315840-3907029134, default = 3907028991) or {+-}size{KMGTP}: Current type is 8300 (Linux filesystem) Hex code or GUID (L to show codes, Enter = 8300): 8e00 Changed type of partition to 'Linux LVM' Command (? for help): c Partition number (1-3): 3 Enter name: cached_cache_pv Command (? for help): p ... Number Start (sector) End (sector) Size Code Name 1 2048 206847 100.0 MiB EF00 EFI system partition 2 206848 3697313934 1.7 TiB 8E00 main_lvm_pv 3 3697315840 3907028991 100.0 GiB 8E00 cached_cache_pv Command (? for help): w Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING PARTITIONS!! Do you want to proceed? (Y/N): y OK; writing new GUID partition table (GPT) to /dev/nvme0n1. Warning: The kernel is still using the old partition table. The new table will be used at the next reboot or after you run partprobe(8) or kpartx(8) The operation has completed successfully.

Note that with gdisk, changing the size of a partition requires deleting it and recreating it with the same partition number at the same starting offset. The data in the partition is unaffected.

Now, we need to notify the kernel that the partition has shrunk:

$ sudo partprobe /dev/nvme0n1

Expand shrunk partition to fit new space

Then, we can expand the PV to fit all the available space:

$ sudo pvresize /dev/nvme0n1p2 Physical volume "/dev/nvme0n1p2" changed 1 physical volume(s) resized or updated / 0 physical volume(s) not resized $ sudo pvdisplay /dev/nvme0n1p2 --- Physical volume --- PV Name /dev/nvme0n1p2 PV Size 1.72 TiB / not usable <3.07 MiB ...

As we can see, the PV size is now exactly the reduced size of the partition. Now that’s done, we can use /dev/nvme0n1p3 as a PV containing our SSD cache.

Creating a new volume group

Now that we have the partitions to serve as our PVs, we can create a volume group called cached:

$ sudo vgcreate cached /dev/md0 /dev/nvme0n1p3 WARNING: Devices have inconsistent physical block sizes (4096 and 512). Physical volume "/dev/md0" successfully created. Physical volume "/dev/nvme0n1p3" successfully created. Volume group "cached" successfully created

Creating the cached LV

Creating a cached LV is somehow a multistep process that requires a lot of math.

Creating an LV on the HDD

First, you’ll need to create an LV containing the underlying data. Let’s put it on /dev/md0, using up all available space. You can obviously use less space if you want and expand it later. This is the command:

$ sudo lvcreate -n example -l 100%FREE cached /dev/md0 Logical volume "example" created.

Creating the cache metadata LV

Next, we need a cache metadata volume on the SSD. 1 GiB should be plenty:

$ sudo lvcreate -n example_meta -L 1G cached /dev/nvme0n1p3 Logical volume "example_meta" created.

Creating the cache LV

Now, we’ll need to use all remaining space on the /dev/nvme0n1p3 PV to serve as our cache. However, -l 100%FREE will not work because creating a cached pool requires some free space for a spare pool metadata LV for repair operations of the exact same size as the metadata. Since our metadata is 256 extents long, we’ll need to identify how much space we have available and reduce it by 256 ( adjust if your metadata size is different):

$ sudo pvdisplay /dev/nvme0n1p3 --- Physical volume --- PV Name /dev/nvme0n1p3 VG Name cached PV Size <100.00 GiB / not usable 3.00 MiB Allocatable yes PE Size 4.00 MiB Total PE 25599 Free PE 25343 Allocated PE 256

As you can see, we have 25343 extents left. We’ll need to subtract 256:

We can now create the actual cache LV:

$ sudo lvcreate -n example_cache -l 25087 cached /dev/nvme0n1p3 Logical volume "example_cache" created.

Creating a cache pool

We can now merge the cache metadata and actual cache LV into a cache pool LV:

$ sudo lvconvert --type cache-pool --poolmetadata cached/example_meta cached/example_cache Using 128.00 KiB chunk size instead of default 64.00 KiB, so cache pool has less than 1000000 chunks. WARNING: Converting cached/example_cache and cached/example_meta to cache pool's data and metadata volumes with metadata wiping. THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.) Do you really want to convert cached/example_cache and cached/example_meta? [y/n]: y Converted cached/example_cache and cached/example_meta to cache pool.

Here, we used the default chunk size chosen by LVM, but depending on the size of your files, you might benefit from a different chunk size. The lvmcache(7) man page has this to say:

The value must be a multiple of 32 KiB between 32 KiB and 1 GiB. Cache chunks bigger than 512 KiB shall be only used when necessary.

Using a chunk size that is too large can result in wasteful use of the cache, in which small reads and writes cause large sections of an LV to be stored in the cache. It can also require increasing migration threshold which defaults to 2048 sectors (1 MiB). Lvm2 ensures migration threshold is at least 8 chunks in size. This may in some cases result in very high bandwidth load of transferring data between the cache LV and its cache origin LV. However, choosing a chunk size that is too small can result in more overhead trying to manage the numerous chunks that become mapped into the cache. Overhead can include both excessive CPU time searching for chunks, and excessive memory tracking chunks.

Attach the cache pool to the HDD LV

Once that’s done, we can now attach the cache pool to the underlying storage to create a cached LV:

$ sudo lvconvert --type cache --cachepool cached/example_cache cached/example Do you want wipe existing metadata of cache pool cached/example_cache? [y/n]: y Logical volume cached/example is now cached.

We can now see this LV:

$ sudo lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert example cached Cwi-a-C--- <3.64t [example_cache_cpool] [example_corig] 0.01 0.62 0.00 ...

Cache modes

Note that there are several cache modes in LVM:

writethrough: any data written to the cached LV is stored in both the cache and the underlying block device (the default). This means that if the SSD fails for some reason, you don’t lose your data, but it also means writes are slower; and
writeback: data is written to cache, and after some unspecified delay, is written to the underlying block device. This means that cache drive failure can result in data loss.

Basically, use writethrough if you want your data to survive an SSD failure, or writeback if you don’t care.

Since I am using RAID 1 for reliability, it’d be pretty silly to then use writeback and risk losing the data and creating an outage, so I kept the default of writethrough.

To use writeback, you can specify --cachemode writeback during the initial lvconvert, or use sudo lvchange --cachemode writeback cached/example afterwards.

Creating a filesystem

Now that the cached LV is created, we just have to create a filesystem on it and mount it. For this exercise, we’ll use ext4, since that’s the traditional Linux filesystem and the most well-supported. I wouldn’t recommend using something like btrfs or ZFS since they are designed to access raw drives.

Creating an ext4 partition is simple:

$ sudo mkfs.ext4 /dev/cached/example mke2fs 1.47.0 (5-Feb-2023) Discarding device blocks: done Creating filesystem with 976528384 4k blocks and 244137984 inodes Filesystem UUID: bb93c359-1915-4f09-b23f-2f3a5e8b8663 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544 Allocating group tables: done Writing inode tables: done Creating journal (262144 blocks): done Writing superblocks and filesystem accounting information: done

Mounting the new filesystem

Now, we need to mount it. We can just run mount, but it makes more sense to define a permanent place for it in /etc/fstab. For this exercise, let’s mount in /example.

First, we create /example:

Then, we add the following line to /etc/fstab:

/dev/cached/example /example ext4 rw,noatime 0 2

Now, let’s mount it:

$ sudo systemctl daemon-reload $ sudo mount /example $ ls /example lost+found

And there we have it. Our new cached LV is mounted on /example, and the default ext4 lost+found directory is visible. Now you can store anything you want in /example.

Monitoring

You can find most cache metrics by running lvdisplay on the cached LV:

$ sudo lvdisplay /dev/cached/example --- Logical volume --- LV Path /dev/cached/example LV Name example VG Name cached ... LV Size <3.64 TiB Cache used blocks 8.40% Cache metadata blocks 0.62% Cache dirty blocks 0.00% Cache read hits/misses 84786 / 40435 Cache wrt hits/misses 222496 / 1883192 Cache demotions 0 Cache promotions 67420 Current LE 953641 ...

Conclusion

In the previous iteration of this before the drive failure, I was able to hit over 95% cache hits on reads storing a mix of mirrors and LLMs, with most of the files very infrequently read. If you have a similar workload, LVM caching is probably highly beneficial.

Note that this technique doesn’t have to be used to cache HDDs. Another possible application lies in the cloud, where you frequently have access to very large but slow block storage over the network and fast but small local storage. You can use LVM cache in this scenario also to cache the slower networked block device with the local storage.

I hope this was helpful and you learned something about LVM. See you next time!

Notes

Read Entire Article