Trying out the OrangePi RV2 and fixing a kernel bug

7 hours ago 1

I picked up the Orange Pi RV2 off Ali Express for $61. This board interests me because it is a single board computer with an 8-core CPU based on the RISC-V ISA. It also has two ethernet interfaces, two M.2 slots & USB 3.0. This board has the potential to be really amazing. There is even integrated WiFi, although I did not really test that.

One point I would like to make before going farther is this board absolutely requires active cooling. I tried using it without any and the board gets extremely hot and exhibits unpredictable behavior. You do not need to add a heatsink but I pointed a fan at it like this to keep it cool. If you're curious I used a power supply from CanaKit to run this board for most testing. The board lists a 5 volt 5 amp requirement, but I'm pretty sure that is based on the board having tons of USB devices plugged in drawing current.

A desk fan was used to keep the board cool while testing

I chose to boot the board off an SD Card I already had. You can download the official images here. I downloaded what they referred to as the "Ubuntu Image" that is based on Ubuntu Noble and is the server image. The SHA256 of the image ubuntu-24.04.2-preinstalled-server-riscv64.img is b8cf9717b2bee32cca3eb27e73b1663d36135a1185421a41397aa6f3097ecc27. I copied this to an SD card and the board booted up first try. The image automatically logs into the graphical console on the HDMI output with a user of "orangepi". The password for this user is orangepi. You can use sudo to add another user and add yourself to the sudoers file. Once you do that lock the original user with sudo passwd --lock orangepi so it cannot be used any longer.

I did try the desktop image but never got it to boot at all. You can just do something like sudo apt-get install xubuntu-desktop from the command line on the server image to change it into a desktop.

The good news is this is a fully functional system based off Ubuntu Noble (24.04) that can do totally normal things like install software using apt-get. This is unfortunately where the easy part ends.

Once I had the board running I was able to do sudo apt-get update && sudo apt-get full-upgrade. This works with no issues. The bad news is the board never reboots. No amount of power cycling or anything ever gets it to boot again. If I re-image the SD card from the downloaded image it boots up immediately. So my conclusion is that the upgrade process messes up the bootloader or something. I noticed update-initramfs was being run during the upgrade. I also noticed the kernel is reported as Linux orangepirv2 6.6.63-ky which appears to be non-standard for Ubuntu Noble.

With that in mind I did the following to prevent the kernel or bootloader from being upgraded

apt-mark hold u-boot-tools linux-base linux-libc-dev binutils-riscv64-linux-gnu linux-u-boot-orangepirv2-current initramfs-tools initramfs-tools-bin initramfs-tools-core

This should work, but I really don't want update-initramfs to run and potentially wipe out a working system. I think what someone did was start from Ubuntu Noble but modify it in some non-standard way with the kernel compiled for this board. So I did this

cp /usr/sbin/update-initramfs /usr/sbin/update-initramfs.backup chmod 644 /usr/sbin/update-initramfs /usr/sbin/update-initramfs.backup echo -e '#!/bin/bash\nexit 0' > /usr/sbin/update-initramfs chmod 755 /usr/sbin/update-initramfs

This replaces the original update-initramfs with a script that does nothing but appears to work. After this I was able do sudo apt-get update && sudo apt-get full-upgrade and get software updates. Obviously the kernel did not upgrade, but that seemed to be the root cause of the issue. Let me point out here that the board does not restart when I do shutdown -r now but manually power cycling it after shutting down does work. More on that later.

Changing the repositories

At this point I noticed all the upgrades were being downloaded from Huawei servers, which are very slow from the US. So I did the following to disable those repositories

mv /etc/apt/sources.list.d/docker.list /etc/apt mv /etc/apt/sources.list /etc/apt/sources.list.backup truncate -s 0 /etc/apt/sources.list

Next I edited /etc/apt/sources.list.d/ubuntu.sources and set it to

Types: deb URIs: http://ports.ubuntu.com/ubuntu-ports Suites: noble noble-updates noble-backports Components: main universe restricted multiverse Signed-By: /usr/share/keyrings/ubuntu-archive-keyring.gpg ## Ubuntu security updates. Aside from URIs and Suites, ## this should mirror your choices in the previous section. Types: deb URIs: http://ports.ubuntu.com/ubuntu-ports Suites: noble-security Components: main universe restricted multiverse Signed-By: /usr/share/keyrings/ubuntu-archive-keyring.gpg

This pulls from the standard Ubuntu mirror for RISC-V packages, which is faster since I am in the United States.

There are a bunch of services that fail to start. It looks like someone tried to setup compressed memory via zswap but did not get it working. I have the 8 GB memory version of the board anyways so I don't care. The dnsmasq as well as smartmontools services are trying to start. So I disabled all of those

systemctl disable dnsmasq systemctl disable orangepi-zram-config systemctl disable smartmontools

I also noticed that the "swappiness" of the system was set to 100, probably related to the failed zswap configuration. So I added vm.swappiness=0 to the end of /etc/sysctl.conf to disable swap usage unless absolutely needed.

The board has two M.2 PCIe slots. The manual lists these as "2 * PCIe 2.0 x 2 lanes, used for connecting NVMe SSD solid state drives". Since these are PCI express 2.0 by 2 lanes, this should be capable of 1 gigabyte per second read and write speeds.

The one on the back is a 2280 slot. I was able to fit a "10Gtek M.2 to SATA Adapter" in this slot. I was particularly interested in this since it should allow the board to be used as a NAS if I want to. This device identified as a 0001:01:00.0 SATA controller: ASMedia Technology Inc. ASM1166 Serial ATA Controller (rev 02) in the output of lspci. It worked without any real issue. What I did was connect up two external SATA drives I needed to wipe anyways to this and used dd to zero the entire drives. Power was supplied through an ATX power supply jumped on to run just the drives.

Orange Pi RV2 with the SATA controller

On the front side there is an M.2 PCIe slot in the 2230 size. I got a 256 GB NVMe drive and fitted it to that location. This showed up as 0002:01:00.0 Non-Volatile memory controller: Kingston Technology Company, Inc. OM3PDP3 NVMe SSD (rev 01). I was able to partition the drive with GPT, create an ext4 filesystem and mount it.

Orange Pi RV2 with a NVMe drive

I did a quick test of the write speed

$ dd bs=1M count=5000 if=/dev/zero of=test_file 5000+0 records in 5000+0 records out 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 15.921 s, 329 MB/s

And also the read speed

# echo 1 > /proc/sys/vm/drop_caches # echo 2 > /proc/sys/vm/drop_caches # dd bs=1M count=5000 if=test_file of=/dev/null 5000+0 records in 5000+0 records out 5242880000 bytes (5.2 GB, 4.9 GiB) copied, 8.15448 s, 643 MB/s

Neither result is terribly impressive, but this is far faster than any kind of SD card and just as convenient to use on this board.

Overall I think this is one of the coolest features of this board. Having two M.2 slots you can add high speed accessories without any additional adapters, which makes this much closer to a real computer. There isn't really much more to say here as everything just worked as I expected on the first attempt.

Even after modifying the OS to avoid updating the kernel and boot loader I never got it to reboot on its own. It always takes at least one power cycle. Interestingly, the board appears to shut down just fine.

The fact that the board cannot reboot is problematic if you intend to use the board as a remote server or similar role. I am sure if I had the designer of both the CPU and the board in the same room the three of us could figure out the issue. But I don't have that information. So I decided to see if I could work around this by using the system watchdog. I checked to see if it is loaded

ericu@orangepirv2:~$ ls -l /dev/watchdog* crw------- 1 root root 10, 130 Jul 16 13:31 /dev/watchdog crw------- 1 root root 246, 0 Jul 16 13:31 /dev/watchdog0

So the kernel is reporting the watchdog is available. Running lsmod shows me this

$ lsmod Module Size Used by binfmt_misc 98304 1 sch_fq_codel 135168 4 bcmdhd 3321856 0 dhd_static_buf 32768 1 bcmdhd ip_tables 122880 0 autofs4 315392 2 btrfs 14213120 0 blake2b_generic 131072 0 motorcomm 151552 2

Since the softdog module is not listed these device file entries should correspond to some kind of hardware watchdog. In fact checking /boot/config-6.6.63-ky shows # CONFIG_SOFT_WATCHDOG is not set. The idea here is that once the hardware watchdog is triggered, if it is not retriggered at a regular interval it automatically reboots the system. Most computing systems have something like this built-in at a hardware level. But I have no idea what is responsible for implementing this. Let's hope for the best. What I did was install the watchdog package by doing the following

apt-get install watchdog systemctl enable watchdog systemctl start watchdog

This should get the watchdog daemon running, which should activate the watchdog. Next I ran these commands

sync kill -9 `pidof watchdog` touch /dev/watchdog touch /dev/watchdog0

From what I can see, this does absolutely nothing. Every system with a functioning watchdog, this should result in the watchdog getting triggered after a while and the system rebooting. The Orange Pi RV2 doesn't reboot. In fact, it continues running just fine. So my theory is at least some of the following must be true

  1. The necessary software to reboot the board is not implemented
  2. The watchdog support in the kernel is not properly implemented
  3. The watchdog support in the kernel does not correspond to real hardware

Before doing more tests I did systemctl disable watchdog to stop the watchdog from being initialized during the boot process. I looked into the kernel configuration again and found this is the only thing set for watchdogs

CONFIG_WATCHDOG=y CONFIG_WATCHDOG_CORE=y CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED=y CONFIG_WATCHDOG_OPEN_TIMEOUT=0 CONFIG_STM32_WATCHDOG=y CONFIG_KY_WATCHDOG=y

The configuration CONFIG_KY_WATCHDOG does not appear to be something you'd find in a typical kernel from kernel.org. The only reference I found to that is from this repository. In the file drivers/watchdog/Makefile I found the line obj-$(CONFIG_KY_WATCHDOG) += x1_wdt.o. This presumably enables compilation of drivers/watchdog/x1_wdt.c which is completely absent from the normal Linux 6.6.63 kernel. That file contains this line MODULE_DESCRIPTION("Ky x1-plat Watchdog Device Driver");, so this is obviously a watchdog kernel module that has been added to the kernel. The same source code file contains this line of code

pr_info("System boots up not because of SoC watchdog reset.\n");

In the output of the dmesg command I found this line being written

[ 15.688070] System boots up not because of SoC watchdog reset.

So this is the right kernel module. I tried using wdctl to get more information but that got me nowhere

ericu@orangepirv2:/usr/lib/modules/6.6.63-ky/kernel/drivers$ sudo wdctl /dev/watchdog wdctl: cannot read information about /dev/watchdog: No such file or directory ericu@orangepirv2:/usr/lib/modules/6.6.63-ky/kernel/drivers$ sudo wdctl /dev/watchdog0 wdctl: cannot read information about /dev/watchdog0: No such file or directory

After digging through the manual in PDF form I found section 3.18 "Hardware watchdog test". It describes using a program "watchdog_test" which is found at /usr/local/bin/wachdog_test. I ran that command to see what it does.

ericu@orangepirv2:~$ sudo watchdog_test 10 [sudo] password for ericu: open success options is 32896,identity is X1 Watchdog put_usr return,if 0,success:0 The old reset time is: 100 return ENOTTY,if -1,success:0 return ENOTTY,if -1,success:0 put_user return,if 0,success:0 put_usr return,if 0,success:0 keep alive keep alive

The program expects to receive input on standard input, which in this case is just the terminal I am using to run it. Each time I did that it says keep alive. If this input stops, it should stop pinging the watchdog. At this point the board should reboot. This never happens so there is obviously a bug here. I checked the output of the command dmesg at this point.

[ 2387.436663] BUG: spinlock bad magic on CPU#4, watchdog_test/2515 [ 2387.442812] lock: reboot_lock+0x0/0x18, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 [ 2387.451340] CPU: 4 PID: 2515 Comm: watchdog_test Not tainted 6.6.63-ky #1.0.0 [ 2387.458549] Hardware name: ky x1 orangepi-rv2 board (DT) [ 2387.463915] Call Trace: [ 2387.466365] [<ffffffff80006638>] dump_backtrace+0x1c/0x24 [ 2387.471854] [<ffffffff80f21b9c>] show_stack+0x2c/0x38 [ 2387.476979] [<ffffffff80f39684>] dump_stack_lvl+0x3c/0x54 [ 2387.482472] [<ffffffff80f396b0>] dump_stack+0x14/0x1c [ 2387.487562] [<ffffffff80073b5c>] spin_bug+0x90/0xa0 [ 2387.492484] [<ffffffff80073c42>] do_raw_spin_lock+0xd6/0x112 [ 2387.498201] [<ffffffff80f43c06>] _raw_spin_lock+0x1a/0x22 [ 2387.503652] [<ffffffff80ac1de6>] spa_wdt_ping+0x1e/0xec [ 2387.508931] [<ffffffff80ac0702>] __watchdog_ping+0x44/0x192 [ 2387.514564] [<ffffffff80ac088a>] watchdog_ping+0x3a/0x46 [ 2387.519890] [<ffffffff80ac1044>] watchdog_ioctl+0x10a/0x5b4 [ 2387.525511] [<ffffffff80241bd6>] __riscv_sys_ioctl+0x8a/0xb2 [ 2387.531208] [<ffffffff80f3a178>] do_trap_ecall_u+0x116/0x12a [ 2387.536933] [<ffffffff80f445f6>] ret_from_exception+0x0/0x6e

I then ran the command again, but never provided it any input on the terminal session.

ericu@orangepirv2:~$ sudo watchdog_test 10 [sudo] password for ericu: open success options is 32896,identity is X1 Watchdog put_usr return,if 0,success:0 The old reset time is: 100 return ENOTTY,if -1,success:0 return ENOTTY,if -1,success:0 put_user return,if 0,success:0 put_usr return,if 0,success:0

After waiting a while, the board rebooted on its own. My conclusion at this point is that the watchdog does not work properly, but it can be used to at least reboot the board. So far the behavior I can confirm is

Command triedResult
shutdown -r nowtries to reboot but never comes back up
watchdog_test 10 with user inputnothing at all happens
watchdog_test 10 without user inputreboot successful
manually interacting with /dev/watchdognothing at all happens

So what is this program watchdog_test and what does it do? I inspected it with the file command and got this result

$ file watchdog_test watchdog_test: ELF 64-bit LSB pie executable, UCB RISC-V, RVC, double-float ABI, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux-riscv64-lp64d.so.1, BuildID[sha1]=567ed09a0bee94bf92fb850eb111d5431e43e1d7, for GNU/Linux 4.15.0, not stripped

This program was apparently compiled quite a while ago, since it targets the Linux 4.15 kernel. Thankfully it still has debugging symbols. I don't know anything about RISCV 64 assembly, but Ghidra does. After about 30 minutes with Ghidra, I was able to produce this decompiled source

long Init(void) { long lVar1; gp = &__global_pointer$; lVar1 = open_wrapper("/dev/watchdog",2); if (lVar1 < 0) { puts_wrapper("device open fail"); lVar1 = -1; } else { puts_wrapper("open success"); } return lVar1; } int main(int argc,char **argv) { int unknown0; int inner_keepalive_loop_control; int uVar1; undefined4 in_register_00002054; undefined8 ioctl_result_1; int iStack_fc; int s_watchdog_info [2]; char watchdog_info [32]; undefined8 s_termios_start; // snip long ioctl_result_0; long watchdog_fd; gp = &__global_pointer$; if (CONCAT44(in_register_00002054,argc) == 2) { watchdog_fd = Init(); ioctl_wrapper(0x80045704,2); ioctl_wrapper(watchdog_fd,0x80285700,s_watchdog_info); printf_wrapper("options is %d,identity is %s\n",(long)s_watchdog_info[0],watchdog_info); ioctl_result_0 = ioctl_wrapper(watchdog_fd,0x80045707,&iStack_fc); printf_wrapper("put_usr return,if 0,success:%d\n",ioctl_result_0); printf_wrapper("The old reset time is: %d\n",(long)iStack_fc); iStack_fc = 1; ioctl_result_1 = ioctl_wrapper(watchdog_fd,0x80045704,&iStack_fc); printf_wrapper("return ENOTTY,if -1,success:%d\n",ioctl_result_1); iStack_fc = 2; ioctl_result_1 = ioctl_wrapper(watchdog_fd,0x80045704,&iStack_fc); printf_wrapper("return ENOTTY,if -1,success:%d\n",ioctl_result_1); iStack_fc = strtol_wrapper(argv[1],0,10); ioctl_result_1 = ioctl_wrapper(watchdog_fd,0xc0045706,&iStack_fc); printf_wrapper("put_user return,if 0,success:%d\n",ioctl_result_1); ioctl_result_1 = ioctl_wrapper(watchdog_fd,0x80045707,&iStack_fc); printf_wrapper("put_usr return,if 0,success:%d\n",ioctl_result_1); do { do { usleep_wrapper(100000); tcgetattr_wrapper(0,&s_termios_start); // snip tcsetattr_wrapper(0,0,&uStack_90); inner_keepalive_loop_control._0_1_ = getc_wrapper(_stdin); tcsetattr_wrapper(0,0,&s_termios_start); } while ((char)inner_keepalive_loop_control == '\x1b'); puts_wrapper("keep alive "); ioctl_wrapper(watchdog_fd,0x80045705,0); } while( true ); } puts_wrapper("Usage : ./watchdog 10"); return -1; }

This doesn't recreate the exact source, but that is fine. What we can see is all this program really does is open /dev/watchdog and start using ioctl on it. The values such as 0x80045707 are just constants from the kernel identifying the specific ioctl to perform. The values 1 and 2 turn the watchdog either on or off. Those constants are

// copied from include/uapi/linux/watchdog.h in the source tree #define WDIOC_GETSUPPORT _IOR(WATCHDOG_IOCTL_BASE, 0, struct watchdog_info) #define WDIOC_GETSTATUS _IOR(WATCHDOG_IOCTL_BASE, 1, int) #define WDIOC_GETBOOTSTATUS _IOR(WATCHDOG_IOCTL_BASE, 2, int) #define WDIOC_GETTEMP _IOR(WATCHDOG_IOCTL_BASE, 3, int) #define WDIOC_SETOPTIONS _IOR(WATCHDOG_IOCTL_BASE, 4, int) #define WDIOC_KEEPALIVE _IOR(WATCHDOG_IOCTL_BASE, 5, int) #define WDIOC_SETTIMEOUT _IOWR(WATCHDOG_IOCTL_BASE, 6, int) #define WDIOC_GETTIMEOUT _IOR(WATCHDOG_IOCTL_BASE, 7, int) #define WDIOC_SETPRETIMEOUT _IOWR(WATCHDOG_IOCTL_BASE, 8, int) #define WDIOC_GETPRETIMEOUT _IOR(WATCHDOG_IOCTL_BASE, 9, int) #define WDIOC_GETTIMELEFT _IOR(WATCHDOG_IOCTL_BASE, 10, int)

The initial setup of the program can be summarized as

  1. open /dev/watchdog
  2. call ioctl WDIOC_SETOPTIONS with a value of 2
  3. call ioctl WDIOC_GETSUPPORT
  4. call ioctl WDIOC_GETTIMEOUT
  5. call ioctl WDIOC_SETOPTIONS with a value of 1
  6. call ioctl WDIOC_SETOPTIONS with a value of 2
  7. call ioctl WDIOC_SETTIMEOUT with the parsed command line argument as the timeout value
  8. call ioctl WDIOC_GETTIMEOUT

Next the program enters a loop. This loop does some stuff with the terminal, reads a character and then calls ioctl WDIOC_KEEPALIVE. This continues forever. This program is actually just a generic program for interacting with the Linux kernel watchdog via the ioctl calls. This should really work on any system with a functional watchdog. So I wrote a basic program like this, which should just set the timeout and exit. This should trigger the watchdog to perform a system reboot.

int main(int argc, char ** argv) { int fd = open("/dev/watchdog", O_RDWR); if (-1 == fd) { exit(1); } int option = 2; if (0 != ioctl(fd, WDIOC_SETOPTIONS, &option)) { exit(4); } int interval = 10; if (0 != ioctl(fd, WDIOC_SETTIMEOUT, &interval)) { exit(2); } option = 1; if (0 != ioctl(fd, WDIOC_SETOPTIONS, &option)) { exit(3); } option = 2; if (0 != ioctl(fd, WDIOC_SETOPTIONS, &option)) { exit(4); } int timeout = 1; if (0 != ioctl(fd, WDIOC_SETTIMEOUT, &timeout)) { exit(4); } return 0; }

This program runs but does absolutely nothing. The board never reboots. Checking dmesg again

BUG: spinlock bad magic on CPU#5, my_watchdog/2528 lock: reboot_lock+0x0/0x18, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 CPU: 5 PID: 2528 Comm: my_watchdog Not tainted 6.6.63-ky #1.0.0 Hardware name: ky x1 orangepi-rv2 board (DT) Call Trace: [<ffffffff80006638>] dump_backtrace+0x1c/0x24 [<ffffffff80f21b9c>] show_stack+0x2c/0x38 [<ffffffff80f39684>] dump_stack_lvl+0x3c/0x54 [<ffffffff80f396b0>] dump_stack+0x14/0x1c [<ffffffff80073b5c>] spin_bug+0x90/0xa0 [<ffffffff80073c42>] do_raw_spin_lock+0xd6/0x112 [<ffffffff80f43c06>] _raw_spin_lock+0x1a/0x22 [<ffffffff80ac1de6>] spa_wdt_ping+0x1e/0xec [<ffffffff80ac0702>] __watchdog_ping+0x44/0x192 [<ffffffff80ac088a>] watchdog_ping+0x3a/0x46 [<ffffffff80ac1044>] watchdog_ioctl+0x10a/0x5b4 [<ffffffff80241bd6>] __riscv_sys_ioctl+0x8a/0xb2 [<ffffffff80f3a178>] do_trap_ecall_u+0x116/0x12a [<ffffffff80f445f6>] ret_from_exception+0x0/0x6e

So my program triggers the same behavior that the watchdog_test. My conclusion here is that there is a kernel bug. Looking back into drivers/watchdog/x1_dwt.c I find that it does reference the indicated reboot_lock in the spa_wdt_ping function as the stack trace indicates.

/* from drivers/watchdog/x1_wdt.c */ static int spa_wdt_ping(struct watchdog_device *wdd) { int ret = 0; struct spa_wdt_info *info = container_of(wdd, struct spa_wdt_info, wdt_dev); spin_lock(&reboot_lock); spin_lock(&info->wdt_lock); /* reset counter */ if (wdd->timeout > 0) { spa_wdt_write(info, WDT_WCR, 0x1); } else ret = -EINVAL; spin_unlock(&info->wdt_lock); spin_unlock(&reboot_lock); return ret; }

I found an unrelated kernel discussion with the same error and a patch the discussion referred to. All this patch does is add a call to spin_lock_init call, which is used to initialize a spin lock. Looking at the implementation, the spin lock referenced by &info->wdt_lock has a call to spin_lock_init but there is none for &reboot_lock. So the implementer just forgot to initialize the spin lock. I looked at the declaration and it is local to file, declared as static spinlock_t reboot_lock;. None of the other watchdog drivers have this convention that I can see. The intention of this lock is to prevent the function spa_wdt_ping and spa_wdt_restart_handler from executing concurrently. The spa_wdt_ping is used to keep the watchdog alive, to prevent a reset. The other function appears to be used to execute an intentional restart of the board. It ends with the line pr_err("reboot system failed: this line shouldn't appear.\n"); which lead me to that conclusion.

My thought here was just to move the spin lock into struct spa_wdt_info which is already available into both contexts. But this already has a field spinlock_t wdt_lock which is already used by spa_wdt_ping as well. This lock should be adequate to prevent concurrent execution of the two functions. At this point I made the decision to just remove the reboot_lock and entirely rely on the existing lock in the struct spa_wdt_info. Since this component is not built as a kernel module, I guess I am rebuilding the kernel to fix this. What I did was

  1. Clone this project that is a fork of the Linux kernel
  2. Checkout the orange-pi-6.6-ky branch
  3. Export the entire thing as a tar file using git archive --prefix=linux-orangepi/ --format=tar HEAD > linux-orangepi.tar
  4. Copy the tar file to the OrangePi RV2 and expand it
  5. Inside the kernel tree, run copy /boot/config-6.6.63-ky .config to start from the existing configuration
  6. Edit Makefile and set EXTRAVERSION = hydrogen18 at the top
  7. Run make menuconfig and save the configuration
  8. Run make -j 8 CFLAGS='-march=native -O3' CXXFLAGS='-march=native -O3'

These steps are based off this guide.

The last step will take a very long time, over an hour in my case. I did this on the NVMe drive I added, I really don't think you should try this on an SD card. I'm sure it is possible to cross compile this kernel but I have no real reason to bother. This is after all a fully featured 8-core computer. After this completes you can run sudo make modules_install && sudo make install. This updated /boot with new symbolic links but did not actually create everything. It was at this point I realized I probably needed update-initramfs to be useable again. I copied the backup back to the original location and ran sudo update-initramfs -v -c -k 6.6.63hydrogen18. There is a bunch of output about modules not found, but they appear to be related to things like vboxvideo which aren't needed in the context of a single board computer. At this point my /boot looks like this

ericu@orangepirv2:/hotels/ericu$ ls -lt /boot/ total 254368 lrwxrwxrwx 1 root root 24 Jul 20 01:03 uInitrd -> uInitrd-6.6.63hydrogen18 -rw-r--r-- 1 root root 68372247 Jul 20 01:03 uInitrd-6.6.63hydrogen18 -rw-r--r-- 1 root root 68372183 Jul 20 01:02 initrd.img-6.6.63hydrogen18 lrwxrwxrwx 1 root root 27 Jul 20 00:41 initrd.img -> initrd.img-6.6.63hydrogen18 lrwxrwxrwx 1 root root 20 Jul 20 00:41 initrd.img.old -> initrd.img-6.6.63-ky lrwxrwxrwx 1 root root 24 Jul 20 00:41 vmlinuz -> vmlinuz-6.6.63hydrogen18 lrwxrwxrwx 1 root root 17 Jul 20 00:41 vmlinuz.old -> vmlinuz-6.6.63-ky -rw-r--r-- 1 root root 261577 Jul 20 00:41 config-6.6.63hydrogen18 -rw-r--r-- 1 root root 5839742 Jul 20 00:41 System.map-6.6.63hydrogen18 -rw-r--r-- 1 root root 34558464 Jul 20 00:41 vmlinuz-6.6.63hydrogen18 -rw-rw-r-- 1 root root 142 Jul 11 03:00 orangepiEnv.txt -rw-r--r-- 1 root root 20490121 Mar 12 01:28 uInitrd-6.6.63-ky -rw-r--r-- 1 root root 20490057 Mar 12 01:28 initrd.img-6.6.63-ky -rw-rw-r-- 1 root root 2544 Mar 12 01:27 boot.scr -rw-rw-r-- 1 root root 1542 Mar 12 01:26 orangepi_first_run.txt.template -rwxrwxr-x 1 root root 1152056 Mar 12 01:26 logo.bmp -rw-rw-r-- 1 root root 230456 Mar 12 01:26 boot.bmp lrwxrwxrwx 1 root root 13 Mar 12 01:25 dtb -> dtb-6.6.63-ky drwxr-xr-x 3 root root 4096 Mar 12 01:25 dtb-6.6.63-ky lrwxrwxrwx 1 root root 17 Mar 12 01:25 Image -> vmlinuz-6.6.63-ky -rw-rw-r-- 1 root root 2472 Mar 12 01:24 boot.cmd -rw-r--r-- 1 root root 261644 Mar 12 01:03 config-6.6.63-ky -rw-r--r-- 1 root root 5839773 Mar 12 01:03 System.map-6.6.63-ky -rwxr-xr-x 1 root root 34558464 Mar 12 01:03 vmlinuz-6.6.63-ky

I did not recompile the device tree as it is not needed. I then used the watchdog_test 10 command to reboot the board. However, it did not reboot. After looking at the original image I started from it appears there are really only two values that matter

  1. /boot/Image symbolic link to the vmlinuz file
  2. /boot/uInitrd symbolic link to the uInitrd file

After fixing /boot/Image on the filesystem the board booted running my recompiled kernel. I did a quick round of tests again and got the following results

Command triedResult
shutdown -r nowreboot successful
watchdog_test 10 with user inputreboot successful
'watchdog' service running and shutdown -rtries to reboot but never comes back up
'watchdog' service running and watchdog_test 10 and user inputreboot successful

This result is a clear improvement. The unusual thing here is that just running shutdown -r with the watchdog service did not reboot for me. At this point I came to the following conclusion regarding what must have happened here

  1. Someone added support for this watchdog hardware to this forked version of the Linux kernel
  2. At some point, there was a bug identified in the watchdog driver code
  3. Someone added reboot_lock to try and address this problem
  4. reboot_lock was never properly initialized so this leaves us with the code as delivered, triggering the warning in the kernel logs

I looked at the implementation of both functions and observed they both call the function spa_wdt_write. This is not surprising as they both try and do things to the watchdog. The use of spin_lock is the correct mechanism to eliminate the possible of concurrent execution of two threads both calling spa_wdt_write or other functions. However, this doesn't prohibit the function spa_wdt_restart_handler from running right after spa_wdt_ping releases the lock. In fact, it pretty much guarantees it. The userspace program that uses the ioctl WDIO_KEEPALIVE to keep the watchdog going usually does this at some interval like 10 seconds. It's entirely possible the underlying hardware doesn't actually like whatever spa_wdt_write(info, WDT_WCR, 0x1) in the watchdog driver does being called effectively back to back. This is just a theory, I have no real evidence to support this.

I thought for a while about how I might fix this, but the easiest way I came up with was to add a 2 second delay to spa_wdt_restart_handler right after it takes the lock. This is done by calling msleep(2000) which in the kernel code is a sleep. This should allow at least 2 seconds to pass since spa_wdt_ping ran. My logic for this is as follows

  1. 2 seconds of delay on a restart handler is basically irrelevant
  2. In a worst case scenario someone has set the watchdog interval to 2 seconds or lower and it just gets tripped anyways during this busy wait

So I added this and recompiled the kernel, then rebooted into my new kernel version. I reperformed the tests and it seemed that it basically worked in all cases. I decided to stress test this however. What I did was just had my workstation run a command in a loop via ssh. This means as soon as the board is online, my workstation winds up asking it to reboot. This is exactly what I would do if I were remotely rebooting the board

Test 1 - reboot command

I ran this command from my workstation

while true; do ssh '-o ConnectTimeout 3' orangepirv2 'shutdown -r now'; sleep 1 ; done

This test ran for over 10 minutes before eventually the board got stuck on the boot screen

The display just showed this forever until the board was manually power cycled

Test 2 - reboot command with the watchdog running

I ran this command from my workstation

while true; do ssh '-o ConnectTimeout 3' orangepirv2 'echo begin; sync; sudo systemctl start watchdog ; sleep 0.1; sudo shutdown -r now ; '; sleep 1 ; done

This test ran for over 10 minutes before eventually the board got stuck on the boot screen. So this is basically the same result as the prior test, I just had the watchdog service running

Test 3 - Enable the watchdog, send one keep alive and wait for it to restart

I ran this command from my workstation

while true; do ssh '-o ConnectTimeout 3' '-o ServerAliveInterval 1' '-o ServerAliveCountMax 2' orangepirv2 'echo begin; sync; sudo ./my_watchdog3 ; '; sleep 1 ; done

This test ran for 12 minutes and always was a successful reboot. So as long as you only trip the watchdog and do not actually ask for a system restart, it works it seems. The program my_watchdog3 is just a program I wrote to trip the watchdog timer reliably

Next steps

At this point I was thoroughly confused because I thought I had fixed things. While I did get the watchdog timer to work as intended, I still could not get the board to reboot reliably. I tried many things and even at one point changed power supplies just to rule out that as a possibility.

At some point it dawned on me that the restart handler that interacts with the watchdog might actually be optional. Sure enough I found this piece of code

/* from drivers/watchdog/x1_wdt.c in function spa_wdt_dt_init if (of_get_property(np, "spa,wdt-enable-restart-handler", NULL)) info->enable_restart_handler = 1; else info->enable_restart_handler = 0;

When info->enable_restart_handler is set to zero the restart handler isn't even used. The of_get_property function should read from the device tree and if that item is set, enable the watchdog timer being used for the reset. I checked the device tree files in /boot and the device tree overlay. I couldn't even find a reference to wdt-enable-restart-handler so my conclusion is this code wasn't even enabled. I realized that this meant my refactor to the restart handler had no real impact. The restart handler isn't even invoked at this point. This did leave me with a question. The original image never rebooted for me. After my first refactor, the tests showed that sometimes the board rebooted. But my change really should not have impacted that at this point. I can't explain this, other than to think I don't have access to the exact source code used to build the original kernel that comes in the SD card image.

I really don't want to deal with changing the device tree, so I refactored the code to look like this.

if (of_get_property(np, "spa,wdt-enable-restart-handler", NULL)) info->enable_restart_handler = 1; else info->enable_restart_handler = 1;

This causes the restart handler to always be enabled. However, this change alone did not really fix anything. But this did at least mean the restart handler should be invoked. At this point I set down and analyzed what the restart handler actually does

/* from drivers/watchdog/x1_wdt.c in function int spa_wdt_restart_handler */ spa_wdt_shutdown_reason(cmd); spa_enable_wdt_clk(info); /* clear WDT status */ spa_wdt_write(info, WDT_WSR, 0x0); /* set timeout to 1/256 second */ spa_wdt_write(info, WDT_WMR, 0x1); /* enable counter and reset/interrupt */ spa_wdt_write(info, WDT_WMER, 0x3); /* reset counter */ spa_wdt_write(info, WDT_WCR, 0x1); mpmu_aprr = info->mpmu_base + MPMU_APRR; reg = readl(mpmu_aprr); reg |= MPMU_APRR_WDTR; writel(reg, mpmu_aprr); mdelay(5000);

The comments are added by me. All this really does is

  1. store some data about the shutdown reason
  2. setup the WDT with a 1/256 second timeout
  3. reset the WDT
  4. update some register MPMU_APRR
  5. wait 5 seconds in a busy loop

The last step is expected to run while the watchdog timer trips and reboots the system. The register MPMU_APRR is described from a source code level comment as "MPMU_APSR is a dummy reg which is used to handle reboot cmds". Setting this seems not strictly needed. Likewise spa_wdt_shutdown_reason is probably nice to have, but also not needed.

The other puzzling thing is the choice of writing a 1 to WDT_WMR. This is equivalent to 1/256 second, or 3.9 millisecond timeout on the WDT. You cannot actually do this with the ioctl command, the lowest value you could pass is a 1 which is translated to 256 before it is used. So I changed this line to spa_wdt_write(info, WDT_WMR, 0xff) which is 1 second. I don't actually know that trying to set a 3.9 millisecond timeout causes a problem, but adding 1 second of delay is not going to matter.

I removed the msleep(2000) I had added to this function as this obviously did not do anything.

I then changed the calls to spin_lock / spin_unlock to spin_lock_irqsave / spin_lock_irqrestore. This disables interrupts once the code is running. There really should not be anything running at this point to interrupt the restart handler. But if there is, this change disables it. I came to this decision by just looking around at other kernel code in drivers/watchdog

So at this point, I wound up with a restart handler implementation that looks like this

static int spa_wdt_restart_handler(struct notifier_block *this, unsigned long mode, void *cmd) { struct spa_wdt_info *info = container_of(this, struct spa_wdt_info, restart_handler); void __iomem *mpmu_aprr; u32 reg; unsigned long flags = 0; spin_lock_irqsave(&info->wdt_lock, flags); spa_wdt_shutdown_reason(cmd); spa_enable_wdt_clk(info); /* clear WDT status */ spa_wdt_write(info, WDT_WSR, 0x0); /* set timeout to 1 seconds */ spa_wdt_write(info, WDT_WMR, 1 << DEFAULT_SHIFT); /* enable counter and reset/interrupt */ spa_wdt_write(info, WDT_WMER, 0x3); /* reset counter */ spa_wdt_write(info, WDT_WCR, 0x1); mpmu_aprr = info->mpmu_base + MPMU_APRR; reg = readl(mpmu_aprr); reg |= MPMU_APRR_WDTR; writel(reg, mpmu_aprr); mdelay(5000); panic("reboot system failed"); spin_unlock_irqrestore(&info->wdt_lock, flags); pr_err("reboot system failed: this line shouldn't appear.\n"); return NOTIFY_DONE; }

At this point I re-ran test 2 again. The board rebooted just fine for a while, but eventually got stuck at the boot screen. I then tried different things with the watchdog service enabled. I wanted to do something that would lead to the watchdog getting tripped in a more realistic manner. What I did was to create a kernel module that can be configured to start threads which are not preemptible and wait forever. The way this works is as follows

  1. A thread is created
  2. That thread takes a spin lock using spin_lock_irqsave.
  3. That thread starts additional threads, which attempt to take the same lock using spin_lock_irqsave
  4. The first threads then requests a busy wait from the kernel

Since all the threads are using spin_lock_irqsave, they won't be pre-empted under normal conditions. This means the CPU is eventually consumed 100% of the time and the watchdog timer gets tripped.

There are instructions on how to use this, but if you are looking for the TLDR here it is

tar xf kernel_hog.tar.gz cd kernel_hog make all mknod /dev/hog c $(dmesg | grep 'hog loaded with device major number' | tail -n 1 | cut -d ' ' -f 8) 0 echo -ne '\x0F\x00' | dd of=/dev/hog

The last step configures the kernel hog to start 15 threads and invokes it. Using the default watchdog service, I could never get the watchdog to trip using this. So I don't know what about the normal watchdog service causes a problem here, but I just went ahead and implemented the simplest watchdog I could. A systemD service that just sets the watchdog to a 10 second interval, then pings it once every second.

Once this was running, using the kernel hog module I was able to get the board to reboot. So my conclusion is that the watchdog is working as well as it is going to.

I actually did a bunch more testing around the board here. If I just constantly did something that rebooted the board, it would always eventually reach a point where it was stuck on the boot process. Power cycling the board would allow it to recover.

The patch I made to the kernel is a clear improvement over the current implementation. My actual theory for what has happened here is

  1. There was an invalid usage of reboot_lock which I removed
  2. There was an issue in the restart handler that I fixed
  3. There is another unidentified issue around board boot up that needs to be fixed

My conclusion here is based on behavior I observed multiple times. This board has a boot issue. Even with the original image from Orange Pi, the board does not boot 100% of the time when power is applied. Additionally, I observed that when I ran test 2 the board eventually gets stuck at the boot screen. Toggling the power supply off then back on results in the board stuck at the boot screen again most of the time. This can be repeated again and again. If I toggle off the power supply for approximately 2 minutes and then back on, it boots up again.

My theory here is that the board has some kind of capacitor in a circuit that is expected to be at 0 volts during the bootup process. For some reason, rebooting may result in that capacitor having some small amount of charge retained. This keeps something in the system in a state it should not be in and results in a no boot condition. By leaving the board off for a while, it ensures the capacitor can discharge all the way and the board boots up the next attempt. Since this is related to some electrical voltage level, this explains why the reboot is not deterministic.

This is just a theory and I'm not going to investigate it farther as I don't think there is a software solution to this problem. It's not like I'm going to spinup a new PCB to fix this or something.

Kernel patches

The kernel I started from on Github seems to be slightly newer than what the image I downloaded included. There were several configuration parameters that were present in the downloaded source, but not the kernel config from /boot. The specific commit I checked out on the orange-pi-6.6-ky is ae9e974d3e19f460b6397bfe8f0f1417a073ce05. I've included the patch showing the change I made to drivers/watchdog/x1_wdt.c, the complete revised source file, & the config I used.

This board is really impressive. If you can get it for a price that is reasonable you can probably find some use for it in some very interesting projects. It's currently listed for $93 on AliExpress which in my opinion is too high of a price for this board. But keep in mind this is the equivalent of an evaluation board. At this point this board is not reliable enough for all applications. I do have future plans for this board, which I intend to write more about when I have time.

Read Entire Article