My 2 month beef with my own Linux environment. (Developer cautionary tale)

13 hours ago 1

Hi everyone! I want to share a two-month-long, insanity-inducing debugging session - part cautionary tale, part comedy - so you can have a quick laugh and hopefully avoid making the same mistakes I did.

For the past couple of months, I’ve been maintaining and experimenting with DebDroid, a project I built to repurpose older Android devices into portable desktops and lightweight home servers.

It’s worth noting that, unlike Termux, DebDroid runs a near-native Linux userland based on glibc, not a minimal runtime. This means it behaves much more like a standard Linux system, but it also encounters more frequent compatibility issues with the Android host. You can think of it as LXC for Android, or like a version of Kali NetHunter adapted for general-purpose use.

My original goal for DebDroid was to get sshd (the OpenSSH server) and gpg working reliably, since both tend to run into issues in a plain, manually-managed chroot environment.

After a quick debugging session, I discovered that older Android kernels (pre-3.17) don’t support the getrandom() system call. Huh? No big deal. I just needed to write my own stub implementation that reads directly from /dev/urandom, wrap it in a shared library around syscall(), and preload it via ld. Easy, right?

In the meantime, I also created some scripts to automatically manage the environment and preload these runtime "patches" system-wide via /etc/ld.so.preload.

Everything was fun and games... until I tried to start an X11/Xfce4 VNC session to see if the project could support graphical environments without additional hand-rolled preloads. The session completely froze. The screen went black, and even the cursor failed to initialize. It was stuck to the ugly, default Xorg version. I spent days staring at logs, while fiddling with xstartup and DBus sessions trying to figure out what went wrong.

At this time, I also started using gdb and strace to determine why and where the xfce4-session processes keeps hanging. Every time, it was a function blocked on either read(), write() or poll() calls. Alright, I patch that function and retry... then another one. Patch, retry... another one. It was a caffeine-induced whack-a-mole game between me and the Linux environment. I eventually ended up with debug builds for nearly every major X11-related package just so I could patch the next stuck "offender". No package was safe from my wrath: GLib, GTK3, xfce4-session and many others, including their dependencies.

I small started by patching functions like g_spawn_sync, g_spawn_async and g_spawn_command_line_sync, recompiled everything directly on my puny tablet with 3GB of RAM and hoped for progress. Every patch seemed to fix something, only for a dozen others to appear. I even spent hours debugging with gdb sessions that sometimes hung themselves.

At some point I became paranoid and thought it must be systemd’s fault. I desperately grabbed a Devuan image and manually chrooted into it. Lo and behold, X11 worked perfectly. "Ah-ha! Systemd is the villain!!!" (average linux user moment, I know) I thought. I even modified my entire project to run on Devuan instead of Debian and updated the README to explain the breaking change and migration options. Victory was mine...or so I thought.

I integrated the Devuan setup into my normal environment and ran it... and it broke. Again! XD At this point, I was ready to give up on software development altogether, uninstall arch and go touch some grass.

Then it hit me... syscalls keep hanging, the "offenders" are everywhere, and patching one just leads to another down the line. It must be that damn syscall wrapper I designed 2 months to fix a small compatibility issue between Linux and old Android kernels. Everything else (GLib, GTK, DBus, Xorg, Xfce4, ...) was misbehaving because the wrapper didn't properly forward arguments to the real syscall(), resulting in hangups for nearly every major package of the environment. Once fixed, everything worked immediately. I still can't believe I sabotaged myself this hard.

The ironic part:

syscall() is the foundation of the system, yet I completely ignored it for a full month. I patched libraries, recompiled packages, rewrote countless stub implementations, and blamed systemd. All of this while the real "offender" was right under my nose. Blocked syscalls that should never ever fail or hang are a spooky developer pit trap, even in Android chroot environments.

Lessons:

  • Never globally override syscall() unless you are ready to deal with the consequences.
  • Tiny compatibility fixes can spiral into months-long insanity trips.
  • If something seems impossible, check if you’re secretly the villain.

The "offender":

```c long syscall(long number, ...) { static syscall_t real_syscall = NULL; if (!real_syscall) { real_syscall = (syscall_t)dlsym(RTLD_NEXT, "syscall"); }

if (number == SYS_getrandom) { void *buf; size_t buflen; unsigned int flags; va_list args; va_start(args, number); buf = va_arg(args, void *); buflen = va_arg(args, size_t); flags = va_arg(args, unsigned int); va_end(args); return urandom_read(buf, buflen); } return real_syscall(number);

}

```

The fix:

```c long syscall(long number, ...) { static syscall_t real_syscall = NULL; if (!real_syscall) { real_syscall = (syscall_t)dlsym(RTLD_NEXT, "syscall"); }

if (number == SYS_getrandom) { void *buf; size_t buflen; unsigned int flags; va_list args; va_start(args, number); buf = va_arg(args, void *); buflen = va_arg(args, size_t); flags = va_arg(args, unsigned int); va_end(args); return urandom_read(buf, buflen); } va_list args; va_start(args, number); long a1 = va_arg(args, long); long a2 = va_arg(args, long); long a3 = va_arg(args, long); long a4 = va_arg(args, long); long a5 = va_arg(args, long); long a6 = va_arg(args, long); va_end(args); // Correctly forwards variadic arguments // syscall accepts up to 6 arguments return real_syscall(number, a1, a2, a3, a4, a5, a6);

} ```

Read Entire Article