Why I Wrote Rmlx

50 minutes ago 1

I’ll start on a downbeat note: clearly, if there ever was a contest between R and Python for dominance of machine learning, R has lost. Nobody is writing AI in R.

There are probably a lot of reasons for that: R is niche, Python is general purpose; Python is slow but R is even slower; R is very REPL-oriented; R is rooted in academia.

It’s also fine. There is plenty of room for R in the world. An obvious niche is “statistical analysis and graphics on your laptop”. R has awesome graphics, at least two nice IDEs, and an unrivalled power to express common data science operations quickly and clearly. Python still has nothing as simple and expressive as

Academics, data journalists and data scientists in organizations doing research can all turn to R for this kind of work.

However, I do worry that R may stagnate. (Controversial take: interest in R from the pharmaceutical industry, which, ah, uses RTF as a standardized file format, is not necessarily a good sign.)

One thing we should aim for is dealing better with big data. Nowadays, even researchers without a big infrastructure should be able to analyse data with millions of rows and many predictors. There’s a lot of data around! We should use it!

It seems relevant that many of us work on laptops with a powerful built-in processor ideally set up for basic statistical operations. I.e. we have Apple Silicon Macs, with a builtin GPU.

But R can’t use the GPU!

An analogy: until a few years ago, we had multicore laptops but R couldn’t use the multiple cores because R was single-threaded. It took packages like “parallel” in base R, or the futureverse, to change that.

Helpfully, Apple has released the MLX framework for working with Apple GPUs. It also has a CUDA backend, so Linux fans are not left out. So, I decided to write Rmlx, an R interface to that library. For the first time, R Mac users have easy access to operations on the GPU.

Rmlx provides R access to MLX’s C++ API, plus a few handy utilities like coordinate descent and functions for working with probability distributions. It doesn’t itself do statistics, but developers can use Rmlx to write fast statistical functions without needing to drop into C.

A companion library, RmlxStats, provides some basic statistical tests. There are MLX analogues to base R lm and glm, along with a very WIP version of glmnet::glmnet().

Here’s an example of the kind of work that Rmlx enables. We’ll use the large nycflights13 dataset. A simple regression runs quickly in base R:

> library(nycflights13) > system.time(summary(lm(arr_delay ~ dep_delay, data = flights))) user system elapsed 0.077 0.005 0.086

But what if we want to control for the date of the flight?

> system.time(summary(lm(arr_delay ~ dep_delay + paste0(day,"/",month), data = flights))) user system elapsed 32.253 0.420 33.007

That’s OK to run once, but too slow to iterate on.

Let’s drop in RmlxStats::mlxs_lm() as a replacement for lm():

> library(RmlxStats) > system.time(summary(mlxs_lm(arr_delay ~ dep_delay + paste0(day,"/",month), data = flights))) user system elapsed 4.348 0.404 3.167

Ten times better. It’s also faster than many functions from specialized packages like fixest or speedglm. GPU go brrrr.

If you don’t call summary, it’s ten times faster again:

> system.time(x <- mlxs_lm(arr_delay ~ dep_delay + paste0(day,"/",month), data = flights)) user system elapsed 0.216 0.137 0.397

This is because MLX uses lazy evaluation, only running operations when they are needed. So developers can plug mlxs_lm(), or the underlying fit function, into more complex chains of output.

Both Rmlx and RmlxStats are very early in development. It’ll be interesting to see what works well on the GPU.

You may also be interested to know that both packages were almost entirely vibe-coded. (I know,I know…. I’m sure a real programmer could do much better.) It’s been an interesting experience, with many up and downsides. The fact is, I could never have tackled an advanced R/C++ project like this without the help of an AI. We’ll see how it pans out.

There are many things to do to keep R moving forward for the 21st century. Lots of great people are working on them. One item on the checklist is to take advantage of modern computer hardware. Rmlx is one start at tackling this.

Read Entire Article