The 80/20 Guide to R You Wish You Read Years Ago

11 hours ago 1

Vilfredo Pareto probably never imagined his observation about Italian land ownership would become one of the most quoted principles in productivity circles. The 80/20 rule says that roughly 80% of effects come from 20% of causes, it has been applied to everything from business strategy to personal habits. And I think it’s also quite applicable in R programming.

I've been writing R code for over 3 years now, and I've noticed something: most intermediate R users get stuck in what I call "local optima." Sure! They write code that gets the job done, but it's often held together by duct tape and hope. As someone in academia, believe me when I say, I’ve seen my fair share of scruffy R scripts that are barely readable, impossible to maintain, and one change away from collapsing. But they're missing out on improvements that require minimal effort.

Here are the handful of changes that will give you the biggest quality-of-life improvements in your R workflows.

If you're still writing code like this, you’re fighting an uphill battle.

df[df$column > 5 & !is.na(df$column), ]

The tidyverse isn't just about cleaner syntax, it's about a fundamentally different approach to data manipulation based on the Unix philosophy: do one thing and do it well. The transformation looks like this:

# The Tidyverse way result <- df |> filter(column > 5, category %in% c("A", "B")) |> group_by(group) |> summarize(value = mean(value)) |> arrange(desc(value))

Notice I'm using the native pipe |> instead of %>%. This matters more than you think. The native pipe is faster, doesn't require loading magrittr, and plays better with R's debugging tools.

But here's the crucial point: understanding base R makes you a better tidyverse programmer. When group_by() isn't behaving as expected, or when you need to write a custom function that works with dplyr, base R knowledge becomes essential.

R is fundamentally different from many popular languages e.g. Python or Java. It's built around vectors, and fighting this design leads to unnecessarily inefficient code. If you're writing for loops to apply the same operation to each element of a vector, you're doing it wrong 90% of the time.

Consider this common pattern:

# Slow results <- numeric(length(data)) for(i in seq_along(data)) { results[i] <- expensive_function(data[i]) }

The purrr::map() family isn't just cleaner, it can often times be faster and handles edge cases better, the difference is evident:

# Fast results <- map_dbl(data, expensive_function) # or even better, if the function is vectorized results <- expensive_function(data)

You can vectorize a function using the aptly named Vectorize() function. But don't fall into the "premature optimization" trap. Sometimes a simple loop is more readable, especially for complex logic or when you need to stop early.

R's default data.frame can often feel slow once you hit a few hundred thousand rows. It's not just about speed; it's about memory consumption because copies of your data are created as you stack on dplyr transformations.

I learned this the hard way when crunching and analyzing datasets with ~5M rows on a decent-spec machine. What should have been some simple aggregations turned into a 10-minute coffee break, sometimes followed by an "out of memory" error.

Enter data.table and DuckDB, the two backends that can transform your relationship with large datasets. data.table is like data.frame's high-performance cousin, written in C and optimized for speed and memory efficiency. DuckDB is even more interesting: it's an embedded analytical database that lets you query datasets larger than your RAM without breaking a sweat.

I can’t say enough good things about DuckDB, I've written an entire blog about how it can save you from huge datasets on low-memory devices. The key insight is that you can boost performance and scale to larger datasets, without ever leaving the comfort of our beloved dplyr syntax.

The compound effect really kicks in when you combine these backends with formats such as Parquet, as opposed to good-old CSVs. Parquet files are typically smaller than equivalent CSVs, load 5-10x faster, and preserve column types so you don't spend time converting strings back to dates.

No matter the language, if you've never had a collaborator (or your future self) unable to run your code because of package version conflicts, you're either very lucky or haven't been collaborating long enough. The solution is virtual environments or in R’s case: renv, and it should be the first thing you set up in every new project.

renv creates isolated, portable environments for each project. When someone else opens your project, renv::restore() installs exactly the same package versions you used.

# In every new project renv::init() # After installing new packages renv::snapshot()

No more scripts that worked last month but mysteriously break today. It's a small upfront investment that pays dividends every time you revisit old code or collaborate with others.

Most R users write scripts like they're writing a lab notebook: a long sequence of commands that transforms data step by step. This works until you need to debug, test, or reuse any part of your analysis.

The functional programming mindset changes everything. Think in terms of small, composable functions. This isn't just about code organization. Functions are easier to test, debug, and reason about. They force you to think about inputs and outputs explicitly. And when something goes wrong (and it will), you can test each function in isolation instead of rerunning your entire analysis.

Modern machines have multiple cores, but most R code runs on a single thread. For embarrassingly parallel problems like bootstrapping, cross-validation, or applying the same function to different subsets of data, this is leaving performance on the table.

The future ecosystem makes parallelization trivial:

library(future) library(furrr) # Tell R to use all available cores plan(multisession) # Instead of results <- map(data_list, expensive_function) # Do this results <- future_map(data_list, expensive_function)

The syntax is nearly identical, but the performance improvement can be dramatic. On the rare occasions I can’t use DuckDB, the future package has saved me a lot of idle time.

Here's what's interesting about these improvements: they compound. Better data storage makes parallelization more effective. Functional thinking makes testing easier. Reproducible environments make collaboration smoother. Each change makes the others more valuable.

This reminds me of James Clear's concept of "atomic habits" i.e. small changes that seem insignificant in isolation but create remarkable results when combined. The difference is that in R, unlike in personal habits, you can implement all of these changes in an afternoon and see immediate results.

R's ecosystem rewards this kind of systematic improvement more than any other language I've used. The tight integration between packages, the consistent grammar across the tidyverse, and the focus on domain-specific problems means that learning one improvement often unlocks several others.

My challenge to you: pick one of these techniques and implement it in your current project this week. Don't try to change everything at once but start somewhere. Your future self and your collaborators will thank you.

If this resonated with you, consider sharing it with your colleagues. And if you have R workflow improvements that have transformed your productivity, I'd love to hear about them.

Discussion about this post

Read Entire Article

The 80/20 Guide to R You Wish You Read Years Ago

Discussion about this post

Related

KotlinConf 2025 Unpacked

What Is a Pipeline?

Develop, Deploy, Operate