Rye Tables vs. Python/Pandas: A Different Way to Wrangle Data

2 hours ago 2

If you’ve ever filtered a CSV in Pandas or written multiple nested for loops to group data in Python, you’ll probably find Rye’s take on the problem interesting.

Rye is a small, homoiconic language inspired by Rebol and Factor. One of its most interesting value types is the Table. A first-class, immutable data structure for tabular data.

Tables are mostly manipulated with pure, composable (endomorphic1) functions. By using Rye’s pipe and op-words, common wrangling tasks become small, chainable expressions.

This post compares how three different approaches handle the same data-wrangling problems:

  • Rye tables
  • Python with Pandas
  • Plain Python (no libraries)

Along the way we’ll look at what makes Rye’s model unique, where Pandas shines, and why pure Python becomes verbose as complexity grows.


Rye, in this case, is a programming language - not the package manager with the same name.
I worked on Rye for years before I saw a post about Rye - the package manager on HN. That tool seems to have been superseded by uv… so time fixes some confusions on its own. :P

Rye aims to be a higher higher-level language with value types that reflect more how people work with information vs. how computers work. One such value type is a Table.

To learn more about tables, visit the Cookbook page: Working with Tables.


#1: Filtering and Transforming Sci-fi books

Scenario: Given a dataset of sci-fi books, find all books published after 1980, add a column showing how old each book is, sort by age, and limit to top 2 results.

This first example represents a common data processing pattern: filtering, transforming, and sorting a dataset.

Rye

books: table { 'title 'author 'year } { "Neuromancer" "William Gibson" 1984 "Snow Crash" "Neal Stephenson" 1992 "Foundation" "Isaac Asimov" 1951 "Dune" "Frank Herbert" 1965 } books .where-greater 'year 1980 |add-column 'age { year } { 2025 - year } |order-by 'age 'asc |columns? { 'title 'author 'age } |display ; Prints: ; | title | author | age | ; +-------------+-----------------+-----+ ; | Snow Crash | Neal Stephenson | 33 | ; | Neuromancer | William Gibson | 41 |

Rye’s example is written using pipe-words ( |word ), very common to Rye, and a fact that base Table functions are endomorphic. They accept a table and return a new table. This means you can compose and combine them in any order you need them to.

Find out more about Rye’s op and pipe-words

Python Pandas

Pandas supports very similar concepts, so the code is very similar in shape and size.

import pandas as pd books = pd.DataFrame([ {"title": "Neuromancer", "author": "William Gibson", "year": 1984}, {"title": "Snow Crash", "author": "Neal Stephenson", "year": 1992}, {"title": "Foundation", "author": "Isaac Asimov", "year": 1951}, {"title": "Dune", "author": "Frank Herbert", "year": 1965} ]) result = (books[books['year'] > 1980] .assign(age=lambda x: 2025 - x['year']) .sort_values('age') [['title', 'author', 'age']]) print(result)

Pandas does introduce some unusual syntax for Python that is not that self-explanatory. It seems to heavily utilize the accessor syntax.

  • books[books['year'] > 1980]
  • [['title', 'author', 'age']]

Python

Contrary to the above two, pure Python is a staple of imperative code, for its good and its bad. It’s a little more verbose, but it’s very basic and not hard to see what exactly is going on.

books = [ {"title": "Neuromancer", "author": "William Gibson", "year": 1984}, {"title": "Snow Crash", "author": "Neal Stephenson", "year": 1992}, {"title": "Foundation", "author": "Isaac Asimov", "year": 1951}, {"title": "Dune", "author": "Frank Herbert", "year": 1965} ] # Filter books after 1980 filtered_books = [] for book in books: if book["year"] > 1980: filtered_books.append(book) # Add age column for book in filtered_books: book["age"] = 2025 - book["year"] # Sort by age filtered_books.sort(key=lambda x: x["age"]) # Display a specific table print("| title | author | age |") print("+-------+--------+-----+") for book in filtered_books: print(f"| {book['title']:<12} | {book['author']:<15} | {book['age']:2} |")

Despite its simplicity, contrary to the above examples this code doesn’t display intent. We see 3 for loops but each is used for an absolutely different purpose. So comments that explain intent make sense here.

#2: Joining and Aggregating Space Colonies

Scenario: Join colony data with planet information, then group by planet type to calculate total population and average colony size, and save the result as an Excel (.xlsx) file.

Moving beyond simple transformations, this example tackles relational operations—joining datasets and performing aggregations.

Rye

colonies: table { 'id 'name 'planet_id 'population } { 1 "New Terra" 1 50000 2 "Olympus" 1 75000 3 "Titan Base" 2 12000 } planets: table { 'id 'name 'type } { 1 "Mars" "Terrestrial" 2 "Titan" "Moon" } colonies .inner-join planets 'planet_id 'id |group-by 'type { 'type count 'population sum 'population avg } |save\xlsx %colonies.xlsx |display ; Prints: ; | type | type_count | population_sum | population_avg | ; +------------------------------------------------------------+ ; | Terrestrial | 2 | 125000.000000 | 62500.000000 | ; | Moon | 1 | 12000.000000 | 12000.000000 |

In Rye even what are usual special forms are just functions, and this example just uses a few more functions. Like add-column in the first example, that accepted a table and two blocks of code/data. Here we have a group-by function with its third argument as its own little microdialect for specifying column to aggregate on and aggregate functions to use.

Rye dialects often rely on Rye’s many word types. Here a lit word ( ‘word ) or string would denote a column and a word ( word ) denotes an aggregate function to apply.

Python Pandas

import pandas as pd colonies = pd.DataFrame([ {"id": 1, "name": "New Terra", "planet_id": 1, "population": 50000}, {"id": 2, "name": "Olympus", "planet_id": 1, "population": 75000}, {"id": 3, "name": "Titan Base", "planet_id": 2, "population": 12000} ]) planets = pd.DataFrame([ {"id": 1, "name": "Mars", "type": "Terrestrial"}, {"id": 2, "name": "Titan", "type": "Moon"} ]) result = (colonies.merge(planets, left_on='planet_id', right_on='id') .groupby('type')['population'] .agg(['count', 'sum', 'mean']) .reset_index()) # Save to Excel file (requires: pip install openpyxl) result.to_excel('colonies.xlsx', index=False) print(result)

Again, Pandas example is of similar shape as Rye one. And again it seems we use some magical accessor syntax:

.groupby('type')['population']

Python

colonies = [ {"id": 1, "name": "New Terra", "planet_id": 1, "population": 50000}, {"id": 2, "name": "Olympus", "planet_id": 1, "population": 75000}, {"id": 3, "name": "Titan Base", "planet_id": 2, "population": 12000} ] planets = [ {"id": 1, "name": "Mars", "type": "Terrestrial"}, {"id": 2, "name": "Titan", "type": "Moon"} ] # Manual join operation joined_data = [] for colony in colonies: for planet in planets: if colony["planet_id"] == planet["id"]: joined_row = {**colony, **planet} joined_row["planet_name"] = planet["name"] joined_data.append(joined_row) # Manual grouping and aggregation from collections import defaultdict groups = defaultdict(list) for row in joined_data: groups[row["type"]].append(row) results = [] for planet_type, group_data in groups.items(): total_pop = sum(row["population"] for row in group_data) avg_pop = total_pop / len(group_data) results.append({ "type": planet_type, "count": len(group_data), "total_population": total_pop, "avg_population": avg_pop }) # Display results print("| type | count | total_pop | avg_pop |") print("+------+-------+-----------+---------+") for result in results: print(f"| {result['type']:<11} | {result['count']:3} | {result['total_population']:8} | {result['avg_population']:7.0f} |") # Note: Excel export would require additional libraries like openpyxl or xlsxwriter. # We don't save to Excel here.

This example highlights how complexity scales differently across approaches. The manual Python implementation becomes significantly more verbose when dealing with joins and aggregations, while both Pandas and Rye maintain relatively concise syntax.

Interactively solving Examples #1 and #2

You can visit asciinema to see the first two examples run interactively in a Rye console (REPL) while displaying the intermediate results.

Rye focuses on being especially useful in an interactive setting.

asciicast

#3: Nesting multi-table Exoplanet data

Scenario: Read 3 CSV files containing exoplanet data, observations, and station information. For each exoplanet, collect the top 3 observations (by confidence) and sum the population of all stations stationed near that exoplanet. Output as multi-layered JSON.

This example demonstrates more complex relational operations across multiple tables, joining data, nested aggregations, and structured output generation.

For better information organisation we load all data into a data context. All functions here again are endomorphic.

Rye

data: context { load\csv %exoplanets.csv |autotype 0.95 :exoplanets load\csv %observations.csv |autotype 0.95 :observations load\csv %stations.csv |autotype 0.95 :stations } data/exoplanets |add-column 'observations { id } { data/observations |where-equal 'exoplanet_id id |order-by 'confidence 'desc |columns? { 'date 'notes } |head 3 } |add-column 'stationed_population { id } { data/stations |where-equal 'stationed_near id |column? 'population |sum } |drop-column 'id |to-json |print

This example uses the fact that Table is a basic Rye value type, so it only makes sense that table cells can also hold tables. We add a column to the base table that holds tables to the top 3 observations, and we add another where we manually aggregate a value to its sum.

Python Pandas

import pandas as pd import json # Load CSV files with automatic type inference exoplanets = pd.read_csv('exoplanets.csv') observations = pd.read_csv('observations.csv') stations = pd.read_csv('stations.csv') # Functional approach with `apply()` requires helper functions for nested operations def get_observations_for_exoplanet(exoplanet_id): """Get top 3 observations for an exoplanet, ordered by confidence desc""" filtered_obs = observations[observations['exoplanet_id'] == exoplanet_id] sorted_obs = filtered_obs.sort_values('confidence', ascending=False) top_obs = sorted_obs[['date', 'notes']].head(3) return top_obs.to_dict('records') def get_stationed_population(exoplanet_id): """Get sum of population for stations near an exoplanet""" filtered_stations = stations[stations['stationed_near'] == exoplanet_id] return int(filtered_stations['population'].sum()) # Add nested observations and stationed population data exoplanets['observations'] = exoplanets['id'].apply(get_observations_for_exoplanet) exoplanets['stationed_population'] = exoplanets['id'].apply(get_stationed_population) # Remove the id column and convert to JSON result_json = exoplanets.drop('id', axis=1).to_json(orient='records', indent=2) print(result_json)

In this case the code between Pandas and Rye is not so similar. It seems Pandas requires the use of helper functions, that are then applied to get the aggregated / nested values.

Python

import csv import json from collections import defaultdict # Load exoplanets exoplanets = [] with open('exoplanets.csv', 'r') as file: reader = csv.DictReader(file) for row in reader: row['id'] = int(row['id']) row['mass'] = float(row['mass']) row['radius'] = float(row['radius']) row['temp'] = float(row['temp']) row['discovery_year'] = int(row['discovery_year']) exoplanets.append(row) # Load observations observations = [] with open('observations.csv', 'r') as file: reader = csv.DictReader(file) for row in reader: row['exoplanet_id'] = int(row['exoplanet_id']) row['confidence'] = float(row['confidence']) observations.append(row) # Load stations stations = [] with open('stations.csv', 'r') as file: reader = csv.DictReader(file) for row in reader: row['population'] = int(row['population']) row['stationed_near'] = int(row['stationed_near']) stations.append(row) # Group observations by exoplanet_id for efficient lookup obs_by_planet = defaultdict(list) for obs in observations: obs_by_planet[obs['exoplanet_id']].append(obs) # Group stations by stationed_near for efficient lookup stations_by_planet = defaultdict(list) for station in stations: stations_by_planet[station['stationed_near']].append(station) # Process each exoplanet for exoplanet in exoplanets: planet_id = exoplanet['id'] # Get top 3 observations by confidence planet_observations = obs_by_planet[planet_id] planet_observations.sort(key=lambda x: x['confidence'], reverse=True) top_observations = [] for obs in planet_observations[:3]: top_observations.append({ 'date': obs['date'], 'notes': obs['notes'] }) exoplanet['observations'] = top_observations # Sum stationed population total_population = sum(station['population'] for station in stations_by_planet[planet_id]) exoplanet['stationed_population'] = total_population # Remove the id column del exoplanet['id'] # Convert to JSON and print result_json = json.dumps(exoplanets, indent=2) print(result_json)

The pure-Python version quickly becomes verbose, so we would have to heavily refactor it or of course use a library like Pandas.

Example inputs

The three csv files we use in the example above:

exoplanets.csv

id,name,mass,radius,temp,discovery_year 1,Kepler-296e,1.79,1.5,251,2014 2,TOI-715b,3.02,1.55,280,2023 3,Gliese-667Cc,3.8,1.5,277,2011

observations.csv

exoplanet_id,date,confidence,notes 1,2015-01-20,0.95,Primary transit detected 1,2015-02-15,0.92,Secondary eclipse confirmed 1,2020-03-10,0.98,Follow-up observation 2,2015-01-18,0.89,Initial detection 3,2014-05-12,0.94,Multiple transits observed 2,2014-08-30,0.96,Orbital period confirmed 3,2014-05-12,0.88,Shallow transit 3,2014-11-15,0.90,Period validation 3,2023-02-14,0.97,Recent discovery

stations.csv

name,population,stationed_near Alpha Station,15000,1 Beta Outpost,8500,2 Gamma Research Base,2300,1 Delta Mining Station,12000,1 Epsilon Colony,45000,3 Zeta Trading Hub,22000,3

Example output

This is the JSON output that running a Rye script for example 3 produces.

[ { "name": "Kepler-296e", "mass": 1.79, "radius": 1.5, "temp": 251, "discovery_year": 2014, "observations": [ { "date": "2020-03-10", "notes": "Follow-up observation" }, { "date": "2015-01-20", "notes": "Primary transit detected" }, { "date": "2015-02-15", "notes": "Secondary eclipse confirmed" } ], "stationed_population": 29300 }, { "name": "TOI-715b", "mass": 3.02, "radius": 1.55, "temp": 280, "discovery_year": 2023, "observations": [ { "date": "2014-08-30", "notes": "Orbital period confirmed" }, { "date": "2015-01-18", "notes": "Initial detection" } ], "stationed_population": 8500 }, { "name": "Gliese-667Cc", "mass": 3.8, "radius": 1.5, "temp": 277, "discovery_year": 2011, "observations": [ { "date": "2023-02-14", "notes": "Recent discovery" }, { "date": "2014-05-12", "notes": "Multiple transits observed" }, { "date": "2014-11-15", "notes": "Period validation" } ], "stationed_population": 67000 } ]

More information

If you find Rye tables interesting, you can read more about them in the Cookbook page:

Cookbook / Working with Tables

To read more about the concepts and specifics of Rye visit:

For code, visit: Rye’s Github

Conclusion

Should you ditch Python and Pandas and use Rye? Of course not. Rye is still a work in progress programming language and it’s nowhere near stability and functionality compared to Python or Pandas.

I would love feedback on the concepts and whether Rye’s table abstractions could fit into your workflow. I’d also love suggestions on how to make it even more useful and internally consistent.

Read Entire Article