Show HN: Databroom – Clean data with GUI or CLI and generate Python/R script

8 hours ago 1

A powerful DataFrame cleaning tool with Command Line Interface, Interactive GUI, and Programmatic API - automatically generates reproducible Python/pandas, R/tidyverse code, and CLI Commands

🌐 Try the Demo App – Interactive GUI in the browser, no install needed.

🧹 Databroom

The Problem: Manual Data Cleaning is Tedious

With pandas (manual approach):

# Pandas approach: ~50 lines of code import pandas as pd import unicodedata import numpy as np df = pd.read_excel("survey_data.xlsx") # Remove empty columns empty_threshold = 0.8 df = df.dropna(axis=1, thresh=int(empty_threshold * len(df))) # Remove empty rows df = df.dropna(how='all') # Fix column names df.columns = df.columns.str.lower() df.columns = df.columns.str.replace(' ', '_') df.columns = df.columns.str.replace('[^a-z0-9_]', '', regex=True) # Normalize text in all string columns def clean_text(text): if pd.isna(text) or not isinstance(text, str): return text # Remove accents text = unicodedata.normalize('NFKD', text) text = ''.join(c for c in text if not unicodedata.combining(c)) return text string_columns = df.select_dtypes(include=['object']).columns for col in string_columns: df[col] = df[col].apply(clean_text) df.to_csv("cleaned_survey.csv", index=False)

Databroom approach:

databroom clean survey_data.xlsx \ --clean-all \ --empty-threshold 0.8 \ --output-file cleaned_survey.csv \ --output-code survey_cleaning.py \ --verbose

Result: Same output, 1 command, includes reproducible script generation.

Feature Manual Pandas Databroom

Lines of code	~20+ lines	1 command
Time to implement	10-15 minutes	10 seconds
Error prone	High (manual logic)	Low (tested operations)
Reproducible	Need to save script	Auto-generates code
Cross-language	Python only	Python + R output
GUI option	No	Yes (databroom gui)
Parameter tuning	Manual coding	CLI flags & GUI sliders

✅ Perfect for:

🤖 Full automation - Transform your entire data cleaning pipeline into a single command
Quick data exploration and cleaning
Batch processing multiple files
Learning data cleaning best practices
Generating reproducible cleaning scripts
Teams needing consistent data preprocessing
Converting workflows between Python and R

# Complete installation - CLI + GUI + API (recommended) pip install databroom # Verify installation databroom --version # CLI + API only (no Streamlit GUI) pip install databroom[cli] # GUI + API only (no CLI interface) pip install databroom[gui]

Command Line Interface (Primary Interface)

Clean your data files instantly with powerful CLI commands:

# Smart clean everything (recommended) databroom clean data.csv --clean-all --output-file clean.csv # Column cleaning with custom threshold databroom clean messy.xlsx --clean-columns --empty-threshold 0.8 --output-file cleaned.xlsx # Complete cleaning pipeline with code generation databroom clean survey.csv --clean-all --output-code cleaning_script.py --lang python # Generate R/tidyverse code databroom clean data.csv --clean-rows --output-code analysis.R --lang r # Advanced options with verbose output databroom clean dataset.json --clean-all --no-snakecase --verbose --info # Launch interactive GUI databroom gui # List all available operations databroom list

Launch the web-based interface for visual data cleaning:

databroom gui # Opens http://localhost:8501 in your browser

Use Databroom directly in your Python scripts:

from databroom import Broom # Load and clean data with method chaining broom = Broom.from_file('data.csv') result = broom.clean_all() # Smart clean everything # Or use specific operations result = (broom .clean_columns(empty_threshold=0.9) .clean_rows()) # Get cleaned DataFrame cleaned_df = result.get_df() print(f"Cleaned {cleaned_df.shape[0]} rows × {cleaned_df.shape[1]} columns") # Generate reproducible code from databroom import CodeGenerator generator = CodeGenerator('python') generator.load_history(result.get_history()) generator.export_code('my_cleaning_pipeline.py')

🖥️ Command Line Interface

Instant cleaning with intuitive flags and parameters
Batch processing capabilities for multiple files
Code generation in Python/pandas and R/tidyverse
Flexible output formats (CSV, Excel, JSON)
Rich help system with examples and colored output
Verbose mode for detailed operation feedback

Drag & drop file upload (CSV, Excel, JSON)
Live preview of cleaning operations
Interactive parameter tuning with sliders and inputs
Real-time code generation with syntax highlighting
One-click download of cleaned data and generated scripts
Operation history with undo functionality
Pipeline management: save current cleaning pipelines to JSON and re-upload them to reproduce or continue work

Chainable methods for fluent data cleaning workflows
Factory methods for easy file loading (from_csv(), from_excel(), etc.)
History tracking for reproducible operations
Template-based code generation with Jinja2
Pipeline I/O: export and load pipelines directly from Python for automated cleaning sessions

Complete scripts with imports, file loading, and execution
Cross-language support (Python/pandas ↔ R/tidyverse)
Template system for customizable output formats
Reproducible workflows that can be shared and version controlled

🧰 Available Cleaning Operations

Operation CLI Flag Purpose

🧹 Clean All	--clean-all	Smart clean everything: columns + rows with all operations
📌 Promote Headers	--promote-headers	Convert a data row to column headers
📋 Clean Columns	--clean-columns	Clean column names: snake_case + remove accents + remove empty
📊 Clean Rows	--clean-rows	Clean row data: snake_case + remove accents + remove empty
✏️ Rename Column	--rename-column	Rename a column (pair with --rename-column-old/--rename-column-new)
🔀 Reorder Columns	--reorder-columns	Reorder columns by listing the desired leading order
🗑️ Drop Columns	--drop-columns	Remove selected columns from the dataset
📌 Keep Columns	--keep-columns	Retain only the specified columns in the given order

# Smart Operations (recommended) --clean-all # Clean everything: columns + rows --clean-columns # Clean column names only --clean-rows # Clean row data only # Structure Operations --promote-headers # Convert data row to column headers --promote-row-index 1 # Row index to promote (default: 0) --keep-promoted-row # Keep the promoted row in data --rename-column --rename-column-old Producto --rename-column-new Item --reorder-columns --reorder-columns-list "id,date,total" --drop-columns --drop-columns-list "temp_column,notes" --keep-columns --keep-columns-list "id,date,total" # Advanced Options (disable specific operations) --no-snakecase # Keep original text case in rows --no-snakecase-cols # Keep original column name case --no-remove-accents-vals # Keep accents in text values --no-remove-empty-cols # Keep empty columns # Parameters --empty-threshold 0.8 # Custom missing value threshold (default: 0.9) # Output options --output-file cleaned.csv # Save cleaned data --output-code script.py # Generate code file --lang python # Code language (python/r) # Behavior options --verbose # Detailed output --quiet # Minimal output --info # Show DataFrame info

# Clean survey data and generate analysis script databroom clean survey_data.xlsx \ --clean-all \ --empty-threshold 0.7 \ --output-file clean_survey.csv \ --output-code survey_analysis.py \ --verbose

# Generate R script for tidyverse users databroom clean research_data.csv \ --clean-all \ --output-code tidyverse_pipeline.R \ --lang r

# Process multiple files with consistent operations for file in data/*.csv; do databroom clean "$file" \ --clean-columns \ --output-file "clean_$(basename "$file")" \ --quiet done

Databroom follows a modular architecture designed for extensibility and maintainability:

databroom/ ├── cli/ # Command line interface (Typer + Rich) │ ├── main.py # Entry point and app configuration │ ├── commands.py # CLI commands (clean, gui, list) │ ├── operations.py # Operation parsing and execution │ └── utils.py # File handling and code generation ├── core/ # Core cleaning engine │ ├── broom.py # Main API with method chaining │ ├── pipeline.py # Operation coordination and state management │ ├── cleaning_ops.py # Individual cleaning operations │ └── history_tracker.py # Automatic operation tracking ├── generators/ # Code generation system │ ├── base.py # Template-based code generator │ └── templates/ # Jinja2 templates for Python/R ├── gui/ # Modular Streamlit web interface │ ├── app.py # Main orchestrator (83 lines) │ ├── components/ # Reusable UI components │ │ ├── file_upload.py # File upload and processing │ │ ├── operations.py # Data cleaning operations │ │ ├── controls.py # Step back, reset, reload controls │ │ └── tabs.py # Data display and export tabs │ └── utils/ # GUI utilities │ ├── session.py # Session state management │ └── styles.py # CSS styling and theming └── tests/ # Comprehensive test suite

# Clone repository git clone https://github.com/onlozanoo/databroom.git cd databroom # Create virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install in development mode pip install -e ".[dev,cli,all]" # Run tests pytest # Run CLI locally python -m databroom.cli.main --help

# Run full test suite pytest # Run with coverage pytest --cov=databroom # Run specific test categories pytest -m "not slow" # Skip slow tests pytest tests/cli/ # Test CLI only pytest tests/core/ # Test core functionality

# Format code black databroom/ isort databroom/ # Lint flake8 databroom/ # Type check mypy databroom/

Current Version: v0.4 – Portable Pipelines Across GUI, CLI, and API Design a cleaning pipeline once — apply it anywhere.

Create a cleaning workflow visually in the GUI
Export it as a JSON pipeline
Run it headlessly via CLI or integrate into scripts and APIs
Re-import it to GUI for review or extension

This update makes your data prep workflows reusable, versionable, and automatable across any environment — without code duplication or switching tools.

✅ Fully Implemented

Smart Operations: --clean-all, --clean-columns, --clean-rows, --promote-headers
Modular GUI Architecture: Organized components with 86% code reduction
Complete CLI with simplified and legacy operations
Interactive Streamlit GUI with live preview and organized operations
Programmatic API with method chaining
Python and R code generation with parameter filtering
Comprehensive test suite
Save/load cleaning pipelines as JSON
Live on PyPI: pip install databroom
Dynamic new operations loading system
Extensible component-based GUI structure