Show HN: Databroom – Clean data with GUI or CLI and generate Python/R script

8 hours ago 1

A powerful DataFrame cleaning tool with Command Line Interface, Interactive GUI, and Programmatic API - automatically generates reproducible Python/pandas, R/tidyverse code, and CLI Commands

PyPI version Python 3.8+ MIT License

🌐 Try the Demo App – Interactive GUI in the browser, no install needed.

Image


The Problem: Manual Data Cleaning is Tedious

With pandas (manual approach):

# Pandas approach: ~50 lines of code import pandas as pd import unicodedata import numpy as np df = pd.read_excel("survey_data.xlsx") # Remove empty columns empty_threshold = 0.8 df = df.dropna(axis=1, thresh=int(empty_threshold * len(df))) # Remove empty rows df = df.dropna(how='all') # Fix column names df.columns = df.columns.str.lower() df.columns = df.columns.str.replace(' ', '_') df.columns = df.columns.str.replace('[^a-z0-9_]', '', regex=True) # Normalize text in all string columns def clean_text(text): if pd.isna(text) or not isinstance(text, str): return text # Remove accents text = unicodedata.normalize('NFKD', text) text = ''.join(c for c in text if not unicodedata.combining(c)) return text string_columns = df.select_dtypes(include=['object']).columns for col in string_columns: df[col] = df[col].apply(clean_text) df.to_csv("cleaned_survey.csv", index=False)

Databroom approach:

databroom clean survey_data.xlsx \ --clean-all \ --empty-threshold 0.8 \ --output-file cleaned_survey.csv \ --output-code survey_cleaning.py \ --verbose

Result: Same output, 1 command, includes reproducible script generation.

Feature Manual Pandas Databroom
Lines of code ~20+ lines 1 command
Time to implement 10-15 minutes 10 seconds
Error prone High (manual logic) Low (tested operations)
Reproducible Need to save script Auto-generates code
Cross-language Python only Python + R output
GUI option No Yes (databroom gui)
Parameter tuning Manual coding CLI flags & GUI sliders

Perfect for:

  • 🤖 Full automation - Transform your entire data cleaning pipeline into a single command
  • Quick data exploration and cleaning
  • Batch processing multiple files
  • Learning data cleaning best practices
  • Generating reproducible cleaning scripts
  • Teams needing consistent data preprocessing
  • Converting workflows between Python and R

# Complete installation - CLI + GUI + API (recommended) pip install databroom # Verify installation databroom --version # CLI + API only (no Streamlit GUI) pip install databroom[cli] # GUI + API only (no CLI interface) pip install databroom[gui]

Command Line Interface (Primary Interface)

Clean your data files instantly with powerful CLI commands:

# Smart clean everything (recommended) databroom clean data.csv --clean-all --output-file clean.csv # Column cleaning with custom threshold databroom clean messy.xlsx --clean-columns --empty-threshold 0.8 --output-file cleaned.xlsx # Complete cleaning pipeline with code generation databroom clean survey.csv --clean-all --output-code cleaning_script.py --lang python # Generate R/tidyverse code databroom clean data.csv --clean-rows --output-code analysis.R --lang r # Advanced options with verbose output databroom clean dataset.json --clean-all --no-snakecase --verbose --info # Launch interactive GUI databroom gui # List all available operations databroom list

Launch the web-based interface for visual data cleaning:

databroom gui # Opens http://localhost:8501 in your browser

Animated preview of Databroom GUI showing data upload, cleaning operations, and code export

Databroom GUI showing the Current Data tab with a preview of the loaded DataFrame, memory usage, and missing values summary

Databroom GUI displaying the History & Pipeline tab to view past cleaning steps and save or run a cleaning pipeline

Databroom GUI presenting the Data Info tab with column types, non-null counts, and sample values

Databroom GUI in the Export Code tab showing auto-generated Python pandas code for the performed cleaning operations

Use Databroom directly in your Python scripts:

from databroom import Broom # Load and clean data with method chaining broom = Broom.from_file('data.csv') result = broom.clean_all() # Smart clean everything # Or use specific operations result = (broom .clean_columns(empty_threshold=0.9) .clean_rows()) # Get cleaned DataFrame cleaned_df = result.get_df() print(f"Cleaned {cleaned_df.shape[0]} rows × {cleaned_df.shape[1]} columns") # Generate reproducible code from databroom import CodeGenerator generator = CodeGenerator('python') generator.load_history(result.get_history()) generator.export_code('my_cleaning_pipeline.py')

🖥️ Command Line Interface

  • Instant cleaning with intuitive flags and parameters
  • Batch processing capabilities for multiple files
  • Code generation in Python/pandas and R/tidyverse
  • Flexible output formats (CSV, Excel, JSON)
  • Rich help system with examples and colored output
  • Verbose mode for detailed operation feedback
  • Drag & drop file upload (CSV, Excel, JSON)
  • Live preview of cleaning operations
  • Interactive parameter tuning with sliders and inputs
  • Real-time code generation with syntax highlighting
  • One-click download of cleaned data and generated scripts
  • Operation history with undo functionality
  • Pipeline management: save current cleaning pipelines to JSON and re-upload them to reproduce or continue work
  • Chainable methods for fluent data cleaning workflows
  • Factory methods for easy file loading (from_csv(), from_excel(), etc.)
  • History tracking for reproducible operations
  • Template-based code generation with Jinja2
  • Pipeline I/O: export and load pipelines directly from Python for automated cleaning sessions
  • Complete scripts with imports, file loading, and execution
  • Cross-language support (Python/pandas ↔ R/tidyverse)
  • Template system for customizable output formats
  • Reproducible workflows that can be shared and version controlled

🧰 Available Cleaning Operations

Operation CLI Flag Purpose
🧹 Clean All --clean-all Smart clean everything: columns + rows with all operations
📌 Promote Headers --promote-headers Convert a data row to column headers
📋 Clean Columns --clean-columns Clean column names: snake_case + remove accents + remove empty
📊 Clean Rows --clean-rows Clean row data: snake_case + remove accents + remove empty
✏️ Rename Column --rename-column Rename a column (pair with --rename-column-old/--rename-column-new)
🔀 Reorder Columns --reorder-columns Reorder columns by listing the desired leading order
🗑️ Drop Columns --drop-columns Remove selected columns from the dataset
📌 Keep Columns --keep-columns Retain only the specified columns in the given order
# Smart Operations (recommended) --clean-all # Clean everything: columns + rows --clean-columns # Clean column names only --clean-rows # Clean row data only # Structure Operations --promote-headers # Convert data row to column headers --promote-row-index 1 # Row index to promote (default: 0) --keep-promoted-row # Keep the promoted row in data --rename-column --rename-column-old Producto --rename-column-new Item --reorder-columns --reorder-columns-list "id,date,total" --drop-columns --drop-columns-list "temp_column,notes" --keep-columns --keep-columns-list "id,date,total" # Advanced Options (disable specific operations) --no-snakecase # Keep original text case in rows --no-snakecase-cols # Keep original column name case --no-remove-accents-vals # Keep accents in text values --no-remove-empty-cols # Keep empty columns # Parameters --empty-threshold 0.8 # Custom missing value threshold (default: 0.9) # Output options --output-file cleaned.csv # Save cleaned data --output-code script.py # Generate code file --lang python # Code language (python/r) # Behavior options --verbose # Detailed output --quiet # Minimal output --info # Show DataFrame info

# Clean survey data and generate analysis script databroom clean survey_data.xlsx \ --clean-all \ --empty-threshold 0.7 \ --output-file clean_survey.csv \ --output-code survey_analysis.py \ --verbose
# Generate R script for tidyverse users databroom clean research_data.csv \ --clean-all \ --output-code tidyverse_pipeline.R \ --lang r
# Process multiple files with consistent operations for file in data/*.csv; do databroom clean "$file" \ --clean-columns \ --output-file "clean_$(basename "$file")" \ --quiet done

Databroom follows a modular architecture designed for extensibility and maintainability:

databroom/ ├── cli/ # Command line interface (Typer + Rich) │ ├── main.py # Entry point and app configuration │ ├── commands.py # CLI commands (clean, gui, list) │ ├── operations.py # Operation parsing and execution │ └── utils.py # File handling and code generation ├── core/ # Core cleaning engine │ ├── broom.py # Main API with method chaining │ ├── pipeline.py # Operation coordination and state management │ ├── cleaning_ops.py # Individual cleaning operations │ └── history_tracker.py # Automatic operation tracking ├── generators/ # Code generation system │ ├── base.py # Template-based code generator │ └── templates/ # Jinja2 templates for Python/R ├── gui/ # Modular Streamlit web interface │ ├── app.py # Main orchestrator (83 lines) │ ├── components/ # Reusable UI components │ │ ├── file_upload.py # File upload and processing │ │ ├── operations.py # Data cleaning operations │ │ ├── controls.py # Step back, reset, reload controls │ │ └── tabs.py # Data display and export tabs │ └── utils/ # GUI utilities │ ├── session.py # Session state management │ └── styles.py # CSS styling and theming └── tests/ # Comprehensive test suite

# Clone repository git clone https://github.com/onlozanoo/databroom.git cd databroom # Create virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install in development mode pip install -e ".[dev,cli,all]" # Run tests pytest # Run CLI locally python -m databroom.cli.main --help
# Run full test suite pytest # Run with coverage pytest --cov=databroom # Run specific test categories pytest -m "not slow" # Skip slow tests pytest tests/cli/ # Test CLI only pytest tests/core/ # Test core functionality
# Format code black databroom/ isort databroom/ # Lint flake8 databroom/ # Type check mypy databroom/

Current Version: v0.4 – Portable Pipelines Across GUI, CLI, and API Design a cleaning pipeline once — apply it anywhere.

  • Create a cleaning workflow visually in the GUI
  • Export it as a JSON pipeline
  • Run it headlessly via CLI or integrate into scripts and APIs
  • Re-import it to GUI for review or extension

This update makes your data prep workflows reusable, versionable, and automatable across any environment — without code duplication or switching tools.

Fully Implemented

  • Smart Operations: --clean-all, --clean-columns, --clean-rows, --promote-headers
  • Modular GUI Architecture: Organized components with 86% code reduction
  • Complete CLI with simplified and legacy operations
  • Interactive Streamlit GUI with live preview and organized operations
  • Programmatic API with method chaining
  • Python and R code generation with parameter filtering
  • Comprehensive test suite
  • Save/load cleaning pipelines as JSON
  • Live on PyPI: pip install databroom
  • Dynamic new operations loading system
  • Extensible component-based GUI structure

🚧 In Active Development

  • Extended cleaning operations library
  • Advanced parameter validation
  • Performance optimizations
  • Enhanced error handling

📋 Planned Features

  • Preview in CLI
  • Configuration presets and templates
  • Enhanced batch processing workflows
  • Custom cleaning operation plugins system
  • Integration with pandas-profiling and data validation tools
  • Advanced data quality reporting and metrics

I welcome contributions! Here's how you can help:

  • 🐛 Bug Reports: Submit issues with detailed reproduction steps
  • 💡 Feature Requests: Propose new cleaning operations or CLI features
  • 📝 Documentation: Improve examples, tutorials, or API docs
  • 🧪 Testing: Add test cases or improve test coverage
  • 💻 Code: Implement new features or fix existing issues

This project is licensed under the MIT License - see the LICENSE file for details.


Built with ❤️ for the data science community

Read Entire Article