Show HN: Datatune- Perform data transformations with Natural Language using LLMs

2 hours ago 3

PyPI version License PyPI Downloads Docs

Perform transformations on your data with natural language using LLMs

From source:

import os import dask.dataframe as dd import datatune as dt from datatune.llm.llm import OpenAI os.environ["OPENAI_API_KEY"] = "your-openai-api-key" llm = OpenAI(model_name="gpt-3.5-turbo") # Load data from your source with Dask df = dd.read_csv("tests/test_data/products.csv") print(df.head()) # Transform data with Map mapped = dt.Map( prompt="Extract categories from the description.", output_fields=["Category", "Subcategory"] )(llm, df) # Filter data based on criteria filtered = dt.Filter( prompt="Keep only electronics products" )(llm, mapped) # Get the final dataframe after cleanup of metadata and deleted rows after operations using `finalize`. result = dt.finalize(filtered) result.compute().to_csv("electronics_products.csv") new_df = dd.read_csv("electronics_products.csv") print(new_df.head())

products.csv

ProductID Name Price Quantity Description SKU 0 1001 Wireless Mouse 25.99 150 Ergonomic wireless mouse with 2.4GHz connectivity WM-1001 1 1002 Office Chair 89.99 75 Comfortable swivel office chair with lumbar su... OC-2002 2 1003 Coffee Mug 9.49 300 Ceramic mug, 12oz, microwave safe CM-3003 3 1004 LED Monitor 24" 149.99 60 24-inch Full HD LED monitor with HDMI and VGA ... LM-2404 4 1005 Notebook Pack 6.99 500 Pack of 3 ruled notebooks, 100 pages each NP-5005

electronics_products.csv

Unnamed: 0 ProductID Name ... SKU Category Subcategory 0 0 1001 Wireless Mouse ... WM-1001 Electronics Computer Accessories 1 3 1004 LED Monitor 24" ... LM-2404 Electronics Monitors 2 6 1007 USB-C Cable 1m ... UC-7007 Electronics Cables 3 8 1009 Bluetooth Speaker ... BS-9009 Electronics Audio

Transform data with natural language:

customers = dd.read_csv("customers.csv") mapped = dt.Map( prompt="Extract country and city from the address field", output_fields=["country", "city"] )(llm, customers)
# Filter to remove rows filtered = dt.Filter( prompt="Keep only customers who are from Asia" )(llm, mapped)

Datatune works with various LLM providers with the help of LiteLLM under the hood.:

# Using Ollama from datatune.llm.llm import Ollama llm = Ollama() # Using Azure from datatune.llm.llm import Azure llm = Azure( model_name="gpt-35-turbo", api_key=api_key, api_base=api_base, api_version=api_version) # OpenAI from datatune.llm.llm import OpenAI llm = OpenAI(model_name="gpt-3.5-turbo")

Datatune leverages Dask DataFrames to enable scalable processing across large datasets. This approach allows you to:

  • Process data larger than context length of LLMs
  • Execute parallel computations efficiently

If you're working with pandas DataFrames, convert them with a simple:

import dask.dataframe as dd dask_df = dd.from_pandas(pandas_df, npartitions=4) # adjust partitions based on your data size

Check out examples

Check out our documentation to learn how to use datatune.

Want to raise an issue or want us to build a new feature? Head over to issues and raise a ticket!

You can also mail us at [email protected]

MIT License

Read Entire Article