Turn fenic data sandboxes into versioned Hugging Face datasets

2 weeks ago 1

fenic is a PySpark-inspired DataFrame framework designed for building production AI and agentic applications. fenic provides support for reading datasets directly from the Hugging Face Hub.

Getting Started

To get started, pip install fenic:

Create a Session

Instantiate a fenic session with the default configuration (sufficient for reading datasets and other non-semantic operations):

import fenic as fc session = fc.Session.get_or_create(fc.SessionConfig())

Overview

fenic is an opinionated data processing framework that combines:

  • DataFrame API: PySpark-inspired operations for familiar data manipulation
  • Semantic Operations: Built-in AI/LLM operations including semantic functions, embeddings, and clustering
  • Model Integration: Native support for AI providers (Anthropic, OpenAI, Cohere, Google)
  • Query Optimization: Automatic optimization through logical plan transformations

Read from Hugging Face Hub

fenic can read datasets directly from the Hugging Face Hub using the hf:// protocol. This functionality is built into fenic’s DataFrameReader interface.

Supported Formats

fenic supports reading the following formats from Hugging Face:

  • Parquet files (.parquet)
  • CSV files (.csv)

Reading Datasets

To read a dataset from the Hugging Face Hub:

import fenic as fc session = fc.Session.get_or_create(fc.SessionConfig()) df = session.read.csv("hf://datasets/datasets-examples/doc-formats-csv-1/data.csv") df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet") df = session.read.parquet("hf://datasets/datasets-examples/doc-formats-csv-1@~parquet/**/*.parquet")

Reading with Schema Management

df = session.read.csv("hf://datasets/username/dataset_name/*.csv", merge_schemas=True) df = session.read.parquet("hf://datasets/username/dataset_name/*.parquet", merge_schemas=True)

Note: In fenic, a schema is the set of column names and their data types. When you enable merge_schemas, fenic tries to reconcile differences across files by filling missing columns with nulls and widening types where it can. Some layouts still cannot be merged—consult the fenic docs for CSV schema merging limitations and Parquet schema merging limitations.

Authentication

To read private datasets, you need to set your Hugging Face token as an environment variable:

export HF_TOKEN="your_hugging_face_token_here"

Path Format

The Hugging Face path format in fenic follows this structure:

hf://{repo_type}/{repo_id}/{path_to_file}

You can also specify dataset revisions or versions:

hf://{repo_type}/{repo_id}@{revision}/{path_to_file}

Features:

  • Supports glob patterns (*, **)
  • Dataset revisions/versions using @ notation:
    • Specific commit: @d50d8923b5934dc8e74b66e6e4b0e2cd85e9142e
    • Branch: @refs/convert/parquet
    • Branch alias: @~parquet
  • Requires HF_TOKEN environment variable for private datasets

Mixing Data Sources

fenic allows you to combine multiple data sources in a single read operation, including mixing different protocols:

df = session.read.parquet([ "hf://datasets/cais/mmlu/astronomy/*.parquet", "file:///local/data/*.parquet", "./relative/path/data.parquet" ])

This flexibility allows you to seamlessly combine data from Hugging Face Hub and local files in your data processing pipeline.

Processing Data from Hugging Face

Once loaded from Hugging Face, you can use fenic’s full DataFrame API:

Basic DataFrame Operations

import fenic as fc session = fc.Session.get_or_create(fc.SessionConfig()) df = session.read.parquet("hf://datasets/imdb/plain_text/train-*.parquet") positive_reviews = df.filter(fc.col("label") == 1).select("text", "label") label_counts = df.group_by("label").agg( fc.count("*").alias("count") )

AI-Powered Operations

To use semantic and embedding operations, configure language and embedding models in your SessionConfig. Once configured:

import fenic as fc session = fc.Session.get_or_create( fc.SessionConfig( semantic=fc.SemanticConfig( language_models={ "gpt-4o-mini": fc.OpenAILanguageModel( model_name="gpt-4o-mini", rpm=60, tpm=60000, ) }, embedding_models={ "text-embedding-3-small": fc.OpenAIEmbeddingModel( model_name="text-embedding-3-small", rpm=60, tpm=60000, ) }, ) ) ) df = session.read.parquet("hf://datasets/imdb/plain_text/train-00000-of-00001.parquet") df_with_embeddings = df.select( "*", fc.semantic.embed(fc.col("text")).alias("embedding") ) df_analyzed = df_with_embeddings.select( "*", fc.semantic.analyze_sentiment( fc.col("text"), model_alias="gpt-4o-mini", ).alias("sentiment") )

Example: Analyzing MMLU Dataset

import fenic as fc session = fc.Session.get_or_create( fc.SessionConfig( semantic=fc.SemanticConfig( language_models={ "gpt-4o-mini": fc.OpenAILanguageModel( model_name="gpt-4o-mini", rpm=60, tpm=60000, ) }, ) ) ) df = session.read.parquet("hf://datasets/cais/mmlu/astronomy/*.parquet") processed_df = (df .filter(fc.col("subject") == "astronomy") .select("question", "choices", "answer") .select( "*", fc.semantic.map( "Rate the difficulty of this question from 1-5: {{question}}", question=fc.col("question"), model_alias="gpt-4o-mini" ).alias("difficulty") ) ) processed_df.show()

Resources

Update on GitHub
Read Entire Article