Quantum Datasets Hub for Reproducible Benchmarks and QML

1 month ago 6

The Aqora Datasets Hub gives quantum researchers and teams a fast place to publish, discover, and reuse datasets built for reproducible benchmarks and QML experiments. Bring graph sets for MaxCut/QAOA, molecular Hamiltonians for VQE, circuit corpora in OpenQASM 3, and their classical companions. Load with pandas or polars, keep datasets private to your organization or share them publicly, and cite immutable versions so results line up across machines and time.

Quick start

URI format: aqora://{publisher}/{name}/v{version}

Load a public Parquet split directly into pandas:

import pandas as pd URI = "aqora://stubbi/pharmacometric-events/v1.0.0" df = pd.read_parquet(URI) print(df.head())

You can also load with polars and push down a column selection + row filter before data transfer:

import polars as pl from aqora_cli.pyarrow import dataset df = pl.scan_pyarrow_dataset(dataset("bernalde/hamlib-binary-optimization", "v1.0.0")) small_max3sat = ( df.filter( (pl.col("problem") == "max3sat") & (pl.col("n_qubits") < 100) ) .select(["hamlib_id","collection","instance_name","n_qubits","n_terms","operator_format"]) .collect() ) print(small_max3sat.head(10))

Why a dedicated hub for quantum-relevant data?

In classical ML, standardized dataset structures and hub-first distribution unlocked discoverability and reproducibility; we're bringing that workflow to quantum. Quantum research leans on curated datasets: graph families for QAOA/MaxCut, molecular Hamiltonians for VQE, circuit corpora, and re-encoded image/time-series for QML. But sources are scattered, metadata is inconsistent, and version lineage is often unclear. That makes benchmark results hard to recreate and slows down reproducible results. The Aqora Datasets Hub formalizes the basics so you can focus on algorithms, not plumbing. Quantum-native artifacts like QASM circuits and Hamiltonians can directly be embedded into the dataset tables researchers can work with in their code base.

What's in v1

  • Immutable versions for every dataset
  • Public and private publishing for teams and open science
  • Clean metadata: license (SPDX), tags, version, and size
  • Simple loading with ready-to-copy Python snippets

Reproducible benchmarks and experiments

Benchmarks and experiments only matter if others can rerun them. An immutable remote hub ensures everyone pulls the same bytes every time without local path drifts or silent file edits. Whether you're sweeping QAOA depths, comparing VQE ansätze, or training QML kernels, the recipe is simple:

  1. Pin a fixed dataset version: publisher/name@version.
  2. Seed Python/NumPy and your QC stack.
  3. Log the dataset version and your code commit in every run.
# Minimal, reproducible setup import os, random, numpy as np, pandas as pd SEED = 42 URI = "aqora://stubbi/pharmacometric-events/v1.0.0" # pin a fixed version os.environ["PYTHONHASHSEED"] = str(SEED) random.seed(SEED); np.random.seed(SEED) # Optional: align Qiskit randomness try: from qiskit.utils import algorithm_globals algorithm_globals.random_seed = SEED except Exception: pass # Deterministic shuffle + split df = pd.read_parquet(URI).sample(frac=1.0, random_state=SEED).reset_index(drop=True) n = len(df); n_train = int(0.8*n); n_val = int(0.1*n) train, val, test = df[:n_train], df[n_train:n_train+n_val], df[n_train+n_val:] print({"uri": URI, "seed": SEED, "n": n}) # → proceed with Qiskit / PennyLane / Cirq, etc.

Best practice for papers/notebooks

  • Pin publisher/name@version in the script header.
  • Log metrics with the dataset version and code commit hash.
  • Publish your eval script + config; anyone can reproduce by loading the same version.

Publish your dataset

Turning a lab folder into a dataset page on Aqora gives it a durable, citable reference. With immutable versions and clean metadata, others can rerun and verify your results without emailing for files or chasing drive links. Keep it private while you iterate, then switch to public when the paper lands. Your citation (and URI) stay stable. This way, you get clear attribution and the community benefits from reproducable research. Follow these steps to create and publish your dataset on Aqora:

  1. Create the dataset at https://aqora.io/datasets/new
  2. Add a short README (purpose, limits, columns), a license (SPDX), and tags.
  3. Choose public or private visibility.
  4. Prepare data in CSV/Parquet/JSON. If you upload a directory, Aqora automatically generates a single Parquet file from it.
  5. Upload the data in one of two ways:
  • UI: Drag and Drop your files on the dataset page created in step 1.
  • CLI (example):
pip install -U aqora-cli aqora login aqora datasets upload alice/graphzoo-3reg --version 1.0.0 data.csv

FAQ

Is this only for "quantum" data?

No. Most experiments pair classical datasets (graphs, molecules, images, time series) with quantum artifacts. The hub treats both as first-class.

Do I need a special format?

Yes, we currently support CSV, Parquet, and JSON. If you upload a directory, we'll generate a single Parquet file from it.

Can I keep a dataset private to my lab?

Yes. Upload as private, share with your team, and switch to public when ready without changing the version you cite.

Do you issue DOIs per dataset or version?

Not yet. In the meantime, cite the immutable publisher/name@version. DOI support is on our roadmap.

What formats and sizes are supported?

CSV, Parquet, and JSON for tables. If you upload a directory, Aqora automatically generates a single Parquet file from it. Large datasets are supported. Use the CLI for multi-file or multi-GB uploads.

Read Entire Article