The Graze Archives

4 weeks ago 1

Building on ATProto is a team sport. As we've shown previously, in open social, we only win when other folks in the ATmosphere win. In that effort, the Graze team is delighted to announce access, effective immediately, to two archived datasets for researchers, developers, archivists, and other folks looking to push the boundaries of the ATmosphere.

Turbostream

The turbostream has been available for about six months via websocket - in short, it is a stream of metadata-enriched posts that hydrate referenced objects in posts such as the author of the post, mentioned users, parent/quoted posts, and so forth. Under the hood, we've been storing that data to S3 for long term archival - we've now made that S3 bucket public, and have set it up for requestor-pays access. In theory, nearly every single post should be within this archive, enriched with these referenced objects to the greatest extent possible.

Megastream

The megastream is a relatively new dataset - it is the turbostream, then enriched with ML inferences. At Graze, we run a handful of ML classifiers against every post to allow our users to be able to filter the content by those classifications. We also generate several text embeddings, and as of recently, even generate text transcriptions for every video passing through Bluesky. This is now generally available in the megastream bucket. While the turbostream archive begins at 2025-04-21, the megastream bucket starts effective 2025-09-09.

Graze Bluesky Archive Access

Two S3 buckets provide enriched Bluesky data snapshots as SQLite databases:

  • graze-turbo-01: Turbostream archive (hydrated references, no ML inferences)

  • graze-mega-02: Megastream archive (turbostream + ML inferences)

What's Inside

Each file contains a several-minute slice of the Bluesky firehose that has been progressively enriched:

Turbostream Archive (graze-turbo-01)

Available from: April 21, 2025

  • Jetstream: Raw Bluesky events (posts, likes, follows, etc.)

  • Turbostream: Hydrated references including full user profiles, mentions, parent/reply posts, and quoted posts

Megastream Archive (graze-mega-02)

Available from: September 9, 2025

  • Jetstream: Raw Bluesky events

  • Turbostream: Hydrated references

  • Megastream: Machine learning inferences added to each record

ML Inferences Included

The Megastream enrichment adds extensive analysis to each post, including:

  • Language detection: Probability scores for 20+ languages

  • Content moderation: Flags for violence, hate speech, self-harm, sexual content, harassment

  • Sentiment analysis: Positive, negative, and neutral classification

  • Topic classification: 20+ categories (Gaming, Arts & Culture, News, Sports, etc.)

  • Emotion detection: 28 emotions (Joy, Anger, Surprise, Sadness, Amusement, etc.)

  • Toxicity scores: Threat, insult, identity hate, obscenity levels

  • Financial sentiment: Market-relevant positive/negative/neutral signals

  • Marketing detection: Spam vs organic content classification

  • Text embeddings: Vector representations for semantic search (multiple models)

All inference scores are included as probability values (0-1 range) for each record.

File Format

Turbostream Archive

jetstream_YYYYMMDD_HHMMSS.db.zip

Example:

jetstream_20250421_235152.db.zip

Megastream Archive

mega/mega_jetstream_YYYYMMDD_HHMMSS.db.zip

Example:

mega/mega_jetstream_20250909_181102.db.zip

Each .db.zip file is a compressed SQLite database containing enriched Bluesky posts from a specific time window.

Prerequisites

  • An AWS account

  • AWS credentials configured (aws configure)

Accessing the Bucket

This is a Requester Pays bucket, which means you pay for data transfer costs when downloading files. Storage costs are covered by the bucket owner.

List All Files

aws s3 ls s3://graze-mega-02/mega/ --request-payer requester

Download a Specific File

aws s3 cp s3://graze-mega-02/mega/mega_jetstream_20250909_181102.db.zip . --request-payer requester

Download All Files

aws s3 sync s3://graze-mega-02/mega/ ./local-folder/ --request-payer requester

Using Python (boto3)

import boto3 s3 = boto3.client('s3') # List files response = s3.list_objects_v2( Bucket='graze-mega-02', Prefix='mega/', RequestPayer='requester' ) for obj in response.get('Contents', []): print(obj['Key']) # Download a file s3.download_file( 'graze-mega-02', 'mega/mega_jetstream_20250909_181102.db.zip', 'local_file.db.zip', ExtraArgs={'RequestPayer': 'requester'} )

Important Notes

  • Always include --request-payer requester in your commands or the request will fail

  • You will be charged AWS data transfer costs for downloads

  • Storage costs are covered by the bucket owner

  • Anonymous access is not supported - you must use authenticated AWS credentials

Cost Estimation

AWS S3 data transfer pricing (as of 2025):

  • First 100 GB/month: $0.09/GB

  • Next 10 TB/month: $0.085/GB

  • Over 50 TB/month: Lower rates available

Check current pricing: https://aws.amazon.com/s3/pricing/

Questions?

Contact Graze.social on BSky or via our site for assistance.

Read Entire Article