Building on ATProto is a team sport. As we've shown previously, in open social, we only win when other folks in the ATmosphere win. In that effort, the Graze team is delighted to announce access, effective immediately, to two archived datasets for researchers, developers, archivists, and other folks looking to push the boundaries of the ATmosphere.
Turbostream
The turbostream has been available for about six months via websocket - in short, it is a stream of metadata-enriched posts that hydrate referenced objects in posts such as the author of the post, mentioned users, parent/quoted posts, and so forth. Under the hood, we've been storing that data to S3 for long term archival - we've now made that S3 bucket public, and have set it up for requestor-pays access. In theory, nearly every single post should be within this archive, enriched with these referenced objects to the greatest extent possible.
Megastream
The megastream is a relatively new dataset - it is the turbostream, then enriched with ML inferences. At Graze, we run a handful of ML classifiers against every post to allow our users to be able to filter the content by those classifications. We also generate several text embeddings, and as of recently, even generate text transcriptions for every video passing through Bluesky. This is now generally available in the megastream bucket. While the turbostream archive begins at 2025-04-21, the megastream bucket starts effective 2025-09-09.
Graze Bluesky Archive Access
Two S3 buckets provide enriched Bluesky data snapshots as SQLite databases:
graze-turbo-01: Turbostream archive (hydrated references, no ML inferences)
graze-mega-02: Megastream archive (turbostream + ML inferences)
What's Inside
Each file contains a several-minute slice of the Bluesky firehose that has been progressively enriched:
Turbostream Archive (graze-turbo-01)
Available from: April 21, 2025
Jetstream: Raw Bluesky events (posts, likes, follows, etc.)
Turbostream: Hydrated references including full user profiles, mentions, parent/reply posts, and quoted posts
Megastream Archive (graze-mega-02)
Available from: September 9, 2025
Jetstream: Raw Bluesky events
Turbostream: Hydrated references
Megastream: Machine learning inferences added to each record
ML Inferences Included
The Megastream enrichment adds extensive analysis to each post, including:
Language detection: Probability scores for 20+ languages
Content moderation: Flags for violence, hate speech, self-harm, sexual content, harassment
Sentiment analysis: Positive, negative, and neutral classification
Topic classification: 20+ categories (Gaming, Arts & Culture, News, Sports, etc.)
Emotion detection: 28 emotions (Joy, Anger, Surprise, Sadness, Amusement, etc.)
Toxicity scores: Threat, insult, identity hate, obscenity levels
Financial sentiment: Market-relevant positive/negative/neutral signals
Marketing detection: Spam vs organic content classification
Text embeddings: Vector representations for semantic search (multiple models)
All inference scores are included as probability values (0-1 range) for each record.
File Format
Turbostream Archive
Example:
Megastream Archive
Example:
Each .db.zip file is a compressed SQLite database containing enriched Bluesky posts from a specific time window.
Prerequisites
An AWS account
AWS credentials configured (aws configure)
Accessing the Bucket
This is a Requester Pays bucket, which means you pay for data transfer costs when downloading files. Storage costs are covered by the bucket owner.
List All Files
Download a Specific File
Download All Files
Using Python (boto3)
Important Notes
Always include --request-payer requester in your commands or the request will fail
You will be charged AWS data transfer costs for downloads
Storage costs are covered by the bucket owner
Anonymous access is not supported - you must use authenticated AWS credentials
Cost Estimation
AWS S3 data transfer pricing (as of 2025):
First 100 GB/month: $0.09/GB
Next 10 TB/month: $0.085/GB
Over 50 TB/month: Lower rates available
Check current pricing: https://aws.amazon.com/s3/pricing/
Questions?
Contact Graze.social on BSky or via our site for assistance.
.png)
