Built to uncover hidden marketing signals on Reddit — and help power smarter growth for Cronlytic.com 🚀
📺 Click the thumbnail above to watch a full explainer — why I built this tool, how it works, and how you can use it to automate Reddit lead generation using GPT-4.
A Python application that scrapes Reddit for potential marketing leads, analyzes them with GPT models, and identifies high-value opportunities for a cron job scheduling SaaS.
- Overview
- Setup
- Configuration
- Running
- Results
- Project Structure
- Cost Controls
- Missing Features
- Why This Exists
- License
- Third-Party Licenses
This tool uses a combination of Reddit's API and OpenAI's GPT models to:
- Scrape relevant subreddits for discussions about scheduling, automation, and related topics
- Identify posts that express pain points solvable by a cron job scheduling SaaS
- Score and analyze these posts to find high-quality marketing leads
- Store results in a local SQLite database for review
The application maintains a balance between focused (90%) and exploratory (10%) subreddits, intelligently refreshing the exploratory list based on discoveries. This exploration process happens automatically as part of the main workflow.
- Python 3.8+
- Reddit API credentials (create an app here)
- OpenAI API key
-
Clone the repository:
git clone https://github.com/yourusername/cronlytic-reddit-scraper.git cd cronlytic-reddit-scraper -
Create a virtual environment:
python3 -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate -
Install dependencies:
pip install -r requirements.txt -
Set up environment variables by copying .env.template to .env:
-
Edit .env and add your API credentials:
REDDIT_CLIENT_ID=your_client_id REDDIT_CLIENT_SECRET=your_client_secret REDDIT_USER_AGENT="script:cronlytic-reddit-scraper:v1.0 (by /u/yourusername)" OPENAI_API_KEY=your_openai_api_key
Configure the application by editing config/config.yaml. Key settings include:
- Target subreddits: Primary subreddits and exploratory subreddit settings
- Post age range: Only analyze posts 5-90 days old
- API rate limits: Prevent hitting Reddit API limits
- OpenAI models: Which models to use for filtering and deep analysis
- Monthly budget: Cap total API spending
- Scoring weights: How to weight different factors when scoring posts
To run the pipeline once:
This will:
- Scrape posts from configured primary subreddits
- Automatically discover and scrape from exploratory subreddits
- Analyze all posts with GPT models
- Store results in the database
To run the pipeline daily at the configured time (TODO, Fix scheduler):
Results are stored in a SQLite database at data/db.sqlite. You can query it using:
You can also use the included results viewer:
The application includes several safeguards to control API costs:
- Monthly budget cap (configurable in config.yaml)
- Efficient batch processing using OpenAI's Batch API
- Pre-filtering with less expensive models before using more powerful models
- Cost tracking and logging
This project is licensed under the MIT License - see the LICENSE file for details.
Reddit Scraping (Posts & Comments) | ✅ Done | Age-filtered, deduplicated, tracked via history table |
Primary & Exploratory Subreddit Logic | ✅ Done | With refreshable exploratory_subreddits.json |
GPT-4o Mini Filtering | ✅ Done | Via batch API, scoring + threshold-based selection |
GPT-4.1 Insight Extraction | ✅ Done | With batch API, structured JSON, ROI + tags |
SQLite Local DB Storage | ✅ Done | Full schema, type handling (post/comment) |
Rate Limiting | ✅ Done | Real limiter applied to avoid Reddit bans |
Budget Control | ✅ Done | Tracks monthly cost, blocks over-budget batches |
Daily Runner Pipeline | ✅ Done | Logs step-by-step, fail-safe batch handling |
Batch API Integration | ✅ Done | With file-based payloads + polling + result fetch |
Cached Summaries → GPT Discovery | ✅ Done | Based on post text, fallback if prompt fails |
Comment scraping toggle | ✅ Done | Controlled via config key (include_comments) |
Retry on GPT Batch Failures | ✅ Done | Can retry 10 times with exponential backoff |
Parallel subreddit fetching | 🟡 Manual (sequential) | Consider async/threaded fetch in future |
Tagged CSV Export / CLI | 🟡 Missing | Useful for non-technical review/debug |
Multi-language / non-English handling | 🟡 Not supported | Detect & skip or flag for English-only use |
Unit tests / mocks | 🟡 Not present | Add test coverage for scoring and DB logic |
Dashboard/UI | ❌ Out of scope (by design) | CLI / SQLite interface is sufficient for now |
This tool was created as part of the growth strategy for Cronlytic.com — a serverless cron job scheduler designed for developers, indie hackers, and SaaS teams.
If you're building something and want to:
- Run scheduled webhooks or background jobs
- Get reliable cron-like execution in the cloud
- Avoid over-engineering with full servers
👉 Check out Cronlytic — and let us know what you'd love to see.
This project is open source for personal and non-commercial use only. Commercial use (including hosting it as a backend or integrating into products) requires prior approval.
See the LICENSE file for full terms.
This project uses open source libraries, which are governed by their own licenses:
- PRAW — MIT License
- APScheduler — MIT License
- OpenAI Python SDK — MIT License
- Reddit API — Subject to Reddit’s Terms of Service
Use of this project must also comply with these third-party licenses and terms.