How to Build a Reddit Sentiment Scanner in Python (Full Pipeline)
Published: March 2026 Reading time: ~12 minutes Tags: Python, Reddit API, sentiment analysis, PRAW, financial NLP, quant trading
If you’ve ever watched a stock move 15% on a Monday morning and then spent the weekend reading r/wallstreetbets wondering if the chatter had anything to do with it, you’re already halfway to understanding why Reddit sentiment data is worth building a pipeline for.
The problem isn’t getting the data. The Reddit API is free and Python makes it trivially easy. The problem is that raw Reddit posts are noise. Most of what gets posted on financial subreddits is low-quality, off-topic, or sarcastic โ and a generic sentiment model trained on product reviews doesn’t know the difference between someone saying “๐๐๐” ironically and someone saying it as a genuine bull signal.
This post walks through the full pipeline: collecting posts from Reddit with PRAW, classifying them with a finance-tuned model, and filtering down to the posts that are actually worth paying attention to. By the end you’ll have a script that runs in under a minute, processes 100 posts per subreddit, and returns a clean list of high-confidence signals.
What you’re building
A Python script that:
- Pulls the top posts from one or more financial subreddits using the Reddit API
- Sends them to a classification API in a single batch call
- Filters the results by quality, directional conviction, and relevance to a specific ticker
- Prints (or saves) only the posts worth looking at
The output looks like this:
NVDA โ bullish [relevant | relevance: 0.89 | confidence: 0.74]
"Blackwell demand is insane. Hyperscalers not slowing capex. Long into earnings."
NVDA โ bullish [relevant | relevance: 0.91 | confidence: 0.61]
"DD: Why I think NVDA prints to $200 by EOY โ full breakdown inside"
---
2 signals from 100 postsClean, structured, filterable. Let’s build it.
Prerequisites
You need three things:
- Python 3.8+
- A Reddit developer app (free, takes two minutes โ instructions below)
- A FinSignals API key (free tier gives you 1,000 credits/month โ enough to run this script 10 times a day on 100 posts)
Install the dependencies:
pip install praw requestsIf you have the FinSignals Python SDK:
pip install finsignals-apiThe examples below use the SDK. If you prefer raw HTTP, there’s a requests version at the end of the post.
Step 1: Create a Reddit developer app
Go to reddit.com/prefs/apps and click “are you a developer? create an app.”
- Name: anything (e.g.
sentiment-pipeline) - Type: select “script”
- Redirect URI:
http://localhost:8080(doesn’t matter for script apps)
After you create it, you’ll see a client ID (the short string under the app name) and a client secret. Save both.
You’ll also need your Reddit username and password.
Step 2: Fetch posts with PRAW
PRAW (Python Reddit API Wrapper) handles all the OAuth and rate-limiting for you. Here’s a minimal fetch:
import praw
reddit = praw.Reddit(
client_id="YOUR_CLIENT_ID",
client_secret="YOUR_CLIENT_SECRET",
username="YOUR_REDDIT_USERNAME",
password="YOUR_REDDIT_PASSWORD",
user_agent="finsignals-scanner/1.0 by YOUR_REDDIT_USERNAME"
)To pull the top 100 posts from r/wallstreetbets right now:
subreddit = reddit.subreddit("wallstreetbets")
posts = list(subreddit.hot(limit=100))
print(f"Fetched {len(posts)} posts")The hot feed returns posts ranked by current momentum. You can also use new (most recent), top with a time filter (day, week, month), or rising for posts gaining traction fast. For a real-time scanner, hot or rising tend to give you the most actionable data.
Each post object has .title, .selftext (the body), .score (upvotes), .num_comments, and a bunch of other metadata you can use later for additional filtering.
Step 3: Prepare the batch payload
The classification API accepts a list of items, each with optional ticker, title, and body fields. The more context you give it, the better the relevance scoring works.
def prepare_items(posts, ticker=None):
items = []
for post in posts:
item = {
"title": post.title,
"body": post.selftext[:1500] if post.selftext else "",
}
if ticker:
item["ticker"] = ticker
items.append(item)
return itemsA few things worth knowing here:
- Truncate long posts. Most of the signal is in the first 1,500 characters. Sending the full body of a 10,000-word DD post doesn’t improve classification meaningfully and burns more processing time.
- Include the ticker when you have it. Supplying a ticker changes how the
relevance_scoreis computed โ it measures how on-topic the post is for that specific symbol, not just for financial content generally. A post about Apple earnings gets a high relevance score if you’re scanning for AAPL; a low one if you’re scanning for NVDA. - Title-only posts are fine. Many Reddit posts have no body text. The model classifies them on title alone โ just pass an empty string for
body.
Step 4: Classify the batch
With the SDK, this is a single call:
import finsignals
client = finsignals.Client("fs_your_key_here")
items = prepare_items(posts, ticker="NVDA")
results = client.classify_batch(items)
print(f"Credits charged: {results.credits_charged}")A batch of 100 items costs 1.00 + 99 ร 0.70 = 70.3 credits. At 1,000 free credits per month, that’s about 14 full scans on the free tier. Upgrade to Starter ($29/mo, 100,000 credits) and you can run this every few minutes all month without thinking about it.
The response comes back as a list of output objects in the same order as your input items. Each output has:
sentimentโpositive,negative, orneutralwith probabilitiesdirectionalityโbullish,bearish, orneutral_directionqualityโrelevant,noise, orspampost_typeโdd,news_reaction,technical_analysis,fundamentals,question, orgeneralrelevance_scoreโ a float in [0, 1]author_confidenceโ a float in [0, 1]sarcasmโ a boolean
Step 5: Filter for signal
Raw classification isn’t the endpoint โ filtering is. Most of the posts in any financial subreddit are noise, off-topic, or low-quality. The classification output gives you the levers to filter them out programmatically.
Here’s a filter function that returns only the posts worth reading:
def filter_signals(posts, outputs, ticker=None, direction=None, min_relevance=0.65):
signals = []
for post, output in zip(posts, outputs):
# Skip noise and spam
if output.quality.label != "relevant":
continue
# Skip low relevance (off-topic for the ticker)
if output.relevance_score < min_relevance:
continue
# Skip sarcasm โ inverted sentiment is worse than no signal
if output.sarcasm:
continue
# Optional: filter by direction
if direction and output.directionality.label != direction:
continue
signals.append({
"post": post,
"output": output,
})
return signalsThe thresholds here are a starting point. min_relevance=0.65 is conservative โ lower it to 0.5 if you’re running a broader scan and want more volume; raise it to 0.80 if you want only the posts that are unambiguously about the ticker you care about.
The sarcasm filter is worth paying attention to. Reddit financial communities use heavy irony โ “oh yeah definitely buy the top, brilliant strategy /s” โ and a naive sentiment model reads it as positive. The sarcasm head flags these for removal. It’s marked as experimental in the docs, so don’t bet the farm on it, but in practice it catches the most egregious cases.
Step 6: Display the results
def print_signals(signals):
if not signals:
print("No signals found.")
return
for s in signals:
post = s["post"]
out = s["output"]
direction = out.directionality.label
relevance = round(out.relevance_score, 2)
confidence = round(out.author_confidence, 2)
post_type = out.post_type.label
print(f"n{direction.upper()} [{post_type} | relevance: {relevance} | confidence: {confidence}]")
print(f'"{post.title}"')
print(f" r/{post.subreddit.display_name} ยท {post.score} upvotes ยท {post.num_comments} comments")
print(f" https://reddit.com{post.permalink}")The full script
Here it is end to end:
import praw
import finsignals
# โโ Config โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
REDDIT_CLIENT_ID = "YOUR_CLIENT_ID"
REDDIT_CLIENT_SECRET = "YOUR_CLIENT_SECRET"
REDDIT_USERNAME = "YOUR_REDDIT_USERNAME"
REDDIT_PASSWORD = "YOUR_REDDIT_PASSWORD"
FINSIGNALS_API_KEY = "fs_your_key_here"
SUBREDDITS = ["wallstreetbets", "investing", "stocks"]
TICKER = "NVDA" # Set to None to scan without ticker context
DIRECTION = "bullish" # "bullish", "bearish", or None for all directions
POSTS_LIMIT = 100 # Posts to fetch per subreddit
MIN_RELEVANCE = 0.65 # Minimum relevance score (0โ1)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def fetch_posts(reddit, subreddit_name, limit):
sub = reddit.subreddit(subreddit_name)
return list(sub.hot(limit=limit))
def prepare_items(posts, ticker=None):
items = []
for post in posts:
item = {
"title": post.title,
"body": post.selftext[:1500] if post.selftext else "",
}
if ticker:
item["ticker"] = ticker
items.append(item)
return items
def filter_signals(posts, outputs, direction=None, min_relevance=0.65):
signals = []
for post, output in zip(posts, outputs):
if output.quality.label != "relevant":
continue
if output.relevance_score < min_relevance:
continue
if output.sarcasm:
continue
if direction and output.directionality.label != direction:
continue
signals.append({"post": post, "output": output})
return signals
def print_signals(signals, subreddit_name, total_posts):
print(f"nโโ r/{subreddit_name} โโ {len(signals)} signals from {total_posts} posts โโ")
if not signals:
print(" No signals matched your filters.")
return
for s in signals:
post = s["post"]
out = s["output"]
print(
f"n {out.directionality.label.upper()} "
f"[{out.post_type.label} | "
f"relevance: {round(out.relevance_score, 2)} | "
f"confidence: {round(out.author_confidence, 2)}]"
)
print(f' "{post.title}"')
print(f" {post.score} pts ยท {post.num_comments} comments ยท https://reddit.com{post.permalink}")
def main():
reddit = praw.Reddit(
client_id=REDDIT_CLIENT_ID,
client_secret=REDDIT_CLIENT_SECRET,
username=REDDIT_USERNAME,
password=REDDIT_PASSWORD,
user_agent="finsignals-scanner/1.0",
)
client = finsignals.Client(FINSIGNALS_API_KEY)
for sub_name in SUBREDDITS:
posts = fetch_posts(reddit, sub_name, POSTS_LIMIT)
items = prepare_items(posts, ticker=TICKER)
results = client.classify_batch(items)
signals = filter_signals(
posts,
results.outputs,
direction=DIRECTION,
min_relevance=MIN_RELEVANCE,
)
print_signals(signals, sub_name, len(posts))
print(f"nDone. Total credits charged across all subreddits: check your dashboard.")
if __name__ == "__main__":
main()Copy that into scanner.py, fill in your credentials in the config block at the top, and run it:
python scanner.pyYou’ll get output like:
โโ r/wallstreetbets โโ 3 signals from 100 posts โโ
BULLISH [dd | relevance: 0.91 | confidence: 0.69]
"NVDA DD: Blackwell demand still accelerating, hyperscaler capex not slowing"
847 pts ยท 213 comments ยท https://reddit.com/r/wallstreetbets/comments/...
BULLISH [news_reaction | relevance: 0.88 | confidence: 0.55]
"NVDA up pre-market on Taiwan Semi data โ what this means"
312 pts ยท 89 comments ยท https://reddit.com/r/wallstreetbets/comments/...
BULLISH [technical_analysis | relevance: 0.72 | confidence: 0.61]
"NVDA holding the 200-day, looking for a bounce entry"
204 pts ยท 56 comments ยท https://reddit.com/r/wallstreetbets/comments/...
โโ r/investing โโ 1 signal from 100 posts โโ
BULLISH [fundamentals | relevance: 0.83 | confidence: 0.74]
"The case for NVDA long-term: not just AI, the data center transition"
1,204 pts ยท 341 comments ยท https://reddit.com/r/investing/comments/...Extending it
A few ways to take this further once the basic pipeline is working.
Schedule it to run at market open
Add a cron job (or use schedule in Python) to run the scanner at 9:15 AM Eastern on weekdays:
# crontab -e
15 9 * * 1-5 /usr/bin/python3 /path/to/scanner.py >> /var/log/scanner.log 2>&1Save output to a CSV or database
Replace print_signals with a function that writes to a CSV or SQLite database, and you have a growing dataset of labeled Reddit posts tied to specific tickers and timestamps. This is genuinely useful for backtesting โ you can look back at the sentiment distribution for a ticker in the days before it moved and start building an intuition for what the signal distribution looks like ahead of a catalyst.
import csv
from datetime import datetime
def save_signals(signals, filename="signals.csv"):
with open(filename, "a", newline="") as f:
writer = csv.writer(f)
for s in signals:
post = s["post"]
out = s["output"]
writer.writerow([
datetime.utcnow().isoformat(),
post.subreddit.display_name,
post.id,
post.title[:200],
out.directionality.label,
out.quality.label,
round(out.relevance_score, 4),
round(out.author_confidence, 4),
out.sarcasm,
out.post_type.label,
post.score,
post.num_comments,
f"https://reddit.com{post.permalink}",
])Scan multiple tickers
Change the config to loop over a watchlist:
WATCHLIST = ["NVDA", "AAPL", "TSLA", "META", "AMD"]
for ticker in WATCHLIST:
for sub_name in SUBREDDITS:
posts = fetch_posts(reddit, sub_name, 50)
items = prepare_items(posts, ticker=ticker)
results = client.classify_batch(items)
signals = filter_signals(results.outputs, direction=None, min_relevance=0.70)
print_signals(signals, sub_name, len(posts))Note: running a full watchlist scan on three subreddits ร 50 posts each = 150 posts per ticker. At 5 tickers that’s 750 posts per run, costing about 526 credits (batch pricing). At that rate, the Starter plan ($29/mo, 100,000 credits) comfortably covers hourly scans all month.
Filter by post engagement
The Reddit post object includes .score and .num_comments. You can add an engagement floor to ignore posts nobody’s reading:
posts = [p for p in raw_posts if p.score > 50 or p.num_comments > 20]High-engagement posts aren’t inherently more accurate as signals, but they do represent the market’s attention โ which is at least as relevant as whether the underlying thesis is correct.
Using requests instead of the SDK
If you prefer not to install the SDK, the REST call is straightforward:
import requests
def classify_batch(items, api_key):
response = requests.post(
"https://api.finsignals.ai/v1/classify/batch",
headers={"X-API-Key": api_key, "Content-Type": "application/json"},
json={"items": items},
timeout=30,
)
response.raise_for_status()
return response.json()
# Usage:
data = classify_batch(items, FINSIGNALS_API_KEY)
for item, output in zip(posts, data["outputs"]):
print(item.title, "โ", output["directionality"]["label"])The response schema is documented in full at finsignals.ai/docs.
A word on what this isn’t
This pipeline surfaces posts worth reading. It doesn’t tell you what to trade, and the people running serious quant strategies treat social sentiment as one feature in a much larger model, not a standalone signal.
That said, the classification layer does something that a raw feed of Reddit posts can’t: it separates the posts that are directionally committed, high-quality, and on-topic from everything else. Whether you use that to trade, to research, or to build something on top of โ that’s the part that’s actually hard to do well with a general-purpose model. A post that says “I wouldn’t touch this stock with a ten-foot pole ๐” is negative sentiment. A post that says “I wouldn’t touch this stock with a ten-foot pole” after three paragraphs of bullish analysis might be sarcasm. These distinctions are why domain-specific fine-tuning exists.
What to read next
- API documentation โ full endpoint reference, batch pricing, and response schema
- API Tester โ try classification without writing any code first
- Pricing โ the free tier is 1,000 credits/month, Starter is $29/mo for 100,000
If you build something with this โ a dashboard, a scanner, an alert system, anything โ the contact page is open. Always interested in what people are building.