The Floor
Your workspace. Begin here.
“We are drowning in information, while starving for wisdom.”— E.O. Wilson, biologist
What is Thresh?
Every day, millions of people talk to each other on Reddit. They ask for help. They share what scares them. They argue about what matters. Most of it disappears into the scroll.
Thresh pulls it out.
Point it at any subreddit. Tell it what you're looking for. It hands you a clean dataset with a complete record of how it was collected. No accounts, no API keys, no code. Everything runs in your browser. Everything stays on your machine.
The threshing floor is among the oldest inventions in human history. High ground, flat stone, wind. The place where you brought the harvest and beat it until the grain fell free. Every civilization built one. Covenants were sealed on threshing floors. Altars were raised on them. Prophets used them to talk about judgment. They were never passive places. They were where you did the work of separation. The flood is information now, and the grain is what people are actually saying to each other when they think no one important is listening.
Quick Start
Enter a subreddit name and start collecting data in seconds. No API keys, no code, no setup.
How It Works
Recent Collections
No collections yet. Start by threshing a subreddit.
Who Uses This — And How
Public Health Researcher
"What are people in r/mentalhealth talking about this month?"
score to find what resonates most
Winnow: Run Identify themes to map dominant concerns
Glean: Export CSV with anonymized usernames for IRB-ready analysis
Strategy tip: Collect the same subreddit monthly and compare word frequency tables. Emerging terms (e.g., a new medication name, a policy change) show you what’s shifting in the conversation before it shows up in the literature.
Journalist
"What questions are people asking in r/personalfinance about student loans?"
num_comments for the biggest conversations
Winnow: Run Extract questions to find what people need answered
Glean: Provenance.txt gives your editor a transparent methodology section
Strategy tip: Try the same keyword across different subreddits (r/personalfinance, r/studentloans, r/povertyfinance). The same topic sounds different depending on who’s talking — that contrast is the story.
Graduate Student
"I need to compare discourse in r/science vs. r/conspiracy for my thesis."
upvote_ratio to see consensus vs. division
Winnow: Run Sentiment analysis on each, then a Custom prompt comparing tone
Glean: Two exports, each with its own provenance — cite both in your methods section
Strategy tip: Use the Academic report format on each collection separately, then use a Custom prompt to compare them side by side. You now have a draft comparative analysis with proper methodology documentation for your advisor.
Community Organizer
"What are residents saying in our city's subreddit about the new transit plan?"
Strategy tip: Collect once with Top sort (what resonates) and once with Controversial sort (what divides). The gap between those two collections is where the real debate lives — and where a town hall should focus.
Labor Market Economist
"What pain points and unmet needs are job seekers describing in r/jobs right now?"
score to find which frustrations resonate most widely
Winnow: Run Identify themes — salary transparency? ghosting? application fatigue? See what the data says
Glean: Generate an AI Research Report framed for your academic audience. You now have a first draft of a labor market sentiment brief grounded in real worker voices
Strategy tip: Thresh the same subreddit once a month for three months and compare word frequencies over time. That’s a longitudinal snapshot of worker sentiment — impossible to get from BLS data alone.
Political Campaign Manager
"What issues are people in r/Denver fired up about before our town hall?"
num_comments to find what’s sparking the most debate
Winnow: Run Extract questions to see what voters are actually asking. Then Identify themes to categorize by issue: housing, crime, transit, cost of living
Glean: Generate a Town Hall Brief for your candidate. Export the data as a CSV backup for your comms team
Strategy tip: Collect from multiple city subreddits (r/Denver, r/DenverFood, r/DenverCirclejerk) to see how the same issues land in different community contexts. Each collection gets its own provenance seal.
Step 1 — Thresh
Configure and collect Reddit data. No API key needed.
“Attention is the rarest and purest form of generosity.”— Simone Weil, philosopher
Building Bigger Datasets
Reddit returns up to 100 posts per request. Thresh automatically paginates for you when you select 250 or 500 posts. But if you need richer data, the real power is systematic sampling:
Each collection is saved separately in your browser. All exports include full provenance documentation, so every dataset is replicable regardless of when or how you collected it.
Step 2 — Harvest
Browse and explore your collected data.
“The eye sees only what the mind is prepared to comprehend.”— Robertson Davies, novelist
No Data Yet
Collect some Reddit data first, then come back here to explore it.
Step 4 — Glean
Export your data with full provenance documentation.
“Without data, you're just another person with an opinion.”— W. Edwards Deming, statistician
Nothing to Export
Collect some data first, then return here to export it.
Step 3 — Winnow
Separate signal from noise. Start with the built-in word frequency table, then use Claude AI for deeper analysis.
“The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.”— Daniel J. Boorstin, historian
No Data to Analyze Yet
Head over to Thresh to collect posts from a subreddit. Once you have a collection, come back here to explore word patterns and run AI analysis.
About
Ethics, methodology, and how this tool works.
“Science is a way of thinking much more than it is a body of knowledge.”— Carl Sagan, astronomer
A Letter from the Builder
I built this because I know what it feels like to drown.
Not in water. In data.
I spent years writing a dissertation on social media discourse during the COVID-19 pandemic. Collecting tweets. Analyzing what people were saying to each other while the world fell apart. And what I found was genuinely illuminating. The patterns were there. The fear, the misinformation, the solidarity, the grief, the stubborn and sometimes beautiful ways people tried to make sense of something that didn't make sense. It was some of the most meaningful work I've ever done.
But the process of getting there was brutal. Even for me. Even with training and tools and institutional support. It required programming. It required API credentials. It required understanding rate limits and data schemas and pagination logic and a dozen other things that had nothing to do with the actual research question I was trying to answer.
And I kept thinking: if this is this hard for me, it is impossible for most people. The teachers who sense something shifting in how their students talk about mental health online. The journalists who can feel a story forming in a subreddit but can't prove it yet. The community organizers. The grad students. They're locked out. Not because they lack the curiosity or the intellect, but because the tools weren't built for them.
So I built this.
It can do much of what my dissertation did. Smaller scale, because of Reddit's rate limits. But that's fine. Small is good, if pointed in the right place, focused with the right lens. A hundred posts from the right subreddit in the right week will tell you more about how a community thinks than a thousand posts scraped without intention.
The world right now is full of quantitative social science. AI has made it easy to count things, to run regressions, to produce charts at speed. I'm grateful for that. But this project aims at something different. It aims at breathing life back into a research method I hold dear: qualitative inquiry. The close reading. The careful listening. Not because I'm an expert in it. Exactly the opposite, really. I chose the quantitative path early in my training, and I have admired qualitative work from afar ever since, with the quiet reverence of someone who knows they took the other road. There is something mysterious in it. Something irreducible. The act of sitting with people's words and asking what do they mean rather than how many are there.
At the deepest core, my hope is that Thresh marries the two again. That it gives you the structure to collect carefully, the numbers to orient yourself, and then the space to read. Really read. What people are saying. That it empowers anyone interested in understanding public discourse to gather data more easily, more ethically, and more transparently than the old tools ever allowed.
The ultimate recursion is this: what was once my dissertation is now everybody's tool.
The waters rise. The feed is a flood. But the threshing floor is high ground, and the work of separation is ancient, and the grain is worth the labor.
With hope,
Jacob E. Thomas, MA, PhD
February 2026
What Is This?
The Threshing Floor is a free, open-source tool for collecting and exporting Reddit data. It is designed for public health researchers, journalists, civic technologists, and anyone who believes public discourse is worth measuring.
It runs entirely in your browser. There is no server, no database, no account to create. Your data stays on your machine.
How It Works
Reddit's public pages serve JSON data alongside HTML. The Threshing Floor fetches this public data through a lightweight proxy (to handle browser security restrictions), then lets you browse, filter, and export it.
No Reddit API key is required. No authentication of any kind. The data collected is limited to what any person could see by visiting Reddit in a web browser.
The workflow follows four steps. Each one builds on the last:
- Thresh (Collect) — Enter a subreddit, choose your sort method and time filter, set a post limit, and optionally filter by keyword. Click Thresh. The tool gathers posts (and optionally comments) and stores them in your browser.
- Harvest (Browse) — Explore your collected data in a sortable, searchable table. Click any post to read its full text and comments. Sort by score to see what resonates, by comments to see what sparks debate, or by date to follow the conversation over time.
- Winnow (Analyze) — Start with the built-in word frequency table, then use Claude AI to identify themes, extract sentiment, summarize discussions, or ask custom questions about your data. AI analysis is built in and free to use.
- Glean (Export) — Download your data as CSV or JSON, bundled in a ZIP with a provenance.txt file that documents exactly how it was collected. Or generate an AI Report: answer two questions about your research, choose your audience, and Claude produces a document tailored to your needs — an IMRaD paper for academics, a data-driven column for journalists, a town hall brief for organizers, or an op-ed for general audiences.
The AI Report Generator
The Glean page includes an AI Report Generator powered by Claude Opus 4.6, Anthropic's most capable model. You provide a research question, choose your audience, and optionally add context. Claude aggregates your collection metadata, summary statistics, word frequency analysis, and the content of up to 50 posts into a complete document.
Each audience gets a different document format:
- Academic — Formal IMRaD paper (Introduction, Methods, Results, Discussion). Hedged language, rigorous limitations, replication-ready methodology. For theses, journal submissions, and IRB documentation.
- Journalist — A data-driven column. Lede-first structure, compelling narrative, quotes from the data, and a methods note for transparency. For newsroom briefings and story pitches.
- Community Organizer — A town hall brief. Key findings as talking points, community voices highlighted, recommended actions, and a statistical snapshot. For city council meetings, campaign prep, and advocacy reports.
- General — An op-ed or explainer piece. Curious tone, narrative arc, accessible statistics, honest caveats. For personal research, newsletters, and public writing.
The report downloads as a formatted Word document (.docx) ready for editing in Microsoft Word or Google Docs. Markdown and clipboard copy are also available.
What was once a dissertation-level undertaking now takes ten minutes and a good question.
Your Data & Storage
Everything is stored in your browser only. There is no server database, no account system, and no cloud sync. Specifically:
- Collections (posts, comments, configuration) are saved in your browser's
localStorage. - Rate limit state and subreddit cache are also stored locally in your browser.
AI analysis is powered by Claude Opus 4.6 via a secure managed proxy. No API keys are required or stored on your device.
This means your data does not sync across browsers or devices. If you switch browsers, your collections will not follow.
Clearing Your History
To erase all Thresh data from your browser:
- Open your browser's Settings (or press
Ctrl+Shift+Delete/Cmd+Shift+Delete) - Navigate to Privacy & Security → Clear browsing data
- Select "Cookies and site data" (this includes localStorage)
- To clear only Thresh: go to your browser's Developer Tools (
F12), open the Application tab, expand Local Storage, find the Thresh site, and delete individual keys or click Clear All
This will remove all saved collections, rate limit state, and cached subreddit data. It cannot be undone.
The Rate Limit Gauge
The Rate Limit gauge at the bottom of the sidebar tracks how many requests remain in your current Reddit rate limit window. Reddit allows 100 requests per minute to its public JSON endpoints.
- Gold bar — plenty of requests remaining. Normal operation.
- Yellow bar (below 30%) — requests are running low. Consider pausing collection.
- Red pulsing bar (below 10%) — critical. Thresh will pause automatically if the limit is reached.
- Cooldown timer — if you hit the limit, a countdown appears showing when requests resume. The collect button is disabled until the cooldown expires.
The rate limit resets automatically each minute. Under normal use (25–100 posts per collection), you will rarely see it drop below gold.
Ethical Considerations
- Re-identification risk: Even with usernames removed, unique writing styles or specific details in posts may allow re-identification. Consider this when publishing findings.
- IRB guidance: If you are conducting academic research, consult your Institutional Review Board about whether your data collection constitutes human subjects research.
- Reddit's Terms: This tool accesses publicly available data. Please review Reddit's Terms of Service and API Terms regarding data collection and use.
- Consent: Reddit users post publicly, but they may not expect their posts to be aggregated and analyzed. Handle data with care and respect.
- Default anonymization: Exports anonymize usernames by default. You can disable this, but consider the implications before doing so.
Provenance
Every export includes a provenance.txt file documenting exactly how the data was collected: the subreddit, sort method, time filter, number of posts, date of collection, and any filters applied. This is the seal on every bundle — it gives you the language for a methods section, a transparency report, or a replication attempt.
AI Analysis
Thresh uses Claude Opus 4.6 (by Anthropic), the most capable AI model available, for analysis and research report generation. AI features are built in and free to use — no API key needed.
AI requests are routed through a secure managed proxy at api.the-threshing-floor.com. The API key is stored server-side as an encrypted secret. Your data is sent to Anthropic for analysis and is not logged or stored by the proxy.
Open Source
The Threshing Floor is free and open source. The code is available on GitHub.
Get in Touch
Found a bug? Have a feature request? Want to share how you’re using Thresh? Reach out anytime:
Citation
Thomas, J.E. (2026). The Threshing Floor: A browser-based tool for Reddit data collection and export. https://github.com/jethomasphd/The_Threshing_Floor
Built using Latent Dialogic Space