Reddit Mass Account Creation: Reverse Engineering Protected APIs and Captcha Infrastructure

This post is a semi-deep dive into how I built AccGen, a Reddit account generator built on top of their Android API. I'll cover the techniques that worked, the infrastructure decisions, and the one problem I couldn't solve easily: captcha.

Intercepting the Android API

I built multiple account generation tools for Reddit and other social media platforms. There are multiple ways to do it, the easiest would be to simply build on top of browsers; it is resource intensive, but if you're good at handling browsers at scale, it will work to an extent. The second way is to build with physical phones, which is resource intensive and requires a budget, but I would consider it the best and easiest approach. And there's the third approach, which is the one I took: reverse engineering the mobile API and sending requests through a server in a manner that would seem like multiple phones are registering accounts.

Reddit's Android app communicates to their servers with a different set of endpoints. These are internal endpoints that were made to be used only by their app, therefore the endpoints are heavily protected, and designed for a client only Reddit controls. They're using multiple techniques to determine whether they're being used by their app or another malicious app (AccGen in this case).

There are multiple ways to intercept the request flow to register an account. One way is to use a proxy, but the issue with that one is that Reddit has SSL pinning and the proxy's certificate will not pass unless you're doing this on a rooted device. I personally built custom sniffers with Frida and installed them on the Reddit app, this saved a lot of time because those custom sniffers were built to sniff requests with exact millisecond timestamps; that made it easier to analyze the request flow and see which request should go before/after which. Easiest way would be to just root your device and use an already existing proxy software like mitmproxy.

Once the traffic was visible, the pattern became clear:

The app uses a mixture of REST API and GraphQL
Account creation goes through a flow of requests that varies between 80-something requests and 100+ requests, excluding telemetry.

Mapping the GraphQL schema across all operations took trial and error.

Key Insight

The Android API is less restricted than the web API because Reddit assumes the client is trusted. Once you can speak the same protocol, you get access to endpoints that aren't available through OAuth-wrapped public routes.

Device Fingerprint Spoofing

The first thing Reddit does after you try to perform a SYN-ACK is they check the client's TLS fingerprint. And they block access depending on whether your TLS matches what the server expects or not. For example, if you're trying to connect to a mobile-only endpoint using a TLS fingerprint that belongs to a browser, you will get blocked, not immediately, but their backend learns pretty fast.

The fingerprint signals I had to replicate:

The first thing that needed to be done before I could make phone fingerprints was checking which HTTP client the Reddit app uses so that I could see what TLS fingerprint it uses. If Reddit are using OkHttp 5.2 for example, and you impersonate OkHttp 3.1, you will get blocked. I learned their client's TLS fingerprint by making an Android app with the same client and performing a request to tls.peet.ws, then extracting the JA3, Akamai, and other values. Consider this as the base fingerprint, the part that tells Reddit servers that all the requests are going through the official app. After that, comes phone-specific fingerprints; these need to be random to an extent; we can't make all accounts on a Samsung Galaxy S22 for example.

I used curl_cffi to spoof the TLS fingerprint because it was the only library I could find that accepts custom JA3 and Akamai values.

I stored a pool of +500 real device profiles extracted from actual Android phones. Each account creation request pulled a fresh profile and rotated the identifiers. The key was consistency; a Galaxy S22 running Android 14 can't suddenly report a Google Pixel screen resolution mid-session. Spoofing those values required patching the HTTP headers.

Device profiles need to be real. You can't make up values and expect them to work.

Proxy Infrastructure

Reddit is more about behavior and telemetry rather than IP. They still use IP to figure out which traffic is high risk. But they're more reliant on the behavior of the user to tell which ones are bots and which are legit users.

Since I was building a software that was supposed to behave as a lot of mobile phones, the only viable option I had was to use mobile proxies.

Proxies were generated on the go, since Oxylabs' and Decodo's systems allow you to generate a session ID and they'll assign an IP to that session on their backend for a given amount of time. Each proxy was tested before use on IP-API's JSON endpoint to save information about the proxy such as where it's located and which ISP it belongs to, for future proxy generation to maintain the same geolocation. One proxy was generated for each account; Reddit allows creating up to 4 accounts with the same IP, but why go that high when I have the option to map one proxy to one account?

Session Management at Scale

Reddit issues a session token on creation. That token expires, and refreshing it requires the same device fingerprint that was used during registration. Change any parameter and the refresh raises a flag, like I said above: Reddit is very sensitive to behavior.

Once the account is made, it gets transferred to a "warmup" phase, which is a separate section of AccGen. It does what a normal bored person would do: joins a few subreddits, scrolls a few posts, then leaves the app.

Then everything was saved to the database, the account alongside everything that belongs to it, to maintain consistency.

I built a session store that tracked:

The original device profile used for creation
The current auth token and its expiry
Cookie state across the session
Rate limit headers from every response

Each account had a scheduled refresh cycle. Tokens were rotated before expiry to avoid a gap. If a refresh failed, the account entered a recovery flow: re-authenticate with the same fingerprint, or mark it as dead.

Accounts that sat idle for too long were also at risk. Reddit's dormant account detection flags accounts that never interact. I added a minimal activity layer: random subreddit visits, a saved post, a scroll action, to keep them warm.

session discipline

Every account had a state machine: created, active, recovery, dead. Alerts fired when too many accounts entered recovery at once; it usually meant a fingerprint profile had been burned and needed replacement.

Scaling

Going from a couple of accounts a day to a few dozen worked fine. Going from dozens to hundreds and potentially thousands? The infrastructure held up. Proxies rotated, fingerprints were consistent, sessions stayed alive. None of that was the bottleneck.

The only thing that stopped me from scaling was captcha.

The Captcha Wall

This is where the project hit a wall.

Reddit uses Google's reCAPTCHA on their systems, both on web and mobile. For web reCAPTCHA tokens, there are dozens of providers, which are all trash. For mobile, there's only one provider, which is trash as well.

Don't get me wrong, those captcha solving services do solve captcha, and they do return tokens. But the tokens returned scream I am a bot. It just depends on what you're trying to do. If you're after retrieving data from a host without having to go through authentication, then those captcha solving services work okay. But be careful using their tokens when performing any authenticated action, you will lose that account.

I found a benchmark that confirmed my speculations. The article Benchmarking reCAPTCHA v3 Solver Services: Speed vs Quality Analysis tested the most known captcha solving services. TL;DR: captcha is doing its job perfectly by keeping bots at bay.

The problem is structural. These services are detectable. Google's ML has trained on enough of their traffic patterns that the tokens carry a signature. No amount of provider-hopping changes that.

The captcha solving services sell you a token. They don't sell you a high-score token even though some of them claim they do; they don't. The distinction matters because for account registration, the threshold is higher than for general browsing. Reddit's registration flow uses reCAPTCHA v3 with a score requirement that no commercial service meets.

Every captcha solving service claims high success rates. What they don't tell you is that the tokens they return have scores too low for authenticated flows.

Reddit actually uses two thresholds in their registration flow, an invisible one and a hard one. If you watch the video I embedded in this blog post, you will notice that AccGen hit a hard threshold, and if you pay real close attention, you will notice that token was returned in a couple of seconds. The reason? The captcha token provider is using a caching system, which is unfair of them not to state in their terms of service. Most people who don't know how caching works or how captcha token harvesting works won't notice and will get lost trying to figure out what is going on.

Hitting a hard threshold happened about 60% of the time, which is my guess at how often the provider I was using was caching their tokens. The token was returned by the provider but Reddit refused it. The account never existed.

The invisible threshold hit 100% of the time if the token was outsourced. Reddit has a cron job that starts daily at 12am UTC, sweeping and suspending all accounts created with low score tokens. Every single account I made with outsourced captcha tokens was eventually caught by that sweep, regardless of how good the fingerprint was or how natural the behavior looked. The exact score Reddit uses to suspend the accounts is private information only Reddit knows.

So even when I got past the hard threshold (40% of attempts), the invisible threshold guaranteed the account had a shelf life of up to 24 hours.

I spent weeks trying to figure out how to solve this, and I kinda did solve it.

How To Fix The Captcha Issue

The issue with outsourcing captcha tokens from public providers is structural. Their infrastructure is detectable; Google's ML has fingerprinting models trained specifically on their traffic. It doesn't matter if you switch providers, rotate proxies, or change fingerprint profiles. The token itself carries a signature.

The only way to get a high-score token is to generate it from a device that Google trusts.

I built a captcha harvesting system that does exactly that. The architecture is simple:

A rooted Android device runs the Reddit app with a Frida script injected into it. That script hooks into Google's reCAPTCHA SDK at the Java layer and calls its internal methods: getTasksClient() to initialize, then executeTask() to generate tokens. The tokens come from the same SDK the legitimate app uses, on a real device, with real Google services. No proxied solving, no emulated devices, no ML-detected patterns.

The Frida script exposes an RPC interface. A Python FastAPI server wraps it into a REST API with a task queue. AccGen sends a POST to /get-token with the app package and action, gets a task ID back, and polls /task/{task_id} until the token is ready.

This worked. The tokens passed Reddit's hard threshold consistently. Accounts stayed alive longer, sometimes for days instead of hours.

But there is a catch. Harvesting too many tokens from a single device triggers Google's rate limiting. The more tokens you pull, the lower the scores drift. It is the same problem the commercial services have, just on a different level. One device can sustain a few dozen accounts per day. Push past that and the scores drop.

The real fix is farming the workload across many devices. Each device generates a small number of tokens per day, staying under Google's radar. Scale horizontally instead of vertically. I never built that cluster (it would have required more hardware than I was willing to commit), but the architecture is designed for it. The queue system in the harvester API was built to support multiple Frida endpoints from day one.

Conclusion

AccGen proved that every layer of Reddit's anti-bot system can be beaten in isolation. The API can be reversed. The fingerprint can be spoofed. Sessions can be maintained at scale. None of those are the hard problem.

The hard problem is captcha, and the hard part of captcha is not solving it; it is solving it at scale without being detected. One device works. A cluster of devices, each pulling a small number of tokens per day, would work at any scale.

The commercial captcha services are not in the business of selling you high-score tokens. They are in the business of selling you tokens, period. If you need high-score tokens, you have to own the system that generates them.