The case for a bug bounty in ML
When I read an ML paper closely enough to actually try and extend it, I find something wrong about a third of the time. I don’t mean fraud (that’s pretty rare in my experience, and not really what I’m worried about anyway). I mean things like a hyperparameter sweep that wasn’t fully reported, or a baseline that was tuned less carefully than the proposed method, or quiet contamination in the benchmark, or a result that just doesn’t survive a different seed. Some of these matter for the central claim of the paper and some really don’t, but in either case almost none of them ever get corrected.
This bothers me less than it used to, because over time I’ve come to think of it as a fairly normal failure of the publication economy and not a moral failing of any particular author. The people who tend to notice these things are graduate students extending the paper, or engineers trying to deploy the method, or just careful readers with time on their hands, and there isn’t really anywhere obvious for any of them to send what they’ve found. None of them are getting paid to do the additional work it would take to turn a private suspicion into a public correction either. So the error stays where it was; the paper keeps getting cited; and a few months later somebody else is building on top of it without knowing.
Peer review and replication, the two mechanisms we usually point at when we want to feel okay about all this, have stopped doing the work. NeurIPS 2024 received over 17,000 submissions, and a reviewer who actually wanted to rerun a paper’s experiments has maybe three weeks for it on top of a day job, for free. That just isn’t going to happen. Replication is in worse shape: a serious check on an expensive paper costs thousands of dollars in compute and weeks of senior engineering time, and the reward for doing it carefully is essentially nothing. There’s nowhere obvious to publish a replication, grant cycles aren’t really set up to fund this kind of work, and no tenure committee I know of will give you credit for it. So mostly it doesn’t happen, and we all sort of pretend it does. (I have specific examples I won’t name here.)
Software security ran into a structurally similar problem in the 90s. Vendors couldn’t find every vulnerability in their own products, and the users who did find vulnerabilities had no real incentive to write them up carefully. Bug bounty programs eventually changed the equilibrium by turning “find a flaw and document it well” into a paid activity. Two decades on, more or less every serious software product has a bounty program, and the security posture of the industry is in a pretty different place than where it started. The interesting thing about that history isn’t that bounties are some kind of magic bullet (they aren’t); it’s that paying for a kind of work nobody was previously paid for changed what kind of work got done.
The closest thing to an analog in science right now is ERROR (error.reviews), which launched in 2024 and pays specialists to check highly-cited psychology papers. ERROR is a great project, but it’s also pretty deliberately designed for psychology: authors opt in to being checked, investigators come from a small pool of trusted reviewers, and the artifacts under review tend to be textual and statistical. That makes sense for psychology, because that’s how psychology papers are mostly constructed. ML papers are constructed differently in ways I think open up a different model.
Most ML papers basically come with downloadable artifacts attached. The code is usually on GitHub somewhere, the weights are on HuggingFace, and the benchmark is just public. Checking a claim doesn’t really need permission from the authors, or lab access, or biological samples, or institutional review. The entire surface of the claim is sitting on a server somewhere, which means adversarial checking, without opt-in from authors, is at least possible. That alone is a different game from ERROR’s.
A lot of the errors are also mechanically checkable in a way I find satisfying. If you want to know whether the test set was in the pretraining corpus, you can substring-match against indexed training data. If you want to know whether a baseline got the same compute budget as the proposed method, you can re-run things with matched compute. Whether the headline number actually survives across seeds is maybe five GPU-days of work, give or take. These mostly aren’t questions of interpretation; they have answers, and the answers are reproducible. (This isn’t true in every field. In some areas of biology, “did this experiment replicate” is itself a contested question.) Mechanical checks matter more than judgment-based ones for a thing like this, because mechanical checks generate norms a field can absorb over time. One-off heroic replications generally don’t.
And then AI assistants have made the routine parts of careful checking dramatically cheaper than they were even two years ago. A research engineer paired with a strong assistant can read a paper, locate the artifacts, identify likely failure modes, draft a check protocol, and run it in a matter of days rather than months. The labor cost of a thorough check has come down sharply, while the credit assignment system in academia really hasn’t moved much at all, which is the gap a bounty program would fill. It also widens the pool of people who can do the work — a careful non-specialist with a strong assistant can now verify things that would previously have required a domain expert, which means the checking pool can be much larger than the small invited panel ERROR works with.
A pilot would look something like this. Take a hundred ML papers from 2022 through 2024, weighted toward high citation count and downstream production use. Set up a bounty board with rough tiers for the kinds of issues you’d want surfaced: training-data contamination at the low end (something like $5K), unreported hyperparameter selection that flips the headline result more like $10K, a failed replication of the central claim under a faithful protocol around $15K, data leakage that invalidates the conclusion $25K or so, fabricated or impossibly-strong baselines somewhere in the $50K range. The exact numbers don’t matter that much; what matters is that the gradient roughly tracks how badly each kind of issue would mislead someone trying to build on the work. Adjudication happens via an independent panel of three senior researchers per claim, rotating across claims so that no single panel ends up dominating things. Original authors get a structured response window with the same publication footing as the checker, and everything is published, including the disputes that don’t ever cleanly resolve. (The unresolved cases are arguably the most interesting data the program would generate.) Total program cost, including adjudication, infrastructure, and compute reimbursement, lands somewhere in the $3-5M range, which is small relative to one frontier training run, and roughly one year of a small academic lab.
The most boring outcome is that almost no errors actually surface, in which case the field is in better shape than the irreproducibility discourse would suggest, and a few million dollars is a pretty cheap way to find that out. A more interesting outcome is that a lot of errors surface but adjudication breaks down, because checkers and authors can’t agree on what even counts as a flaw. That probably tells us ML doesn’t have shared standards for what an experimental claim means, which is arguably more important to learn than any one paper’s correctness, and it suggests the next intervention to fund should be about methodological standards rather than error detection per se. The case I personally worry about most is that errors do surface and authors retaliate against the checkers in ways the program can’t shield them from. If that happens, we’ve learned that the cultural side of this is deeper than the incentive side, and that any future bounty program would need legal or institutional protections built in from the start. If it just works, we have a template, and a lot less hand-waving in the whole discourse.
Some of the obvious failure modes aren’t informative and the design has to handle them from the start: frivolous claims gaming the bounty pool, coordinated attacks on individual researchers, a selection bias toward papers with the most checkable artifacts (which would create a perverse incentive against open science). Most of the actual design work, in my experience thinking about this, is in the adjudication procedure and the protections for both checkers and authors. The bounty mechanism itself is honestly the easy part.
None of this would have made sense as a proposal three or four years ago. The labor cost of careful checking was much higher then, the downstream cost of a contaminated benchmark was much lower (papers mostly stayed in a literature instead of getting deployed into production systems, cited in procurement decisions, or built into safety evaluations), and there wasn’t really a generation of postdocs and engineers idling on skills that would be perfect for this kind of work. All three of those have shifted, and the institution that would coordinate the work simply doesn’t exist yet.
The field has spent the better part of a decade arguing about reproducibility from circumstantial evidence — survey papers, anecdotes, individual heroic replications — and a bounty pilot, regardless of how it lands, would replace some of that argument with measurement. Adjacent fields that have run programs like this tend to come out with much sharper pictures of their own pathologies than they had going in, and ML is in a better position to learn from a pilot than those fields were, mostly because the artifacts are easier to check and the tooling has gotten meaningfully better.
Someone should run this. It probably doesn’t have to be me, and it probably shouldn’t live inside an existing institution either: a standing program is much easier to set up as a small independent organization than as a project inside a university, for basically the same reasons replication doesn’t fit inside normal grant cycles. What it needs is some initial funding, a small group of senior researchers willing to adjudicate, and the patience to let the data tell us whether to keep going.