Why we built AgentX-Ray, and what testing a model honestly actually means.

Introduction

I did not set out to build a benchmark. I set out because I stopped trusting the ones we already had.

Every week a new model shows up claiming it beats everything before it. The scores look incredible. Then you actually use the thing and it falls apart the moment you push on it. The gap between what a model scores and what a model does turned out to be the whole problem.

The problem with scores

Most benchmarks test a model under perfect conditions. Clean questions, no pressure, nobody trying to trip it up. That tells you what a model can do on its best day. It tells you almost nothing about what happens when a real user pushes back, asks the same thing three different ways, or quietly tries to get it to break its own rules.

There is also a quieter problem. Benchmarks leak. The test questions end up in training data, the model memorizes the answers, and suddenly it looks brilliant on a test it has basically already seen. A score like that is not a measure of intelligence. It is a measure of how much of the test the model already had.

So the number on the leaderboard and the trust you can actually place in the model are two different things. We wanted to close that gap.

What we built

AgentX-Ray is an adversarial proving ground. Instead of asking a model easy questions, it tries to break it. It probes for the things that actually matter when you put a model in front of real people. Does it hold its ground under pressure, or does it cave when a user keeps pushing. Does it make things up and say them confidently, or does it know when it does not know. Does it stay consistent when you reword the same question.

The questions are built so a model cannot study for the test. You bring your own model and your own key. Your model stays yours. The score comes from how it behaves, not from what it claims.

It also is not only for large language models. One of the first models we tested was a tiny on device search model, seven megabytes, running fully in a browser. It scored far better than its size suggested. That is the kind of thing you only learn when you actually test, instead of guessing from the spec sheet.

What happened next

The honest part. For a while the dashboard read zero. We had built the whole thing and not one stranger had run it yet. That is the hardest stretch for anyone building something, and pretending otherwise helps no one.

Then a real builder showed up, ran their model, and got a result they were proud of. They gave feedback, we shipped it the same day, and they started telling people. One real user who actually cares is worth more than a thousand empty signups. That is when it stopped being an idea and started being a thing people use.

Where it goes

The belief underneath all of this is simple. Nobody should have to take a model on faith. Not the maker, not the buyer, not the curious person who just wants to know if the hype is real. Maybe a model really is the best. Maybe it is not. You do not know until it is tested by someone who is not selling it.

That is what we are building. A neutral place where any model, from a solo founder or a frontier lab, can be put under real pressure and come out with a number that actually means something.

If you build models, come break ours, and let us break yours.

Categories

AgentX-Ray

Introduction

The problem with scores

What we built

What happened next

Where it goes

Discussion (0)