โ† All Articles

What I've Learned Evaluating AI Every Day

I evaluate AI outputs from frontier models daily. Here's what actually works, what doesn't, and what the hype misses.

Part of my work involves evaluating outputs from frontier AI models. Claude, GPT, Gemini. I review responses, assess quality, identify problems, and flag concerning behavior.

It's given me a perspective that's different from both the AI hype and the AI doom narratives. I see these systems fail every day. I also see them do genuinely impressive things. The truth is more nuanced than either extreme.

Here's what I've learned from actually working with these systems, not just reading about them.

What These Systems Actually Do Well

They're genuinely good at synthesis

Give an AI a bunch of information and ask it to summarize, extract patterns, or reorganize. this is where they shine. They can process more context than humans and find connections across that context. Not because they "understand" it in the human sense, but because pattern-matching at scale is what they're built to do.

They're useful thinking partners

Not because they have good ideas. they don't really have ideas. But they're good at reflecting your ideas back in different forms. They can expand on a concept, suggest alternatives, point out things you might have missed. It's like having a conversation with a well-read generalist who never gets tired.

They can produce decent first drafts

The key word is "first." AI-generated content usually needs editing, but starting from a draft is often faster than starting from blank. The trick is knowing when AI drafts are good enough to refine versus when they're leading you in the wrong direction.

They handle repetitive cognitive work

Formatting, transforming data between structures, generating variations on a theme. tasks that are tedious for humans but require some language understanding. This is probably the most reliable use case I've seen.

Where They Consistently Fall Apart

They fabricate with confidence

This is the one that still surprises me, even though I see it daily. These models will present completely made-up information with the same confidence as accurate information. They'll cite papers that don't exist, quote statistics they invented, describe events that never happened. And they do it fluently, which makes it harder to catch.

When I evaluate outputs, catching fabrications is a significant part of the job. They're not rare edge cases. they're common enough that you can't trust AI outputs without verification.

They miss context that seems obvious

An AI can process more text than any human, but it doesn't understand context the way humans do. It misses implications. It takes things literally that were meant figuratively. It responds to what was written without grasping what was meant.

I've seen AI give technically correct answers to the wrong question because it interpreted the prompt differently than the user intended. The words matched, but the meaning didn't.

They're inconsistent in ways that matter

The same prompt can get different responses. Not just slightly different. sometimes contradictory. This makes them unreliable for anything that requires consistency. Good enough for brainstorming, problematic for decision-making.

They can't reliably follow complex instructions

Simple instructions work well. Complex, multi-step instructions break down. The more constraints you add, the more likely the AI is to violate some of them. This is a problem for anyone trying to build reliable systems on top of these models.

What the Hype Gets Wrong

"AI will replace [job X]"

Maybe eventually. But what I see in current systems is tools that can handle pieces of jobs, not entire jobs. They're better at augmenting human work than replacing it. The people I've seen use AI effectively treat it as a capable assistant, not as a replacement for thinking.

"AI understands [thing]"

It doesn't understand anything. It predicts what text should come next based on patterns. When that prediction aligns with useful output, it looks like understanding. When it doesn't, the illusion breaks. This isn't a criticism. it's just an accurate description of what these systems are.

"AI is almost [AGI/sentient/human-level]"

Having evaluated thousands of AI outputs, I can tell you: we're not close. These systems are impressive language processors, but they're not reasoning in any meaningful sense. They're very good at appearing to reason because they've been trained on text produced by reasoning humans.

What the Doom Gets Wrong

"AI is too dangerous to deploy"

There are legitimate risks, but the current systems are tools. Powerful tools can be dangerous when misused. That's true of lots of technologies. The answer is thoughtful deployment, not prohibition.

"We can't control AI"

Current AI systems are very controllable. They do what they're configured to do, within limits. The challenge isn't that we can't control them. it's that controlling them requires understanding their failure modes, and most teams deploying AI haven't done that work.

"AI will [catastrophic outcome]"

The more I work with these systems, the less impressed I am by catastrophic scenarios. Not because bad outcomes aren't possible, but because the more mundane failures are the ones I actually see. AI systems don't need to go rogue to cause problems. they just need to be deployed without adequate testing.

What I Actually Worry About

Having spent time in the weeds with these systems, here's what concerns me:

Overreliance on AI outputs

People trusting AI without verification. Using AI-generated content without review. Making decisions based on AI recommendations without understanding the limitations. The technology isn't the problem. the misplaced trust is.

Speed over safety

Companies are racing to deploy AI features. The market rewards shipping fast. But AI deployments need more testing than traditional software, not less. The incentives are misaligned.

Security as an afterthought

Prompt injection, data leakage, model manipulation. these vulnerabilities are real and under-addressed. Most AI deployments I read about don't seem to be thinking about security at all.

Capability outpacing understanding

We're deploying systems we don't fully understand. That's not unusual for technology, but the gap between capability and understanding feels particularly large with AI. We know what works, but we don't always know why it works or when it will stop working.

How I Think About It Now

After a year of intensive AI study and daily hands-on work, my view is pretty simple:

These are useful tools with real limitations. The limitations don't make them useless. The capabilities don't make them magic. They're good at some things, bad at others, and understanding the difference is the whole game.

Most of what goes wrong with AI isn't about AI being dangerous or uncontrollable. It's about people deploying AI without understanding what they're deploying. The technology isn't the hard part. The hard part is doing the work to use it responsibly.

That work isn't exciting. It's not in the headlines. But it's what separates AI that works from AI that becomes a cautionary tale.