Encoding Scientific Judgment into AI Systems
Pilots pass every benchmark yet rarely change a decision that matters. The durable opportunity is to encode scientific judgment into a shared reasoning substrate.
Moving Goalposts
A major enterprise pharma decides to cull 90% of AI pilots because of the 900 pilots running, only 10% were delivering real enterprise value.
This should be concerning for AI companies building in this space. How can it be that AI systems massively outperform on benchmarks and hit the technical metrics but still fail to translate that to enterprise value? AI pilots often succeed technically but fail to influence decisions, reduce cycle times, or justify continued investment.
Part of the problem is that the early wins in AI use cases have been entirely absorbed into the frontier models themselves. Document generation, literature summarization, data exploration, and other forms of procedural assistance are rapidly becoming baseline capabilities rather than differentiated products. Raising the floor of capabilities means AI companies have to tackle harder and more complex drug discovery problems. The further you are from program decisions, the less valuable the systems become, no matter how strong they perform on benchmarks.
The goalposts have shifted.
It’s not about giving 100 scientists a co-pilot that makes them 30% more productive anymore. The bigger opportunity is to make the judgment of an R&D org’s best scientists available across hundreds of decisions simultaneously.
The first pharma company to be able to encode scientific reasoning and judgment into substrates that humans and AI agents can understand, augment, and iterate on will unlock a fundamentally different model of AI-driven drug discovery.
But getting there requires understanding why the current agent architecture begins to break as its capabilities expand.
TL;DR
There must be an operating layer between an organization’s knowledge, the agents acting on it, and the decisions those agents are expected to support.
How do you teach AI to think like your scientists?
Drugs are the product of differentiated data and knowledge. Together, it becomes the most treasured and closely guarded resource in any serious pharmaceutical organization. They will not freely share the failed experiments, internal disagreements, proprietary datasets, decision histories, and tacit standards of evidence that distinguish how they discover drugs. Nor will they allow external vendors to train generalized models on the corpus from which their competitive advantage is derived.
This makes drug discovery a uniquely unforgiving environment for AI agents/co-scientists.
The incentives are structurally misaligned. AI companies are largely limited to public-domain training data, yet the public literature is a poor representation of how real drug-discovery decisions are made. It over-represents positive findings, omits much of the negative evidence, and strips away the organizational context, debate, and accumulated experience that determine whether a program advances or dies.
Most importantly, scientific judgment has been encoded into people, never in systems.
Capability and confusion scale together.
To meaningfully participate in real-world drug discovery work, AI agents must reach beyond what models absorbed in pre-training and handle modalities beyond text. Drug discovery is inherently multimodal, requiring diverse data types, transformations, and post-processing analytics.
Frontier models are already highly capable and they will only continue to get better. Modern agentic systems are designed to give them the tools to gather context and draw conclusions. For example, much of bioinformatics procedural work can be automated by workflow and analysis agents with access to computing resources.
However, the complexity of multimodal reasoning in life sciences means these aren’t one-off calls. A single question can send an agent through hundreds of files; each pass between the agent, the data, and the task consumes more tokens in the context loop. In long-running autonomous research, the promise is that the system continuously gathers context, reduces uncertainty, and converges on the best answer.
Here’s where problems start. Biology is messy. Contradiction is the normal state of evidence and real insight more often comes from knowing which data to ignore and which data to pay attention to. Rarely does it come from careful rereading of the same data to uncover some universal truth.
The more powerful the system, the more it gathers; the more it gathers, the more tokens it burns and the more context entropy it generates. Latency and costs scale non-linearly, as each decision requires more context, more reasoning, and more careful selection between actions. Left unchecked, agentic systems can become a runaway cost center, earnestly burning tokens on the way to a worse answer.
The obvious response is to narrow the aperture.



