Biology is a graph.

Why every pharma needs a knowledge graph as foundational infrastructure.

Nov 03, 2025

Biology Doesn’t Fit in Tables

If you’ve worked in drug discovery, or just biological sciences in general, you’ve encountered the frustration of spending a ridiculous amount of time pulling data together to answer seemingly straightforward questions like “What are the druggable targets in pathway X that are overexpressed in tumor type Y?”.

Opening six different databases
Downloading CSV files from PubMed, TCGA, ChEMBL, and your internal systems
Writing brittle Python scripts to join these datasets
Discovering that gene names don’t match across sources (HUGO vs. Entrez vs. Ensembl)
Spending three days on data wrangling before you can even start analysis
Producing an answer that’s already stale because you can’t easily update it

There are those who believe that agentic systems or “AI scientists” will eliminate these problems, they don’t (which is a topic for another blog post, maybe).

The reality is that this isn’t a tooling problem. It’s a representation problem. We’ve been trying to force inherently graph-structured data into tabular formats for decades and it’s killing productivity and forcing us to build things that don’t need to exist if you just modeled data properly.

The argument for this post is simple: Biology is a graph. Any other representation is a lossy approximation that handicaps our ability to do science at scale.

Why Biology is Fundamentally Graph Structured

Biological Meaning Exists in Relationships

An isolated biological entity has almost no meaning. The information content is encoded in how the parts interact, regulate, and influence each other. Consider TP53, perhaps the most studied gene in all of cancer biology. In any given database you’ll see an entry that looks like this:

Does this tell you anything useful about p53? Does it explain why it’s the guardian of the genome? Does it hint at why 1000s of papers have been published on it?

No. This representation strips away everything that makes p53 biologically meaningful.

Biological meaning is an emergent property of the network.

As a disease driver, p53 is mutated in more than half of all human cancers. But it’s not uniform and the spectra differs dramatically depending on the cancer type. For example, in Lung AD, mutations are mostly missense and cluster within DBD create stable yet dysfunctional proteins. In ovarian cancers, p53 mutations are mixtures of missense and truncating variants that eliminate protein function altogether. These differences have important biological and therapeutic implications. A single mutational frequency column cannot capture this type of nuance.

As a transcription regulator, p53 controls expression of 100s of downstream genes. When DNA damage is detected, TP53 upregulates CDKN1A (p21) to arrest the cell cycle, giving time for repair. It activates BBC3 (PUMA) and BAX to trigger apoptosis if damage is irreparable. It regulates RRM2B and SCO2 to modulate metabolism. It induces SESN1 and SESN2 to manage oxidative stress. The circuits that p53 regulates is not a singular output. It’s a context-dependent network of regulatory relationships that changes based on cell type, stress type, and cellular state.

As a protein, p53 exists in a sprawling web of physical interactions that determine its activity. In unstressed cells, MDM2 targets it for degradation to keep levels low. MDM4 reinforces this repression. Acetylation by p300/CBP enhances transcriptional activity by stabilizing it. Deacetylation by SIRT1 reverses activation. The point is that it’s not a simple on/off switch. These are complex dynamically regulated networks of protein-protein interactions, post-translational modifications, and feedback loops that are, again, context-dependent.

As a therapeutic target, TP53 defines treatment strategies through multiple relationship types. In tumors with wild-type p53, MDM2 inhibitors can reactivate the p53 pathway. In tumors with mutant p53, the mutation creates new vulnerabilities where loss of TP53 function produces synthetic lethality with WEE1 inhibitors, ATR inhibitors, and PARP inhibitors in certain genetic contexts. Mutant p53 can also gain oncogenic functions through interactions with other transcription factors. Understanding how to treat a TP53-altered cancer requires traversing from the mutation through the pathway network to potential therapeutic interventions.

Tables joins become nuclear level explosions.

Try to place them in tables in a relational db and they might look as follows:

genes table with basic gene info
gene_disease table linking disease associations
gen_expression table with transcriptional expression data across tissues/cells/clinical popiulations
ppi table
regulatory_relationships table with transcription factor targets
pathways table with pathway membership
mutations table with variant data
drug_targets table with therapeutic relationships

This seems reasonable until you try to answer actual biological questions.

“What are the druggable proteins that interact with genes in the p53 pathway that are overexpressed in lung adenocarcinoma?”

In SQL, this requires:

Query the pathways table to get genes in the p53 pathway
Join to the gene_expression table filtered for lung adenocarcinoma
Join to the protein_interactions table to get interacting proteins
Join to the drug_targets table to filter for druggable proteins
Potentially join back to the genes table to get protein-coding gene information

That’s 4-5 joins across different tables. Now add in the complications:

Gene identifiers don’t match across tables (HUGO symbols vs. Entrez IDs vs. Ensembl IDs)
You need to filter by expression thresholds, but what threshold?
“Overexpressed” needs to be compared against normal tissue, that’s another join, also what threshold?
“Druggable” has multiple definitions (ligandable pocket, known ligands, approved drugs)—more branching logic
Pathway membership is ambiguous (is indirect regulation included? what about post-translational regulation?)

Each of these complications adds more joins, more logic, more query complexity. Change one criterion and you need to rewrite significant portions.

Context-Dependent Relationships Cannot Be Flattened

This is the most fundamental problem with tabular representations. Biological relationships are entirely context-dependent, and context doesn’t fit in columns neatly.

If you want to model Gene X regulates Gene Y, this isn’t binary, it depends on:

Cell type
Developmental stage
Disease state
Environmental conditions
Genetic background
Measurement
Experimental conditions

In a table, you have three options to choose from, all of them are bad.

Ignore the context. Store just “X regulates Y” and lose accuracy.
Add context columns to the table. But how many columns do you add? How often does this schema change? How do you query across contexts?
Create separate records for each context. Massive duplication.

In a graph, this can be natively modeled and captured by traversing graph nodes. It’s also easy to chain multiple patterns together to get more and more complex relationships as network topologies.

Biological Identity Depends on Network Position

What a specific entity “is” depends on where it sits in the graph (or subgraph).

For example, is PKC an oncogene or a tumor suppressor? Well, it depends. Which tissue? Which pathway context? PKC promotes cell survival in some contexts and apoptosis in others. To understand the functional role, you must examine its network position.

Is CDK4/6 inhibition growth-suppressive or growth-promoting? In ER+ breast cancer with intact Rb, it’s suppressive. That’s why palbociclib works. In Rb-null cancers, blocking CDK4/6 can paradoxically enhance growth by removing cell cycle checkpoints. The therapeutic effect depends on the network context.

This is a fundamental principle: biological function is not an intrinsic property of an entity, it’s an emergent property of that entity’s position in a network of relationships.

Tables force you to assign properties to entities: “CDK4.function = cell cycle progression.” Graphs let you represent the truth: “CDK4 function emerges from its relationships with cyclins, CDK inhibitors, Rb, E2F, and downstream targets, which differ by cell type and genetic context.”

You need a graph.

So what? Why should you care?

Productivity cost: Scientists spend 30-40% of their time wrangling data instead of analyzing it. Most of this time is spent joining, aligning, and integrating datasets i.e., reconstructing the relationship network that should have been preserved in the first place. No, an AI agent doesn’t help you do this either.

Correctness cost: Flattening relationships loses context, which leads to incorrect conclusions. How many drug targets have been pursued based on gene expression data that ignored tissue context? How many pathways are misunderstood because protein isoforms were conflated?

Innovation cost: Complex questions that require multi-hop reasoning are simply not asked because they’re too hard to answer in SQL. Scientists constrain their hypotheses to fit the limitations of their database, not the boundaries of biological possibility.

This is why every major drug discovery organization needs a knowledge graph. Not because graphs are trendy or cool, but because biology is a graph, and representing it otherwise is scientifically incorrect and practically limiting.