The DNAI Research Toolkit brings our complete cancer digital twin pipeline to your institution. No patient data leaves your network. Full multi-omics analysis, locally.
Choose the interface that fits your team. All three run entirely on your infrastructure.
Chat naturally with your data. The AI agent calls DNAI tools automatically, interprets results, generates visualizations, and answers follow-up questions.
Batch processing for bioinformatics teams. Analyze patients, run cohort discovery, validate predictions, and export structured results.
Integrate DNAI into your existing pipelines. Full programmatic access to every analysis component with structured return types.
The entire pipeline — models, inference, analysis — runs on your hardware. No cloud calls, no data uploads, no API keys for inference.
All model checkpoints ship with the toolkit. Inference runs on CPU, CUDA, or Apple MPS. No internet required.
Docker, Singularity/Apptainer for HPC, or native Python install. Works on Linux, macOS, and Windows.
Validate DNAI predictions against your outcomes. Share only C-indices and statistical summaries back to us.
Eight analysis modules powered by the same models behind the DNAI platform.
Converts each patient's multi-omics profile into a compact, structured 328-dimensional representation capturing proliferation, 50 biological pathways, immune context, and epigenetics in a single vector.
Predicts relative survival risk calibrated against 9,415 TCGA patients across 33 cancer types. Includes a site-robust checkpoint specifically designed for external institutional data.
Ranks driver genes per patient by combining somatic mutation evidence with pathway context and protein-level structure. Covers 2,932 genes across the cancer genome.
Identifies druggable vulnerabilities from loss-of-function mutations. Combines 28 curated gene-drug pairs with a trained ML classifier (ρ=0.776 on held-out cell lines) that predicts novel context-dependent vulnerabilities using pathway state, DepMap essentiality (18,435 genes), and drug embeddings.
Ranks somatic variants by immunotherapy potential, scoring tumor microenvironment permissiveness, clonal prevalence, expression level, and variant type to flag candidates for checkpoint inhibitors or vaccine pipelines.
Traces causal chains from mutations through signaling pathways to druggable targets using the SIGNOR knowledge graph. Fully deterministic and auditable — every recommendation traces back to specific genes.
Reconstructs the tumor’s clonal architecture from variant allele frequencies. Maps subpopulations to a fixed 4-slot ODE model with a Resistance Sentinel that preserves minority drug-resistant clones. Validated on 228K GENIE patients with per-clone drug sensitivity annotations.
Scores every driver mutation across 6 dimensions: genetic evidence, essentiality (DepMap), druggability, clinical precedent, synthetic lethality, and pathway centrality (SIGNOR). Ranks actionable targets with evidence tiers and resistance route forecasting. 71.8% of patients have at least one actionable target.
Predicts synergistic drug combinations from monotherapy data alone — zero-shot, without training on any combination data. Finds drug pairs that orthogonally target different clonal subpopulations: drug A suppresses dominant clones while drug B targets the Resistance Sentinel. Validated at ρ=0.800 on 1,209 drug pairs.
Optimizes drug dosing schedules by simulating tumor dynamics under treatment pressure with pharmacokinetic constraints (half-life, toxicity budgets, maximum cumulative dose). Finds when to switch drugs based on clonal evolution. Achieves 42% dose reduction vs standard concurrent dosing while preventing resistance.
Creates digital twins from targeted sequencing panels (MSK-IMPACT, FoundationOne, etc.) without requiring RNA-seq. A trained panel adapter maps mutation and CNA data from 167 known panels to the full 328-dimensional latent space, enabling analysis for the millions of patients with panel data only.
Automatically scans your entire cohort for hidden patterns: molecular subgroups, synergistic pathway interactions, resistance signatures, outlier patients, and mutation-pathway links — with built-in replication testing.
Auto-detection handles format conversion, gene symbol harmonization, and normalization. No manual preprocessing required.
| Modality | Formats | Notes |
|---|---|---|
| RNA-seq | CSV, TSV, GCT, H5AD, NPZ | Auto-normalizes counts/TPM/FPKM |
| Mutations | MAF, VCF, CSV | Gene + variant type + VAF |
| Copy Number | GISTIC, segment, gene-level | Any numeric matrix |
| Methylation | 450K / EPIC beta-values | Clipped to [0, 1] |
| Histology (WSI) | Pre-extracted NPZ / PT | UNI2-h 1536d embeddings |
| Clinical | CSV / TSV | Flexible column mapping |
RNA-seq expression is the primary modality. All others are optional and enhance the analysis when available. Missing modalities are handled gracefully via Product-of-Experts architecture.
These are findings that standard bioinformatics pipelines cannot produce. They require the multi-omics, pathway-structured latent representation.
The toolkit clusters patients in 328-dimensional latent space that integrates RNA, mutations, CNV, and methylation into pathway-level biology. Unlike PCA on raw expression, this captures cross-modal interactions invisible to single-modality analysis.
For every loss-of-function mutation, the toolkit combines 28 curated SL pairs with a trained ML classifier that predicts novel context-dependent vulnerabilities from DepMap essentiality (18,435 genes) and pathway state. Most clinicians check BRCA/PARP but miss TP53/WEE1, ARID1A/EZH2, ATM/ATR, and context-dependent vulnerabilities the classifier discovers.
Tests all 1,225 pairwise interactions across 50 Hallmark pathways for synergistic effects on survival. A tumor with moderate MYC AND moderate EMT may be far more aggressive than high MYC alone — the interaction term carries the signal.
For each recurrent mutation, tests whether mutated patients have significantly different pathway activations vs wild-type — discovered from your data, not from databases. KRAS may activate different downstream pathways in LUAD vs PAAD.
Compares 328d digital twins of responders vs non-responders to identify distinguishing dimensions, then traces them back to specific pathways, proliferation dynamics, and immune context.
Computes multi-dimensional distance from cohort centroid in latent space. These are your most interesting patients: potential novel subtypes, misdiagnosed samples, rare driver combinations, or basket trial candidates.
The combination discovery engine predicts synergistic pairs from monotherapy data alone by simulating which drugs target different clonal subpopulations. It finds combinations where drug A suppresses dominant clones and drug B targets resistant minorities — validated at ρ=0.800 on 1,209 drug pairs, including leave-target-family-out proof of genuine discovery.
The schedule optimizer simulates tumor evolution under different dosing strategies and finds sequences that achieve the same tumor control with less total drug. In testing, optimized schedules reduce cumulative drug exposure by 42% vs concurrent dosing while still preventing resistance escape.
With the AI agent interface, researchers can ask any question about their data and the agent figures out how to answer it — “compare pathway profiles of my youngest vs oldest quartile”, “show me patients whose clonal architecture predicts resistance”, or “which patients are most similar to TCGA luminal B?” No pre-built dashboard needed.
HTML reports with digital twin summary, risk score, driver genes, SL opportunities, immunogenic variants, and clonal architecture.
Risk distribution, top drivers across the cohort, quality summary, and stratification analysis with statistical tests.
Prioritized non-obvious findings: latent subgroups, pathway interactions, resistance patterns, outliers, and mechanistic hypotheses.
CSV files (risk scores, drivers, full 328d latent space) and JSON results for integration with your existing pipelines.
C-index with bootstrap confidence intervals against your known outcomes, per-cancer-type breakdown, and diagnostic notes.
Built-in CLAUDE.md that teaches the AI agent about DNAI, so it can explain every result in the context of your specific data.
The DNAI Research Toolkit is available for non-commercial research use under a collaboration agreement.
33 TCGA cancer types in the training set. Cancer types not listed below can still be analyzed using the UNKNOWN_EXTERNAL generalization token, with reduced but non-zero signal.
We believe transparency about limitations is essential. These are the known constraints of the current toolkit.
Survival predictions rank patients, they do not predict exact survival times
C-index 0.704 (internal), 0.633 (external multi-site). Best suited for cohort stratification and identifying high-risk vs low-risk groups, not individual prognosis.
Synthetic lethality combines 28 curated pairs with a trained ML classifier, but predictions beyond curated pairs are investigational
Curated pairs are from clinical trials and preclinical studies. The v2 ML classifier (ρ=0.776) extends predictions beyond known pairs using pathway context and DepMap, but novel predictions require experimental validation.
Trained primarily on adult cancers (TCGA) — accuracy varies by cancer type
33 adult cancer types are well-represented. Pediatric, rare, and cancer types not in TCGA will have reduced but non-zero predictive signal.
Histology features do not transfer across institutions
Whole-slide image embeddings capture scanner and staining characteristics alongside biology. Histology analysis is available within-institution only; cross-site is suppressed by default.
Treatment recommendations are exploratory, not validated for clinical decisions
The treatment module provides hypothesis-generating rankings for research context. It has not undergone prospective clinical validation and should not guide treatment decisions.
Immunogenic variant candidates require downstream wet-lab confirmation
The toolkit scores variants by clonality, immune context, and expression, but HLA binding prediction and peptide-MHC stability testing are needed to confirm immunogenicity.
Clonal deconvolution from bulk sequencing has inherent resolution limits
Reliable detection of 2–3 major clones. Resolving many small subclones requires high-depth sequencing with well-separated allele frequency distributions.
Research use only. The DNAI Research Toolkit is not a medical device and is not intended for clinical decision-making. All findings should be independently validated before any clinical application.
Fill in the form below. After review and approval, you will receive instructions on how to download and install the package.
Questions? Contact us at info@dnai.bio