Tech specs

A plain-English map of how MediGen retrieves, cites, refuses, and scales strategic synthesis from a focused demo into an enterprise system.

The retrieval problem

MediGen has to answer questions across a corpus that can grow past 50,000 documents without turning into a keyword search box. Lookup means finding the exact paragraph that names a term, date, clause, protocol, or obligation. Synthesis means connecting several retrieved paragraphs and explaining what they collectively say. At this scale, the hard part is deciding which passages deserve to shape the answer before the model writes a sentence.

Hybrid retrieval

BM25 (term overlap) rewards passages that share important query words. The dense proxy (char-n-gram TF-IDF) catches near matches, abbreviations, and wording differences without a hosted embedding service. Reciprocal Rank Fusion (RRF, k=60) combines both ranked lists so one weak method does not dominate the result. Production retrieval stack: BM25 (term overlap) + Voyage AI embeddings (dense semantic) + Reciprocal Rank Fusion (RRF, k=60) + Cohere Rerank 3.5 (top 5). ACL pre-filter is applied at retrieval, never post-retrieval — post-filtering is the most dangerous anti-pattern in tenant-aware RAG because aggregate signals (BM25 IDF, embedding spread) can leak cross-tenant documents.

Citations are the product

The answer is only useful if every claim can be traced back to a paragraph. Each cited claim carries a source marker, so reviewers can jump from the summary to the underlying record and check the wording themselves. In the product experience, clicking a citation chip scrolls to the source and highlights the paragraph. In production, the Anthropic Citations API binds each generated span to its cited source verbatim, and a DeBERTa-MNLI grounding verifier runs per claim before display — un-entailed claims are stripped or replaced with "no supporting evidence found".

Five trust mechanics

The product's trust contract is the entire product. Five mechanics make every answer auditable.

Clickable citations — every [n] marker maps to a source passage; clicking scrolls and highlights it.
Multi-turn follow-ups — the user can keep narrowing without re-explaining context; Phase 1 scope.
Refusal copy — when the engine has no evidence above the abstention floor, it refuses with a calm explanation rather than guessing. Reviewer-friendly default.
Share — copy / link / email actions on every answer; outputs leave the engine the way they came in: cited.
Audit footer — model, duration, retrieval scores, abstention gate state. Closed by default, expandable. The full reasoning trail is recoverable.

Abstention is a feature

MediGen should refuse when the retrieved evidence is too thin. The demo uses a dual-signal floor: if lexical and proxy-semantic retrieval both stay weak, the engine does not ask the model to improvise. That matters because legal and scientific review punish confident guesses. Stanford RegLab (Magesh et al., 2025) benchmarked legal AI assistants and found Westlaw AI hallucinated on ~33% of queries and Lexis+ AI on ~17%. Refusal is a feature, not a failure mode.

Stanford RegLab benchmark

Ask vs Research

Ask: Single retrieval pass. End-to-end p95 target ≤ 15 seconds under expected pilot load. Typical 3–6 citations per answer. Research: bounded multi-document synthesis with a visible plan, read-only corpus tools, and a structured cited report. End-to-end p95 target ≤ 120 seconds for bounded pilot tasks. Typical 6–12 citations per report.

Models we use here

A production MediGen deployment would use an approved Bedrock Claude or Azure OpenAI generation path inside MediGen's cloud boundary, selected during the cloud, security, and commercial review. This demo runs against a third-party inference API for speed of iteration; the production path keeps the same retrieval, citation, refusal, and audit mechanics around the approved model endpoint.

What the production system would add

Production runs on a specialized model and tooling stack. Each component has one bounded job in the pipeline.

Docling — document parsing (PDFs, contracts, scans) with structure preservation
Voyage — embeddings for true semantic retrieval
Cohere Rerank 3.5 — relevance reordering on retrieval candidates
Approved generator — Bedrock Claude or Azure OpenAI path selected during implementation
DeBERTa-MNLI — grounding verifier (encoder-only, cannot generate; runs per claim before display)

Plus the surrounding production controls:

ACL pre-filter at retrieval time (fail-closed by tenant; G8 requires zero leakage across 50+ permission-pair tests)
Full audit log with six-year retention via S3 Object Lock
Eval harness — 450 questions across 6 sets: 150 golden + 75 refusal + 75 ACL pairs + 100 adversarial + 20 LegalBench + 30 quarterly fresh sample
10 launch gates G1–G10 (G3 caps hallucination at ≤5%; full set in the launch checklist)