MediGenDemo

Tech specs

A plain-English map of how MediGen retrieves, cites, refuses, and scales strategic synthesis from a focused demo into an enterprise system.

The retrieval problem

MediGen has to answer questions across a corpus that can grow past 50,000 documents without turning into a keyword search box. Lookup means finding the exact paragraph that names a term, date, clause, protocol, or obligation. Synthesis means connecting several retrieved paragraphs and explaining what they collectively say. At this scale, the hard part is deciding which passages deserve to shape the answer before the model writes a sentence.

Hybrid retrieval

BM25 (term overlap) rewards passages that share important query words. The dense proxy (char-n-gram TF-IDF) catches near matches, abbreviations, and wording differences without a hosted embedding service. Reciprocal Rank Fusion (RRF, k=60) combines both ranked lists so one weak method does not dominate the result. In production, MediGen would use Voyage embeddings for true semantic retrieval and Cohere Rerank for final relevance ordering.

Citations are the product

The answer is only useful if every claim can be traced back to a paragraph. Each cited claim carries a source marker, so reviewers can jump from the summary to the underlying record and check the wording themselves. In the product experience, clicking a citation chip scrolls to the source and highlights the paragraph. In production, Anthropic Citations API would enforce source-linked generation; this demo uses simple [n] markers.

Abstention is a feature

MediGen should refuse when the retrieved evidence is too thin. The demo uses a dual-signal floor: if lexical and proxy-semantic retrieval both stay weak, the engine does not ask the model to improvise. That matters because legal and scientific review punish confident guesses. Stanford RegLab reported in 2025 that Westlaw AI hallucinates about 33% of benchmark queries and Lexis+ about 17%.

Stanford RegLab benchmark

Ask vs Research

Ask is for focused questions where the user needs a fast, cited answer from the strongest passages. Its target latency is about 3-4 seconds, with 3-6 citations. Research is for slower synthesis across more documents, competing evidence, and reviewer-ready context. Its target latency is about 10-15 seconds, with 6-12 citations. The modes share retrieval, but they spend different amounts of time on breadth, drafting, and citation density.

Models we use here

This demo uses Fireworks AI by default, specifically accounts/fireworks/models/llama-v3p1-70b-instruct. The reason is practical: it is open-weights, hosted, fast enough for a demo, and cost-effective while we prove the product loop. A production MediGen deployment would use Bedrock Claude inside the enterprise AWS HIPAA boundary, keeping model access, logging, and data movement inside the approved environment.

What the production system would add

  • Voyage embeddings (true semantic retrieval)
  • Cohere Rerank 3.5 (relevance reordering)
  • Anthropic Citations API + DeBERTa-MNLI grounding verifier
  • ACL pre-filter at retrieval (fail-closed by tenant)
  • Full audit log with object-lock retention
  • Eval harness (270-question golden + adversarial + refusal sets)