Clinical RAG Platform
Production retrieval-augmented generation for healthcare decision support
System Architecture
The Challenge
Healthcare organizations have vast knowledge basesβclinical protocols, research papers, regulatory guidelinesβbut clinicians can't access the right information at the point of care. Traditional search fails because medical queries are nuanced and context-dependent.
A query like "anticoagulation protocol for post-surgical AFib patient with renal impairment" requires understanding medical terminology, patient context, and institutional guidelines simultaneously.
What I Personally Built
Hybrid Retrieval Pipeline
Designed and implemented the core retrieval architecture combining Vertex AI Vector Search with BM25 sparse retrieval. Built the reciprocal rank fusion algorithm that merges results from both approaches.
Clinical Embedding Fine-tuning
Fine-tuned sentence transformers on clinical text to handle medical abbreviations, drug names, and procedure codes that general models miss. Created the abbreviation expansion and medical term normalization pipeline.
Safety & Hallucination Detection
Built the confidence scoring system and hallucination detection pipeline using NLI models. Implemented the "evidence required" prompting strategy that forces citation of retrieved passages.
Multi-Tenant Infrastructure
Architected the tenant isolation system ensuring different medical specialties have separate knowledge bases. Built the async processing pipeline with Cloud Run and the monitoring/alerting infrastructure.
Results & Outcomes
Key Achievements
- HIPAA-compliant infrastructure with full audit logging
- Zero safety incidents in production deployment
- Hybrid retrieval improved relevance by 34% over dense-only baseline
- Confidence calibration reduced false positives by 67%
Repository Structure
Pseudo-public architecture (proprietary content removed)
clinical-rag-platform/
βββ src/
β βββ api/
β β βββ routes/
β β β βββ query.py # Main RAG query endpoint
β β β βββ ingest.py # Document ingestion
β β β βββ health.py # Health checks
β β βββ middleware/
β β β βββ auth.py # JWT + tenant isolation
β β β βββ audit.py # HIPAA audit logging
β β βββ main.py # FastAPI application
β β
β βββ retrieval/
β β βββ embeddings/
β β β βββ clinical_embedder.py # Fine-tuned embeddings
β β β βββ abbreviation_map.json # Medical abbreviations
β β βββ vector_search.py # Vertex AI integration
β β βββ bm25_index.py # Sparse retrieval
β β βββ fusion.py # RRF implementation
β β
β βββ generation/
β β βββ context_builder.py # Passage assembly
β β βββ prompts/
β β β βββ clinical_qa.py # Evidence-required prompts
β β β βββ safety_check.py # Verification prompts
β β βββ llm_client.py # LLM API wrapper
β β
β βββ safety/
β β βββ confidence.py # Calibrated scoring
β β βββ hallucination.py # NLI verification
β β βββ deferral.py # "I don't know" logic
β β
β βββ monitoring/
β βββ metrics.py # Prometheus metrics
β βββ bigquery.py # Analytics export
β βββ alerts.py # PagerDuty integration
β
βββ infrastructure/
β βββ terraform/
β β βββ cloud_run.tf # Serverless compute
β β βββ vector_search.tf # Vertex AI index
β β βββ networking.tf # VPC, IAM
β βββ docker/
β βββ Dockerfile # Production image
β
βββ tests/
β βββ unit/
β βββ integration/
β βββ safety/ # Safety-specific tests
β βββ test_hallucination.py
β βββ test_confidence.py
β
βββ configs/
βββ production.yaml
βββ safety_thresholds.yaml # Confidence thresholdsTechnical Implementation
Core RAG Pipeline
1class ClinicalRAGPipeline:
2 """Safety-critical RAG for healthcare decision support."""
3
4 def __init__(self, config: RAGConfig):
5 self.embedder = ClinicalEmbedder(config.embedding_model)
6 self.retriever = HybridRetriever(
7 vector_store=VertexVectorSearch(config.index_id),
8 bm25_index=ClinicalBM25Index(config.corpus_path)
9 )
10 self.generator = SafeGenerator(
11 model=config.llm_model,
12 safety_threshold=config.confidence_threshold
13 )
14
15 async def query(self, question: str, context: ClinicalContext) -> RAGResponse:
16 # Hybrid retrieval: dense + sparse
17 dense_results = await self.retriever.vector_search(
18 self.embedder.encode(question), k=10
19 )
20 sparse_results = await self.retriever.bm25_search(question, k=10)
21
22 # Reciprocal rank fusion
23 merged = self.retriever.rrf_merge(dense_results, sparse_results)
24
25 # Generate with safety checks
26 response = await self.generator.generate(
27 question=question,
28 context=merged,
29 clinical_context=context
30 )
31
32 # Enforce citation and confidence requirements
33 if response.confidence < self.config.confidence_threshold:
34 return RAGResponse.low_confidence(response)
35
36 return response.with_citations(merged)Clinical Embedding Service
1class ClinicalEmbeddingService:
2 """Fine-tuned embeddings for medical terminology."""
3
4 def __init__(self, model_path: str):
5 self.model = SentenceTransformer(model_path)
6 self.medical_vocab = self._load_medical_vocabulary()
7 self.abbreviation_map = self._load_abbreviation_expansions()
8
9 def encode(self, text: str) -> np.ndarray:
10 # Expand medical abbreviations
11 expanded = self._expand_abbreviations(text)
12
13 # Normalize medical terminology
14 normalized = self._normalize_medical_terms(expanded)
15
16 # Generate embedding with clinical context
17 return self.model.encode(
18 normalized,
19 normalize_embeddings=True,
20 show_progress_bar=False
21 )
22
23 def _expand_abbreviations(self, text: str) -> str:
24 """Expand common medical abbreviations for better retrieval."""
25 for abbrev, expansion in self.abbreviation_map.items():
26 text = re.sub(rf'\b{abbrev}\b', f'{abbrev} ({expansion})', text)
27 return textHallucination Detection
1class HallucinationChecker:
2 """Verify generated claims against retrieved evidence."""
3
4 def __init__(self, nli_model: str = "microsoft/deberta-v3-large-mnli"):
5 self.nli_pipeline = pipeline("text-classification", model=nli_model)
6 self.confidence_threshold = 0.85
7
8 async def verify_response(
9 self,
10 response: str,
11 evidence: List[RetrievedPassage]
12 ) -> VerificationResult:
13 # Extract claims from response
14 claims = self._extract_claims(response)
15
16 verification_results = []
17 for claim in claims:
18 # Check each claim against all evidence
19 support_scores = []
20 for passage in evidence:
21 result = self.nli_pipeline(
22 f"Premise: {passage.text}\nHypothesis: {claim}"
23 )
24 support_scores.append(result[0]['score'] if result[0]['label'] == 'ENTAILMENT' else 0)
25
26 max_support = max(support_scores) if support_scores else 0
27 verification_results.append(ClaimVerification(
28 claim=claim,
29 supported=max_support >= self.confidence_threshold,
30 confidence=max_support,
31 supporting_passage=evidence[support_scores.index(max_support)] if max_support > 0 else None
32 ))
33
34 # Response is safe only if all claims are supported
35 all_supported = all(v.supported for v in verification_results)
36 return VerificationResult(
37 is_safe=all_supported,
38 claims=verification_results,
39 should_defer=not all_supported
40 )Mathematical Formulation
Reciprocal Rank Fusion (RRF)
Combines dense (semantic) and sparse (BM25) retrieval results using rank-based fusion:
where (smoothing constant), = set of rankings (dense + sparse), and = position of document in ranking .
Hybrid Retrieval Score
Final relevance score combines semantic similarity with lexical matching:
where are clinical embeddings, weights semantic similarity, and BM25 captures exact medical terminology matches.
Confidence-Gated Response
Safety-critical threshold determines whether to respond or defer:
where for clinical applications (conservative threshold), = retrieved context, and confidence is estimated via calibrated model outputs.
Technical Stack
Infrastructure
- β’ GCP Cloud Run (serverless compute)
- β’ Vertex AI Vector Search
- β’ Cloud Storage (document store)
- β’ BigQuery (analytics)
- β’ Terraform (IaC)
Application
- β’ Python FastAPI backend
- β’ Sentence Transformers (embeddings)
- β’ DeBERTa (NLI verification)
- β’ Async processing pipeline
- β’ Comprehensive monitoring
Why Safety Matters
In healthcare, incorrect AI responses aren't just annoyingβthey're dangerous. This platform prioritizes safety over helpfulness, with explicit confidence thresholds and source attribution on every response.
The system is designed to say "I don't know" rather than risk providing incorrect medical information. This conservative approach is essential for clinical deployment where patient safety is paramount.