Clinical RAG Platform

Production retrieval-augmented generation for healthcare decision support

Serving 180+ hospitals6+ product teamsSafety-Critical

System Architecture

The Challenge

Healthcare organizations have vast knowledge basesβ€”clinical protocols, research papers, regulatory guidelinesβ€”but clinicians can't access the right information at the point of care. Traditional search fails because medical queries are nuanced and context-dependent.

A query like "anticoagulation protocol for post-surgical AFib patient with renal impairment" requires understanding medical terminology, patient context, and institutional guidelines simultaneously.

πŸ‘€

What I Personally Built

Hybrid Retrieval Pipeline

Designed and implemented the core retrieval architecture combining Vertex AI Vector Search with BM25 sparse retrieval. Built the reciprocal rank fusion algorithm that merges results from both approaches.

PythonVertex AIFastAPI

Clinical Embedding Fine-tuning

Fine-tuned sentence transformers on clinical text to handle medical abbreviations, drug names, and procedure codes that general models miss. Created the abbreviation expansion and medical term normalization pipeline.

PyTorchSentence TransformersUMLS

Safety & Hallucination Detection

Built the confidence scoring system and hallucination detection pipeline using NLI models. Implemented the "evidence required" prompting strategy that forces citation of retrieved passages.

DeBERTa NLICalibrationSafety Thresholds

Multi-Tenant Infrastructure

Architected the tenant isolation system ensuring different medical specialties have separate knowledge bases. Built the async processing pipeline with Cloud Run and the monitoring/alerting infrastructure.

GCP Cloud RunBigQueryTerraform

Results & Outcomes

180+
Hospitals served
6+
Product teams using platform
<500ms
P95 query latency
99.9%
Uptime SLA

Key Achievements

  • HIPAA-compliant infrastructure with full audit logging
  • Zero safety incidents in production deployment
  • Hybrid retrieval improved relevance by 34% over dense-only baseline
  • Confidence calibration reduced false positives by 67%

Repository Structure

Pseudo-public architecture (proprietary content removed)

clinical-rag-platform/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”œβ”€β”€ routes/
β”‚   β”‚   β”‚   β”œβ”€β”€ query.py          # Main RAG query endpoint
β”‚   β”‚   β”‚   β”œβ”€β”€ ingest.py         # Document ingestion
β”‚   β”‚   β”‚   └── health.py         # Health checks
β”‚   β”‚   β”œβ”€β”€ middleware/
β”‚   β”‚   β”‚   β”œβ”€β”€ auth.py           # JWT + tenant isolation
β”‚   β”‚   β”‚   └── audit.py          # HIPAA audit logging
β”‚   β”‚   └── main.py               # FastAPI application
β”‚   β”‚
β”‚   β”œβ”€β”€ retrieval/
β”‚   β”‚   β”œβ”€β”€ embeddings/
β”‚   β”‚   β”‚   β”œβ”€β”€ clinical_embedder.py    # Fine-tuned embeddings
β”‚   β”‚   β”‚   └── abbreviation_map.json   # Medical abbreviations
β”‚   β”‚   β”œβ”€β”€ vector_search.py      # Vertex AI integration
β”‚   β”‚   β”œβ”€β”€ bm25_index.py         # Sparse retrieval
β”‚   β”‚   └── fusion.py             # RRF implementation
β”‚   β”‚
β”‚   β”œβ”€β”€ generation/
β”‚   β”‚   β”œβ”€β”€ context_builder.py    # Passage assembly
β”‚   β”‚   β”œβ”€β”€ prompts/
β”‚   β”‚   β”‚   β”œβ”€β”€ clinical_qa.py    # Evidence-required prompts
β”‚   β”‚   β”‚   └── safety_check.py   # Verification prompts
β”‚   β”‚   └── llm_client.py         # LLM API wrapper
β”‚   β”‚
β”‚   β”œβ”€β”€ safety/
β”‚   β”‚   β”œβ”€β”€ confidence.py         # Calibrated scoring
β”‚   β”‚   β”œβ”€β”€ hallucination.py      # NLI verification
β”‚   β”‚   └── deferral.py           # "I don't know" logic
β”‚   β”‚
β”‚   └── monitoring/
β”‚       β”œβ”€β”€ metrics.py            # Prometheus metrics
β”‚       β”œβ”€β”€ bigquery.py           # Analytics export
β”‚       └── alerts.py             # PagerDuty integration
β”‚
β”œβ”€β”€ infrastructure/
β”‚   β”œβ”€β”€ terraform/
β”‚   β”‚   β”œβ”€β”€ cloud_run.tf          # Serverless compute
β”‚   β”‚   β”œβ”€β”€ vector_search.tf      # Vertex AI index
β”‚   β”‚   └── networking.tf         # VPC, IAM
β”‚   └── docker/
β”‚       └── Dockerfile            # Production image
β”‚
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ unit/
β”‚   β”œβ”€β”€ integration/
β”‚   └── safety/                   # Safety-specific tests
β”‚       β”œβ”€β”€ test_hallucination.py
β”‚       └── test_confidence.py
β”‚
└── configs/
    β”œβ”€β”€ production.yaml
    └── safety_thresholds.yaml    # Confidence thresholds

Technical Implementation

Core RAG Pipeline

Clinical RAG Pipeline
Safety-critical hybrid retrieval with confidence thresholds and mandatory citations
36 linesPythonFastAPI + Vertex AI
1class ClinicalRAGPipeline:
2    """Safety-critical RAG for healthcare decision support."""
3    
4    def __init__(self, config: RAGConfig):
5        self.embedder = ClinicalEmbedder(config.embedding_model)
6        self.retriever = HybridRetriever(
7            vector_store=VertexVectorSearch(config.index_id),
8            bm25_index=ClinicalBM25Index(config.corpus_path)
9        )
10        self.generator = SafeGenerator(
11            model=config.llm_model,
12            safety_threshold=config.confidence_threshold
13        )
14    
15    async def query(self, question: str, context: ClinicalContext) -> RAGResponse:
16        # Hybrid retrieval: dense + sparse
17        dense_results = await self.retriever.vector_search(
18            self.embedder.encode(question), k=10
19        )
20        sparse_results = await self.retriever.bm25_search(question, k=10)
21        
22        # Reciprocal rank fusion
23        merged = self.retriever.rrf_merge(dense_results, sparse_results)
24        
25        # Generate with safety checks
26        response = await self.generator.generate(
27            question=question,
28            context=merged,
29            clinical_context=context
30        )
31        
32        # Enforce citation and confidence requirements
33        if response.confidence < self.config.confidence_threshold:
34            return RAGResponse.low_confidence(response)
35        
36        return response.with_citations(merged)

Clinical Embedding Service

Clinical Embedding Service
Fine-tuned embeddings with medical abbreviation expansion
27 linesPythonSentence Transformers
1class ClinicalEmbeddingService:
2    """Fine-tuned embeddings for medical terminology."""
3    
4    def __init__(self, model_path: str):
5        self.model = SentenceTransformer(model_path)
6        self.medical_vocab = self._load_medical_vocabulary()
7        self.abbreviation_map = self._load_abbreviation_expansions()
8    
9    def encode(self, text: str) -> np.ndarray:
10        # Expand medical abbreviations
11        expanded = self._expand_abbreviations(text)
12        
13        # Normalize medical terminology
14        normalized = self._normalize_medical_terms(expanded)
15        
16        # Generate embedding with clinical context
17        return self.model.encode(
18            normalized,
19            normalize_embeddings=True,
20            show_progress_bar=False
21        )
22    
23    def _expand_abbreviations(self, text: str) -> str:
24        """Expand common medical abbreviations for better retrieval."""
25        for abbrev, expansion in self.abbreviation_map.items():
26            text = re.sub(rf'\b{abbrev}\b', f'{abbrev} ({expansion})', text)
27        return text

Hallucination Detection

Hallucination Checker
NLI-based verification of generated claims against evidence
40 linesPythonDeBERTa + Transformers
1class HallucinationChecker:
2    """Verify generated claims against retrieved evidence."""
3    
4    def __init__(self, nli_model: str = "microsoft/deberta-v3-large-mnli"):
5        self.nli_pipeline = pipeline("text-classification", model=nli_model)
6        self.confidence_threshold = 0.85
7    
8    async def verify_response(
9        self, 
10        response: str, 
11        evidence: List[RetrievedPassage]
12    ) -> VerificationResult:
13        # Extract claims from response
14        claims = self._extract_claims(response)
15        
16        verification_results = []
17        for claim in claims:
18            # Check each claim against all evidence
19            support_scores = []
20            for passage in evidence:
21                result = self.nli_pipeline(
22                    f"Premise: {passage.text}\nHypothesis: {claim}"
23                )
24                support_scores.append(result[0]['score'] if result[0]['label'] == 'ENTAILMENT' else 0)
25            
26            max_support = max(support_scores) if support_scores else 0
27            verification_results.append(ClaimVerification(
28                claim=claim,
29                supported=max_support >= self.confidence_threshold,
30                confidence=max_support,
31                supporting_passage=evidence[support_scores.index(max_support)] if max_support > 0 else None
32            ))
33        
34        # Response is safe only if all claims are supported
35        all_supported = all(v.supported for v in verification_results)
36        return VerificationResult(
37            is_safe=all_supported,
38            claims=verification_results,
39            should_defer=not all_supported
40        )

Mathematical Formulation

Reciprocal Rank Fusion (RRF)

Combines dense (semantic) and sparse (BM25) retrieval results using rank-based fusion:

where (smoothing constant), = set of rankings (dense + sparse), and = position of document in ranking .

Hybrid Retrieval Score

Final relevance score combines semantic similarity with lexical matching:

where are clinical embeddings, weights semantic similarity, and BM25 captures exact medical terminology matches.

Confidence-Gated Response

Safety-critical threshold determines whether to respond or defer:

where for clinical applications (conservative threshold), = retrieved context, and confidence is estimated via calibrated model outputs.

Technical Stack

Infrastructure

  • β€’ GCP Cloud Run (serverless compute)
  • β€’ Vertex AI Vector Search
  • β€’ Cloud Storage (document store)
  • β€’ BigQuery (analytics)
  • β€’ Terraform (IaC)

Application

  • β€’ Python FastAPI backend
  • β€’ Sentence Transformers (embeddings)
  • β€’ DeBERTa (NLI verification)
  • β€’ Async processing pipeline
  • β€’ Comprehensive monitoring

Why Safety Matters

In healthcare, incorrect AI responses aren't just annoyingβ€”they're dangerous. This platform prioritizes safety over helpfulness, with explicit confidence thresholds and source attribution on every response.

The system is designed to say "I don't know" rather than risk providing incorrect medical information. This conservative approach is essential for clinical deployment where patient safety is paramount.