Document Understanding Platform
Enterprise HIPAA-compliant document extraction with multi-provider LLM integration and human-in-the-loop learning
The Challenge
Healthcare organizations process thousands of insurance authorization and denial documents daily—faxed PDFs, scanned forms, multi-page documents with varying quality. Manual extraction is slow, error-prone, and doesn't scale.
The challenge isn't just OCR—it's understanding document structure, extracting specific fields with high confidence, validating against business rules, and maintaining HIPAA compliance throughout. When extraction confidence is low, humans need to review and correct, and those corrections should improve future performance.
System Architecture
The Solution
Enterprise document processing platform with a complete extraction pipeline:
Multi-Provider LLM Integration
Seamless switching between Anthropic Claude, OpenAI GPT-4, and Azure OpenAI. Provider abstraction enables cost optimization, fallback handling, and A/B testing different models on the same documents.
Dual OCR Engine Support
Tesseract and EasyOCR with automatic quality assessment. Per-word confidence scoring enables intelligent routing—high-confidence extractions auto-complete, low-confidence documents route to human review.
Human-in-the-Loop Learning
Every human correction is captured with context for reinforcement learning. Reward shaping distinguishes between corrections, additions, and confirmations. Model performance tracked per field with precision/recall/F1 metrics.
Business Rules Engine
Configurable validation rules with severity levels. Cross-field validation catches logical inconsistencies (e.g., denial reason + authorization number). Custom Python expressions for complex business logic.
Technical Implementation
Multi-Provider LLM Service
The LLM service abstracts provider differences, enabling seamless switching and fallback. Field definitions are database-driven, allowing runtime configuration without code changes.
1class LLMService:
2 """Multi-provider LLM extraction with graceful fallback."""
3
4 def __init__(self, db: Session = None):
5 # Initialize clients based on available API keys
6 self.anthropic_client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
7 self.openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
8 self.azure_openai_service = AzureOpenAIService()
9
10 self.default_provider = os.getenv("DEFAULT_LLM_PROVIDER", "anthropic")
11 self.model_version = f"{self.default_provider}_{self.default_model}_v1.0"
12
13 def extract_fields(self, ocr_text: str, provider: str = None) -> Dict[str, Any]:
14 """Extract fields with configurable field definitions from database."""
15
16 # Get field definitions (required vs optional)
17 required_fields = self.field_service.get_required_fields()
18 optional_fields = self.field_service.get_optional_fields()
19
20 # Route to appropriate provider
21 if provider == "azure_openai":
22 return self.azure_openai_service.extract_fields(ocr_text, all_fields)
23 elif provider == "anthropic":
24 result = self._extract_with_anthropic(prompt, model)
25 elif provider == "openai":
26 result = self._extract_with_openai(prompt, model)
27
28 # Calculate confidence scores per field
29 confidence_scores = self._calculate_confidence_scores(extracted_data)
30
31 return {
32 'extracted_fields': extracted_data,
33 'confidence_scores': confidence_scores,
34 'requires_review': self._requires_review(extracted_data, confidence_scores),
35 'model_version': self.model_version # For RL tracking
36 }Document Quality Assessment
Before extraction, documents are assessed for quality using computer vision metrics. Poor quality documents are flagged early with specific recommendations for improvement.
1class DocumentQualityService:
2 """Multi-metric document quality assessment."""
3
4 def assess_document_quality(self, file_path: str) -> Dict[str, Any]:
5 # Image quality metrics
6 quality_metrics = self._assess_image_quality(image_path, image)
7
8 # Text quality via OCR confidence
9 text_metrics = self._assess_text_quality(image_path)
10
11 # Weighted overall score
12 overall_score = self._calculate_overall_score(quality_metrics, text_metrics)
13
14 return {
15 "image_dpi": quality_metrics.get("dpi", 0),
16 "image_clarity_score": quality_metrics.get("clarity_score", 0.0),
17 "text_density_score": text_metrics.get("text_density", 0.0),
18 "overall_quality_score": overall_score,
19 "quality_issues": self._identify_issues(quality_metrics, text_metrics),
20 "recommendations": self._generate_recommendations(...)
21 }
22
23 def _assess_image_quality(self, image_path: str, pil_image: Image) -> Dict:
24 # Laplacian variance for clarity
25 clarity_score = cv2.Laplacian(gray, cv2.CV_64F).var()
26
27 # Contrast and brightness analysis
28 contrast = gray.std()
29 brightness_score = 1.0 - abs(gray.mean() - 128) / 128.0
30
31 # Noise estimation via median filter comparison
32 noise_level = self._estimate_noise_level(gray)
33
34 return {"clarity_score": clarity_score, "contrast": contrast, ...}Business Rules Validation
Extracted fields are validated against configurable business rules. Cross-field rules catch logical inconsistencies that single-field validation would miss.
1class WorkflowService:
2 """Business rule validation and intelligent assignment."""
3
4 def validate_business_rules(self, document_id: int) -> Dict[str, Any]:
5 """Validate against all active business rules."""
6
7 active_rules = self.db.query(BusinessRule).filter(
8 BusinessRule.is_active == True
9 ).all()
10
11 violations = []
12 for rule in active_rules:
13 violation = self._validate_single_rule(document, rule)
14 if violation:
15 # Store in database for audit trail
16 rule_violation = BusinessRuleViolation(
17 document_id=document_id,
18 rule_id=rule.id,
19 violation_details=violation,
20 severity=rule.severity
21 )
22 self.db.add(rule_violation)
23
24 if rule.severity == "error":
25 violations.append(violation)
26
27 return {"has_violations": len(violations) > 0, "violations": violations}
28
29 def _validate_cross_field_rule(self, document, rule, rule_def) -> Optional[Dict]:
30 """Cross-field validation (e.g., denial + auth number conflict)."""
31
32 if rule_def.get("logic") == "denial_no_auth_number":
33 denial_reason = document.extracted_fields.get("denial_reason")
34 auth_number = document.extracted_fields.get("authorization_number")
35
36 if denial_reason and auth_number:
37 return {
38 "rule_name": rule.name,
39 "issue": "Denied documents should not have authorization numbers",
40 "severity": rule.severity
41 }Reinforcement Learning from Human Feedback
Human corrections are captured with full context for model improvement. Reward shaping provides nuanced feedback—confirmations are positive, corrections are mildly negative, missed fields are strongly negative.
1class ReinforcementLearningService:
2 """Human feedback capture for continuous model improvement."""
3
4 def record_human_feedback(
5 self,
6 document_id: int,
7 field_name: str,
8 original_value: str,
9 corrected_value: str,
10 feedback_type: str, # correction, addition, removal, confirmation
11 model_version: str
12 ) -> HumanFeedback:
13
14 # Calculate reward based on feedback type
15 reward_score = self._calculate_reward(
16 feedback_type, original_value, corrected_value
17 )
18
19 feedback = HumanFeedback(
20 document_id=document_id,
21 field_name=field_name,
22 original_value=original_value,
23 corrected_value=corrected_value,
24 feedback_type=feedback_type,
25 reward_score=reward_score,
26 model_version=model_version,
27 ocr_context=ocr_context # For training context
28 )
29
30 # Update model performance metrics
31 self._update_model_performance(model_version, field_name, feedback_type)
32
33 return feedback
34
35 def _calculate_reward(self, feedback_type: str, original: str, corrected: str) -> float:
36 """Reward shaping for RL training data."""
37 if feedback_type == "confirmation":
38 return 1.0 # Model was correct
39 elif feedback_type == "correction":
40 return -0.5 # Partial penalty (had something, was wrong)
41 elif feedback_type == "addition":
42 return -0.8 # Missed field entirely
43 elif feedback_type == "removal":
44 return -0.3 # Hallucinated field
45 return 0.0Mathematical Formulation
Key algorithms underlying the document understanding pipeline.
Per-Field Confidence Scoring
Base Confidence:
where = extracted value for field , = character length.
Format Validation Boost:
+0.1 boost if value matches expected pattern (dates, emails, phone numbers).
Overall Document Confidence:
where = required fields (80% weight), = optional fields (20% weight).
RLHF Reward Shaping
Feedback Reward Function:
Rationale:
- Missed fields (-0.8) penalized most: harder to catch in review than wrong values
- Corrections (-0.5): model found right location, wrong value
- Hallucinations (-0.3): easier to catch, less dangerous than omissions
- Confirmations (+1.0): positive reinforcement for correct extractions
Per-Field Performance Tracking:
where = precision, = recall for field , computed from human feedback over sliding window.
Document Quality Assessment
Image Clarity (Laplacian Variance):
Higher variance indicates sharper edges. Blurry images have low .
Brightness Score:
where = mean grayscale intensity. Optimal at 128 (mid-gray), penalizes too dark/light.
Weighted Quality Score:
Documents with are flagged with specific improvement recommendations.
HIPAA Compliance Framework
Healthcare document processing requires strict compliance. The platform implements comprehensive safeguards:
Administrative
- • 4-tier role-based access control
- • JWT authentication with expiration
- • Complete audit logging
- • Session timeout management
Technical
- • PHI access tracking per request
- • Security headers (HSTS, CSP, etc.)
- • Data integrity verification
- • Automated retention policies
Audit Trail Example
Every PHI access is logged with user, session, IP, action, resource, and data hash for integrity verification:
HIPAAAuditLog(
user_id="reviewer_123",
session_id="sess_abc",
action="READ",
resource_type="document",
resource_id="doc_456",
phi_accessed=True,
ip_address="10.0.1.50",
data_hash="sha256:a1b2c3..."
)Production Capabilities
Technology Stack
Backend
- • FastAPI (async Python)
- • PostgreSQL + SQLAlchemy
- • Redis + Celery (queue)
- • Tesseract + EasyOCR
Frontend
- • React + TypeScript
- • Vite (build tooling)
- • Bootstrap components
- • Real-time updates
LLM Providers
- • Anthropic Claude (3 models)
- • OpenAI GPT-4/3.5
- • Azure OpenAI Service
Infrastructure
- • Docker + Docker Compose
- • Kubernetes-ready
- • Prometheus metrics
- • Swagger/OpenAPI docs
Technical Deep Dive
Key architectural decisions and trade-offs—the questions a senior engineer at a frontier AI lab would ask in a technical interview.
Each provider has a dedicated extraction method (_extract_with_anthropic, _extract_with_openai) that normalizes responses to a common format. The key insight is that all providers return JSON, but with different wrapper structures:
- Anthropic:
response.content[0].text - OpenAI:
response.choices[0].message.content - Azure: Same as OpenAI but with different auth headers
The _parse_extraction_result() method handles JSON cleanup (removing markdown fences) and field name normalization (mapping display names to internal names).
Fallback logic: If the primary provider fails, the system checks get_available_providers() and routes to the next available. Provider availability is determined by API key presence at initialization.
Confidence scoring uses a heuristic approach combining field completeness and format validation:
- Base confidence: 0.8 for any non-empty extraction
- Format validation boost: +0.1 if field matches expected pattern (dates, emails, phones)
- Length penalty: Drops to 0.5 if value is <2 characters
Overall confidence is a weighted average: 80% from required fields, 20% from optional fields. This ensures missing required fields heavily impact the score.
Review Triggers (configurable via env vars):
MIN_CONFIDENCE_THRESHOLD=0.7— Overall confidence below this triggers reviewREQUIRED_FIELDS_THRESHOLD=0.8— Any required field below this triggers review- Missing required fields — Always triggers review regardless of confidence
The system captures human feedback with full context for future model fine-tuning:
Reward Shaping:
- Confirmation (+1.0): Human verified model was correct
- Correction (-0.5): Model extracted something, but wrong value
- Addition (-0.8): Model missed field entirely (false negative)
- Removal (-0.3): Model hallucinated a field (false positive)
Why these specific values? Missing fields (-0.8) is penalized more than wrong values (-0.5) because a wrong extraction at least indicates the model found the right location. Hallucinations (-0.3) are penalized less because they're easier to catch in review than missed fields.
Data captured for training: Each feedback record includes the original OCR context, model version, field name, original/corrected values, and reviewer ID. This enables:
- Fine-tuning on correction pairs (original → corrected)
- Prompt engineering improvements based on common error patterns
- Per-field performance tracking to identify weak spots
HIPAA compliance is implemented via middleware that intercepts every request:
- PHI endpoint detection: Routes like
/documents,/upload,/revieware flagged as PHI-accessing - Audit logging: Every PHI access creates a
HIPAAAuditLogrecord with user, session, IP, action (CRUD), resource type/ID, and data hash - Data integrity: SHA-256 hash of access metadata enables tamper detection
Security Headers (added to every response):
X-Frame-Options: DENY— Prevents clickjackingStrict-Transport-Security: max-age=31536000— Forces HTTPSContent-Security-Policy: default-src 'self'— Prevents XSSX-Content-Type-Options: nosniff— Prevents MIME sniffing
Role-based access: 4-tier system (Admin, Supervisor, Reviewer, Viewer) with permission checks before PHI access. Supervisors can access quality checks; Reviewers can only access assigned documents.
Document quality assessment uses computer vision metrics before OCR:
Image Quality Metrics:
- DPI: Extracted from image metadata, minimum 200 recommended
- Clarity: Laplacian variance (cv2.Laplacian) — higher = sharper
- Contrast: Standard deviation of grayscale values
- Brightness: Distance from optimal mean (128) — penalizes too dark/light
- Noise: Difference between original and median-filtered image
Text Quality Metrics (via OCR):
- Text density: Ratio of detected text blocks to total blocks
- Average confidence: Mean OCR confidence across all words
- Readable ratio: Percentage of words with confidence >60%
Weighted overall score: DPI (20%), Clarity (25%), Contrast (15%), Brightness (10%), Noise (10%), Text confidence (15%), Text density (5%). Documents below MIN_QUALITY_THRESHOLD=0.5 are flagged with specific recommendations (e.g., "Scan at higher resolution", "Ensure document is in focus").
Design Trade-offs & Limitations
Heuristic confidence vs. model-based: Current confidence scoring uses rule-based heuristics rather than a trained confidence model. This was a deliberate choice for interpretability and ease of tuning, but a learned confidence estimator could be more accurate.
Synchronous RL feedback: Feedback is captured but not used for real-time model updates. The system generates training data for offline fine-tuning rather than online learning. This is appropriate for healthcare where model changes require validation.
OCR-first architecture: The system OCRs entire documents before LLM extraction. An alternative would be vision-language models (GPT-4V, Claude 3) that process images directly. The OCR-first approach was chosen for cost efficiency and to leverage existing Tesseract infrastructure.
Why It Matters
This project demonstrates several capabilities relevant to frontier AI labs:
- Production ML Engineering: Not a prototype—full-stack application with database models, async processing, monitoring, and deployment configuration.
- LLM Application Architecture: Multi-provider abstraction, prompt engineering for structured extraction, confidence scoring, and graceful degradation.
- Human-in-the-Loop ML: Systematic capture of human feedback with reward shaping for continuous improvement—the foundation of RLHF.
- Healthcare Domain Expertise: HIPAA compliance isn't an afterthought—it's built into the architecture with comprehensive audit logging and access controls.
Portfolio Context
This project complements the Clinical RAG Platform by demonstrating a different aspect of healthcare ML: document processing vs. knowledge retrieval. Together, they show breadth across the healthcare AI stack—from unstructured document ingestion to structured knowledge access.