Document Understanding Platform

Enterprise HIPAA-compliant document extraction with multi-provider LLM integration and human-in-the-loop learning

HIPAA CompliantMulti-Provider LLMHuman-in-the-Loop RL

The Challenge

Healthcare organizations process thousands of insurance authorization and denial documents daily—faxed PDFs, scanned forms, multi-page documents with varying quality. Manual extraction is slow, error-prone, and doesn't scale.

The challenge isn't just OCR—it's understanding document structure, extracting specific fields with high confidence, validating against business rules, and maintaining HIPAA compliance throughout. When extraction confidence is low, humans need to review and correct, and those corrections should improve future performance.

System Architecture

Pipeline:
Main Flow
RL Feedback
Human corrections improve extraction accuracy over time

The Solution

Enterprise document processing platform with a complete extraction pipeline:

Multi-Provider LLM Integration

Seamless switching between Anthropic Claude, OpenAI GPT-4, and Azure OpenAI. Provider abstraction enables cost optimization, fallback handling, and A/B testing different models on the same documents.

Dual OCR Engine Support

Tesseract and EasyOCR with automatic quality assessment. Per-word confidence scoring enables intelligent routing—high-confidence extractions auto-complete, low-confidence documents route to human review.

Human-in-the-Loop Learning

Every human correction is captured with context for reinforcement learning. Reward shaping distinguishes between corrections, additions, and confirmations. Model performance tracked per field with precision/recall/F1 metrics.

Business Rules Engine

Configurable validation rules with severity levels. Cross-field validation catches logical inconsistencies (e.g., denial reason + authorization number). Custom Python expressions for complex business logic.

Technical Implementation

Multi-Provider LLM Service

The LLM service abstracts provider differences, enabling seamless switching and fallback. Field definitions are database-driven, allowing runtime configuration without code changes.

LLM Service
Multi-provider extraction with configurable field definitions
36 linesPythonFastAPI + Anthropic/OpenAI
1class LLMService:
2    """Multi-provider LLM extraction with graceful fallback."""
3    
4    def __init__(self, db: Session = None):
5        # Initialize clients based on available API keys
6        self.anthropic_client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
7        self.openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
8        self.azure_openai_service = AzureOpenAIService()
9        
10        self.default_provider = os.getenv("DEFAULT_LLM_PROVIDER", "anthropic")
11        self.model_version = f"{self.default_provider}_{self.default_model}_v1.0"
12    
13    def extract_fields(self, ocr_text: str, provider: str = None) -> Dict[str, Any]:
14        """Extract fields with configurable field definitions from database."""
15        
16        # Get field definitions (required vs optional)
17        required_fields = self.field_service.get_required_fields()
18        optional_fields = self.field_service.get_optional_fields()
19        
20        # Route to appropriate provider
21        if provider == "azure_openai":
22            return self.azure_openai_service.extract_fields(ocr_text, all_fields)
23        elif provider == "anthropic":
24            result = self._extract_with_anthropic(prompt, model)
25        elif provider == "openai":
26            result = self._extract_with_openai(prompt, model)
27        
28        # Calculate confidence scores per field
29        confidence_scores = self._calculate_confidence_scores(extracted_data)
30        
31        return {
32            'extracted_fields': extracted_data,
33            'confidence_scores': confidence_scores,
34            'requires_review': self._requires_review(extracted_data, confidence_scores),
35            'model_version': self.model_version  # For RL tracking
36        }

Document Quality Assessment

Before extraction, documents are assessed for quality using computer vision metrics. Poor quality documents are flagged early with specific recommendations for improvement.

Quality Assessment
Multi-metric analysis with OpenCV and PIL
34 linesPythonOpenCV + Tesseract
1class DocumentQualityService:
2    """Multi-metric document quality assessment."""
3    
4    def assess_document_quality(self, file_path: str) -> Dict[str, Any]:
5        # Image quality metrics
6        quality_metrics = self._assess_image_quality(image_path, image)
7        
8        # Text quality via OCR confidence
9        text_metrics = self._assess_text_quality(image_path)
10        
11        # Weighted overall score
12        overall_score = self._calculate_overall_score(quality_metrics, text_metrics)
13        
14        return {
15            "image_dpi": quality_metrics.get("dpi", 0),
16            "image_clarity_score": quality_metrics.get("clarity_score", 0.0),
17            "text_density_score": text_metrics.get("text_density", 0.0),
18            "overall_quality_score": overall_score,
19            "quality_issues": self._identify_issues(quality_metrics, text_metrics),
20            "recommendations": self._generate_recommendations(...)
21        }
22    
23    def _assess_image_quality(self, image_path: str, pil_image: Image) -> Dict:
24        # Laplacian variance for clarity
25        clarity_score = cv2.Laplacian(gray, cv2.CV_64F).var()
26        
27        # Contrast and brightness analysis
28        contrast = gray.std()
29        brightness_score = 1.0 - abs(gray.mean() - 128) / 128.0
30        
31        # Noise estimation via median filter comparison
32        noise_level = self._estimate_noise_level(gray)
33        
34        return {"clarity_score": clarity_score, "contrast": contrast, ...}

Business Rules Validation

Extracted fields are validated against configurable business rules. Cross-field rules catch logical inconsistencies that single-field validation would miss.

Workflow Service
Business rule validation with cross-field logic
41 linesPythonSQLAlchemy + Custom Rules
1class WorkflowService:
2    """Business rule validation and intelligent assignment."""
3    
4    def validate_business_rules(self, document_id: int) -> Dict[str, Any]:
5        """Validate against all active business rules."""
6        
7        active_rules = self.db.query(BusinessRule).filter(
8            BusinessRule.is_active == True
9        ).all()
10        
11        violations = []
12        for rule in active_rules:
13            violation = self._validate_single_rule(document, rule)
14            if violation:
15                # Store in database for audit trail
16                rule_violation = BusinessRuleViolation(
17                    document_id=document_id,
18                    rule_id=rule.id,
19                    violation_details=violation,
20                    severity=rule.severity
21                )
22                self.db.add(rule_violation)
23                
24                if rule.severity == "error":
25                    violations.append(violation)
26        
27        return {"has_violations": len(violations) > 0, "violations": violations}
28    
29    def _validate_cross_field_rule(self, document, rule, rule_def) -> Optional[Dict]:
30        """Cross-field validation (e.g., denial + auth number conflict)."""
31        
32        if rule_def.get("logic") == "denial_no_auth_number":
33            denial_reason = document.extracted_fields.get("denial_reason")
34            auth_number = document.extracted_fields.get("authorization_number")
35            
36            if denial_reason and auth_number:
37                return {
38                    "rule_name": rule.name,
39                    "issue": "Denied documents should not have authorization numbers",
40                    "severity": rule.severity
41                }

Reinforcement Learning from Human Feedback

Human corrections are captured with full context for model improvement. Reward shaping provides nuanced feedback—confirmations are positive, corrections are mildly negative, missed fields are strongly negative.

RL Feedback Service
Human feedback capture with reward shaping
45 linesPythonSQLAlchemy + Custom RL
1class ReinforcementLearningService:
2    """Human feedback capture for continuous model improvement."""
3    
4    def record_human_feedback(
5        self,
6        document_id: int,
7        field_name: str,
8        original_value: str,
9        corrected_value: str,
10        feedback_type: str,  # correction, addition, removal, confirmation
11        model_version: str
12    ) -> HumanFeedback:
13        
14        # Calculate reward based on feedback type
15        reward_score = self._calculate_reward(
16            feedback_type, original_value, corrected_value
17        )
18        
19        feedback = HumanFeedback(
20            document_id=document_id,
21            field_name=field_name,
22            original_value=original_value,
23            corrected_value=corrected_value,
24            feedback_type=feedback_type,
25            reward_score=reward_score,
26            model_version=model_version,
27            ocr_context=ocr_context  # For training context
28        )
29        
30        # Update model performance metrics
31        self._update_model_performance(model_version, field_name, feedback_type)
32        
33        return feedback
34    
35    def _calculate_reward(self, feedback_type: str, original: str, corrected: str) -> float:
36        """Reward shaping for RL training data."""
37        if feedback_type == "confirmation":
38            return 1.0   # Model was correct
39        elif feedback_type == "correction":
40            return -0.5  # Partial penalty (had something, was wrong)
41        elif feedback_type == "addition":
42            return -0.8  # Missed field entirely
43        elif feedback_type == "removal":
44            return -0.3  # Hallucinated field
45        return 0.0

Mathematical Formulation

Key algorithms underlying the document understanding pipeline.

Per-Field Confidence Scoring

Base Confidence:

where = extracted value for field , = character length.

Format Validation Boost:

+0.1 boost if value matches expected pattern (dates, emails, phone numbers).

Overall Document Confidence:

where = required fields (80% weight), = optional fields (20% weight).

RLHF Reward Shaping

Feedback Reward Function:

Rationale:

  • Missed fields (-0.8) penalized most: harder to catch in review than wrong values
  • Corrections (-0.5): model found right location, wrong value
  • Hallucinations (-0.3): easier to catch, less dangerous than omissions
  • Confirmations (+1.0): positive reinforcement for correct extractions

Per-Field Performance Tracking:

where = precision, = recall for field , computed from human feedback over sliding window.

Document Quality Assessment

Image Clarity (Laplacian Variance):

Higher variance indicates sharper edges. Blurry images have low .

Brightness Score:

where = mean grayscale intensity. Optimal at 128 (mid-gray), penalizes too dark/light.

Weighted Quality Score:

Documents with are flagged with specific improvement recommendations.

HIPAA Compliance Framework

Healthcare document processing requires strict compliance. The platform implements comprehensive safeguards:

Administrative

  • • 4-tier role-based access control
  • • JWT authentication with expiration
  • • Complete audit logging
  • • Session timeout management

Technical

  • • PHI access tracking per request
  • • Security headers (HSTS, CSP, etc.)
  • • Data integrity verification
  • • Automated retention policies

Audit Trail Example

Every PHI access is logged with user, session, IP, action, resource, and data hash for integrity verification:

HIPAAAuditLog(
    user_id="reviewer_123",
    session_id="sess_abc",
    action="READ",
    resource_type="document",
    resource_id="doc_456",
    phi_accessed=True,
    ip_address="10.0.1.50",
    data_hash="sha256:a1b2c3..."
)

Production Capabilities

500+
Documents/hour (4 workers)
100+
Concurrent batch processing
3
LLM providers supported
15+
Database models

Technology Stack

Backend

  • • FastAPI (async Python)
  • • PostgreSQL + SQLAlchemy
  • • Redis + Celery (queue)
  • • Tesseract + EasyOCR

Frontend

  • • React + TypeScript
  • • Vite (build tooling)
  • • Bootstrap components
  • • Real-time updates

LLM Providers

  • • Anthropic Claude (3 models)
  • • OpenAI GPT-4/3.5
  • • Azure OpenAI Service

Infrastructure

  • • Docker + Docker Compose
  • • Kubernetes-ready
  • • Prometheus metrics
  • • Swagger/OpenAPI docs

Technical Deep Dive

Key architectural decisions and trade-offs—the questions a senior engineer at a frontier AI lab would ask in a technical interview.

Each provider has a dedicated extraction method (_extract_with_anthropic, _extract_with_openai) that normalizes responses to a common format. The key insight is that all providers return JSON, but with different wrapper structures:

  • Anthropic: response.content[0].text
  • OpenAI: response.choices[0].message.content
  • Azure: Same as OpenAI but with different auth headers

The _parse_extraction_result() method handles JSON cleanup (removing markdown fences) and field name normalization (mapping display names to internal names).

Fallback logic: If the primary provider fails, the system checks get_available_providers() and routes to the next available. Provider availability is determined by API key presence at initialization.

Confidence scoring uses a heuristic approach combining field completeness and format validation:

  • Base confidence: 0.8 for any non-empty extraction
  • Format validation boost: +0.1 if field matches expected pattern (dates, emails, phones)
  • Length penalty: Drops to 0.5 if value is <2 characters

Overall confidence is a weighted average: 80% from required fields, 20% from optional fields. This ensures missing required fields heavily impact the score.

Review Triggers (configurable via env vars):

  • MIN_CONFIDENCE_THRESHOLD=0.7 — Overall confidence below this triggers review
  • REQUIRED_FIELDS_THRESHOLD=0.8 — Any required field below this triggers review
  • Missing required fields — Always triggers review regardless of confidence

The system captures human feedback with full context for future model fine-tuning:

Reward Shaping:

  • Confirmation (+1.0): Human verified model was correct
  • Correction (-0.5): Model extracted something, but wrong value
  • Addition (-0.8): Model missed field entirely (false negative)
  • Removal (-0.3): Model hallucinated a field (false positive)

Why these specific values? Missing fields (-0.8) is penalized more than wrong values (-0.5) because a wrong extraction at least indicates the model found the right location. Hallucinations (-0.3) are penalized less because they're easier to catch in review than missed fields.

Data captured for training: Each feedback record includes the original OCR context, model version, field name, original/corrected values, and reviewer ID. This enables:

  • Fine-tuning on correction pairs (original → corrected)
  • Prompt engineering improvements based on common error patterns
  • Per-field performance tracking to identify weak spots

HIPAA compliance is implemented via middleware that intercepts every request:

  • PHI endpoint detection: Routes like /documents, /upload, /review are flagged as PHI-accessing
  • Audit logging: Every PHI access creates a HIPAAAuditLog record with user, session, IP, action (CRUD), resource type/ID, and data hash
  • Data integrity: SHA-256 hash of access metadata enables tamper detection

Security Headers (added to every response):

  • X-Frame-Options: DENY — Prevents clickjacking
  • Strict-Transport-Security: max-age=31536000 — Forces HTTPS
  • Content-Security-Policy: default-src 'self' — Prevents XSS
  • X-Content-Type-Options: nosniff — Prevents MIME sniffing

Role-based access: 4-tier system (Admin, Supervisor, Reviewer, Viewer) with permission checks before PHI access. Supervisors can access quality checks; Reviewers can only access assigned documents.

Document quality assessment uses computer vision metrics before OCR:

Image Quality Metrics:

  • DPI: Extracted from image metadata, minimum 200 recommended
  • Clarity: Laplacian variance (cv2.Laplacian) — higher = sharper
  • Contrast: Standard deviation of grayscale values
  • Brightness: Distance from optimal mean (128) — penalizes too dark/light
  • Noise: Difference between original and median-filtered image

Text Quality Metrics (via OCR):

  • Text density: Ratio of detected text blocks to total blocks
  • Average confidence: Mean OCR confidence across all words
  • Readable ratio: Percentage of words with confidence >60%

Weighted overall score: DPI (20%), Clarity (25%), Contrast (15%), Brightness (10%), Noise (10%), Text confidence (15%), Text density (5%). Documents below MIN_QUALITY_THRESHOLD=0.5 are flagged with specific recommendations (e.g., "Scan at higher resolution", "Ensure document is in focus").

Design Trade-offs & Limitations

Heuristic confidence vs. model-based: Current confidence scoring uses rule-based heuristics rather than a trained confidence model. This was a deliberate choice for interpretability and ease of tuning, but a learned confidence estimator could be more accurate.

Synchronous RL feedback: Feedback is captured but not used for real-time model updates. The system generates training data for offline fine-tuning rather than online learning. This is appropriate for healthcare where model changes require validation.

OCR-first architecture: The system OCRs entire documents before LLM extraction. An alternative would be vision-language models (GPT-4V, Claude 3) that process images directly. The OCR-first approach was chosen for cost efficiency and to leverage existing Tesseract infrastructure.

Why It Matters

This project demonstrates several capabilities relevant to frontier AI labs:

  • Production ML Engineering: Not a prototype—full-stack application with database models, async processing, monitoring, and deployment configuration.
  • LLM Application Architecture: Multi-provider abstraction, prompt engineering for structured extraction, confidence scoring, and graceful degradation.
  • Human-in-the-Loop ML: Systematic capture of human feedback with reward shaping for continuous improvement—the foundation of RLHF.
  • Healthcare Domain Expertise: HIPAA compliance isn't an afterthought—it's built into the architecture with comprehensive audit logging and access controls.

Portfolio Context

This project complements the Clinical RAG Platform by demonstrating a different aspect of healthcare ML: document processing vs. knowledge retrieval. Together, they show breadth across the healthcare AI stack—from unstructured document ingestion to structured knowledge access.