AI Document Processing Pipeline - Solution Overview
What this project solves
A business received large volumes of documents such as invoices, application forms, PDFs, emails, and structured uploads. The manual process was slow, error-prone, and operationally expensive. The goal was to build a backend pipeline that could ingest files, extract structured data with AI, validate it, route low-confidence cases to human review, and push clean results into downstream systems.
What had to be true
- Large files could not block the request path
- The system had to support async processing and retries
- Extraction needed confidence scoring, not blind trust
- Bad rows or bad fields should not poison the whole job
- Users needed status tracking and downloadable results
- The pipeline needed audit logs and replay tools
Stack
GoAWSPostgreSQLRedisS3SQSTextractor OCR providerOpenAI/AnthropicECS FargateCloudWatch
Solution in plain English
The system accepted files through an API, stored them in S3, and created processing jobs. Workers then picked up the jobs asynchronously, extracted text, ran LLM-based field extraction, validated the output, and either:
- auto-applied high-confidence results
- or created review tasks for a human operator
This kept the upload fast, the processing resilient, and the outputs trustworthy.
High-level architecture
End-to-end processing flow
API flow
Worker flow
Core components
api-service
Handles:
- upload requests
- job creation
- status endpoints
- review UI endpoints
- audit endpoints
ocr-worker
Handles:
- OCR
- page/text extraction
- document splitting if needed
extraction-worker
Handles:
- LLM prompts for extraction
- schema mapping
- confidence scoring
- normalized output
validation-worker
Handles:
- business rule validation
- duplicate checks
- field constraints
- routing to review or auto-apply
apply-worker
Handles:
- write-back to target systems
- webhooks
- exports
- idempotent updates
Data model
| Table | Purpose |
|---|---|
document_jobs |
Top-level job state |
document_files |
File metadata and storage pointer |
document_pages |
OCR text or page slices |
extraction_runs |
AI extraction attempts |
extracted_fields |
Structured output per field |
validation_results |
Pass/fail data |
review_tasks |
Human review queue |
exports |
Downstream write-back state |
audit_logs |
System and operator history |
Example field extraction shape
A contract or invoice might extract:
- supplier name
- invoice number
- total amount
- due date
- currency
- line item count
- tax amount
- customer reference
A CV parser might extract:
- name
- phone
- skills
- years of experience
- notice period
The same pipeline pattern still works.
Confidence and review model
Auto-apply path
If the extraction is complete and passes both confidence and validation thresholds, the result can be applied automatically.
Review path
If the confidence is weak, required fields are missing, or validation fails, the document lands in a review queue with highlighted fields and suggested values.
That is what makes it usable in real operations.
Example worker code
func (w *ExtractionWorker) Handle(ctx context.Context, jobID string) error {
doc, err := w.Documents.Get(ctx, jobID)
if err != nil {
return err
}
prompt := w.Prompts.BuildExtractionPrompt(doc.Text, doc.Template)
result, err := w.LLM.Extract(ctx, prompt)
if err != nil {
return err
}
if err := w.Extractions.Store(ctx, jobID, result.Fields, result.Confidence); err != nil {
return err
}
return w.Queue.PublishValidationTask(ctx, jobID)
}