AI Document Processing Pipeline - Solution Overview

What this project solves

A business received large volumes of documents such as invoices, application forms, PDFs, emails, and structured uploads. The manual process was slow, error-prone, and operationally expensive. The goal was to build a backend pipeline that could ingest files, extract structured data with AI, validate it, route low-confidence cases to human review, and push clean results into downstream systems.

What had to be true

Large files could not block the request path
The system had to support async processing and retries
Extraction needed confidence scoring, not blind trust
Bad rows or bad fields should not poison the whole job
Users needed status tracking and downloadable results
The pipeline needed audit logs and replay tools

Stack

Go
AWS
PostgreSQL
Redis
S3
SQS
Textract or OCR provider
OpenAI / Anthropic
ECS Fargate
CloudWatch

Solution in plain English

The system accepted files through an API, stored them in S3, and created processing jobs. Workers then picked up the jobs asynchronously, extracted text, ran LLM-based field extraction, validated the output, and either:

auto-applied high-confidence results
or created review tasks for a human operator

This kept the upload fast, the processing resilient, and the outputs trustworthy.

High-level architecture

flowchart LR U[User / Ops Team] --> API[Go API] API --> AUTH[Auth] API --> S3[S3] API --> PG[(PostgreSQL)] API --> REDIS[Redis] API --> Q[SQS] Q --> OCR[OCR Worker] OCR --> S3 OCR --> PG OCR --> Q Q --> EXTRACT[AI Extraction Worker] EXTRACT --> LLM[LLM Provider] EXTRACT --> PG EXTRACT --> Q Q --> VALIDATE[Validation Worker] VALIDATE --> PG VALIDATE --> Q Q --> APPLY[Apply / Export Worker] APPLY --> CRM[CRM / ERP / Internal API] APPLY --> PG ADMIN[Admin / Review UI] --> API

End-to-end processing flow

flowchart TD A[Upload File] --> B[Store in S3] B --> C[Create Processing Job] C --> D[OCR / Text Extraction] D --> E[LLM Field Extraction] E --> F[Validation Rules] F --> G{Confidence High Enough?} G -- Yes --> H[Apply to System] G -- No --> I[Send to Human Review] H --> J[Mark Job Completed] I --> K[Reviewer Decision] K --> H

API flow

sequenceDiagram participant U as User participant API as Go API participant S3 as S3 participant PG as PostgreSQL participant Q as SQS U->>API: POST /documents/upload API->>S3: Store file API->>PG: Create job API->>Q: Queue OCR task API-->>U: job_id + status URL

Worker flow

sequenceDiagram participant Q as Queue participant OCR as OCR Worker participant AI as AI Worker participant VAL as Validation Worker participant PG as PostgreSQL Q->>OCR: OCR task OCR->>PG: Store extracted text OCR->>Q: Queue extraction task Q->>AI: Extraction task AI->>PG: Store extracted fields + confidence AI->>Q: Queue validation task Q->>VAL: Validation task VAL->>PG: Store validation result VAL->>Q: Queue apply or review task

Core components

`api-service`

Handles:

upload requests
job creation
status endpoints
review UI endpoints
audit endpoints

`ocr-worker`

Handles:

OCR
page/text extraction
document splitting if needed

`extraction-worker`

Handles:

LLM prompts for extraction
schema mapping
confidence scoring
normalized output

`validation-worker`

Handles:

business rule validation
duplicate checks
field constraints
routing to review or auto-apply

`apply-worker`

Handles:

write-back to target systems
webhooks
exports
idempotent updates

Data model

Table	Purpose
`document_jobs`	Top-level job state
`document_files`	File metadata and storage pointer
`document_pages`	OCR text or page slices
`extraction_runs`	AI extraction attempts
`extracted_fields`	Structured output per field
`validation_results`	Pass/fail data
`review_tasks`	Human review queue
`exports`	Downstream write-back state
`audit_logs`	System and operator history

Example field extraction shape

A contract or invoice might extract:

supplier name
invoice number
total amount
due date
currency
line item count
tax amount
customer reference

A CV parser might extract:

name
email
phone
skills
years of experience
notice period

The same pipeline pattern still works.

Confidence and review model

Auto-apply path

If the extraction is complete and passes both confidence and validation thresholds, the result can be applied automatically.

Review path

If the confidence is weak, required fields are missing, or validation fails, the document lands in a review queue with highlighted fields and suggested values.

That is what makes it usable in real operations.

Example worker code

func (w *ExtractionWorker) Handle(ctx context.Context, jobID string) error {
    doc, err := w.Documents.Get(ctx, jobID)
    if err != nil {
        return err
    }

    prompt := w.Prompts.BuildExtractionPrompt(doc.Text, doc.Template)

    result, err := w.LLM.Extract(ctx, prompt)
    if err != nil {
        return err
    }

    if err := w.Extractions.Store(ctx, jobID, result.Fields, result.Confidence); err != nil {
        return err
    }

    return w.Queue.PublishValidationTask(ctx, jobID)
}