Back to Projects
AI IntegrationData PipelineGoAWS
February 2025

AI Document Processing Pipeline

Async document ingestion pipeline with AI extraction, confidence scoring, and human review workflows.

5x faster processing35% manual workload reduction
GoAWSPostgreSQLRedisSQSTextractOpenAIECS Fargate

AI Document Processing Pipeline - Solution Overview

What this project solves

A business received large volumes of documents such as invoices, application forms, PDFs, emails, and structured uploads. The manual process was slow, error-prone, and operationally expensive. The goal was to build a backend pipeline that could ingest files, extract structured data with AI, validate it, route low-confidence cases to human review, and push clean results into downstream systems.

What had to be true

  • Large files could not block the request path
  • The system had to support async processing and retries
  • Extraction needed confidence scoring, not blind trust
  • Bad rows or bad fields should not poison the whole job
  • Users needed status tracking and downloadable results
  • The pipeline needed audit logs and replay tools

Stack

  • Go
  • AWS
  • PostgreSQL
  • Redis
  • S3
  • SQS
  • Textract or OCR provider
  • OpenAI / Anthropic
  • ECS Fargate
  • CloudWatch

Solution in plain English

The system accepted files through an API, stored them in S3, and created processing jobs. Workers then picked up the jobs asynchronously, extracted text, ran LLM-based field extraction, validated the output, and either:

  • auto-applied high-confidence results
  • or created review tasks for a human operator

This kept the upload fast, the processing resilient, and the outputs trustworthy.

High-level architecture

flowchart LR U[User / Ops Team] --> API[Go API] API --> AUTH[Auth] API --> S3[S3] API --> PG[(PostgreSQL)] API --> REDIS[Redis] API --> Q[SQS] Q --> OCR[OCR Worker] OCR --> S3 OCR --> PG OCR --> Q Q --> EXTRACT[AI Extraction Worker] EXTRACT --> LLM[LLM Provider] EXTRACT --> PG EXTRACT --> Q Q --> VALIDATE[Validation Worker] VALIDATE --> PG VALIDATE --> Q Q --> APPLY[Apply / Export Worker] APPLY --> CRM[CRM / ERP / Internal API] APPLY --> PG ADMIN[Admin / Review UI] --> API

End-to-end processing flow

flowchart TD A[Upload File] --> B[Store in S3] B --> C[Create Processing Job] C --> D[OCR / Text Extraction] D --> E[LLM Field Extraction] E --> F[Validation Rules] F --> G{Confidence High Enough?} G -- Yes --> H[Apply to System] G -- No --> I[Send to Human Review] H --> J[Mark Job Completed] I --> K[Reviewer Decision] K --> H

API flow

sequenceDiagram participant U as User participant API as Go API participant S3 as S3 participant PG as PostgreSQL participant Q as SQS U->>API: POST /documents/upload API->>S3: Store file API->>PG: Create job API->>Q: Queue OCR task API-->>U: job_id + status URL

Worker flow

sequenceDiagram participant Q as Queue participant OCR as OCR Worker participant AI as AI Worker participant VAL as Validation Worker participant PG as PostgreSQL Q->>OCR: OCR task OCR->>PG: Store extracted text OCR->>Q: Queue extraction task Q->>AI: Extraction task AI->>PG: Store extracted fields + confidence AI->>Q: Queue validation task Q->>VAL: Validation task VAL->>PG: Store validation result VAL->>Q: Queue apply or review task

Core components

api-service

Handles:

  • upload requests
  • job creation
  • status endpoints
  • review UI endpoints
  • audit endpoints

ocr-worker

Handles:

  • OCR
  • page/text extraction
  • document splitting if needed

extraction-worker

Handles:

  • LLM prompts for extraction
  • schema mapping
  • confidence scoring
  • normalized output

validation-worker

Handles:

  • business rule validation
  • duplicate checks
  • field constraints
  • routing to review or auto-apply

apply-worker

Handles:

  • write-back to target systems
  • webhooks
  • exports
  • idempotent updates

Data model

Table Purpose
document_jobs Top-level job state
document_files File metadata and storage pointer
document_pages OCR text or page slices
extraction_runs AI extraction attempts
extracted_fields Structured output per field
validation_results Pass/fail data
review_tasks Human review queue
exports Downstream write-back state
audit_logs System and operator history

Example field extraction shape

A contract or invoice might extract:

  • supplier name
  • invoice number
  • total amount
  • due date
  • currency
  • line item count
  • tax amount
  • customer reference

A CV parser might extract:

  • name
  • email
  • phone
  • skills
  • years of experience
  • notice period

The same pipeline pattern still works.

Confidence and review model

Auto-apply path

If the extraction is complete and passes both confidence and validation thresholds, the result can be applied automatically.

Review path

If the confidence is weak, required fields are missing, or validation fails, the document lands in a review queue with highlighted fields and suggested values.

That is what makes it usable in real operations.

Example worker code

func (w *ExtractionWorker) Handle(ctx context.Context, jobID string) error {
    doc, err := w.Documents.Get(ctx, jobID)
    if err != nil {
        return err
    }

    prompt := w.Prompts.BuildExtractionPrompt(doc.Text, doc.Template)

    result, err := w.LLM.Extract(ctx, prompt)
    if err != nil {
        return err
    }

    if err := w.Extractions.Store(ctx, jobID, result.Fields, result.Confidence); err != nil {
        return err
    }

    return w.Queue.PublishValidationTask(ctx, jobID)
}