🔍 SoRag Indexing Architecture

Indexing Workflow - Document Processing Pipeline

Linear LangGraph pipeline that processes documents from raw files to vector search-ready data. 6 stages with 59 components.

graph TB %% Style definitions classDef workflow fill:#667eea,stroke:#fff,stroke-width:3px,color:#fff classDef node fill:#764ba2,stroke:#fff,stroke-width:2px,color:#fff classDef factory fill:#f093fb,stroke:#333,stroke-width:2px,color:#333 classDef storage fill:#ff6b6b,stroke:#333,stroke-width:2px,color:#333 classDef startEnd fill:#4CAF50,stroke:#fff,stroke-width:2px,color:#fff %% Entry and Exit START(("📄 Documents")):::startEnd END(("🔍 Search Ready")):::startEnd %% Main Pipeline WORKFLOW["🏗️ INDEXING WORKFLOW
Linear LangGraph Pipeline"]:::workflow %% Processing Nodes TASK_INIT["📋 Task Init
Load metadata"]:::node PARSE["📝 Parse
Extract content
4 file types"]:::node CHUNK["✂️ Chunk
Split text
4 strategies"]:::node EMBED["🧠 Embed
Generate vectors
Dense + Sparse"]:::node UPSERT["⬆️ Upsert
Store in DBs
5 systems"]:::node COMMIT["✅ Commit
Finalize & cleanup"]:::node %% Component Factories subgraph COMPONENTS["Processing Components (59 files)"] PARSERS["📝 Parsers
PDF, Image, Text, Doc"]:::factory CHUNKERS["✂️ Chunkers
Fixed, Semantic, Page, Smart"]:::factory EMBEDDERS["🧠 Embedders
BGE, OpenAI, BM25"]:::factory UPSERTERS["⬆️ Upserters
FAISS, Qdrant, Pinecone, ES"]:::factory end %% Main flow START --> WORKFLOW WORKFLOW --> TASK_INIT %% Linear pipeline with error handling TASK_INIT --> PARSE PARSE --> CHUNK CHUNK --> EMBED EMBED --> UPSERT UPSERT --> COMMIT COMMIT --> END %% Component usage PARSE ==> PARSERS CHUNK ==> CHUNKERS EMBED ==> EMBEDDERS UPSERT ==> UPSERTERS %% Error paths TASK_INIT -.->|error| ERROR["❌ Failed"]:::startEnd PARSE -.->|error| ERROR CHUNK -.->|error| ERROR EMBED -.->|error| ERROR UPSERT -.->|error| ERROR