Legacy Data to Structured Database
Automated pipeline utilizing LLMs to structured un-structured legacy word documents.
The Challenge: "Digital Paper"
The organization had inventory data trapped in thousands of legacy .docx files.
- [ERR] Mixed formatting (tables vs paragraphs)
- [ERR] Non-standardized item naming
- [ERR] Unstructured human comments
The Extraction Pipeline
Python script walks directory tree. Extracts raw text/tables via `python-docx`.
GPT-4 analyzes unstructured text to identify items and conditions.
Returns strictly formatted JSON via Function Calling. Canonical ID mapping.
Data saved to SQL. Linked to extracted Room IDs.
Logic & Transformation
Bedroom Inventory
Oak Chest of Drawers (3) - Good
Note: Curtains are faded.
1{ 2 "room_id": "C2", 3 "inventory": [ 4 { 5 "item": "Desk Chair", 6 "qty": 1, 7 "attrs": ["Blue"], 8 "condition": "Poor" 9 },10 {11 "item": "Chest of Drawers",12 "qty": 1,13 "attrs": ["Oak", "3 Drawers"],14 "condition": "Good"15 }16 ]17}
Traditional RegEx failed on visual structures. GPT-4 successfully inferred that "(3)" meant "3 drawers" in one context, but "(2)" meant "quantity: 2" in another.
Caching
Implemented `dill` serialization to cache API responses locally. Allowed zero-cost schema iteration.
Concurrency
Used Python's `ProcessPoolExecutor` to process files in parallel, reducing runtime from hours to minutes.
Validation
Enforced strict JSON Schema. Model validation errors were fed back into the prompt for self-correction.