← Back
Inventory Extraction ::
GPT-4 AUTOMATION

Legacy Data to Structured Database

Automated pipeline utilizing LLMs to structured un-structured legacy word documents.

The Challenge: "Digital Paper"

The organization had inventory data trapped in thousands of legacy .docx files.

  • [ERR] Mixed formatting (tables vs paragraphs)
  • [ERR] Non-standardized item naming
  • [ERR] Unstructured human comments

The Extraction Pipeline

Step 1
Ingest

Python script walks directory tree. Extracts raw text/tables via `python-docx`.

Step 2
Interpret

GPT-4 analyzes unstructured text to identify items and conditions.

Step 3
Normalize

Returns strictly formatted JSON via Function Calling. Canonical ID mapping.

Step 4
Persist

Data saved to SQL. Linked to extracted Room IDs.

Logic & Transformation

Legacy Document Input
// File: Room_C2.docx

Bedroom Inventory

1x Blue Desk Chair (Wobbly leg)
Oak Chest of Drawers (3) - Good

Note: Curtains are faded.

JSON Output (Function Call)
JSON
1{
2 "room_id": "C2",
3 "inventory": [
4 {
5 "item": "Desk Chair",
6 "qty": 1,
7 "attrs": ["Blue"],
8 "condition": "Poor"
9 },
10 {
11 "item": "Chest of Drawers",
12 "qty": 1,
13 "attrs": ["Oak", "3 Drawers"],
14 "condition": "Good"
15 }
16 ]
17}

Traditional RegEx failed on visual structures. GPT-4 successfully inferred that "(3)" meant "3 drawers" in one context, but "(2)" meant "quantity: 2" in another.

Caching

Implemented `dill` serialization to cache API responses locally. Allowed zero-cost schema iteration.

Concurrency

Used Python's `ProcessPoolExecutor` to process files in parallel, reducing runtime from hours to minutes.

Validation

Enforced strict JSON Schema. Model validation errors were fed back into the prompt for self-correction.

Technology Stack

Python 3 OpenAI API (GPT-4) Pandas MySQL