Inventory Extraction - Rui Almeida

The Challenge: "Digital Paper"

The organization had inventory data trapped in thousands of legacy .docx files.

[ERR] Mixed formatting (tables vs paragraphs)
[ERR] Non-standardized item naming
[ERR] Unstructured human comments

The Extraction Pipeline

Step 1

Ingest

Python script walks directory tree. Extracts raw text/tables via `python-docx`.

Step 2

Interpret

GPT-4 analyzes unstructured text to identify items and conditions.

Step 3

Normalize

Returns strictly formatted JSON via Function Calling. Canonical ID mapping.

Step 4

Persist

Data saved to SQL. Linked to extracted Room IDs.

Logic & Transformation

Legacy Document Input

// File: Room_C2.docx

Bedroom Inventory

1x Blue Desk Chair (Wobbly leg)
Oak Chest of Drawers (3) - Good

Note: Curtains are faded.

JSON Output (Function Call)

JSON

 1{
 2  "room_id": "C2",
 3  "inventory": [
 4    {
 5      "item": "Desk Chair",
 6      "qty": 1,
 7      "attrs": ["Blue"],
 8      "condition": "Poor"
 9    },
10    {
11      "item": "Chest of Drawers",
12      "qty": 1,
13      "attrs": ["Oak", "3 Drawers"],
14      "condition": "Good"
15    }
16  ]
17}

Traditional RegEx failed on visual structures. GPT-4 successfully inferred that "(3)" meant "3 drawers" in one context, but "(2)" meant "quantity: 2" in another.

Caching

Implemented `dill` serialization to cache API responses locally. Allowed zero-cost schema iteration.

Concurrency

Used Python's `ProcessPoolExecutor` to process files in parallel, reducing runtime from hours to minutes.

Validation

Enforced strict JSON Schema. Model validation errors were fed back into the prompt for self-correction.

Technology Stack

Python 3 OpenAI API (GPT-4) Pandas MySQL