Text-to-SQL at Enterprise Scale: Lessons from Field Operations

Natural-language-to-SQL feels magical when it works. The first time a field engineer types "show me the wells where pressure dropped more than 10% last week" and gets a correct query back, you understand why people are excited.

The second time, after the model hallucinates a column name and a Lambda errors out, you understand why most Text-to-SQL systems don't make it to production. This is what worked.

The schema awareness problem

The model needs to know the schema. Not just table names: column types, foreign key relationships, common joins, business meaning of columns. "current_pressure_psi" and "current_pressure_bar" mean different things, and the model has no way to know that without help.

We solved this with a dedicated schema description doc, regenerated weekly from Snowflake metadata plus human-authored annotations on the columns that mattered most. The doc went into the system prompt as compressed YAML. About 4K tokens of schema for a 200-table warehouse.

Hallucinated columns

The most common failure mode: the model invents a plausible-looking column that doesn't exist. "well_id_v2" instead of "well_id". "pressure_reading" instead of "pressure_psi".

The fix is a validation pass before execution. Parse the generated SQL, check every column against the schema doc, and if anything doesn't match, hand the failure back to the model with a specific correction. Two iterations resolves 95% of hallucination errors.

Performance budgets

We capped query cost in Snowflake using a query tag and a circuit breaker. If a generated query would scan more than X bytes or take longer than Y seconds, it never ran.

The model occasionally produced beautifully correct but catastrophically expensive queries. Cartesian joins on telemetry tables. Full table scans where an index would have worked. Catching those is cheaper than running them.

Real user patterns

What surprised me: most queries field engineers actually want are simple. Top N, filter by date, group by well. The hard queries are rare and usually wrong on the first try.

We optimized hard for the easy queries (sub-second response, no hallucination tolerance) and gracefully degraded the hard ones (explanation of why it might be wrong, suggestion to refine).

What broke

Time zones. Always time zones. The warehouse stores timestamps in UTC, the field crews think in mountain time, and the model defaults to nothing in particular. We hardcoded the timezone in the system prompt and the errors stopped.