Classical MLOps assumed a model artifact: a file you trained, versioned, deployed. LLM applications break that assumption. "The model" is a prompt plus a retrieval setup plus tool definitions plus a context window strategy. There's no single artifact to version.

This forces some changes.

What drift means for LLMs

For classical ML, drift means input distribution changed. For LLM applications, drift means: user queries changed, retrieved context changed, model behavior changed (the vendor updated weights silently), or tool outputs changed.

You monitor all four, and they require different signals.

Observability stack

We use Evidently AI for data and embedding drift on the retrieval side, CloudWatch for system metrics, and custom traces for per-request observability.

The trace is the most useful artifact. Every production request gets a trace showing: input, retrieved context, tool calls, generated output, latency at each step, and cost. When something goes wrong, you start at the trace.

Retraining triggers

For classical ML, you retrain on a schedule or on drift. For LLM applications, you rebuild prompts and re-tune retrieval on quality drift. When eval scores drop below a threshold, regenerate the prompt and re-run evals.

Most of what classical MLOps calls "retraining" is, for LLM apps, prompt engineering with eval-gated deployment.

Cost monitoring

Tokens are expensive. Per-request cost dashboards are not optional. We set per-user, per-feature, and per-agent cost budgets, with alerts when any exceeds projection.

The most common cost surprise: a feature that suddenly starts retrieving 20 chunks instead of 5 because someone tweaked a parameter. We caught it in the cost dashboard before the bill arrived.

When to retrain a prompt

When evals drop. Not on a schedule. Not because you read a blog post. The eval scores are the only honest signal you have.