Skip to content
Chapter 1

Prerequisites

Before You Optimize

20 min read4 sections
⚠️Before You Begin

Before diving into inference engineering, you need clear answers to five questions: Which model(s)? What interface? What latency budget? What unit economics? What usage patterns?

Early on in building an AI product, the answers to these questions may not be clear. At this early stage, it's often better to use off-the-shelf APIs whenever possible rather than investing in dedicated inference. As the product scales, requirements become clear and inference engineering becomes a worthwhile pursuit.

Scale and Specialization#

There are two ways to add AI models to your product:

  • Shared inference: Send your traffic to a public API endpoint for a given model and pay per million tokens or some other consumption-based metric.
  • Dedicated deployments: Rent GPUs and set up an inference service exclusively for your application, paying per hour of GPU time.

Shared versus dedicated inference is not exactly the same conversation as closed versus open models — there are plenty of shared endpoints for open models and many model labs offer large customers some kind of dedicated setup for their closed models. However, one of the key motivations for adopting open models is that it unlocks unrestricted dedicated inference.

Most AI products start with pay-per-token APIs because the tradeoffs make sense while looking for product-market fit.

ProsCons
SharedZero overhead, only pay for consumptionCost scales linearly with usage
No cold start times, model always availableProvider uptime caps product uptime
Minimal engineering, just need an API keyNo control over latency or rate limits

Over time, shifting to dedicated deployments becomes essential for three reasons:

  1. Scale: You are processing enough volume that it's more economical to pay per GPU than per million tokens.
  2. Specialization: You are running a custom or fine-tuned model, or you have specific latency or uptime requirements.
  3. Orchestration: Your product relies on multiple models and multi-step pipelines and you need to minimize network latency and deployment complexity.

Key Takeaway

The switch to dedicated deployments puts you in charge of your own inference engineering. Only switch once there is a clear and immediate business need — it adds engineering surface area and increases the floor of your monthly spend.

About Your App#

Every inference engineering decision you make will be downstream of your use case.

There are two cases where inference engineers need to build at the highest level of generality:

  • Foundation models: You have trained your own model and will sell consumption directly via a public shared inference API.
  • Inference platforms: You are building an inference platform, either internally or as a product, and need to support any model and any use case.

But most AI-native applications are vertical apps like code editors or customer service agents. When building inference for vertical apps, you want to add as many constraints as possible by getting specific with your use case.

AI-Native Applications

Generative AI models have unlocked a new class of applications across industries and domains:

CategoryExampleConsiderations
AgentsProspecting agent for sales teamsOne user action triggers many inference calls
ChatFront-line customer support with RAGTime to first token makes chat feel fast
VoiceReal-time translation between languagesEnd-to-end latency for natural conversation
MediaVirtual try-on for clothes and jewelryBalance output quality vs. speed
SearchLegal document discoveryOffline corpus filling vs. online user requests
RecSysE-commerce product recommendationsConsistent latency with high request volume
CompletionTab completion for coding in an IDEFull completion chunk at user's typing speed
ModerationScan user-generated content for safetyHigh throughput for cost-effective checks

Online versus Offline

One of the primary tradeoffs in inference engineering is latency versus throughput. Lower latency makes your application faster, but higher throughput makes it cheaper at scale because you can use fewer GPUs for the same number of users.

Most AI applications — code completion, chat, voice agents — are online applications that run in real time and should be optimized for latency.

However, some applications have offline batch inference needs. Offline jobs are better served by high-throughput model deployments where each individual request would be too slow for a good user experience, but the system as a whole processes far more requests per hour.

Example offline workloads include:

  • Catalog transcription: Transcribing a back catalog of podcasts or audio to make it searchable
  • Document processing: Embedding, converting, or analyzing documents on a regular cadence
  • Corpus preparation: Cleaning, embedding, or preparing massive corpora for model training
💡Dual-Use Models

It's possible to have a single model used for both online and offline jobs. Whisper could be used in both a real-time dictation app and a batch transcription job. If both use cases have enough volume, it's more cost effective to create two separate deployments — one optimized for latency, the other for throughput.

Consumer versus B2B

Consumer applications are generally more cost-sensitive with less predictable usage patterns. A single launch can drive a spike in usage overnight. Prioritize marginal cost and flexibility.

Business-to-business products often have better margins and more stable usage but require high availability and consistently low latency. Prioritize latency and uptime.

In both cases, compliance can limit infrastructure options — data sovereignty, user privacy, and regulatory compliance are essential considerations.

Model Selection#

All else being equal — hardware, runtime, optimizations, architecture — inference on a smaller model with fewer parameters will be faster and cheaper than inference on a larger model.

Key Takeaway

The most important decision in model performance optimization isn't the runtime engine or speculation algorithm — it's which model you choose to work with in the first place. Find or create the smallest, easiest-to-run model that's smart enough to handle the task.

Model Evaluation

Model evaluation, or evals, is the practice of systematically measuring model intelligence. High-conviction evals are a prerequisite for inference engineering:

  • Spend time wisely: Before investing in making a model fast, evals ensure the model is useful.
  • Establish a baseline: Some performance optimization techniques risk reducing model quality, requiring a baseline to compare against.

Intelligence benchmarks like MMLU or SWE-bench are useful for shortlisting models, but they have become saturated. Goodhart's Law states that "when a measure becomes a target, it ceases to be a good measure."

Tips for useful evaluation work:

  1. Look at your data: Check eval results against your intuition for the product and problem space
  2. Be precise: Focus evaluation on the hardest problems a model needs to solve
  3. Use tools: Don't reinvent the wheel on one of the fundamental problems in AI engineering

Fine-Tuning for Domain-Specific Quality

Fine-tuning is the practice of taking a pre-trained foundation model and adapting it to a specific use case by introducing new data.

If you can fine-tune a small model to pass your evals, you set yourself up for an easier time hitting your latency and cost targets. A great example is translating English into SQL — a tiny fine-tuned model of just a few billion parameters can reach equivalent performance to hundred-billion parameter general-purpose models on this specific task.

Distillation

Distillation is the process of using a large "teacher" model to train a smaller "student" model to emulate the larger model's behavior. Unlike fine-tuning on synthetic data, distillation shows the student the teacher's actual probability distributions, not just its final answers.

In January 2025, DeepSeek released distilled versions of DeepSeek-R1 on top of Llama 3 and Qwen 2.5 architectures. These distilled models showed similar reasoning behavior with worse benchmark scores but were relatively small and could take advantage of existing performance work.

Measuring Latency and Throughput#

The two most common performance metrics for LLMs are TTFT (time to first token) and TPS (tokens per second).

MetricDescriptionPhaseGoal
TTFTHow long until the user sees the first output token?Compute-bound prefillLower is better
TPSHow many tokens per second after the first token?Bandwidth-bound decodeHigher is better

More specific TPS terms:

  • Perceived TPS: Observed tokens per second per user after the first token (latency)
  • Total TPS: Total tokens generated per second by the entire inference service (throughput)
  • Inter-token latency (ITL): Time between subsequent tokens. 10ms ITL = 100 TPS

Latency Percentiles

LLM total response time is generally a right-skewed distribution — most times concentrate around a mode, but outliers can take significantly longer.

PercentileMeaning
P50Median latency — 1 in 2 requests is slower
P9090th percentile — 1 in 10 requests is slower
P9595th percentile — 1 in 20 requests is slower
P9999th percentile — 1 in 100 requests is slower

Key Takeaway

While driving down average latency matters, good performance work also focuses on reducing P90/P99 latencies. Outliers can dramatically affect user experience and trust. It's not good enough for most interactions to feel snappy if one in every ten takes several seconds.

End-to-End Metrics

Both inference-only and end-to-end metrics are valuable. Inference time tells you how effective your model performance work is, while end-to-end metrics reveal your users' perception of how fast your application is.

When inference time is fast but end-to-end time is slow, turn your attention to infrastructure rather than model performance optimization.

Check Your Understanding

1 / 11

When should you switch from shared inference to dedicated deployments?