Before diving into inference engineering, you need clear answers to five questions: Which model(s)? What interface? What latency budget? What unit economics? What usage patterns?
Early on in building an AI product, the answers to these questions may not be clear. At this early stage, it's often better to use off-the-shelf APIs whenever possible rather than investing in dedicated inference. As the product scales, requirements become clear and inference engineering becomes a worthwhile pursuit.
Scale and Specialization#
There are two ways to add AI models to your product:
- Shared inference: Send your traffic to a public API endpoint for a given model and pay per million tokens or some other consumption-based metric.
- Dedicated deployments: Rent GPUs and set up an inference service exclusively for your application, paying per hour of GPU time.
Shared versus dedicated inference is not exactly the same conversation as closed versus open models — there are plenty of shared endpoints for open models and many model labs offer large customers some kind of dedicated setup for their closed models. However, one of the key motivations for adopting open models is that it unlocks unrestricted dedicated inference.
Most AI products start with pay-per-token APIs because the tradeoffs make sense while looking for product-market fit.
| Pros | Cons | |
|---|---|---|
| Shared | Zero overhead, only pay for consumption | Cost scales linearly with usage |
| No cold start times, model always available | Provider uptime caps product uptime | |
| Minimal engineering, just need an API key | No control over latency or rate limits |
Over time, shifting to dedicated deployments becomes essential for three reasons:
- Scale: You are processing enough volume that it's more economical to pay per GPU than per million tokens.
- Specialization: You are running a custom or fine-tuned model, or you have specific latency or uptime requirements.
- Orchestration: Your product relies on multiple models and multi-step pipelines and you need to minimize network latency and deployment complexity.
Key Takeaway
The switch to dedicated deployments puts you in charge of your own inference engineering. Only switch once there is a clear and immediate business need — it adds engineering surface area and increases the floor of your monthly spend.
About Your App#
Every inference engineering decision you make will be downstream of your use case.
There are two cases where inference engineers need to build at the highest level of generality:
- Foundation models: You have trained your own model and will sell consumption directly via a public shared inference API.
- Inference platforms: You are building an inference platform, either internally or as a product, and need to support any model and any use case.
But most AI-native applications are vertical apps like code editors or customer service agents. When building inference for vertical apps, you want to add as many constraints as possible by getting specific with your use case.
AI-Native Applications
Generative AI models have unlocked a new class of applications across industries and domains:
| Category | Example | Considerations |
|---|---|---|
| Agents | Prospecting agent for sales teams | One user action triggers many inference calls |
| Chat | Front-line customer support with RAG | Time to first token makes chat feel fast |
| Voice | Real-time translation between languages | End-to-end latency for natural conversation |
| Media | Virtual try-on for clothes and jewelry | Balance output quality vs. speed |
| Search | Legal document discovery | Offline corpus filling vs. online user requests |
| RecSys | E-commerce product recommendations | Consistent latency with high request volume |
| Completion | Tab completion for coding in an IDE | Full completion chunk at user's typing speed |
| Moderation | Scan user-generated content for safety | High throughput for cost-effective checks |
Online versus Offline
One of the primary tradeoffs in inference engineering is latency versus throughput. Lower latency makes your application faster, but higher throughput makes it cheaper at scale because you can use fewer GPUs for the same number of users.
Most AI applications — code completion, chat, voice agents — are online applications that run in real time and should be optimized for latency.
However, some applications have offline batch inference needs. Offline jobs are better served by high-throughput model deployments where each individual request would be too slow for a good user experience, but the system as a whole processes far more requests per hour.
Example offline workloads include:
- Catalog transcription: Transcribing a back catalog of podcasts or audio to make it searchable
- Document processing: Embedding, converting, or analyzing documents on a regular cadence
- Corpus preparation: Cleaning, embedding, or preparing massive corpora for model training
It's possible to have a single model used for both online and offline jobs. Whisper could be used in both a real-time dictation app and a batch transcription job. If both use cases have enough volume, it's more cost effective to create two separate deployments — one optimized for latency, the other for throughput.
Consumer versus B2B
Consumer applications are generally more cost-sensitive with less predictable usage patterns. A single launch can drive a spike in usage overnight. Prioritize marginal cost and flexibility.
Business-to-business products often have better margins and more stable usage but require high availability and consistently low latency. Prioritize latency and uptime.
In both cases, compliance can limit infrastructure options — data sovereignty, user privacy, and regulatory compliance are essential considerations.
Model Selection#
All else being equal — hardware, runtime, optimizations, architecture — inference on a smaller model with fewer parameters will be faster and cheaper than inference on a larger model.
Key Takeaway
The most important decision in model performance optimization isn't the runtime engine or speculation algorithm — it's which model you choose to work with in the first place. Find or create the smallest, easiest-to-run model that's smart enough to handle the task.
Model Evaluation
Model evaluation, or evals, is the practice of systematically measuring model intelligence. High-conviction evals are a prerequisite for inference engineering:
- Spend time wisely: Before investing in making a model fast, evals ensure the model is useful.
- Establish a baseline: Some performance optimization techniques risk reducing model quality, requiring a baseline to compare against.
Intelligence benchmarks like MMLU or SWE-bench are useful for shortlisting models, but they have become saturated. Goodhart's Law states that "when a measure becomes a target, it ceases to be a good measure."
Tips for useful evaluation work:
- Look at your data: Check eval results against your intuition for the product and problem space
- Be precise: Focus evaluation on the hardest problems a model needs to solve
- Use tools: Don't reinvent the wheel on one of the fundamental problems in AI engineering
Fine-Tuning for Domain-Specific Quality
Fine-tuning is the practice of taking a pre-trained foundation model and adapting it to a specific use case by introducing new data.
If you can fine-tune a small model to pass your evals, you set yourself up for an easier time hitting your latency and cost targets. A great example is translating English into SQL — a tiny fine-tuned model of just a few billion parameters can reach equivalent performance to hundred-billion parameter general-purpose models on this specific task.
Distillation
Distillation is the process of using a large "teacher" model to train a smaller "student" model to emulate the larger model's behavior. Unlike fine-tuning on synthetic data, distillation shows the student the teacher's actual probability distributions, not just its final answers.
In January 2025, DeepSeek released distilled versions of DeepSeek-R1 on top of Llama 3 and Qwen 2.5 architectures. These distilled models showed similar reasoning behavior with worse benchmark scores but were relatively small and could take advantage of existing performance work.
Measuring Latency and Throughput#
The two most common performance metrics for LLMs are TTFT (time to first token) and TPS (tokens per second).
| Metric | Description | Phase | Goal |
|---|---|---|---|
| TTFT | How long until the user sees the first output token? | Compute-bound prefill | Lower is better |
| TPS | How many tokens per second after the first token? | Bandwidth-bound decode | Higher is better |
More specific TPS terms:
- Perceived TPS: Observed tokens per second per user after the first token (latency)
- Total TPS: Total tokens generated per second by the entire inference service (throughput)
- Inter-token latency (ITL): Time between subsequent tokens. 10ms ITL = 100 TPS
Latency Percentiles
LLM total response time is generally a right-skewed distribution — most times concentrate around a mode, but outliers can take significantly longer.
| Percentile | Meaning |
|---|---|
| P50 | Median latency — 1 in 2 requests is slower |
| P90 | 90th percentile — 1 in 10 requests is slower |
| P95 | 95th percentile — 1 in 20 requests is slower |
| P99 | 99th percentile — 1 in 100 requests is slower |
Key Takeaway
While driving down average latency matters, good performance work also focuses on reducing P90/P99 latencies. Outliers can dramatically affect user experience and trust. It's not good enough for most interactions to feel snappy if one in every ten takes several seconds.
End-to-End Metrics
Both inference-only and end-to-end metrics are valuable. Inference time tells you how effective your model performance work is, while end-to-end metrics reveal your users' perception of how fast your application is.
When inference time is fast but end-to-end time is slow, turn your attention to infrastructure rather than model performance optimization.
Check Your Understanding
1 / 11When should you switch from shared inference to dedicated deployments?