What is inference engineering?

Inference engineering is the discipline of optimizing and deploying AI models for production use. It covers three layers: runtime (GPU hardware, CUDA, model optimization), infrastructure (serving frameworks, autoscaling, multi-GPU), and tooling (developer experience, benchmarking, observability).

How do I calculate VRAM requirements for LLM inference?

VRAM = model weights (params × bytes per param) + KV cache (2 × layers × kv_heads × head_dim × seq_len × batch × bytes) + activations (~5% of weights) + overhead (~1.5 GB). Use our interactive VRAM Calculator at inferenceengineering.tech/exercises/vram-calculator.

What are the best inference serving frameworks?

The three leading inference engines are vLLM (best ease of use, GPU/TPU support), SGLang (excellent performance, NVIDIA/AMD), and TensorRT-LLM (raw NVIDIA performance). The best choice depends on your model, scale, and hardware.

What GPU should I use for LLM inference?

For production LLM inference: NVIDIA H100/H200 for large models (70B+), A100 for medium models (7-70B), L4 for small models or budget deployments, and B200/B300 (Blackwell) for next-gen workloads. Use our GPU Selection Advisor at inferenceengineering.tech/exercises/gpu-advisor.

Chapter 1: Prerequisites | Inference Engineering

⚠️Before You Begin

Before diving into inference engineering, you need clear answers to five questions: Which model(s)? What interface? What latency budget? What unit economics? What usage patterns?

Early on in building an AI product, the answers to these questions may not be clear. At this early stage, it's often better to use off-the-shelf APIs whenever possible rather than investing in dedicated inference. As the product scales, requirements become clear and inference engineering becomes a worthwhile pursuit.

Scale and Specialization#

There are two ways to add AI models to your product:

Shared inference: Send your traffic to a public API endpoint for a given model and pay per million tokens or some other consumption-based metric.
Dedicated deployments: Rent GPUs and set up an inference service exclusively for your application, paying per hour of GPU time.

Shared versus dedicated inference is not exactly the same conversation as closed versus open models — there are plenty of shared endpoints for open models and many model labs offer large customers some kind of dedicated setup for their closed models. However, one of the key motivations for adopting open models is that it unlocks unrestricted dedicated inference.

Most AI products start with pay-per-token APIs because the tradeoffs make sense while looking for product-market fit.

	Pros	Cons
Shared	Zero overhead, only pay for consumption	Cost scales linearly with usage
	No cold start times, model always available	Provider uptime caps product uptime
	Minimal engineering, just need an API key	No control over latency or rate limits

Over time, shifting to dedicated deployments becomes essential for three reasons:

Scale: You are processing enough volume that it's more economical to pay per GPU than per million tokens.
Specialization: You are running a custom or fine-tuned model, or you have specific latency or uptime requirements.
Orchestration: Your product relies on multiple models and multi-step pipelines and you need to minimize network latency and deployment complexity.

Key Takeaway

The switch to dedicated deployments puts you in charge of your own inference engineering. Only switch once there is a clear and immediate business need — it adds engineering surface area and increases the floor of your monthly spend.

About Your App#

Every inference engineering decision you make will be downstream of your use case.

There are two cases where inference engineers need to build at the highest level of generality:

Foundation models: You have trained your own model and will sell consumption directly via a public shared inference API.
Inference platforms: You are building an inference platform, either internally or as a product, and need to support any model and any use case.

But most AI-native applications are vertical apps like code editors or customer service agents. When building inference for vertical apps, you want to add as many constraints as possible by getting specific with your use case.

AI-Native Applications

Generative AI models have unlocked a new class of applications across industries and domains:

Category	Example	Considerations
Agents	Prospecting agent for sales teams	One user action triggers many inference calls
Chat	Front-line customer support with RAG	Time to first token makes chat feel fast
Voice	Real-time translation between languages	End-to-end latency for natural conversation
Media	Virtual try-on for clothes and jewelry	Balance output quality vs. speed
Search	Legal document discovery	Offline corpus filling vs. online user requests
RecSys	E-commerce product recommendations	Consistent latency with high request volume
Completion	Tab completion for coding in an IDE	Full completion chunk at user's typing speed
Moderation	Scan user-generated content for safety	High throughput for cost-effective checks

Online versus Offline

One of the primary tradeoffs in inference engineering is latency versus throughput. Lower latency makes your application faster, but higher throughput makes it cheaper at scale because you can use fewer GPUs for the same number of users.

Most AI applications — code completion, chat, voice agents — are online applications that run in real time and should be optimized for latency.

However, some applications have offline batch inference needs. Offline jobs are better served by high-throughput model deployments where each individual request would be too slow for a good user experience, but the system as a whole processes far more requests per hour.

Example offline workloads include:

Catalog transcription: Transcribing a back catalog of podcasts or audio to make it searchable
Document processing: Embedding, converting, or analyzing documents on a regular cadence
Corpus preparation: Cleaning, embedding, or preparing massive corpora for model training

💡Dual-Use Models

It's possible to have a single model used for both online and offline jobs. Whisper could be used in both a real-time dictation app and a batch transcription job. If both use cases have enough volume, it's more cost effective to create two separate deployments — one optimized for latency, the other for throughput.

Consumer versus B2B

Consumer applications are generally more cost-sensitive with less predictable usage patterns. A single launch can drive a spike in usage overnight. Prioritize marginal cost and flexibility.

Business-to-business products often have better margins and more stable usage but require high availability and consistently low latency. Prioritize latency and uptime.

In both cases, compliance can limit infrastructure options — data sovereignty, user privacy, and regulatory compliance are essential considerations.

Model Selection#

All else being equal — hardware, runtime, optimizations, architecture — inference on a smaller model with fewer parameters will be faster and cheaper than inference on a larger model.

Key Takeaway

The most important decision in model performance optimization isn't the runtime engine or speculation algorithm — it's which model you choose to work with in the first place. Find or create the smallest, easiest-to-run model that's smart enough to handle the task.

Model Evaluation

Model evaluation, or evals, is the practice of systematically measuring model intelligence. High-conviction evals are a prerequisite for inference engineering:

Spend time wisely: Before investing in making a model fast, evals ensure the model is useful.
Establish a baseline: Some performance optimization techniques risk reducing model quality, requiring a baseline to compare against.

Intelligence benchmarks like MMLU or SWE-bench are useful for shortlisting models, but they have become saturated. Goodhart's Law states that "when a measure becomes a target, it ceases to be a good measure."

Tips for useful evaluation work:

Look at your data: Check eval results against your intuition for the product and problem space
Be precise: Focus evaluation on the hardest problems a model needs to solve
Use tools: Don't reinvent the wheel on one of the fundamental problems in AI engineering

Fine-Tuning for Domain-Specific Quality

Fine-tuning is the practice of taking a pre-trained foundation model and adapting it to a specific use case by introducing new data.

If you can fine-tune a small model to pass your evals, you set yourself up for an easier time hitting your latency and cost targets. A great example is translating English into SQL — a tiny fine-tuned model of just a few billion parameters can reach equivalent performance to hundred-billion parameter general-purpose models on this specific task.

Distillation

Distillation is the process of using a large "teacher" model to train a smaller "student" model to emulate the larger model's behavior. Unlike fine-tuning on synthetic data, distillation shows the student the teacher's actual probability distributions, not just its final answers.

In January 2025, DeepSeek released distilled versions of DeepSeek-R1 on top of Llama 3 and Qwen 2.5 architectures. These distilled models showed similar reasoning behavior with worse benchmark scores but were relatively small and could take advantage of existing performance work.

Measuring Latency and Throughput#

The two most common performance metrics for LLMs are TTFT (time to first token) and TPS (tokens per second).

Metric	Description	Phase	Goal
TTFT	How long until the user sees the first output token?	Compute-bound prefill	Lower is better
TPS	How many tokens per second after the first token?	Bandwidth-bound decode	Higher is better

More specific TPS terms:

Perceived TPS: Observed tokens per second per user after the first token (latency)
Total TPS: Total tokens generated per second by the entire inference service (throughput)
Inter-token latency (ITL): Time between subsequent tokens. 10ms ITL = 100 TPS

Latency Visualizer

Adjust inference parameters to see how they affect user-perceived latency.

Time to First Token (TTFT)200 ms

Tokens per Second (TPS)60 tok/s

Input Length500 tokens

Output Length150 tokens

TTFT

200 ms

Inter-Token Latency

16.7 ms

Decode Time

2.50s

Total Time

2.70s

Timeline Breakdown

Prefill

Decode

TTFT Percentiles

P50

200 ms

P90

280 ms

P99

440 ms

token token token token token token token token token token token token token token token token token token token token ...+130 more tokens

Latency Percentiles

LLM total response time is generally a right-skewed distribution — most times concentrate around a mode, but outliers can take significantly longer.

Percentile	Meaning
P50	Median latency — 1 in 2 requests is slower
P90	90th percentile — 1 in 10 requests is slower
P95	95th percentile — 1 in 20 requests is slower
P99	99th percentile — 1 in 100 requests is slower

Key Takeaway

While driving down average latency matters, good performance work also focuses on reducing P90/P99 latencies. Outliers can dramatically affect user experience and trust. It's not good enough for most interactions to feel snappy if one in every ten takes several seconds.

End-to-End Metrics

Both inference-only and end-to-end metrics are valuable. Inference time tells you how effective your model performance work is, while end-to-end metrics reveal your users' perception of how fast your application is.

When inference time is fast but end-to-end time is slow, turn your attention to infrastructure rather than model performance optimization.

Check Your Understanding

1 / 11

When should you switch from shared inference to dedicated deployments?