Here's a problem I faced at my startup: we had 16 different fine-tuned classifiers for our customer support AI. Each one detected something different—issue resolution, suspicious messages, requests for human agents, phone claims, and more. The naive solution? Deploy 16 separate GPU instances. The monthly bill? $17,000+.
There had to be a better way. And there was: hot-swappable LoRA adapters. By serving all 16 classifiers from a single GPU, we cut our costs by 93% while actually improving latency. Here's how.
The Multi-Classifier Problem
Our AI support agent needed to make multiple real-time decisions during every conversation:
- Issue Resolution Detection: Has the customer's problem been solved?
- Escalation Detection: Is the customer asking for a human agent?
- Fraud Detection: Is someone trying to manipulate the system?
- Phone Claim Detection: Is this a device warranty claim?
- Topic Switching: Did the customer change subjects mid-conversation?
- ...and 11 more specialized classifiers
Each classifier was a fine-tuned Mistral 7B model with its own specialized training data. Traditional deployment would mean:
- 16 GPU instances running 24/7
- ~$1,500/month per instance (ml.g5.2xlarge on AWS)
- Complex routing logic to hit the right endpoint
- Network latency for each classifier call
The LoRA Solution: One Base Model, Many Adapters
LoRA (Low-Rank Adaptation) works by freezing the base model and training small "adapter" matrices. These adapters are tiny—typically 0.1-1% of the base model size. A 7B parameter model with a rank-32 LoRA adapter adds only ~85MB of weights.
Since adapters are small, we can load many adapters into GPU memory simultaneously and switch between them at inference time. The base model (7B parameters) stays loaded once; only the adapter weights (85MB each) change per request.
With 16 adapters at 85MB each, we add only 1.36GB to our memory footprint. A 24GB GPU can easily handle the 7B base model (~14GB in bfloat16) plus all adapters with room to spare for KV cache.
Implementation #1: HuggingFace PEFT
The simplest approach uses HuggingFace's PEFT library. Here's the model loading code:
def model_fn(model_dir):
model_path = f'{model_dir}/'
# Load base model with Flash Attention 2
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map={'': 0},
attn_implementation="flash_attention_2"
)
# Load first adapter and wrap with PeftModel
model = PeftModel.from_pretrained(
model,
model_path + "issueResolved",
adapter_name="issueResolved",
device_map={"": 0}
)
# Load additional adapters (they share the base model!)
model.load_adapter(model_path + "isMessageWeird", adapter_name="isMessageWeird")
model.load_adapter(model_path + "isPhoneClaim", adapter_name="isPhoneClaim")
model.load_adapter(model_path + "requireExternalAccess", adapter_name="requireExternalAccess")
model.load_adapter(model_path + "askingForAgent", adapter_name="askingForAgent")
model.load_adapter(model_path + "customerNotUnderstanding", adapter_name="customerNotUnderstanding")
# ... load up to 15+ adapters
model.eval()
return model
The key insight: we call load_adapter() for each specialized model, but they all share the same base weights. Inference then looks like this:
def predict_fn(inputs, model):
tokenized = tokenizer(inputs['input_text'], return_tensors="pt").to(device)
# Hot-swap to the requested adapter
if inputs.get("adapter_name"):
model.set_adapter(inputs['adapter_name']) # Instant switch!
outputs = model.generate(**tokenized, output_scores=True, ...)
else:
# Use base model without any adapter
with model.disable_adapter():
outputs = model.generate(**tokenized, output_scores=True, ...)
# Extract predictions and confidence
probs = torch.stack(outputs.scores, dim=1).softmax(-1)
# ... aggregate probabilities
return [{"confidence_score": confidence, "predicted_text": prediction}]
The set_adapter() call is nearly instantaneous—it just switches which adapter matrices are used in the forward pass. No model reloading, no memory reallocation.
HuggingFace Limitations
This approach works, but has some drawbacks:
- Sequential processing: Can't batch requests with different adapters
- Stateful: The "current adapter" is model state, complicating concurrent requests
- Suboptimal KV cache: Standard attention implementation
- ~15 requests/second throughput on g5.2xlarge
Implementation #2: vLLM (3x Faster)
vLLM is a high-throughput inference engine that supports LoRA natively. It solves all the HuggingFace limitations:
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
# Initialize vLLM with LoRA support
llm = LLM(
model='./deployment_vllm',
tokenizer='./deployment_vllm',
enable_lora=True, # Enable LoRA support
max_model_len=8000, # Max sequence length
max_lora_rank=32, # Max LoRA rank across all adapters
max_loras=16 # Max concurrent LoRA adapters
)
# Create LoRA request objects (just metadata, not loaded yet)
lora_issue_resolved = LoRARequest(
'issueResolved', # Adapter name
1, # Unique ID
lora_local_path='./deployment_vllm/issueResolved/'
)
lora_message_weird = LoRARequest(
'isMessageWeird',
2,
lora_local_path='./deployment_vllm/isMessageWeird/'
)
# ... create 14 more LoRARequest objects
Notice the key parameters: max_loras=16 tells vLLM to reserve space for 16 concurrent adapters, and max_lora_rank=32 sets the maximum rank across all adapters.
Inference is cleaner—we pass the adapter as a parameter rather than mutating state:
def predict_fn(inputs, model):
input_text = inputs.get("input_text")
adapter_name = inputs.get("adapter_name")
sampling_params = SamplingParams(
temperature=0,
max_tokens=1, # Single token for classification
logprobs=1 # Get log probabilities
)
if adapter_name:
# Get the corresponding LoRARequest
lora_adapter = prompt2adapter[adapter_name]['adapter']
# vLLM handles the adapter loading/switching internally!
out = model.generate(
input_text,
lora_request=lora_adapter, # Pass adapter per-request
sampling_params=sampling_params
)
else:
# Base model inference
out = model.generate(input_text, sampling_params=sampling_params)
predicted_text = out[0].outputs[0].text
confidence = math.exp(out[0].outputs[0].cumulative_logprob)
return [{"confidence_score": confidence, "predicted_text": predicted_text}] This is stateless—each request specifies its own adapter, enabling proper concurrent processing.
How vLLM Achieves 3x Throughput: PagedAttention
vLLM's secret weapon is PagedAttention, a novel attention algorithm inspired by virtual memory in operating systems. To understand why it matters, we need to understand the KV cache problem.
The KV Cache Problem
During autoregressive generation, transformers cache the Key and Value tensors from previous tokens to avoid recomputation. This KV cache grows with sequence length and can consume massive memory:
- Mistral 7B, 8K context: ~2GB KV cache per sequence
- Batch of 10 sequences: ~20GB just for KV cache
Traditional implementations allocate contiguous memory per sequence. When sequences have different lengths, this creates severe fragmentation—up to 60% of GPU memory can be wasted on padding.
PagedAttention: Virtual Memory for LLMs
# Traditional attention: Contiguous KV cache per sequence
# Memory layout: [seq1_full_kv | seq2_full_kv | seq3_full_kv]
# Problem: Fragmentation when sequences have different lengths
# PagedAttention: Virtual memory for KV cache
# Memory layout: Fixed-size blocks that can be allocated anywhere
#
# Logical view (per sequence):
# Sequence 1: [Block 0] -> [Block 3] -> [Block 7]
# Sequence 2: [Block 1] -> [Block 4]
# Sequence 3: [Block 2] -> [Block 5] -> [Block 8] -> [Block 9]
#
# Physical memory: Blocks stored non-contiguously
# [B0|B1|B2|B3|B4|B5|B6|B7|B8|B9|...]
#
# Benefits:
# 1. Near-zero fragmentation (~4% waste vs ~60% in naive approach)
# 2. Dynamic allocation - sequences grow/shrink freely
# 3. Memory sharing - same prompt = same blocks (copy-on-write) PagedAttention divides the KV cache into fixed-size blocks (typically 16 tokens each). Sequences don't need contiguous memory—they just maintain a list of block pointers. This provides:
- Near-zero fragmentation: Blocks are uniform size, ~4% waste vs ~60%
- Dynamic allocation: Sequences grow by acquiring new blocks
- Memory sharing: Identical prefixes share blocks (copy-on-write)
- Better batching: More sequences fit in memory = higher throughput
Continuous Batching
Traditional batching waits for all sequences in a batch to complete before starting new ones. vLLM uses continuous batching: as soon as one sequence finishes, a new request immediately takes its slot. This eliminates idle time and maximizes GPU utilization.
Optimized LoRA Kernels
vLLM implements custom CUDA kernels that can process multiple LoRA adapters in a single batch:
# How vLLM handles LoRA adapters internally:
# 1. Base model weights are loaded once (frozen)
base_weight = model.layers[i].self_attn.q_proj.weight # Shape: [hidden, hidden]
# 2. LoRA adapters stored separately per adapter
lora_A = adapter.layers[i].self_attn.q_proj.lora_A # Shape: [rank, hidden]
lora_B = adapter.layers[i].self_attn.q_proj.lora_B # Shape: [hidden, rank]
# 3. Forward pass with LoRA:
# output = input @ base_weight.T + (input @ lora_A.T @ lora_B.T) * scaling
#
# The LoRA matrices are MUCH smaller:
# - Base: 4096 x 4096 = 16.7M params
# - LoRA (rank=32): (32 x 4096) + (4096 x 32) = 262K params (64x smaller!)
# 4. vLLM's optimization: Batched LoRA computation
# Multiple requests with different adapters can be batched together
# GPU kernels handle the per-request adapter selection efficiently The key optimization: requests with different adapters can still be batched together. The kernel applies the appropriate adapter per-token based on the request metadata. This is why vLLM achieves 3x the throughput of HuggingFace.
Deploying on AWS SageMaker
SageMaker provides a clean interface for deploying custom inference logic. Here's the complete inference script structure:
# inference.py - SageMaker entry point
import logging
import json
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
logger = logging.getLogger(__name__)
# Global state
llm = None
prompt2adapter = {}
def model_fn(model_dir):
"""Called once when the endpoint starts"""
global llm, prompt2adapter
model_path = f'{model_dir}/'
# Initialize all LoRA adapters
adapters = [
('issueResolved', 1),
('isMessageWeird', 2),
('isPhoneClaim', 3),
('requireExternalAccess', 4),
('isSwitchingIssues', 5),
('tryingToManipulate', 6),
('askingForAgent', 7),
('customerNotUnderstanding', 8),
('askingAboutOtherDevice', 9),
('haveEnoughInfo', 10),
('nextStepsRelevant', 11),
('isReadyForNextStep', 12),
('isSameReco', 13),
('isAppropriateAgent', 14),
('isHowToQuestion', 15),
('isPhoneDevice', 16),
]
for name, id in adapters:
lora = LoRARequest(name, id, lora_local_path=f'{model_path}{name}/')
prompt2adapter[name] = {'adapter': lora}
# Initialize vLLM engine
llm = LLM(
model=model_path,
tokenizer=model_path,
enable_lora=True,
max_model_len=8000,
max_lora_rank=32
)
logger.info('vLLM engine loaded with 16 LoRA adapters')
return llm
def input_fn(json_request_data, content_type='application/json'):
"""Parse incoming request"""
return json.loads(json_request_data)
def predict_fn(inputs, model):
"""Run inference with the specified adapter"""
# ... (implementation shown above)
def output_fn(output, accept='application/json'):
"""Format response"""
return json.dumps(output), accept Deployment uses SageMaker's HuggingFace container with our custom inference script:
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
role = sagemaker.get_execution_role()
# Create HuggingFace Model with custom inference script
huggingface_model = HuggingFaceModel(
model_data='s3://my-bucket/model-artifacts/mistral-7b-multi-lora.tar.gz',
role=role,
transformers_version='4.37.0',
pytorch_version='2.1.0',
py_version='py310',
entry_point='inference_vllm.py', # Our custom script
source_dir='./code', # Directory with inference code
model_server_workers=1, # vLLM manages its own parallelism
# Environment variables for vLLM
env={
'VLLM_ATTENTION_BACKEND': 'FLASH_ATTN',
'CUDA_VISIBLE_DEVICES': '0',
}
)
# Deploy to real-time endpoint
predictor = huggingface_model.deploy(
instance_type='ml.g5.2xlarge', # 24GB VRAM - fits 7B + 16 adapters
initial_instance_count=1,
endpoint_name='mistral-7b-multi-lora'
) Model Artifact Structure
The model tarball uploaded to S3 has this structure:
mistral-7b-multi-lora.tar.gz
├── config.json # Base model config
├── tokenizer.json # Tokenizer files
├── model.safetensors # Base model weights (14GB)
├── issueResolved/
│ ├── adapter_config.json # LoRA config (rank, alpha, etc.)
│ └── adapter_model.safetensors # LoRA weights (~85MB)
├── isMessageWeird/
│ ├── adapter_config.json
│ └── adapter_model.safetensors
├── isPhoneClaim/
│ └── ...
└── ... (14 more adapter directories) Invoking the Endpoint
import boto3
import json
runtime = boto3.client('sagemaker-runtime')
def classify_conversation(conversation_text, task_type):
"""
Classify a conversation using the appropriate adapter.
task_type options:
- 'issueResolved': Is the customer's issue resolved?
- 'isMessageWeird': Is this message suspicious/weird?
- 'askingForAgent': Is customer asking for human agent?
- ... (14 more classifiers)
"""
payload = {
'input_text': format_prompt(conversation_text, task_type),
'adapter_name': task_type
}
response = runtime.invoke_endpoint(
EndpointName='mistral-7b-multi-lora',
ContentType='application/json',
Body=json.dumps(payload)
)
result = json.loads(response['Body'].read().decode())
return result[0] # {'confidence_score': 0.98, 'predicted_text': 'yes'}
# Example usage - run multiple classifiers on same conversation
conversation = "Customer: Thanks, that fixed it! Agent: Great to hear..."
results = {
'issue_resolved': classify_conversation(conversation, 'issueResolved'),
'asking_for_agent': classify_conversation(conversation, 'askingForAgent'),
'is_phone_claim': classify_conversation(conversation, 'isPhoneClaim'),
}
Each request specifies which adapter to use via the adapter_name parameter. The endpoint handles the routing internally—no need for complex API gateway logic.
Cost Analysis: 93% Savings
# Cost Analysis: Multi-LoRA vs Separate Deployments
# Option A: Separate endpoint per classifier (traditional approach)
# ----------------------------------------------------------------
# 16 classifiers × ml.g5.2xlarge ($1.515/hr) = $24.24/hr
# Monthly cost: $24.24 × 24 × 30 = $17,452.80
# Option B: Single endpoint with hot-swappable LoRA (our approach)
# ----------------------------------------------------------------
# 1 × ml.g5.2xlarge ($1.515/hr) = $1.515/hr
# Monthly cost: $1.515 × 24 × 30 = $1,090.80
# SAVINGS: $16,362/month (93.7% reduction!)
# Latency comparison (P50):
# - Separate endpoints: ~50ms (includes cold adapter load)
# - Multi-LoRA vLLM: ~20ms (adapters pre-loaded, just switch)
# Throughput comparison (requests/sec on single instance):
# - HuggingFace PEFT: ~15 req/s
# - vLLM with LoRA: ~45 req/s (3x faster) The numbers are striking: $16,362/month in savings by consolidating 16 endpoints into one. And we're not sacrificing performance—we're actually gaining it through vLLM's optimizations.
Production Considerations
1. Adapter Loading Strategy
vLLM can lazy-load adapters on first use or preload them all at startup. For latency-sensitive applications, preloading is essential:
# Force preload all adapters at startup
for adapter in prompt2adapter.values():
# Dummy request to trigger loading
_ = llm.generate("test", lora_request=adapter['adapter'],
sampling_params=SamplingParams(max_tokens=1)) 2. Handling High Concurrency
vLLM handles concurrent requests with different adapters gracefully, but there's a limit based on max_loras. If you exceed this, requests will queue. Monitor the num_requests_waiting metric in production.
3. Memory Planning
Budget GPU memory carefully:
- Base model: ~14GB (7B params × 2 bytes in bfloat16)
- LoRA adapters: ~85MB each × 16 = ~1.4GB
- KV cache: Remaining memory (shared across requests)
- Buffer: ~1GB for CUDA overhead
On a 24GB GPU, this leaves ~7GB for KV cache, supporting roughly 30-40 concurrent 4K-token sequences.
4. A/B Testing Adapters
Hot-swappable adapters make A/B testing trivial. Deploy both versions as different adapters and route traffic by including the adapter name in your request:
# A/B test between v1 and v2 of issue resolution model
adapter = 'issueResolved_v2' if random.random() < 0.1 else 'issueResolved_v1'
result = classify_conversation(text, adapter) Benchmarks: Real-World Performance
On a production workload of 3,000 customer support conversations across all 14 active classifiers:
| Metric | HuggingFace PEFT | vLLM |
|---|---|---|
| P50 Latency | 45ms | 18ms |
| P99 Latency | 120ms | 42ms |
| Throughput | 15 req/s | 45 req/s |
| Memory Usage | 18.2GB | 15.8GB |
| Accuracy | 97.2% | 97.2% |
vLLM delivers 2.5x lower latency and 3x higher throughput with identical accuracy. The memory savings come from PagedAttention's efficient KV cache management.
Conclusion
Hot-swappable LoRA adapters are a game-changer for multi-model LLM deployments. The combination of:
- LoRA's parameter efficiency (85MB per specialized model)
- vLLM's PagedAttention (3x throughput improvement)
- SageMaker's managed infrastructure (easy deployment and scaling)
...enables serving dozens of specialized models from a single GPU at a fraction of the traditional cost. For us, it meant $16,000/month in savings and faster inference.
If you're running multiple fine-tuned models in production, this architecture should be your default. The days of "one GPU per model" are over.