“Our ML engineers want to use Modal for everything. Modal helped reduce our VLM document parsing latency by 3x and allowed us to scale throughput to >100,000 pages per minute.”
“Modal lets us deploy new ML models in hours rather than weeks. We use it across spam detection, recommendations, audio transcription, and video pipelines, and it’s helped us move faster with far less complexity.”
import modal
vllm_image = (
modal.Image.from_registry(f"nvidia/{tag}", add_python="3.12")
.uv_pip_install("vllm==0.10.2", "torch==2.8.0")
)
model_cache = modal.Volume.from_name("huggingface-cache", create_if_missing=True)
app = modal.App("vllm-inference")
@app.function(image=vllm_image, gpu="H100", volumes={"/root/.cache/huggingface": model_cache})
@modal.web_server(port=8000)
def serve():
import subprocess
cmd = "vllm serve Qwen/Qwen3-8B-FP8 --port 8000"
subprocess.Popen(cmd)
Deploy any state-of-the-art or custom LLM using our flexible Python SDK.
Our in-house ML engineering team helps you implement inference optimizations specific to your workload.
You maintain full control of all code and deployments for instant iterations. No black boxes.
Modal’s Rust-based container stack spins up GPUs in < 1s.
Modal autoscales up and down for max cost efficiency.
Modal’s proprietary cloud capacity orchestrator guarantees high GPU availability.