Name: Scalable and Efficient LLM Serving With the VLLM Production Stack - Junchen Jiang, University of Chicago & Yue Zhu, IBM Research
Start: 2025-06-25T16:20:00-0600
End: 2025-06-25T17:00:00-0600

June 23 - 25, 2025
Denver, Colorado
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for Open Source Summit North America 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in Mountain Daylight Time (UTC/GMT -6). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

IMPORTANT NOTE: Timing of sessions and room locations are subject to change.

Wednesday June 25, 2025 4:20pm - 5:00pm MDT

Bluebird Ballroom 3E

Large Language Models (LLMs) are reshaping how we build applications; however, efficiently serving them at scale remains a major challenge.

The vLLM serving engine, historically focused on single-node deployments, is now being extended into a full-stack inference system through our open-source project, **vLLM Production Stack**. This extension enables any organization to deploy vLLM at scale with high reliability, high throughput, and low latency.
Code: https://github.com/vllm-project/production-stack

At a high level, the vLLM Production Stack project allows users to easily deploy to their Kubernetes cluster through a single command. vLLM Production Stack's optimizations include KV cache sharing to speed up inference (https://github.com/LMCache/LMCache), prefix-aware routing that directs inference queries to vLLM instances holding the corresponding KV caches, and robust observability features for monitoring engine status and autoscaling.

Attendees will discover best practices and see real-time demonstrations of how these optimizations work together to enhance LLM inference performance.

Speakers

Junchen Jiang

Assistant Professor, University of Chicago

Junchen Jiang is an Assistant Professor of CS at the University of Chicago. His research pioneers new approaches to LLM inference systems (https://github.com/vllm-project/production-stack and https://github.com/LMCache/LMCache). He received his Ph.D. from CMU in 2017 and his bachelor’s... Read More →

Yue Zhu

Staff Research Scientist, IBM Research

Yue Zhu is a Staff Research Scientist specializing in foundation model systems and distributed storage systems. Yue obtained a Ph.D. in Computer Science from Florida State University in 2021 and has consistently contribute to sustainability for foundation models and scalable and efficient... Read More →

Wednesday June 25, 2025 4:20pm - 5:00pm MDT
Bluebird Ballroom 3E

Open AI + Data

Audience Experience Level Intermediate

Need help? View Support Guides
Event questions? Contact Event Planner

Open Source Summit North America 2025

Junchen Jiang

Yue Zhu

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!