Loading…
June 23 - 25, 2025
Denver, Colorado
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for Open Source Summit North America 2025 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in Mountain Daylight Time (UTC/GMT -6). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

IMPORTANT NOTE: Timing of sessions and room locations are subject to change.

Monday June 23, 2025 4:30pm - 5:10pm MDT
Large Language Models (LLM) require preprocessing vast amounts of data, a process that can span days due to its complexity and scale, often involving PetaBytes of data. This talk demonstrates how Kubeflow Pipelines (KFP) simplify LLM data processing with flexibility, repeatability, and scalability. These pipelines are being used daily at IBM Research to build indemnified LLMs tailored for enterprise applications.

Different data preparation toolkits are built on Kubernetes, Rust, Slurm, or Spark. How would you choose one for your own LLM experiments or enterprise use cases and why should you consider Kubernetes and KFP?

This talk describes how open source Data Prep Toolkit leverages KFP and KubeRay for scalable pipeline orchestration, e.g. deduplication, content classification, and tokenization.

We share challenges, lessons, and insights from our experience with KFP, highlighting its applicability for diverse LLM tasks, such as data preprocessing, RAG retrieval, and model fine-tuning.
Speakers
MN

Mohammad Nassar

Research Engineer, IBM
Mohammad Nassar, a Cloud Research Engineer at IBM Haifa, specializes in AI-driven data engineering, automation, and hybrid cloud technologies. With an M.Sc. in Computer Science from Technion, his research focused on coding theory and data systems. His work spans AI-powered data preparation... Read More →
avatar for Anish Asthana

Anish Asthana

Engineering Manager, Red Hat
Anish is an engineering manager at Red Hat in the OpenShift AI organization. He is working on making machine learning easier for the wider community by building a platform out with cloud capabilities at the core. Most recently, his interests have been focused on the Distributed Workloads... Read More →
Monday June 23, 2025 4:30pm - 5:10pm MDT
Bluebird Ballroom 2C
  Open AI + Data

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link