PyData Amsterdam 2024

pydiverse pipedag - A library for data pipeline orchestration optimizing high development iteration speed
09-20, 14:10–14:45 (Europe/Amsterdam), Escher

This talk presents github.com/pydiverse/pydiverse.pipedag, a data pipeline orchestration library for rapid iterative development with automatic cache invalidation. It allows users to focus on their actual tasks: Writing analytics and data transformation code in pandas, polars, sqlalchemy, ibis, and the like.

The talk is meant for people working with data pipelines on beginner to advanced level. It teaches best practices in dealing with code versioning vs. data versioning, working with small data samples during development, and how to gradually improve the coding style of an existing pipeline.


Years of building data pipelines has taught us many things, but one point above all else: High iteration speed is indispensable for building the best possible model. During this time, we evaluated a large array of existing pipeline orchestration tools, but none fully fit our requirements. Especially, the desire to enable frequent changes along the whole transformation and model training pipeline led us to start a new open source project called pipedag. It makes it easy to run many manifestations of the same data pipeline with various sampling sizes. We call them pipeline instances. This enables quick development on the smallest possible data (i.e. executing a pipeline in seconds) in an interactive style or with debugger support of an IDE. Pipedag caches intermediate and final outputs while taking care of cache invalidation automatically. Thus changes affecting the whole pipeline can be developed and pushed and the update of all pipeline instances with various sizes is handled automatically and correctly while developers can already work on the next change. Effectively, pipedag helps avoid problems in data versioning vs. code versioning when committing changes at high pace.

Pipedag supports a wide variety of data transformation programming styles (pandas, polars, sqlalchemy, ibis, tidypolars, ...) so that adoption can be a few hours of work without changing the existing data transformation code in the pipeline. Even extensive raw-SQL code bases can be the starting point. All those programming styles can communicate seamlessly without boilerplate code which makes it possible that code bases can be improved in a gradual way task by task without the need for transforming the code base all at once. Pipedag implements a modular approach and can be dask-parallelized or layered on top of existing tools like prefect.

While https://github.com/pydiverse/pydiverse.pipedag/ is already used in professional deployments, this talk would be the first one to address a wider public. The target audience of this talk are people working with data pipelines on beginner to advanced level. We will discuss both the tool pipedag and general data pipeline best practices. Furthermore, we will explore in depth why managing multiple pipeline instances with varying input size is essential to high iteration speed and why automatic cache invalidation is a key feature to solve code versioning vs. data versioning issues. Further topics to cover are: gradual improvement of existing pipeline code, validation and testing, the power of explorative SQL, cache invalidation for lazy Polars dataframe expressions, and parquet caching capabilities to mitigate slow database access.

See also: Slides (798.8 KB)
  • IOI 2000 / Jugend forscht (https://t.ly/DmvHG)
  • Electrical Engineering at KIT, Stanford University, and IMEC/KU-Leuven
    -- Research: Mapping software descriptions on hardware targets + Transactional Memory
  • QuantCo – Optimizing high stakes business decisions by data analytics

https://www.linkedin.com/in/mtrautmann