PyData Amsterdam 2024

From Data Pipelines to a Data Platform: Embracing Monorepo Architecture
09-19, 10:35–11:10 (Europe/Amsterdam), Van Gogh

In this talk we will dive into how we architect our on-premise Data Platform on top of a Python monorepo, which allows hundreds of contributors in different data roles to focus on their core qualities; delivering data-driven products.


Building data-driven products in globally operating financial institutions is complex. It requires the handling of petabyte-sized datasets that include personal identifiable information (PII) or payment data. Teams that build data products must ensure data quality, completeness, and accuracy while balancing data access, data security, and compliance with regulatory requirements. That's where our monorepo approach makes lives easier!

Our on-premise data platform is powered by a monorepo with over a million lines of Python code. Using a monorepo has allowed us to provide a “golden path” for individual contributors to produce high-value output while keeping cognitive overload to minimum. I’ll highlight the power of a monorepo architecture and how it makes deployment, orchestration, storage, cross-team alignment and governance simple for the users, such that they can focus on the data and product derivatives.

Finally, we’ll talk about common traps when adopting monorepos. How a monorepo is not a good abstraction by itself, how to manage dependencies and prevent dependency hell, and how to safely roll out platform-wide changes.

Data Platform Engineer @ Adyen