PyData Amsterdam 2024

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
08:30
08:30
30min
Registration + Coffee
Amstel Room - OBA Oosterdok
08:30
30min
Registration + Coffee
Rokin Room - OBA Oosterdok
08:35
08:35
300min
Tutorial Day! No sessions in Kromhouthal, scroll right / down for tutorial sessions in OBA (Oosterdokskade 143, 1011 DL Amsterdam)

Tutorial day

Rembrandt
09:00
09:00
90min
Boost Your LLM: Building LLM Agents with LangChain
Ana Chaloska, Maria Bader, PhD

With this workshop you will familiarize yourself with the key concepts of an LLM agent using LangChain and the OpenAI chat completion endpoints.

Rokin Room - OBA Oosterdok
09:00
90min
Writing Python modules in Rust - PyO3 101
Cheuk Ting Ho

In this workshop, we will cover the very basic of using PyO3 - rust library that package rust crates into Python modules. This is the most popular tool in terms of creating Python libraries with Rust.

Amstel Room - OBA Oosterdok
10:30
10:30
30min
Snack Break
Amstel Room - OBA Oosterdok
10:30
30min
Snack Break
Rokin Room - OBA Oosterdok
11:00
11:00
90min
How you (yes, you!) can write a Polars Plugin
Marco Gorelli

Polars is a dataframe library taking the world by storm. It is very runtime and memory efficient and comes with a clean and expressive API. Sometimes, however, the built-in API isn't enough. And that's where its killer feature comes in: plugins. You can extend Polars, and solve practically any problem.

No prior Rust experience required, intermediate Python or general programming experience required. By the end of the session, you will know how to write your own Polars Plugin! This talk is aimed at data practitioners.

Amstel Room - OBA Oosterdok
11:00
90min
Prompt hacking for Generative AI
Sander van Dorsten, Myrthe Lammerse

💡 The best way to learn how to secure a system is to know how it breaks! In this tutorial you will protect your fictive company's private information that resides behind a Generative AI chatbot. You will work in teams of two to set up the best system prompt, defending yourself against hackers that want to steal your private data. In parrallel, you will be tasked with trying to steal this same information from accounts of other teams! The winner of the day walks away with a prize.

Rokin Room - OBA Oosterdok
12:30
12:30
60min
Lunch
Amstel Room - OBA Oosterdok
12:30
60min
Lunch
Rokin Room - OBA Oosterdok
13:30
13:30
90min
Master Advanced Web Scraping Techniques in Python
Fabien Vauchelles

Join me for an incredible workshop to unlock the full potential of Anti-Ban & Web Scraping in Python! From novice to virtuoso, you’ll learn advanced techniques for collecting crucial datasets to train AI models.

Amstel Room - OBA Oosterdok
13:30
90min
Mastering Data Flow: Empower Your Projects with Prefect's Pipeline Magic
Adam Hill, Chris Frohmaier

Embark on a transformative journey into the realm of data engineering with our 90-minute workshop dedicated to, recently released, Prefect 3. In this hands-on session, participants will learn the ins and outs of building robust data pipelines using the latest features and enhancements of Prefect 3. From data ingestion to advanced analytics, attendees will gain hands-on experience and practical insights to elevate their data engineering skills.

Rokin Room - OBA Oosterdok
15:00
15:00
30min
Afternoon Snack
Amstel Room - OBA Oosterdok
15:00
30min
Afternoon Snack
Rokin Room - OBA Oosterdok
15:30
15:30
90min
Sweet Summer Child Score
Laura Summers

Sweet Summer Child Score (SSCS) is an open source library to identify potential AI harms. In this tutorial we'll break into small groups and take the quiz online using a motivating scenario. Participants will practice mapping the risks of an AI system in a structured way, helping to formalise their instincts, identify potential harms, and plan next steps to better understand or reduce the risks of these harms.

Amstel Room - OBA Oosterdok
15:30
90min
Uplift Modeling for Marketing Personalization in Practice
Felipe Moraes, Hugo Manuel Proenca, Matteo Romeo

Are you a machine learning enthusiast looking to dive into the fascinating world of uplift modeling? Do you want to leverage advanced techniques to personalize user experiences and drive business outcomes? Join us for a dynamic session where we transform complex concepts into practical insights you can apply immediately!

Uplift modeling is a cutting-edge approach that goes beyond traditional predictive modeling by estimating the causal effects of treatments on individuals. This makes it the to go framework for personalized marketing, customer retention, and beyond. Our tutorial is designed to provide you with a practical understanding of uplift modeling, complete with real-world Python examples.

Rokin Room - OBA Oosterdok
08:00
08:00
60min
Registration
Rembrandt
08:00
60min
Registration
Van Gogh
08:00
60min
Registration
Escher
08:00
60min
Registration
Mondriaan
09:00
09:00
30min
Opening notes

Opening notes of the conference

Rembrandt
09:30
09:30
50min
Keynote - Open-source Multimodal AI in the Wild
Merve Noyan

Merve will talk about multimodal AI, why open-source multimodal AI matters, the current state of the ecosystem and how you can get started with multimodal AI.

Rembrandt
10:20
10:20
15min
Break
Rembrandt
10:20
15min
Break
Van Gogh
10:20
15min
Break
Escher
10:20
15min
Break
Mondriaan
10:35
10:35
35min
Counting down for CRA - updates and expectations
Cheuk Ting Ho

The EU Commission is likely to vote on the Cyber Resilience Act (CRA) later this year. The CRA is an ambitious step towards protecting consumers from software security issues by creating a new list of responsibilities for software developers and providers. The Act also creates a new category of actor known as an “Open Source Steward” which we think makes important allowances for public open source repositories like CPython and Python Package Index (PyPI.) Once the dust settles, everyone who makes software will need to consider the CRA’s mandates in their security roadmaps.

In this talk we will look at the timeline for the new legislation, any critical discussions happening around implementation and most importantly, the new responsibilities outlined by the CRA. We’ll also discuss what the PSF is doing for CPython and for PyPI and what each of us in the Python ecosystem might want to do to get ready for a new era of increased certainty – and liability – around security.

Target audience

Developers and maintainers whose project or product may be affected by the CRA. European legislation won’t just affect the European market, it will affect the software industry and the open source community globally as it is very hard to segregate one project or product from the EU market. So, this is for everyone in the Python community who shares their code with the world.

Goal

To educate the general public about CRA - how it can affect us and how to get ready for it. We also want to provide more information for the Python community about what has been done by the PSF regarding the CRA to reassure them that the Python community is aware and getting prepared for the CRA.

Mondriaan
10:35
35min
From Data Pipelines to a Data Platform: Embracing Monorepo Architecture
Maico Timmerman

In this talk we will dive into how we architect our on-premise Data Platform on top of a Python monorepo, which allows hundreds of contributors in different data roles to focus on their core qualities; delivering data-driven products.

Van Gogh
10:35
35min
Run a benchmark they said. It will be fun they said.
Vincent D. Warmerdam

This is the story of a fun idea that turned into a huge benchmark before it turned into a rabbit hole.

I was trying to figure out reasonable default parameters for some of the components in the skrub library. In order to do that I was looking for datasets with a permissible license which I could use for benchmarking. This is how I stumbled on some old Kaggle competitions that still had their datasets publicly available. So I should just run a simple benchmark, right?

That's where it all started. There were many lessons. They will all be shared.

Rembrandt
10:35
35min
Understanding Polars Expressions when you're used to pandas
Marco Gorelli

When it comes to dataframes, pandas is the go-to library for many people. Yet Polars is taking the world by storm, and so many data practitioners are curious about trying it out. There is a learning curve though, as Polars introduces some concepts which pandas users might not be familiar with. This talk will be a deep dive into one of those concepts (expressions) and will focus on how you can understand them from a pandas perspective.

Escher
11:20
11:20
35min
Differential Privacy Made Practical
Rob Romijnders

With AI becoming more common, there’s a growing need for privacy in our data processing algorithms. Differential Privacy (DP) is a popular way to quantify privacy loss and has been adopted in many applications. Examples include the Android keyboard learning about user typing, Apple’s system to collect statistics on emoticons and website usage, and the US government releasing Census population statistics. We’ll discuss in this talk an intuitive and tangible definition of differential privacy and how the above examples implement DP. In an IPython notebook, we'll demonstrate the effects of Differential Privacy on a small-scale data science problem. Additionally, I’ll refer to Python repos for doing differential privacy at scale, including for deep learning.

Rembrandt
11:20
35min
Polars 1.0 and beyond
Ritchie Vink

Polars is a novel query engine that focuses on DataFrame use front-end. This July it has hit the 1.0 milestone and this August it has landed GPU support.

The 1.0 milestone has made the Polars team confident about the API going forward and now they want to expand on the functionality. This talk will go through what has been done pre 1.0 and what to expect after. 1.0 has cemented the API and set the constraints for the future other engines the Polars team is working on.

Escher
11:20
35min
Productionizing Generative AI at ING: Navigating Risk, Compliance, and Defense Mechanism
Farzam Fanitabasi, Zeynab Rashidpour

Productionizing generative AI applications in a highly regulated banking environment such as ING comes with a plethora of challenges; from handling the technical complexities of massive foundational models, to addressing ethical considerations and rigorous risk assessments. These challenges are not unique to the financial sector but equally critical in other areas such as healthcare, and government. In this talk we will delve into these critical aspects, focusing on implementing, deploying, and monitoring generative AI applications with stringent compliance requirements and limited risk appetite. Highlighting analytical and human-in-the-loop approaches, we'll demonstrate how to establish robust first and second lines of defense, mitigate data and reputational risks, and ensure the accuracy of information, which is vital across various sectors.

Mondriaan
11:20
35min
The ML Monitoring Flow for Models Deployed to Production
Wojtek Kuberski

The talk will cover the core ML Monitoring Flow necessary to maintain and maximize the business impact of models deployed to production. We will focus on the three main steps of the Flow: Performance monitoring, Root Cause Analysis, and Issue resolution. In the performance monitoring part, we will cover the two core algorithms that allow us to estimate the predictive performance without ground truth: Confidence-Based Performance Estimation (CBPE) and Direct Loss Estimation (DLE). The Root Cause Analysis part will go over various Drift Detection algorithms, focusing especially on multivariate drift detection and linking the drop in performance to drift signals. In the issue resolution part, we will briefly cover the typical steps to fix ML failure and their applicability and limitations.

Van Gogh
12:05
12:05
35min
Alice in the Open Source Land
Magdalena Kowalczuk

First experience of stepping into the rabbit hole of contributing to open-source software, highlighting key learnings and practical steps for beginners. It covers overcoming self-doubt, learning through collaboration, and the unexpected joys of community engagement. What you can learn from contributing to Open Source and what you probably will not as an aspiring Data Scientist.

Rembrandt
12:05
35min
Building a Data Platform from scratch
Rodel van Rooijen

Ever wondered how to start from scratch, without any existing data infrastructure? In this talk, I will share my experience of building a data platform from scratch at a startup. This talk is intended for data (platform) engineers, data scientists, and anyone interested in building a scalable data platform in the cloud using open-source tools.

I will discuss the challenges faced in designing and implementing this platform, as well as the lessons learned along the way. We'll answer questions such as, why build a data platform at a startup? Why pick open source over alternatives? How to deploy data infrastructure on Kubernetes? How to build the first data products?

Van Gogh
12:05
35min
Introduction to Causal Inference using pgmpy
Ankur Ankan

In the domain of data science, a significant number of questions are aimed at understanding and quantifying the effects of interventions, such as assessing the efficacy of a vaccine or the impact of price adjustments on the sales volume of a product. Traditional association based machine learning methods, predominantly utilized for predictive analytics, prove inadequate for answering these causal questions from observational data, necessitating the use of causal inference methodologies. This talk aims to introduce the audience to the Directed Acyclic Graph (DAG) framework for causal inference. The presentation has two main objectives: firstly, to provide an insight into the types of questions where causal inference methods can be applied; and secondly, to demonstrate a walkthrough of causal analysis on a real dataset, highlighting the various steps of causal analysis and showcasing the use of the pgmpy package.

Escher
12:05
35min
Show off your python code to even a complete newbie, using shiny for python
Thomas Wouters, Emiel Declercq

Struggling with static reports? This talk introduces Shiny for Python, a framework for crafting interactive web apps in minutes. Leverage your Python skills (pandas, matplotlib) to design user-friendly dashboards for real-time data analysis. Ideal for data scientists of all levels (no Shiny for R experience required).

Mondriaan
12:40
12:40
60min
Lunch break
Rembrandt
12:40
60min
Lunch break
Escher
12:40
60min
Lunch break
Mondriaan
12:40
45min
Women @ PyData Lunch gathering.

Interested in meeting other women attending PyData Amsterdam? Then join us for lunch in the van Gogh room!

Van Gogh
13:40
13:40
50min
Algorithmic bias is everywhere (especially at Breeze) - what can we do about it?
Thomas Crul

In this talk, I will detail how we found out that our recommender system at Breeze may be showing discriminatory behavior, and what we've been doing since to attempt to solve this issue. I will dive into our visit to the Netherlands Institute for Human Rights and how we have been trying to gain insight into the issue through gaining expert feedback and performing an audit without violating privacy legislation.

Van Gogh
13:40
50min
From language to marketing: RNNs for data-driven multi-touch attribution at Booking.com
Narendra Mukherjee, Narmin Gulmammadova

Are you a data scientist thinking about the efficacy of digital marketing channels and the link between marketing and a customer’s decision to buy? Or, do you wonder about language models and their application to other sequence modelling tasks? This talk will introduce the field of marketing attribution modelling and show that language/sequence models (like attention-based RNNs) can be modified to function as flexible attribution models. In the course of the talk, we will use our work at Booking.com to illustrate the challenges involved in such re-purposing of language models to attribution, esp. in the evaluation of these models in the absence of “ground truth” signals.

Rembrandt
13:40
50min
GenAI Beyond Chat with RAG, Knowledge Graphs and Python
Martin O'Hanlon

This session is a GenAI talk, where you will learn how Knowledge Graphs, Vectors and Retrieval Augmented Generation (RAG) can support your projects.

Mondriaan
13:40
50min
Time Series forecasting with NumPyro
Juan Orduz

We provide an introduction to implementing time series forecasting modeling in NumPyro. This allows us to write custom models for which we can have complete control. This includes demand forecasting estimation via censoring likelihoods and hierarchical time series models.

Escher
14:40
14:40
35min
Almost Perfect: A Benchmark on Algorithms for Quasi-Experiments
Raphael Tamaki

In this presentation, we will compare four algorithms that can be used for quasi-experiments in terms of the bias and variance between predicted and actual treatment effects and the confidence/credible intervals associated with the predictions:
- Difference-in-Differences,
- Synthetic Control,
- Meta-Learners,
- Graphical Causal Models,

By the end of this lesson, attendees will understand the shortcomings and benefits of the different algorithms and be better informed about which one best suits their needs.

Rembrandt
14:40
35min
Drift Detection on Irregular Time Series with Multiple Non-Uniform Seasonal Patterns Using MIST and DTW algorithms
Vitalie Spinu

This talk will delve into the technical and conceptual challenges associated with drift detection on irregular time series exhibiting non-uniform seasonal patterns, such as end-of-month, pay-day, or holiday effects. We will demonstrate how drifts can be efficiently identified using a combination of the MIST (Multiple Irregular Seasonalities and Trend decomposition) and DTW (Dynamic Time Warping) algorithms.

Van Gogh
14:40
35min
Open-source Machine Learning on Encrypted Data
Andrei Stoian

This talk is about data science on encrypted data with Python and Fully Homomorphic Encryption (FHE). FHE is a groundbreaking technology that secures data while preserving its utility, allowing data owners to process it even in its encrypted state. This talk will show how Concrete ML, a Python framework for data science with FHE, makes it easy to convert machine learning models to work with encrypted data, without knowing anything about cryptography.

The talk is addressed to privacy-minded Python developers that want to learn what is possible to build with FHE and how to build privacy and confidentiality features into their software. The talk is informative and assumes some participants have some data science knowledge. A first part will focus on use-cases and a second part on using Concrete ML in conjunction with other Python data processing libraries to implement the use-cases.

Escher
14:40
35min
Roseman Labs - Python-powered encrypted AI
Niek Bouman

Are you ready to push the boundaries of what’s possible with Python? Roseman Labs has developed a groundbreaking Python package—crandas—that puts cutting-edge cryptography right at your fingertips. If you’re familiar with pandas and sci-kit learn, crandas will feel like a natural extension, empowering you to unlock the full potential of sensitive data, gain deeper insights, and make predictions without compromising privacy.

Mondriaan
15:15
15:15
20min
Break
Rembrandt
15:15
20min
Break
Van Gogh
15:15
20min
Break
Escher
15:15
20min
Break
Mondriaan
15:35
15:35
35min
AI: Ethical & Responsible by Design
Natasha Govender-Ropert

As AI continues to transform industries and our daily lives, ensuring that these technologies are developed and ethically and responsibility have become a critical priority.
This talk will delve in the complex landscape of Ethical & Responsible AI, highlighting key considerations for mitigating risks and challenges that arise when developing and deploying AI technologies. We will explore fundamental principles such as biasness, transparency, accountability and discuss real-world examples where these principles could be upheld or comprised.
In addition, the talk will detail the key principles of the EU AI Act and map out the implications of this legislation for developers, businesses, and society at large.

Mondriaan
15:35
35min
Build a personalized Commute virtual assistant in Python with Hopsworks and LLM Function Calling
Javier

The invention of the clock and the organization of time in zones have helped synchronize human activities across the globe. While timekeepers are better at planning and sticking to the plan, time optimists somehow believe that time is malleable and extends the closer the deadline. Nevertheless, whether you are an organized timekeeper or a creative timebender, external factors can affect your commute.

In this talk, we will define the different components necessary to build a personalized commute virtual assistant in Python. The assistant will help you analyze your historical lateness records, estimate future delays, and suggest the best time to leave home based on these predictions. It will be powered by a LLM and will use a technique called Function Calling to recognize the user intent from the conversation history.

Van Gogh
15:35
35min
Data Science for Social Good: Making Impact in Resource-Constrained Environments
Anastasiia Kulakova, Mehrzad Karami

How can data science be used to make positive impact on the world? In this talk, we'll highlight some #data4good projects we ran with volunteers with open source tooling only, typically in a resource-constrained environment.

Rembrandt
15:35
35min
Synthetic Data for Localized Solutions
Atieno Ouma

Data scientists often encounter challenges related to sparse, privatized, or low-quality data. This talk will explore how synthetic data is changing problem-solving in markets with data scarcity, drawing from my experience in the African market, particularly in Kenya.
I will demonstrate how synthetic data has been utilized to develop models for proactive healthcare solutions.
The methodologies and lessons learned from these applications provide valuable insights that could inspire approaches in other markets, including Europe.

Escher
16:20
16:20
35min
Data for Social Good
CorrelAid NL

Are you interested to find out how you can apply your (data science) skills to make a positive impact on the world? Or do you already work or volunteer for a non-profit organisation, that can use a hand? Then this unconference session is for you!

Mondriaan
16:20
35min
How Dimensional is a `pandas.DataFrame`, anyway?
James Powell

Folks: for the last-time, a pandas.DataFrame is one-dimensional. Yes, the docs call the pandas.DataFrame a “[t]wo-dimensional, size-mutable, potentially heterogeneous tabular data.” The docs are wrong. You're wrong. Everyone’s wrong. And I’m going to prove it.

Van Gogh
16:20
35min
Jounai.nl: Playing with New Tech to Reinvent the News
Maarten Sukel

Join me as I take you on a journey through the creation of Jounai, an AI-driven news platform that started as a fun side project but is now (fully) functional. The idea was simple: work with different tech than usual, combine it with what works, and perhaps: use generative AI to innovate news consumption in a responsible way. We ended with a free to use website with multiple automated AI generated podcasts every day, with references to sources and always up to date news articles and more.

In this talk, I will take you along in the exciting process of experimenting with new technologies and the lessons learned along the way. From comparing 'old-school' machine learning and natural language processing (NLP) to what I call ‘API ML’, to changing to Azure while being used to working with AWS, and switching from Python to Java for web services because why not try—I'll share it all. Hint: Wil likely build them in Python again in the future. Why? Will discuss during the talk! We'll also look at how Vue.js measures up against its brother Nuxt.js for front-end development. This talk isn't just about creating an AI generated news website—it's about continuous learning, staying curious, and keeping your mind sharp by embracing new challenges.

Rembrandt
16:20
35min
dbt-score: a linter for your dbt model metadata
Matthieu Caneill

dbt (Data Build Tool) is a great framework for creating, building, organizing, testing and documenting data models, i.e. data sets living in a database or a data warehouse. Through a declarative approach, it allows data practitioners to build data with a methodology inspired by software development practices.

This leads to data models being bundled with a lot of metadata, such as documentation, data tests, access control information, column types and constraints, 3rd party integrations... Not to mention any other metadata that organizations need, fully supported through the meta parameter.

At scale, with hundreds or thousands of data models, all this metadata can become confusing, disparate, and inconsistent. It's hard to enforce good practices and maintain them in continuous integration systems. We introduce in this presentation a linter we have built: dbt-score. It allows data teams to programmatically define and enforce metadata rules, in an easy and scalable manner.

Escher
17:00
17:00
50min
Lightning talks

Lightning talks

Rembrandt
17:50
17:50
10min
Closing notes

Closing notes

Rembrandt
18:00
18:00
120min
Social Event with Snowflake
Rembrandt
18:00
120min
Social Event with Snowflake
Escher
18:00
120min
Social Event with Snowflake
Mondriaan
18:15
18:15
60min
Pub Quiz

Show off your knowledge on all things, useful or useless, python or not, on this pub quiz with the inimitable James Powell! Drinks, snacks and good vibes are kindly provided by Snowflake.

Van Gogh
08:30
08:30
30min
Registration
Rembrandt
08:30
30min
Registration
Van Gogh
08:30
30min
Registration
Escher
09:00
09:00
50min
Keynote - Applied NLP in the age of Generative AI
Ines Montani

Large Language Models (LLMs) and in-context learning have introduced a new paradigm for developing natural language understanding systems: prompts are all you need! Prototyping has never been easier, but not all prototypes give a smooth path to production. In this talk, I'll share the most important lessons we've learned from solving real-world information extraction problems in industry, and show you a new approach and mindset for designing robust and modular NLP pipelines in the age of Generative AI.

Rembrandt
09:50
09:50
15min
Break
Rembrandt
09:50
15min
Break
Van Gogh
09:50
15min
Break
Escher
09:50
15min
Break
Mondriaan
10:05
10:05
50min
Boosting AI Reliability: Uncertainty Quantification with MAPIE
Louis Lacombe, Thibault Cordier

MAPIE (Model Agnostic Prediction Interval Estimator) is your go-to solution for managing uncertainties and risks in machine learning models. This Python library, nestled within scikit-learn-contrib, offers a way to calculate prediction intervals with controlled coverage rates for regression, classification, and even time series analysis. But it doesn't stop there - MAPIE can also be used to handle more complex tasks like multi-label classification and semantic segmentation in computer vision, ensuring probabilistic guarantees on crucial metrics like recall and precision. MAPIE can be integrated with any model - whether it's scikit-learn, TensorFlow, or PyTorch. Join us as we delve into the world of conformal predictions and how to quickly manage your uncertainties using MAPIE.

Link to Github: https://github.com/scikit-learn-contrib/MAPIE

Escher
10:05
50min
Private targeting strategies
Gilian Ponte

Recent advancements in causal inference have led to the emergence of sophisticated targeting methods, which are perceived as intrusive by consumers. In response, policymakers have recently imposed bans on targeting due to its privacy invasive nature (e.g., Meta). In this talk, we introduce two private targeting strategies that we prove to satisfy differential privacy: a mathematical definition of privacy. These two private targeting strategies allow analysts to target customers while simultaneously establish a level of privacy risk. We first introduce "Private Causal Neural Networks" (PCNNs), which estimate the causal or incremental effect of a targeting intervention. The second strategy involves the randomization of the targeting decision. In two increasingly complex simulation studies, we benchmark the two private targeting strategies to accurately learn the population average treatment effect, conditional average treatment effect (i.e., CATE), and its targeting profitability. In a field experiment with over 400,00 customers, we empirically apply the privacy protection strategies and visualize the inherent trade-off between privacy risk and profitability.

Van Gogh
10:05
90min
The Impact of Generative AI on Data Analytics: Panel Discussion
Mahmoud Yassin, Gautier Chenard, Hanna van der Vlis, Thomas Papadimos, Stephan De Goede, Vladimir Grevtsev

Generative AI is revolutionizing data analytics by enabling unprecedented levels of insight generation, automation, and decision-making capabilities. This panel brings together leading data and business experts to discuss changes in analytical processes, data quality requirements, data accessibility, data security, ethical applications of AI, and workforce preparation to effectively leverage generative AI technologies.

Mondriaan
10:05
50min
The Odyssey of Hacking LLMs: Insights from Two Shipmates sailing in the LLM CTF @ SaTML 2024
Thomas Fraunholz, Sandro Bauer

So come in and listen to the epic story of two unseasoned sailors who embarked on a journey to face the 44 trials posed by the Capture the Flag (CTF) competition for LLMs at this year's 'Conference on Secure and Trustworthy Machine Learning' (SaTML). Each test, one more difficult than the next, required them to break through the defense of the LLM to reveal the its hidden secret...

What sounds like a game—and it was—has a serious background. LLMs, like any new technology, offer both opportunities and risks. And it is the latter we are concerned with. Perhaps you have heard of jailbreaks—prompts that can lead an LLM to not just be helpful and friendly but to assist in building a bomb. This competition was centered around this very question: Is it possible to secure an LLM with simple means such as prompts and filters?

This question grows more significant with the increasing spread of LLMs. The EU AI Act elevates this concern to a new level, classifying LLMs as General Purpose AI (GPAI) and explicitly requiring model evaluations, including 'conducting and documenting adversarial testing to identify and mitigate systemic risks' and to 'ensure an adequate level of cybersecurity protection.'

With this in mind, what could be greater than listening to two - now experienced - mariners who can tell you about the treacherous dangers of the seven seas? You'll learn firsthand about the current state of the art in adversial attacks, how these can be practically applied, and how you can defend yourself in the future with the help of guardrails - or not. Untrained sailor, no matter how basic your knowledge of LLMs may be, don't miss this golden opportunity to prepare yourself for your own epic voyage with LLMs.

Rembrandt
11:05
11:05
35min
Causal Effect Estimation in Practice: Lessons Learned from E-commerce & Banking
Danial Senejohnny

Applying the tools and techniques from causal effect estimation in a practical setting can be challenging. Randomized experiments, as the golden standard method for effect estimation, are often not practical. Alternative solutions use observational (non-experimental) data, while they introduce their own challenges, which will be addressed in this talk. These challenges are often not elaborately discussed in text books and can be summarized as follows: 1) samples from treatment group are not available all at once and could become available throughout time (online data stream), 2) appropriate control groups are not immediately available for comparison with the treatment group, and 3) the outcomes of choice in causal effect estimation should be in line with the business questions and accepted KPI’s in the domain.

This self-contained talk targets generic data scientists by presenting the theory of causal effect estimation in a simplified and visual (little-math) fashion. In addition, technical & business requirements, lessons learned from e-commerce & banking, and results are shared when it comes to applying causal effect estimation in practice.

Escher
11:05
35min
Going Beyond Copilot with AI Agents
Alex Shershebnev

In this discussion you'll learn how to go beyond Copilot with the use of AI Agents.

Van Gogh
11:05
35min
Terrible tokenizer troubles in large language models
Sander Land

Huge amounts of resources are being spent training large language models in an end-to-end fashion. But did you know that at the bottom of all these models remains an important but often neglected component that converts text to numeric inputs? As a result of weaknesses in this ‘tokenizer’ component, some inputs can not be understood by language models, causing wild hallucinations, or worse.
This talk will cover some of our recent research in finding what text causes problems for a specific model, and show you how to break even the most advanced models.

Rembrandt
11:50
11:50
35min
How I hacked UMAP and won at a plotting contest
Jeroen Janssens

In this talk, I’ll share my journey of animating UMAP, a cutting-edge dimensionality reduction algorithm, by visualizing not just its final output but each intermediate step as well. I’ll explain why and how I modified UMAP’s source code, while also demonstrating the use of Polars for data wrangling, Plotnine for visualization, and ffmpeg for animation. The result ultimately earned me a runner-up position in the 2024 Plotnine plotting contest.

Escher
11:50
35min
Hunting unicorns with Network analysis
Valerio Ciotti

We construct and analyze a time-varying worldwide network of professional relationships among startups to predict long-term economic performance using network centrality measures.
The presentation will provide an overview of the world wide startup network construction from CrunchBase data using Networkx. We will be modeling employee flow and knowledge transfer as links between startups. Using a network centrality we will be able to rank early stage startups (pre-seeded) and evaluate how their ranked position is correlated with their success ahead of time. Finally we will touch on implications for entrepreneurs, investors, and policymakers

Van Gogh
11:50
35min
Meet the PyData speakers!

Meet the speakers

Mondriaan
11:50
35min
Retrieve me if you can: SLM-powered retrieval to scale freelancers matching at Malt
Marc Palyart, Warren Jouanneau

This talk unveils Malt's secret weapon for highly efficient freelancer matching - a powerful neural retriever. We'll showcase how Small Language Models (SLMs) help us instantly connect companies with their ideal freelancers. Discover how our retriever was built, deployed with a vector database and optimized to minimize resource consumption.

Rembrandt
12:25
12:25
60min
Lunch break
Rembrandt
12:25
60min
Lunch break
Van Gogh
12:25
60min
Lunch break
Escher
12:25
60min
Lunch break
Mondriaan
13:25
13:25
35min
How Research Teams Can Deliver Higher-Quality Insights Faster
Vasiliy Kaminskiy

Delivering high-quality insights swiftly is crucial for research teams aiming for excellence. The key to achieving this lies in operational efficiency across all stages of the research workflow.

Van Gogh
13:25
35min
Je ne regrette rien - Teaching Machine Learning Models Regret Avoidance
Laura Israel

When providing ML-based optimization products with multiple independent components, we can easily loose the holistic view of the problem that we are trying to solve and might end up not making optimal decisions. The easy way out - switching from a modular approach to a single unified optimization model - is often not feasible though. In this talk, I will discuss and demonstrate the implementation of an alternative method that will allow ML models to consider potential behavior of other components in a complex system using regret-sensitive loss functions. By using a simple game-theoretical formalization of the system and quantifying regret (i.e., the experience of a sub-optimal decision when information about the best action comes available after the model was already called), we can widen the optimization scope of the model and increase the overall performance of complex decision processes.

Rembrandt
13:25
180min
Open Source Sprint: Narwhals
Marco Gorelli, Magdalena Kowalczuk

Narwhals is an extremely lightweight and extensible compatibility layer between dataframe libraries, and it needs your help! An open source sprint is the perfect opportunity to make your first contribution to open source. The core maintainers of the Narwhals package will prepare a list of easy and accessible first issues to get started with, and will be present in this session to guide you to make your first commit to the package. This is the perfect opportunity to give back to the Python ecosystem, while having some fun.

Mondriaan
13:25
35min
Polishing Python: Preventing Performance Corrosion with Rust
Mike Kraus

Python is beloved for its simplicity and versatility, but it can struggle with performance in compute-intensive tasks. Rust, on the other hand, offers high performance and memory safety. This talk will explain how you can harness the power of Rust to enhance Python modules using the PyO3 library.

We will explore this through a practical example: a pure Python payment handler and an optimized version where its functionality is abstracted away using Rust. This approach will demonstrate how to overcome performance bottlenecks while retaining the ease of use and flexibility that Python offers. However, like any tool, it comes with its own considerations and trade-offs.

This talk is particularly interesting for Machine Learning Engineers and Python developers seeking to boost the performance of their applications.

Escher
14:10
14:10
35min
SHAP beyond the standard graphics: co-design of ML-models in earth sciences
Hans Korving

Discover the transformative power of SHAP values in machine learning as we decode complex insights into actionable information for stakeholder involvement. Through a unique blend of feature attribution, dimensionality reduction, and clustering, we uncover the crucial drivers behind model predictions, enabling active participation in model co-design. Join us for an engaging session to explore practical case studies and share invaluable lessons for leveraging SHAP values effectively.

Van Gogh
14:10
35min
Uncertainty quantification: How much can you trust your machine learning model?
Mojtaba Farmanbar

Uncertainty identification in machine learning is crucial for making robust decisions, enhancing model trustworthiness, and assessing risks. By quantifying and understanding uncertainty, machine learning practitioners can build more reliable and trustworthy AI systems.

Imagine having a machine learning model that predicts whether a given image contains a cat or not. While traditional machine learning approaches provide binary predictions (cat or not cat) for each image, you do not know the confidence level of the model in each prediction.

Conformal prediction (CP) is a machine learning framework for uncertainty quantification that adds a layer of confidence estimation to model predictions. Instead of just giving a binary answer, it provides a range of possible outcomes (prediction sets) along with a measure of how confident it is in each outcome. These prediction sets come with coverage guarantees for the true outcome, ensuring that they will detect at least a specified percentage of true values. Importantly, conformal prediction is agnostic to the underlying machine learning model, and it makes no assumptions about the underlying data distribution. In other words, it is a model-agnostic and distribution-free approach.

As a result, conformal prediction offers a robust framework that empowers stakeholders to make more informed decisions, particularly in high-stakes domains such as healthcare, finance, and autonomous systems

Rembrandt
14:10
35min
pydiverse pipedag - A library for data pipeline orchestration optimizing high development iteration speed
Martin Trautmann

This talk presents github.com/pydiverse/pydiverse.pipedag, a data pipeline orchestration library for rapid iterative development with automatic cache invalidation. It allows users to focus on their actual tasks: Writing analytics and data transformation code in pandas, polars, sqlalchemy, ibis, and the like.

The talk is meant for people working with data pipelines on beginner to advanced level. It teaches best practices in dealing with code versioning vs. data versioning, working with small data samples during development, and how to gradually improve the coding style of an existing pipeline.

Escher
14:55
14:55
35min
From mocking to rocking your tests with testcontainers
Barend Linders

Our software and services often have closely linked dependencies. When writing python unit tests, one tends to mock away these dependencies (i.e. database call). Mocking - while great in some scenarios - has some drawbacks as well. At Nicolab we have embraced testcontainers for a lot of these situations. I would like to show you this tool that offers you the possibility to easily spin up your own containers representing your external dependencies and use them in your test, without ever having to leave your python / pytest environment.

Van Gogh
14:55
35min
How to measure a city
Françoise Provencher

This story is about urbanism, geospatial analysis, and when people look for numbers to get facts, when in fact, numbers are opinions. You will learn how some cities in Canada are designing indicators to measure how livable they are, the tradeoffs for good metric design and how your methodology is encoding your opinions into the numbers. This talk will benefit anyone using data to support decisions, and no prior knowledge is required.

Rembrandt
14:55
35min
Who's Who and Where's What: Dealing With Names and Addresses Around the World
Philip Blair

Data scientists and app developers today must deal with data coming from different regions around the world. Whether handling signup form data, scraping news articles, or building LLM pipelines, one of the most common types of unstructured text data seen today are names of people and addresses; however, conventions for how these are written are completely different depending on their country of origin. In this talk, intended for data scientists and non-technical stakeholders alike, I will provide an introduction to what these types of data can look like, a number of misconceptions about what they do or don't contain, and some examples for how to work with them.

Escher
15:30
15:30
20min
Break
Rembrandt
15:30
20min
Break
Van Gogh
15:30
20min
Break
Escher
15:50
15:50
35min
Debugging as an experimental science
Sarah Diot-Girard

If there is only one experience shared by anyone who ever wrote code, it is debugging. Then, why is it so often a frustrating experience, abstruse and wasteful?
It does not have to be that way. This talk will focus on methods to help with making debugging a rational, positive experience, and we will explore how debugging can even help with gaining some valuable knowledge about your codebase.

Escher
15:50
35min
From Predictions to Action: Fusing Machine Learning and Mixed Integer Linear Programming
Stanley van de Meent, Jakub Tomaszewski

Imagine leveraging AI to make critical decisions, only to find that your perfectly optimized solution causes more problems than it solves. This talk uncovers the intriguing yet challenging fusion of Machine Learning (ML) and Mixed Integer Linear Programming (MILP). We will delve into how combining these powerful tools can lead to breakthroughs—or disasters—if not managed carefully. From misaligned objectives to feedback loops spiraling out of control, real-world scenarios will illustrate where things can go wrong and how to avoid these pitfalls. By the end, you will have a roadmap for harnessing this powerful combination without falling into common traps.

Van Gogh
15:50
35min
LLM Security 101 - An Introduction to AI Red Teaming
Richie Lee

At the intersection of cybersecurity and data science, AI red teaming uses adversarial attacks to test and secure LLM systems, playing a key role in AI security and safety. Following Microsoft's best practices, this hands-on session is tailored to data scientists who acknowledge the need to secure Gen AI systems, but simply do not know how (yet). Attendees will learn to assess and mitigate LLM risks, observe a live penetration testing demo, and gain practical steps to embark on their own AI security journeys.

Rembrandt
16:30
16:30
50min
Keynote - The Art of Language: Mastering Multilingual Challenges in LLMs
Marzieh Fadaee

Multilingual Natural Language Processing (NLP) has played a pivotal role in the recent advancements of Large Language Models (LLMs). The ability to understand and generate text in multiple languages has expanded the capabilities of these models, making them more versatile and accessible to a global audience. In this talk we explore the current landscape of multilingual LLMs, addressing the challenges and opportunities that lie ahead. The discussion will cover critical topics such as the scarcity of multilingual datasets, the evaluation and benchmarking of multilingual models, and the unique safety considerations when dealing with diverse languages.

Additionally, the talk will highlight the challenges and gains of the global open science efforts, such as Aya and Global Exams, to build state of the art multilingual models and resources. Finally, we discuss the unexplored areas in multilingual NLP, providing insights into potential future research directions and the ongoing efforts to enhance the performance and applicability of LLMs.

Rembrandt
17:20
17:20
10min
Closing notes

Closing notes Friday

Rembrandt
18:00
18:00
35min
Changing Hats: Techie vs Comic
Arda Kaygan

.

Van Gogh