Run a benchmark they said. It will be fun they said. PyData Amsterdam 2024

Run a benchmark they said. It will be fun they said.
.ical

09-19, 10:35–11:10 (Europe/Amsterdam), Rembrandt

This is the story of a fun idea that turned into a huge benchmark before it turned into a rabbit hole.

I was trying to figure out reasonable default parameters for some of the components in the skrub library. In order to do that I was looking for datasets with a permissible license which I could use for benchmarking. This is how I stumbled on some old Kaggle competitions that still had their datasets publicly available. So I should just run a simple benchmark, right?

That's where it all started. There were many lessons. They will all be shared.

I wanted to introduce a new feature into a machine learning library. But that means I also need to convince people that it is a good idea so it was time to find me a convincing benchmark. It's hard to find a single dataset that can persuade a lot of people ... but people seem to take Kaggle seriously ... so maybe I should compare against a large set of old competitions?

Long story short, I started downloading. This led me to a collection of datasets with classification/regression/timeseries tasks ready for comparison. That's a lot to handle manually so I created a Python package to help me benchmark in a somewhat calm manner.

However, the surface area of questions only got bigger when I got here. Suddenly I was dealing with these questions:

How do you set up proper experiments for all these datasets?
What's a reliable way to deal with the compute? Do I have to use the big three cloud providers or do I genuinely have a better avenue that I might explore?
What can we learn about sensible defaults for machine learning pipelines? Is there really such a thing as "one general setting"? Or do we always need to tune per dataset?
How hard is it to get close to the best performing Kaggle scores? How far can we get with simple models? Do hyperparameters really matter? Is XGBoost All You Need[tm]? Or is scikit-learn a simple winner?

As always with rabbit holes, it quickly turned into an ants nest because there's always an extra thing to check. All of that will be explained in this talk and I'll try to spend plenty of time sharing lessons that I did not expect to learn, but did nonetheless.

Vincent D. Warmerdam

Vincent is a senior data professional, and recovering consultant, who worked as an engineer, researcher, team lead, and educator in the past. I’m especially interested in understanding algorithmic systems so that one may prevent failure. As such, he prefers simpler solutions that scale and worry more about data quality than the number of tensors we throw at a problem. He's also well known for creating calmcode as well as a small dozen of open-source packages.

He's currently employed at probabl where he works together with scikit-learn core maintainers to improve the ecosystem of tooling.

Run a benchmark they said. It will be fun they said. .ical 09-19, 10:35–11:10 (Europe/Amsterdam), Rembrandt

Run a benchmark they said. It will be fun they said.
.ical

09-19, 10:35–11:10 (Europe/Amsterdam), Rembrandt