PyData Amsterdam 2024

How Dimensional is a `pandas.DataFrame`, anyway?
09-19, 16:20–16:55 (Europe/Amsterdam), Van Gogh

Folks: for the last-time, a pandas.DataFrame is one-dimensional. Yes, the docs call the pandas.DataFrame a “[t]wo-dimensional, size-mutable, potentially heterogeneous tabular data.” The docs are wrong. You're wrong. Everyone’s wrong. And I’m going to prove it.


In this talk, we will discuss foundational theory underlying common structures encountered when modelling data in Python: the builtin Python list, tuple, and dict; moving on to array.array in the Python standard library; continuing to the numpy.ndarray; and concluding with an indepth discussion of the pandas.array, pandas.Series, and pandas.DataFrame. Along the way, we’ll discuss the meaning of homogeneity and heterogeneity, as well as adapted concepts like “loose homogeneity” and “loose heterogeneity.” We’ll discuss why pandas.DataFrame.unstack is actually semantically meaningful (but why pandas.DataFrame.melt is not,) and we’ll discuss an incredibly common use-case (portfolio analyses on time series trading data) that is almost always incorrectly modelled to the deteriment of ease and speed of computation.

James serves as lead instructor for Don't Use This Code. Don't Use This Code provides consulting, coaching, and training services to a number of clients in the financial services and tech industry, helping them develop greater expertise in the use of Python for data analysis, computational simulation, and automation.

If you are interested in developing your staff's expertise in Python, Pandas, or with data analysis and software development in general, please reach out to us at [email protected]