09-20, 14:55–15:30 (Europe/Amsterdam), Escher
Data scientists and app developers today must deal with data coming from different regions around the world. Whether handling signup form data, scraping news articles, or building LLM pipelines, one of the most common types of unstructured text data seen today are names of people and addresses; however, conventions for how these are written are completely different depending on their country of origin. In this talk, intended for data scientists and non-technical stakeholders alike, I will provide an introduction to what these types of data can look like, a number of misconceptions about what they do or don't contain, and some examples for how to work with them.
Everyone has preconceptions about what names and addresses look like, depending on where they are from. When building products or analyzing data involving people and places from other cultures, it is important to make sure that these biases don't cause things to break unexpectedly. This introduction to the world of names and addresses aims to walk listeners through the following:
- What names typically look like in western countries versus other regions of the world
- Assumptions about names that don't always hold true when looking around the world
- The standard format of addresses that we are used to in Europe/North America
- Where these address formats break down when going elsewhere
- Open source and proprietary tools that Python data scientists can use to process these data, including a demo of libraries including nominally and pypostal
- Rethinking why we want to parse names at all, and what alternatives we have
While the final sections will be lightly technical, the talk as a whole is meant to serve as an interesting overview to both technical and non-technical audiences, and is aimed at helping all of us build tools that work better for everyone.
Philip is the director of Blair Software, an Amsterdam-based AI consultancy specializing in NLP software. Originally from the United States, he spent nearly a decade doing applied research and development on NLP systems, and conducts a mixture of AI software development, AI corporate advisory work, and advising software companies of the impacts of transatlantic AI regulation for companies in Europe and the United States.