09-20, 11:05–11:40 (Europe/Amsterdam), Rembrandt
Huge amounts of resources are being spent training large language models in an end-to-end fashion. But did you know that at the bottom of all these models remains an important but often neglected component that converts text to numeric inputs? As a result of weaknesses in this ‘tokenizer’ component, some inputs can not be understood by language models, causing wild hallucinations, or worse.
This talk will cover some of our recent research in finding what text causes problems for a specific model, and show you how to break even the most advanced models.
The talk will be based on our recent work “Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models” but will not go into much mathematical depth. It will aim mainly at interesting results and things learnt along the way, as well as how we can relate this to better data engineering in general.
15m: On the more informative side, introducing tokenization and going into a little technical depth, only covering the simple case in the technique for finding under-trained tokens, and some ways that models break.
10m: On the more entertaining side: covering a number of individual cases in models as a ‘zoo of bugs’, aiming to horrify as well as generalize to bugs outside of this specific area.
5m: Wrapping up with some suggestions on if and how we can avoid the worst of such issues.
No prior knowledge of LLM training will be assumed, only some experience with using language models via chat interfaces.
Sander is a Machine Learning engineer at Cohere, working on post-training, reward modelling, and model evaluation. Originally from Groningen, he now lives in Denmark.