Last time I mentioned One-Hot Encoding and how to tell if your model actually works. Today let’s talk about how models cheat—because they will, given any chance.
One-Hot Encoding: The Simple Way
Last time I covered embeddings (the smart, multi-dimensional approach). Before that, we had one-hot encoding.
Think of it as a giant checklist. Red, Blue, Green becomes [1,0,0], [0,1,0], [0,0,1].
It’s simple but dumb. Red is as different from Pink as it is from Tractor. No relationships, no nuance. With 10,000 categories, you get 10,000 columns. Memory nightmare.
Use it for simple stuff. For anything complex, you want embeddings.
The Validation Trap: Stop Splitting Randomly
This is where everyone messes up. You split data 80-20 training/validation. Most people random shuffle.
Don’t do that.
If you have time series data (sales, stock prices), random splitting lets your model cheat. It sees Tuesday, skips Wednesday, sees Thursday. When asked to predict Wednesday? It’s not predicting—it’s interpolating because it saw Thursday’s data.
Split chronologically instead. First 18 months train, last 6 months validate. Force the model to actually predict the future, not fill in gaps it already saw.
The One Metric Problem
Everyone chases high accuracy. It feels good. But accuracy lies.
Build a disease detector for something affecting 1 in 1,000 people. A model that always says “no disease” is 99.9% accurate. Looks great on paper, completely useless in reality—it misses every sick person.
Better Approaches
Don’t rely on accuracy alone:
- Precision/Recall: Are you catching rare events? How many false alarms?
- Confusion Matrix: Shows exactly where the model gets confused
- Eye Test: Look at what it gets wrong. Any patterns?
Also check if the model’s predictions follow the right trends. Does it match the training data patterns?
The Real Lesson
Training models isn’t just hitting run and watching loss decrease. It’s about setting up fair tests the model can’t cheat.
Give a model random splits and one metric? It’ll find the easiest path—which isn’t real learning.
It’s like a student who memorizes a practice test. Ace it because they remember the questions. Give them the real test with different wording? Completely lost. They never understood the concepts, just memorized answers.