🐿️
25

Pro tip: I was stuck on a weird AI training bug for a week until I tried a simple fix

I was training a small image model on my laptop and it kept crashing after 3 hours every time. I tried changing the learning rate and batch size, but nothing worked. A friend in Denver told me to just shuffle the training data order before each run, which I thought was too basic to help. I did it anyway, and the next run went all the way to completion without a single crash. Has anyone else fixed a stubborn training issue with something that seemed too easy?
3 comments

Log in to join the discussion

Log In
3 Comments
ericking
ericking2mo ago
Ever try just restarting your computer before a long training run?
7
brookewood
brookewood2mo ago
Maybe @ericking is onto something with that restart trick for clearing memory leaks.
2
max808
max80824d ago
Curious about that crash pattern - did you notice if it always failed around the same epoch or batch number before the shuffle? Had a similar situation where the dataset was accidentally sorted by difficulty, so it hit a wall when it got to the weird edge cases all at once. The shuffle probably spread those hard examples out enough to keep the loss from spiking too high at any single point.
3