Recurse Center - Batch 3 - Cycle 20250305-20253207 - Debugging


Debugging

This cycle was spent getting back from a little break and continue debugging my speech transformer.

Day 1

Unfortunately my copy of the CommonVoice dataset got deleted, so I spent the day re-downloading and running scripts to setup the dataset again. Luckily everything was scripted, so it didn't take too long!

CommonVoice dataset scripts

Day 2

Today was spent mostly debugging why I was getting bad output from my model while training. This was mostly spent debugging the CharacterVocabularly class and its encode/decode functions, to understand how tokenization and decoding was funcitoning and if there was some issue causing these weird output patterns in text as it was passed through the model.

With using_special_chars = True, we get output that looks like this:

Predicted text: ooooooooooooooooooooooooooooooooounknown_charhoooooounknown_charunknown_charhhvvvbsvvvbbboooolllllll...
Actual text: SOShe also fought at the battle of bothwell bridgeunknown_charEOSPADPADPADPADPADPADPADPADPADPADPADPA...

With using_special_chars = False, we get this kind of output:

Predicted text: dddd   fffyyfffffdddiooooooffooeeeehhiooooooonnnnniioooooooinnniinnnnniiuueeeeeeexyyyfyjuooeeeeedeuu
Actual text: for example colombia chile argentina and venezuela

Something is happening (or not happening) with the predicted text, so I'll have more sluething to do. Excited to see what I end up learning!

Day 3:

Today was spent writing this long and continuing to debug the model. Something I want to figure out quicker debug feedback loop with starting training, getting output, and stopping the process to free of the GPU memory.

Things for next cycle

I'll be spending the next cycles continuing to train and debug my model. My hope is to get a working demo by the end of the month!