Recurse Center - Batch 3 - Cycle 20250205-20250207 - Training


Training

I started training my speech transformer! A cool moment was during some debugging and decoded a tokenized piece of text that I had only seen first as a list of integers first, watching this sentence "emerge" from just numbers.

print("text: ", text)

...

text:  tensor([[ 1, 26, 11, 28,  3, 30,  5, 15,  8, 22, 22, 30, 16, 28, 30, 22, 18, 24,
         15,  3,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0]]

...

print('string: ', self.cv.decode(text[0].tolist()))

...

string:  why bless my soul

I'm dealing with an OutOfMembery error right now, as it seems like there is a bug in my collate_fn() that handles managing frame legnths to 20,000, but they seem to be entering my attention function at a way larger number, causing memory issues (getting attention scores scales quadradically)

audio feature shape:  torch.Size([1, 46675, 80])

I'm excited to continue debugging and scale up training until I can get to a full YOLO run, hopefully next cycle!

Day 1

Today I started working on my SpeechTransformerLearner class, which is where training happens (for some reason I feel like calling it learning rather than training even though I know that's the convention).

SpeechTransformerLearner class

The first function I worked on was compute_loss(), which allows us to see whether or not our model is getting any better and predicting words from spectrogram/text pairs used in training/learning.

For speech recogntion, the metric we really care about is Word Rate Error, which is measured by comparing a speech recording and its transcript against what the model predicts the transcript is, only using the same speech recording.

WER is measured by this formula:

        # WER = (Substitutions + Deletions + Insertions) / (Total Words in Ground Truth)

And can be calculated by finding the Levenshtein distance between predicted words and the ground truth transcription.

Day 2

Today I worked on more functions for my learner, like:

And a few other helper functions.

I also paired with another Recurser on some virtual machine / devops-y stuff on Heap, which was loads of fun :)

Day 3:

Today I continued working on getting closer to doing my first training run, starting with writing some more code for checkpointing, logging and tracking metrics like grad norm to see if there gradient clipping is necessary, and if so, what value it should be.

I was able to hook my learn() function up to Weights and Biases which was cool to see my training showing up in the dashboard, getting some experience in some MLOps.

Finally I started to get some training off the ground, but encounted some bugs around how my custom batching is working.

I did get to do some fun debugging and fiddled around with decoding some tokenized strings, which was gratifying to see!

This was the first text I got to see come out of decode():

string:  dance of fire is a small trilogy

I'm really excited to get this working and scale up testing as much as we can.

Things for next cycle

Next cycle I'm going to continue training, and hopefuly get that working on Ray.