Recurse Center - Batch 3 - Cycle 20250128-20250130 - Processing

Processing

This was a pretty good cycle with a lot of heads-down work on text and audio preprocessing of the CommonVoice dataset. I also got to pair a bit on some new infrastructure setup on Heap.

Day 1

Today I worked on audio and text dataset pre-processing pipeline for my speech transformer. According to the paper, they used the "Wall Street Journal (WSJ) dataset, training on si284, validating on dev93 and evaluating on eval92 set."

It's not specified in the paper which specific WSJ speech corpus they used, as the Linguistic Data Consotrum offers two: WSJ0 and WSJ1.

Either way, each dataset costs $1,500 and $2,500 respectively for non-members, which is....a bit outside of my budget at the moment :)

So instead, I'm using Mozilla's CommonVoice dataset, which is completely free, open-source, and publicly available speech datasets that is comprised of community-donated speech samples.

I spent a good amount of time exploring the dataset, which you can follow along here:

commonvoice.ipynb

I also worked on some classes for pre-processing this dataset for training with my speech transformer with a new speech transformer dataset file:

CommonVoiceDataset

From some of the data analysis, we were able to determine that filtering out speech recordings with sentences > 100 characters would still allow us to use more than 95% of the dataset, which is good for keeping our sequence length tight for efficient training.

self.tsv_items = self.tsv_items[ self.tsv_items['sentence'].str.len() <= self.cfg.max_seq_length ]

I also learned how to use the apply() function in pandas to do efficiently clean up elements in our dataset, like renaming file extensions and cleaning up sentences.

For example, we can write functions like these:

# replace .mp3 with .wav
def replace_mp3_with_wav(path):
    return path.replace("mp3", "wav")

# clean sentences
def clean_sentences(s):
    return s.split('\t')[0].strip().lower()

And "apply" them like this!:

self.tsv_items['path'] = self.tsv_items['path'].apply(replace_mp3_with_wav)

self.tsv_items['sentence'] = self.tsv_items['sentence'].apply(clean_sentences)

One of the more interesting things I came across while working on this Dataset class was handling normalization. From the paper, they write:

"The input acoustic features were 80-dimensional filterbanks extracted with a hop size of 10ms and a window size of 25ms, extended with temporal first and second order differences and per-speaker mean subtraction and variance normalization."

I was curious to understand more about how this normalization worked and why it was necessary. Asking Claude, I got some answers:

Prompt:

Can you help me with normalization? Still don't provide any code.

Response:

For normalization, we have a few options based on the paper's specification of "per-speaker mean subtraction and variance normalization":

Simple Approach (per utterance):

- Calculate mean across time and frequency for each utterance

- Subtract mean

- Calculate variance

- Normalize by standard deviation

Pros: Simple, no speaker tracking needed

Cons: Not exactly what paper describes

Per-Speaker Approach (matches paper):

- Track speaker IDs from CommonVoice

- Calculate stats across all utterances from same speaker

- Store/cache statistics

- Apply during getitem

Pros: Matches paper exactly

Cons: More complex, needs speaker management

Global Statistics (alternative):

- Calculate stats across entire dataset

- Store in dataset class

- Apply same normalization to all samples

Pros: Simple, consistent

Cons: Loses speaker-specific characteristics

For best results, we should implement the per-speaker approach. This would need:

1. Speaker ID tracking in init

2. Statistics calculation/storage

Lookup during normalization

Would you like me to explain how to implement any of these approaches in detail?

Remember: The paper used per-speaker normalization because different speakers have different acoustic characteristics (pitch, volume, speaking rate, etc.)

I figured out where speaker ID was stored in the dataset (as client_id), but still wanted to understand a bit more about what this normalization would actually accomplish for training purposes. In particular, because of my research on bias in speech recognition systems, I was really curious to know how normalization affected the how the model would handle (or not handle) different acoustic characteristics:

Prompt:

And why do we need to do normalization?

Response:

Normalization is important for several reasons:

- Speaker Variation:

    - Different speakers have different voice characteristics

    - Some speak louder/softer

    - Different pitch ranges

    - Different speaking rates

    - Different accents/pronunciations
- Model Training Benefits:

    - Makes training more stable

    - Helps model focus on important features

    - Reduces impact of speaker-specific variations

    - Makes features more consistent across speakers

- Technical Reasons:

    - Brings features to similar scale

    - Prevents very large/small values

    - Helps with gradient flow

    - Makes optimization easier

- Paper's Approach:

    - They use per-speaker normalization

    - Helps model learn speaker-independent patterns

    - Makes model more robust to different speakers

Think of it like this:

    - Without normalization: Model has to learn to handle raw differences between speakers

    - With normalization: Model can focus on the actual speech patterns rather than speaker-specific characteristics

Would you like me to:

    - Show what the features look like before/after normalization?

    - Explain different normalization strategies?

    - Or continue with implementation?

I ended up going with a more simple approach to normalization, which normalizes on a per-utterance basis.

I'll probably go back and and re-evaluate that approach at a later date to see how it affects model performance. All said though, I was really surprised to learn how normalization works to average out speaker-specific characteristics in order to make training more efficient and effective, which is something that comes up in Mills and Li's research on speech recognition in Vocal Features. Could this be one reason why ASR systems exhibit bias?

Finally, I worked on setting up custom batching with a custom collate function that follows the paper's approach to processing the data for the model:

"In the training stage, the samples were batched together by approximate feature sequence length and each training batch contained 20000-frames features."

Day 2

Today I paired with fellow Recurser on setting up and playing around with Ray, an orchestration infrastructure tool that handles the queuing and running of jobs across GPU machines (among other things). I got to learn a bit more about Kubernetes, SSH tunneling, port forwarding, and got to see a job get initialized on Ray (but didn't run due to some disk space issues on Heap :/ )

In debugging the disk space issue, I learned about ncdu as a prettier way to see and analyze disk space usage.

Day 3:

Today was mostly spent reviewing notes and writing this log.

Things for next cycle

Next cycle I should be training my speech transformer! I'm hoping I can do more stuff with Ray as well (and maybe do my training through its interface?)