Recurse Center - Batch 3 - Cycle 2025012-20250114 - Encoding

Encoding

This was a really productive past few days where I made a lot of progress on this speech transformer.

A lot of what I'm trying to do is incredibly new for me (building an automated speech recognition system from scratch), and without the benefit of a mentor to help introduce me to a lot of new concepts that I'm not familiar with, I've been relying on using Claude in Cursor to help introduce me to these concepts at a high-level, so I can program it myself. I've explictly asked it to never provide any code, and instead help me build intuition and guide me through implementation through questions and explaination. A sample prompt might look like:

I'm implementing the decoder portion of a Speech Transformer following the paper "Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition". I have already implemented: 1. Complete encoder with: - Initial feature extraction (Conv2D layers) - Linear projection and positional encoding - Multi-head attention (with optional masking) - Feed-forward networks - Layer normalization and residual connections Now I need to implement the decoder. According to the paper, each decoder block should have: 1. Masked multi-head self-attention 2. Cross-attention with encoder outputs 3. Position-wise feed-forward networks 4. Layer normalization and residual connections Additionally, the decoder needs: 1. Character embedding layer at the input 2. Final linear + softmax for output probabilities Please help guide me through implementing these components. Don't provide code directly - instead help me build intuition and guide me through the implementation with questions and explanations.

Or I might write a prompt with the image of the model attached and some text from the paper itself:

Hi! Please read the attached prompt and start helping me along. Today I want to finish the implemetation of the decoder. Please don't provide any code, just help me help me build intuition and guide me through implementation through questions and explaination. I've also attached an image of the architecture. Here is the description of the decoder from the paper: "The decoder is shown in the right half of Figure 2. We firstly employ a learned character-level embedding to convert the character sequence to the output encoding of dimension dmodel, which is added with the positional encoding. Then, the sum of them are inputted to a stack of Nd decoder-blocks to obtain the final decoder outputs. Differently from the encoder-block, each decoder-block has three sub-blocks: The first is a masked multi-head attention which has the same queries, keys and values. And the masking is utilized to ensure the predictions for position j can depend only on the known outputs at positions less than j. The second is a multi-head attention whose keys and values come from the encoder outputs and queries come from the previous sub-block outputs. The third is also positionwise feed-forward networks. Like the encoder, layer normalization and residual connection are also performed to each sub-block of the decoder. Finally, the outputs of decoder are transformed to the probabilities of output classes by a linear projection and a subsequent softmax function."

I've found this way of working with LLMs like Claude extremely helpful, and it was something I got introduced to while working through ARENA during my last batch. In one of the notebooks, they encourage us to use LLMs to help understand code or concepts we aren't familiar with or are seeing for the first time, and they explain this distinction between "playing in easy mode" vs "playing in hard mode":

From ARENA, they write:

We'll be discussing more advanced ways to use GPT 3 and 4 as coding partners / research assistants in the coming weeks, but for now we'll look at a simple example: using GPT to understand code. You're recommended to read the recent LessWrong post by Siddharth Hiregowdara in which he explains his process. This works best on GPT-4, but I've found GPT-3.5 works equally well for reasonably straightforward problems (see the section below).

This is where I got introduced to the concept of "playing in easy mode" vs "playing in hard mode":

Is using GPT in this way cheating? It can be, if your first instinct is to jump to GPT rather than trying to understand the code yourself. But it's important here to bring up the distinction of playing in easy mode vs playing in hard mode. There are situations where it's valuable for you to think about a problem for a while before moving forward because that deliberation will directly lead to you becoming a better researcher or engineer (e.g. when you're thinking of a hypothesis for how a circuit works while doing mechanistic interpretability on a transformer, or you're pondering which datastructure best fits your use case while implementing some RL algorithm). But there are also situations (like this one) where you'll get more value from speedrunning towards an understanding of certain code or concepts, and apply your understanding in subsequent exercises. It's important to find a balance!

It's been nice finding that balance by learning something new from Claude, and then trying to implement that concept on my own. Then when something like that comes up again, I'm feeling myself getting better at writing a lot of things myself from scratch that I've never seen before, as my intuition continues to grow and new concepts begin to stick :)

Day 1

Today was mostly spent working on the encoder side of the abode diagram. That meant implementing a lot of classes, and writing tests for them. At the end of the day, our SpeechTransformer class ended up looking like this:

class SpeechTransformer(nn.Module):
    """Speech Transformer model that converts speech spectrograms to text.

    The model follows the architecture from "Speech-Transformer: A No-Recurrence 
    Sequence-to-Sequence Model for Speech Recognition" paper.

    Architecture Overview:
    1. Two Conv2d layers with stride 2 reduce time and frequency dimensions by 4x
    2. Linear projection to d_model dimension
    3. Positional encoding
    4. Transformer encoder blocks
    5. Layer norm

    Currently encoder-only!

    Input shape: [batch_size, time_steps, freq_bins]
    Output shape: [batch_size, reduced_time_steps, d_model]
    """

    def __init__(self):
        super().__init__()
        self.cfg = Config()

        # encoder components
        self.repeat = Repeat(self.cfg)
        self.conv2d_block_one = Conv2DBlock(self.cfg, self.cfg.n_channels, self.cfg.n_out_channels, self.cfg.conv2d_kernel_size, self.cfg.conv2d_stride, self.cfg.conv2d_padding)
        self.conv2d_block_two = Conv2DBlock(self.cfg, self.cfg.n_out_channels, self.cfg.n_out_channels, self.cfg.conv2d_kernel_size, self.cfg.conv2d_stride, self.cfg.conv2d_padding)
        self.reshape = Reshape(self.cfg, "b c ts fb -> b ts (c fb)")
        self.linear = Linear(self.cfg)
        self.encoder_positional_encoder = PositionalEncoder(self.cfg)
        self.encoder_blocks = nn.Sequential(
            *[EncoderBlock() for _ in range(self.cfg.n_encoder_layers)]
        )
        self.layer_norm = LayerNorm(self.cfg)

        # encoder sequential
        self.encoder = nn.Sequential(
            self.repeat,
            self.conv2d_block_one,
            self.conv2d_block_two,
            self.reshape,
            self.linear,
            self.encoder_positional_encoder,
            self.encoder_blocks,
            self.layer_norm
        )

        # decoder components
        self.character_embedding = CharacterEmbedding(self.cfg)
        self.decoder_positional_encoder = PositionalEncoder(self.cfg)

PyTorch has this really nice construct called nn.Sequential, which let's you stack a bunch of nn.Modules together, and you can simply call:

def forward(self, x: Float[t.Tensor, "batch time_steps freq_bins"]) -> Float[t.Tensor, "batch reduced_time d_model"]: # type: ignore
    return self.encoder(x)

And nn.Sequential will call all of the forward methods for those objects! That way you can pass your input just once into your sequential, and your data will flow through all of those layers, sequentially. :)

Day 2

Today I started working on the decoder side of the model, and got as far as writing the classes for encoding/decoding input text (CharacterVocabulary) and creating our embeddings (CharacterEmbeddings). Here is what those two classes look like.

CharacterVocabulary

CharacterEmbedding

Day 3:

Today I had to handle some non-RC things.

Things for next cycle

My hope is to finish the decoder side of the speech transformer model, and do some more documentation / reviewing what I've been learning. It's been a lot so far! Hopefully by the end of the next cycle we can start training this speech transformer with open datasets like CommonVoice.