Recurse Center - Batch 2 - Cycle 20241203-20241205 - Dimension


Dimension

This cycle was mainly working on Zora, finishing up some data pre-processing on Heap (as well as fixing some Ansible code!).

Day 1

I worked on a bug in the Ansible code that manages the Heap cluster, debugging a nested for loop that can be improved with some proper data structures.

I finished CommonVoice data preprocessing. The directory of wav files, produced from the mp3 files, totals 390GB :) Compare that to the zipped version of the download, which was only 84GB.

(base) jo@broome:~$ du -sh /data/jo/commonvoice/*
84G     /data/jo/commonvoice/commonvoice.tar.gz
483G    /data/jo/commonvoice/cv-corpus-19.0-2024-09-13

(base) jo@broome:~$ du -sh /data/jo/commonvoice/cv-corpus-19.0-2024-09-13/en/*
79M     /data/jo/commonvoice/cv-corpus-19.0-2024-09-13/en/clip_durations.tsv
92G     /data/jo/commonvoice/cv-corpus-19.0-2024-09-13/en/clips
390G    /data/jo/commonvoice/cv-corpus-19.0-2024-09-13/en/clips_wav
4.7M    /data/jo/commonvoice/cv-corpus-19.0-2024-09-13/en/dev.tsv
92M     /data/jo/commonvoice/cv-corpus-19.0-2024-09-13/en/invalidated.tsv
104M    /data/jo/commonvoice/cv-corpus-19.0-2024-09-13/en/other.tsv
1.3M    /data/jo/commonvoice/cv-corpus-19.0-2024-09-13/en/reported.tsv
4.7M    /data/jo/commonvoice/cv-corpus-19.0-2024-09-13/en/test.tsv
352M    /data/jo/commonvoice/cv-corpus-19.0-2024-09-13/en/train.tsv
456K    /data/jo/commonvoice/cv-corpus-19.0-2024-09-13/en/unvalidated_sentences.tsv
223M    /data/jo/commonvoice/cv-corpus-19.0-2024-09-13/en/validated_sentences.tsv
555M    /data/jo/commonvoice/cv-corpus-19.0-2024-09-13/en/validated.tsv

Day 2

I started porting training code to Zora library.

Day 3:

Today was more porting of training code.

Things for next cycle

I'm going to start working on my presentation for Zora, continue working on the library, maintaining Heap, and finding time for some other fun programming :)