GSoC Red Hen Lab — Week 2
Pipeline for Multimodal Television Show Segmentation
Hi, welcome back! If you haven’t already read the introductory article based on the community bonding period then I strongly recommend you do so before continuing.
It’s currently week two of Google Summer of Code and so far I’ve been having a blast setting up my project. During the first week I got myself familiarized with the CWRU HPC clusters and developed the baseline code for music classification.
In order to accomplish this task I set out to utilize the inaSpeechSegmenter library. At an early stage, Frankie help point out that the library was not optimal to run on the Rosenthal collection. One of the major targets for this week is to adapt this library to optimally run on batches of 8 hour long mp4 files asynchronously.
Goals for the week
- Modify slurm job to copy source code and batch of mp4 files to the temporary directory on the allocated gpu node cluster.
- Modify the featGenerator method of InaSpeechSegmenter to process multiple files, break them down into segments of 45min and segment them into music/noise/speech.
- If 1 & 2 are accomplished, then proceed to save the outputs back to the gallina home directory.
Difficulties along the way
One of the most troublesome problems during this week was with setting up the singularity image on HPC login node and utilizing the singularity container on the GPU nodes. Since the home
directory on the HPC is quite limited, I setup a symbolic link to point to my gallina
home directory. However, this ended up causing a load of issues when trying to utilize singularity within a GPU node. Since the GPU nodes don’t have access to the gallina directory, the .singularity
folder was nowhere to be found. Ultimately, this inhibited me to utilize the GPU nodes.
Conclusion
I was able to successfully write a bash script to dynamically load in the mp4 files based on batch index value.
n=$SLURM_ARRAY_TASK_ID
i=0
allFiles=()
while IFS= read -r line; do
if [ $i -eq $n ]
then
echo i equal to n
allFiles+=($line) for f in ${allFiles[@]}; do
echo $f
rsync -az hpc3:${f} /tmp/$USER/mtvss/data/tmp/video_files
done
fi
i=$((i+1))
done < /tmp/$USER/mtvss/data/tmp/batch_cat1.txt
As shown above, it reads the batch_cat1.txt
file line by line. If the line read is equal to the index (array task id), it loops through all the files and uses rysnc
to load them to /tmp
In the end, I was able to make significant changes to the media2feats
module to process files in 45 minute segments. However, the process in which I was starting and joining the threads was still suboptimal. Frankie suggested using a queue data structure to store the processed 45 min segments. This queue can then be used to yield the features (mfcc, loge) for segmentation. This multithreaded speedup would substantially increase the rate at which the files are processed.