GSoC Red Hen Lab — Week 6

Harshith Mohan Kumar
4 min readJul 25, 2022
GSoC 2022 Red Hen Lab

Pipeline for Multimodal Television Show Segmentation

Hi, welcome back! If you haven’t already read the introductory article based on the community bonding period then I strongly recommend you do so before continuing.

The focus this week is to build the next feature of the multimodal pipeline. Previously I had finished implementing a stable version of the pipeline to extract audio features and timestamps which contain music. Using these segmentation labels, I plan to extract and annotate keyframe images to ultimately train an image classifier for the purpose of filtering out false positives.

Next week marks the midway point for the Google Summer of Code project. In my original proposal I aimed to finish the first stage of the pipeline which includes music segmentation and image classification by this midway point. So in this week, I plan to wrap up the image classification part and try to integrate the classification as part of a filter to the existing music pipeline.

Goals for the week

  1. Run the stage one pipeline on ~30% of the category one files.
  2. Continue developing annotation tools to label images.
  3. Extract a variety of keyframe images from various files for image classification training.

Work Done

First thing I did was to submit the working music segmentation pipeline code to the slurm controller as an array job. I submitted around 5–6 jobs which could run in parallel. Each job could process up to 500 mp4 files and took approximately 9 hours to finish executing. In order to not run the pipeline on existing files, I made sure to integrate checks to determine if the mp4 file has already been processed fully or partially. If the file has been processed partially, it picks up at the last 45 min segment and continues processing from there. In the end, I had segmented around 3000+ mp4 files each around 8 hours long which amounts to about 2.7 years of video footage. Although impressive as it may sound, 3000 files only amounts to 30% of category 1 files.

Moving on to the image classification task, I first extracted and stored key frames from the music-intervals to be further labeled. In order to make the labeling process easy and faster for myself, in each music interval, I extracted three keyframes. Then I built an annotation tool using jupyter notebook and python to display these batches of three keyframes at a time. The batches of three keyframes would then be labeled by the set {1, 2, 3} where 1 indicates the presence of a title sequence, 2 indicates a commercial and 3 when I’m not sure what the image contains.

The screenshot below demonstrates an example for title sequence shot for the Dr. Phil show. In the following screenshot, a commercial for Tylenol is displayed and annotated. You can see the effectiveness of extracting three key frames as supposed to one in this example since it clearly demonstrates the transitions throughout the commercial. Finally the screenshot after that is quite noisy and hard to make out. Since these images would be hard to determine if it’s part of a commercial or title sequence, I label them as unknown. In doing so it also helps avoid introducing noisy data for the image classifier.

Annotation Tool: Example of Title Sequence
Annotation Tool: Example of Commercial
Annotation Tool: Example of Unknown

Once the image dataset was curated, the next step was to build and train an image classifier. I decided to pick the ResNet50V2 pretrained model provided by TesnorFlow Keras. In my original proposal I suggested using MobileNetV2 due to its efficient performance. However, since I have access to wonderful Tesla v100 GPUs, compute is hardly an issue. I took the pre-trained ResNet50 and added an average pooling layer followed by a dropout layer and finally a dense layer. These top most layers are used for fine-tuning the model to the images I’ve collected.

Model Summary

Surprisingly, the model seemed to perform incredibly well right off the bat with minimal hyper-parameter optimizations. The training accuracy, loss and validation loss for each epoch is shown in the screenshot below.

Training Output for 20 Epochs

Conclusion

The work performed in this week was highly successful. It has allowed me to get back on track with the roadmap proposed in my original proposal. The only downside is that I have to figure out a rule-based filtering method and accuracy measure when producing the final stage-1 metadata. I believe this would the be the primary goal for next week. Additionally, I’ll be working on my midterm evaluation in the following week.

--

--