GSoC Red Hen Lab — Week 9

Harshith Mohan Kumar
4 min readAug 18, 2022
GSoC 2022 Red Hen Lab

Pipeline for Multimodal Television Show Segmentation

Hi, welcome back! If you haven’t already read the introductory article based on the community bonding period then I strongly recommend you do so before continuing.

This week I had the opportunity to meet with Professor Tim Groeling and Francis Steen to discuss about the results from the stage-1 and explore ways to filter the data using simple rules prior to feeding it to the RNN-DBSCAN clustering pipeline.

Goals for the week

  1. Create a presentation illustrating the progress to date.
  2. Using the code developed from last week, extract and store image features for clustering.
  3. Design an outline for RNN-DBSCAN implementation

Work Done

Previously I was working on a presentation to visualize and display my midway progress. I picked up this presentation and added the progress over the past week. In this article, I’ll showcase the most important slides. If you would like to view the entire presentation click here.

Starting with a quick exploratory data analysis of the category 1 data. As you can see presented in a pie chart in the screenshot below, there is an equal distribution of mp4 files from the years ranging from 1995 to 2006. The total count of mp4 files is 10399. on average each mp4 file is around 6–8 hours long.

EDA Results

Now at this point in time, I’ve processed about 4,500 mp4 files using the stage-1 pipeline. The audio features and the speech/music/noise labels for these files have been generated and stored. The size of the data produced is approximately 1.5 Terra bytes.

To briefly recap the performance of the fine-tuned ResNet50V2 model, I’ve added the plots showing the accuracy and loss in relation to epochs.

Accuracy vs Epoch for fine-tuned ResNet50V2
Loss vs Epoch for fine-tuned ResNet50V2

Finally, lets move on to the analysis of the outputs obtained from stage-1. I created a scatter plot with confidence of prediction on the Y-axis and start time on the X-axis. Prior to creating this plot, I also filtered out all confidence values under 95% for insightful evaluations. This plot is shown below.

Scatter plot — confidence vs start times

From a quick visual inspection of the plot we can see that the commercials and title sequences are a bit sporadic. In a perfect situation, we would have alternating green and blue dots with high confidence values. Now the goal here is to reduce the “noise” shown through this scatter plot.

There are various ways to go about doing this. One is to filter only high confidence outputs, but then our classifier could be wrong. We could combat this by increasing the size of the dataset used for training. Professor Tim, indicated that his team at UCLA could help me with this task. Prof. Francis suggested taking a look a method like Apriori to generate rule based approaches to further process the extracted start and stop times. He also suggested using closed captioning text to increase the confidence levels using text based modalities which hasn’t been explored.

Moving on from our meeting, I worked on extracting and storing image features by submitting the modified code as array jobs. While this was working in the background, I started to read the RNN-DBSCAN paper in greater detail. Although the paper is not available online, I requested the author Krzysztof Cios and he kindly send me the full-text version of the paper. I also started to explore Frankie’s implementation (sklearn-ann) on GitHub.

Conclusion

By the end of this week I got a pretty good glimpse of the end product which I’m trying to build. From inputs from the UCLA professors I understood the requirements and ways in which the UCLA team could use my pipeline as a filtering tool to help them during the manual annotation process. The pipeline which I’m currently building helps cut down costs by a significant factor by roughly identifying “hot spots” to look for.

Next week I’ll be implementing and submitting the clustering algorithm to produce clusters of the existing image features. This stage-2 pipeline is one of many ways to further filter and add another high level segmentation. When combined with the existing stage-1, it will allow manual annotators to quickly go through these clusters and label them on a per show basis.

--

--