GSoC Red Hen Lab — Week 7
Pipeline for Multimodal Television Show Segmentation
Hi, welcome back! If you haven’t already read the introductory article based on the community bonding period then I strongly recommend you do so before continuing.
With the completion of week 7, I’m happy to say that I’m more than half way through my Google Summer of Code project! At the starting of this week, I was tasked to fill out the contributor midterm evaluation form which hardly takes a few minutes to complete. This form gauges the progress of your project and collects feedback for Google and mentors. Similarly, mentors are asked to fill out the mentor midterm evaluation form. Their feedback is then shared with the contributors at the end of the week.
During this week, I wrapped up any loose ends of the pipeline stage-1. Last week I was able to successfully train the image classifier to act as a filtering mechanism. However, it was not integrated within the existing pipeline. Therefore the main task for this week is to integrate the fine-tuned image classifier by having it produce and store the classification labels in another csv.
Goals for the week
- Extract additional images from the commercial CSV provided by Professor Tim Groeling.
- Store and retrieve the trained weights for the fine-tuned ResNet50V2.
- Adapt the code to include the image classification/filtering stage.
Work Done
The first thing I did was to increase the size of the training dataset by including images from the commercial CSV provided by Professor Tim Groeling. This csv contains the start and stop times of commercials within several mp4 files pulled in 1989. This data was produced by the UCLA team and was segmented by human annotation. Using my labeling annotation code, I fed in these timestamps and extracted another 1000+ images. The final size of my image dataset is now approximately 4000 images.
While training the model, I noticed a few areas in which I could improve the top most layers. First and foremost, I added a softmax activation function to retrieve the relative “confidence” of predicting a certain class. I also added a few more checks to ensure that the data transformation during the prediction step would run smoothly.
These few changes allowed me to increase the overall accuracy and performance of the image classifier. Next I moved on to building the prediction code. I started from the top by creating a route for my pipeline to extract timestamps from the music segmentation CSV files and feeding them to the image classification stage. A downside of doing this is that now, I’ll have to rsync the mp4 file unless the entire stage-1 pipeline has finished executing on the file.
After extracting the timestamps, I fed it to a script which extracts 5 key frames within each music segment. I extracted 5 to average out the confidence of the prediction to get a better overall estimate. I then proceeded by saving all this metadata into a csv which has the columns [label (title seq or commercial), start time, end time, confidence of prediction]. An example output of a specific mp4 file is shown below:
By observing the output, we can see that there's a general pattern where title sequences are sandwiched by multiple commercials. In theory, this is an ideal pattern which reassures that the audio segmentation and image classification pipeline is doing a relatively good job. However, there are instances where there are multiple title sequences such as index 43–45. The first two have a very short difference in time which indicates that they are most likely part of the same title sequence. The other (index 45) occurs approximately 14 minutes after the previous. Such instances introduce a lot of uncertainty within the pipeline.
Therefore, I need to explore a few rule based approaches to weed out noisy segmentation outputs prior to clustering.
Conclusion
Now that the first stage of the pipeline is implemented and running successfully, the next step is to proceed to the clustering stage. However, at the same time I must be collecting the image filtered labels. I intend to publish another release on my GitHub to isolate and freeze the working version of the pipeline stage-1 prior to moving on to the next stage.
I must also ensure that the image filtering stage works well on all edge cases. During my call with Frankie we discussed extracting and storing image features extracted from the ResNet50V2 for clustering. Although my original proposal outlined a multimodal clustering algorithm, we think it’s best to focus on one modality and by analyzing the results we can decide to add the music features or not.
We also discussed alternative ways of averaging out these predictions and decided to look into it further next week.