GSoC Red Hen Lab — Week 10
Pipeline for Multimodal Television Show Segmentation
Hi, welcome back! If you haven’t already read the introductory article based on the community bonding period then I strongly recommend you do so before continuing.
The main goal for this week is to implement and analyze the outputs from the clustering algorithm. The idea is to get this implemented as quickly as possible to further tune and optimize the algorithm down the line.
Throughout the week while working on the RNN-DBSCAN algorithm, I also started to ponder about what “my final product” would be. In my original proposal, my output from the second stage consists of metadata which includes the show number, start and stop time, show name, network name and prediction accuracy.
While approaching the end of the GSoC timeline, I’m carefully analyzing what I can get done and produce as my final output.
Goals for the week
- Implement the RNN-DBSCAN algorithm for clustering
- Tune the RNN-DBSCAN N Neighbors parameter
- Analyze performance of RNN-DBSCAN
Work Done
First things first, I started the week off strong by implementing the RNN-DBSCAN using the sklearn-ann library built by Frankie. Around the start of the week, I had about 1.4 million image features which were being fed as input to the clustering algorithm. The screenshot below shows the shape of input which is fed to the RNN-DBSCAN.
Initially, I had a few issues loading in this large amount of data onto memory. To combat the problem I loaded in the array and saved it as a npy file. Then I reloaded the newly made npy file using the mmap mode. These Memory-mapped files are very convenient since they used for accessing small segments of large files on disk, without reading the entire file into memory. The speed up in loading the input data after this implementation was substantial. It also helped eliminate the Bus Error I was getting while executing the clustering algorithm.
With the mmap implementation, the clustering code was working smoothly. It’s performance however is quite complex. It would take over 12 hours to run on 1.4 million files. For that reason, I decided to tune the N neighbors parameter using a smaller subset. To ensure that the subset chosen reflects the population I took all the files on the same day across one year. The output of the clustering algorithm is shown below.
Conclusion
This week I was able to successfully get the RNN-DBSCAN algorithm implemented and working on the features extracted from the stage-1 of the pipeline. This marks the end of week 10 which means that there’s only two more weeks left in Google Summer of Code! Although I’m fairly on track with my proposal, I’m a bit sad that the program is coming to an end soon.
Next week I’ll be exploring ways to filter and improve the clustering performance. As observed from the screenshot provided, the Silhouette coefficient score is 0.106 which is quite low. I believe this is due to the fact that there may be a lot of insignificant points within the input data.
In the following weeks, I’ll also have to be very clear and coherent on what my end output is. I’ll have to perform sufficient performance analysis on the pipeline I’ve built to ensure that this is a tool which is useful for the UCLA team.