GSoC Red Hen Lab — Week 11

Harshith Mohan Kumar
4 min readSep 10, 2022
GSoC 2022 Red Hen Lab

Pipeline for Multimodal Television Show Segmentation

Hi, welcome back! If you haven’t already read the introductory article based on the community bonding period then I strongly recommend you do so before continuing.

In the previous week, I presented the outputs of RNN-DBSCAN on a small subset of the extracted features from stage 1. To evaluate the cluster, I displayed the number of clusters, number of noise points, and Silhouette Coefficient. The Silhouette Coefficient was quite low ~0.1, when the Silhouette Coefficient is close to zero it signifies that the clusters are indifferent (distance between clusters is not significant).

I brought up this issue during my weekly meetings with my mentor Frankie, to which he mentioned that there isn’t one true metric for evaluating unsupervised algorithms. He suggested looking into the distribution formed by counting the number of points in each cluster and plotting them into a histogram. In addition to the evaluation metrics, we discussed the complexity issue of running the clustering algorithms on millions of image features at a time.

Based on this conversation, this week I focused on performance analysis and code optimization.

Goals for the week

  1. Explore Silhouette Coefficient and cluster point distribution for [2,3,4,5,10] N neighbors.
  2. Create a subset of the existing stage-1 features based on the Rosenthal recording settings.

Work Done

I initially started off the week by running the RNN-DBSCAN on a subset create using the Rosenthal VCR Settings data entered by Heidy. This settings file was provided to me by Professor Tim Groeling as he suggested that since the recordings settings between certain dates listed in this excel sheet were quite similar and structured. I wrote a small python script to extract the meaningful start and stop dates and exported them to my pipeline code to create the subsets prior to running the RNN-DBSCAN algorithm.

Although I hypothesized that the subsets might improve the Silhouette Coefficient score, the improvement was insubstantial. The new Silhouette score was approximately 0.3. As noted earlier, exploring alternative measures to determine the performance of the clustering algorithm would paint a better picture of what exactly is going on. For that reason, I added a few modifications to the code to store and visualize the number of points in each cluster.

Right off the bat, it was quite clear that the number of noise points was the largest among all the other clusters. Additionally, for n-neighbors of [2,3,4] there were a very high number of clusters being formed (~10,000+). I created a histogram using 20 as my bin size and the output can be shown below.

The numerical output of histogram consisting of points in each cluster with bin size set to 20.

The visual plot of the histogram was dominated by the noise points but the other bins have an almost equal distribution. In the screenshot shown above, bins [1–13] all have around 1430 points in total. The outer bins [0 and 14–19] seem to be outliers since they either contain too many or zero points in the cluster.

Continuing with the analysis, I also calculated the mean, median, mode, and standard deviation for each variation of N-neighbors. The values are shown below.

n=2Number of clusters: 21001Average: 6.918956240179039Median: 4.0Mode: ModeResult(mode=array([3]), count=array([8647]))Standard Deviation: 202.87316381740104n=3Number of clusters: 13479Average: 10.780102381482306Median: 5.0Mode: ModeResult(mode=array([5]), count=array([4738]))Standard Deviation: 307.90113534916253n=4Number of clusters: 8720Average: 16.66341743119266Median: 5.0Mode: ModeResult(mode=array([5]), count=array([4500]))Standard Deviation: 606.3514501449703n=5Number of clusters: 3939Average: 36.88880426504189Median: 9.0Mode: ModeResult(mode=array([6]), count=array([755]))Standard Deviation: 1333.1194053402387

By observing these statistical values it’s evident that the median is consistently around five for most n neighbors. This makes sense because in stage one I randomly extract 5 image features for each range of timestamps. I initially did this so that the confidence of prediction would be higher and more accurate by taking an average of the five instances rather than one random instance within the range. However, while clustering this could be isolating each of these five instances into its own cluster.

Another eye-catching metric is the standard deviation. These values are significantly high and it is most likely due to the high count of noise points and a few highly populated clusters. Bin one from the histogram has a total of 22,432 points. Compared to the other clusters, it is significantly higher in the count. This illustrates that there are a few clusters that are high in cluster counts and many clusters with around 5 points each. In a way, it sort of resembles a long tail distribution.

Conclusion

This week marks the second last week of the coding phase for Google Summer of Code. The project has more or less come to an end and next week I’ll be continuing to analyze the clustering algorithm and produce a final report for the project.

--

--