GSoC Red Hen Lab — Week 4
Pipeline for Multimodal Television Show Segmentation
Hi, welcome back! If you haven’t already read the introductory article based on the community bonding period then I strongly recommend you do so before continuing.
Last week I ran into a bit of a memory hiccup. When running my pipeline on the GPU compute nodes, the memory resources were getting exhausted after executing the music segmentation stage on approximately 8 mp4 files. Each file was taking up 2GB of memory and I had requested 30GB in total. This meant that there was no way I would be able to copy over 100 files and process them at once.
With some help from my mentor Frankie, we planned out a consumer-producer thread setup. The producer thread would be in charge of copying over the mp4 files and the consumer thread would execute the music segmentation. Once the consumer was finished with one file, I’d be able to remove it from memory since it’s no longer required.
This was one of my main tasks for week four. This article describes my progress, goals, and tie-downs throughout the week.
Goals for the week
- Implement a multithreaded producer-consumer queue implementation to execute the pipeline.
- Execute the pipeline on category one mp4 files and store features/segmentation labels.
Work Done
I implemented the producer-consumer threads by passing a python Queue data structure between the threads to communicate which mp4 file is currently being processed and which can be removed. We decided that at max, there should be around 8 files in the queue in order to maintain stable memory usage. My high-level pipeline code ended up looking like this:
# Create a queue to hold loaded files
loaded_files = Queue(maxsize=8)
# Queue to indicate when all files have been processed
finished = Queue()producer = Thread(target=load_files, ...)
consumer = Thread(target=process_files, ...)producer.start()
consumer.start()producer.join()
consumer.join()
For more technical information check out my code on GitHub.
In addition to the producer-consumer thread setup, I also used memray which is a great open-source memory profiler for python. I used this tool only during debugging to determine if there were any other memory leaks within my code.
Shifting the file copying task from the bash script to the python script meant that I’d have to invoke rsync
with ssh
. This led to a few issues along the way where the singularity container did not have the appropriate packages and had to be recompiled with rsync
and openssh-client
.
Frankie and I pushed up our weekly meeting a day earlier and in doing so I was able to add a few more features on Friday. During our call, we discussed the current progress and what is to come in the next few weeks prior to the midway peiord. We decided to branch out the code into two paths. The first path would involve using the existing version of the code to run batch jobs. The second path would build on top of the existing version to implement the keyframe extraction. Therefore these two paths would become my goals to accomplish in the upcoming week.
Difficulties along the way
Briefly mentioned before, running rsync
within a singularity container through a python script on the GPU node initially ran into a few issues. The solution was quite rudimentary. I had to bind the my /home/$USER
so that my .ssh
configs would be visible within the singularity container.
Conclusion
With my code up and running the next task is to scale it up and have it run on thousands of files. While waiting for those results to turn up, I’d have to parallelly build upon my existing code to extract keyframes from the existing segmented files. With three weeks left prior to the midway period, I hope to extract keyframes and train a deep learning image classifier. I’d then be able to utilize this classifier to filter out false positives produced by the noisy music segmentation. The addition of these few elements would wrap up a highly successful halfway progress report!