The idea of this project was to create a real-time translation tool that would translate from British Sign Language (BSL) into English text.
Because it was real-time, the idea is the tool would run your camera and film the user signing. It would then translate those signs into English and write that text on the screen.
It was quite an ambitious project, so fairly quickly the scope was limited to just covering the BSL alphabet. This gave 26 possible signs which were all unique and greatly simplified the problem.
The dataset was obtained from Youtube videos and manually labelled. Each video was labelled by creating a CSV file that detailed the starting and ending timestamps of each sign in the video.
This data could then be used to train a custom AI model.
Once this was working on a PC, I also did some work to port it to an Android app (just for inference - not for training).
Suprisingly the application actually worked quite well. It was able to predict signs with ~60% accuracy - obviously not good enough for daily use but definitely far above random chance so it was definitely working.
The project was eventually stopped as the small amount of labelled data became severely limiting. Obtaining more data would have been extremely time and work intensive and it wasn't clear how much that would actually help (most youtube videos of BSL are videos to teach others so are purposefully slowed down compared to real-life BSL users). Extending the range of the application to cover more than just the BSL alphabet would also have hit similar problems.
In the below sections I describe roughly how the system worked, though I am writing this long after the project finished - so there may be some inaccuracies.
Processing the data
To do the initial data processing, I started with a number of videos which go through the BSL alphabet. Each video contains a CSV file which details the start and end timestamp of each sign and the letter that sign corresponds to.
We need to process that data to convert it into the data format that the model takes in. We don't train on the videos directly because that would be very data intensive when actually all we care about are the hand positions in each frame. We can use the Mediapipe tool from Google to extract hand coordinates for each frame in the video. There is also some extra logic to figure out which hand is left and right - these need to be in the same order each time.
In doing this we did lose some useful information - BSL doesn't just rely on hand positions, facial expressions also provide vital information. I ignored this fact at this point - mainly to simplify things, but also because the meaning of the alphabet signs wouldn't be changed by the facial expression. If we wanted to go beyond the alphabet this is something we'd need to fix!
After getting the hand coordinates, we now have an x, y and z coordinate for the 21 points on each hand that Mediapipe detects, for all frames in the video and can use this to train a model
Training the model
The model that seemed to work best was a modified version of ResNet which is a Convolutional Neural Network (CNN). We had converted a video into a set of 21 points (each of which had 3 coordinates) that ran for a certain number of frames - this is actually equivalent to an image with three channels and a resolution of 21 x number of frames. I suspect this is why the CNN architecture seemed to work the best. We did need to modify Resnet however to change the number of dimensions - each Conv layer in the Resnet architecture was updated to be 3 dimensional instead of 2. I believe this allowed the convolutional layers to work better though I'm not sure I could explain how.
Now at this point I hit an issue - the length of each sign is not equal. So the number of frames a sign runs over depends on the sign itself, but also the person doing the sign. However, you can't train on data with unequal length. I got around this by adding lots of padding where signs were short and cutting off the beginning of the sign where signs were long.
We now had a context window however, so we could tell the model to take in the data from the past x frames (I believe I used 30, roughly 1 second of video) and ask it to train on what sign should be predicted based on those 30 frames - though later frames were weighted much higher than earlier frames to improve predictions when the signer was changing between signs.
So how did this go? Well this is the training loss and per sign accuracy for one of the models:
Training Loss: 0.19430937376976531
Validation Loss: 1.0497158827459303
Per sign accuracy:
Of course, in the wild it performed a lot worse than this - but these were encouraging results.
Inference worked much the same way as training - the hand coordinates from the previous 30 or so frames were passed into the model and the model attempted to predict the sign from those.
This could then be overlaid onto the video on screen - with some checks, such as requiring the same prediction for multiple frames before updating the screen.
The Android App
Finally, I wanted to see if I could get this to run on a mobile phone. I was worried that performance would be very poor and the tool would only be usable on a PC with a large GPU. By using Torchscript I was able to export a trained model and use this in an Android app which could run on my old Google Pixel 5 phone.
Performance-wise things were fine - though I did need to limit the model to only running inference every couple of frames. This didn't seem to be a problem however - it just meant that it took slightly longer to update the prediction between signs.