Fitness Activity Recognition Using MediaPipe BlazePose
Fellowship.ai conducted research using Google’s MediaPipe Pose ML solution with BlazePose to analyze fitness demonstration videos for guided instruction and accurate result reporting.
The goal was to prove that with a basic camera, a fitness equipment provider could deliver accurate instruction using the MediaPipe BlazePose’s 33-keypoint coordinate system and could detect when trainees were performing exercises inaccurately by comparing the pose differences with a trainer as a reference.
Brief Background on Home Fitness and Artificial Intelligence
The mainstream fitness industry didn’t start taking off in the U.S. until the 1970s, when new modalities of instruction were being created to help Americans live healthier lives and decrease their risks of cardiovascular diseases. These fitness movements' real driver was the programming's entertainment aspect, with Jane Fonda, Richard Simmons, Tae Bo, and more. Also, home fitness equipment became more accessible and lower cost, allowing Americans to bring the gym to the living room.
Today, popular fitness equipment and entertainment brands like Tempo and Peloton are using artificial intelligence to deliver personalized content suggestions while tracking how accurately their customers follow an instructor's prompts.
While this is a great advance for the home fitness industry, they still rely on expensive equipment ($300 for Peloton’s camera) or thousands of dollars for a wall-mounted device.
Why Fellowship.ai performed research with MediaPipe BlazePose for Fitness Tracking Recognition
Expensive devices will not be accessible for most people who want to work out at home on a budget, and this equipment is optional. Using MediaPipe’s Pose ML solution with BlazePose, Fellowship.ai demonstrated tracking body movement using the 33 coordinates and training models using videos captured from YouTube.
Why Fellowship.ai Chose Mediapipe’s Blazepose Instead of Other Solutions
During the early stages of research, a range of pose estimation techniques were investigated including PIFuHD, PoseNet, and MoveNet.
PIFuHD was introduced by Facebook and is short for Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization. Using PIFuHD, a 3D human model can be reconstructed from a 2D image.
Although PIFuHD showed impressive results in creating accurate 3D human models, it only worked for 2D images and could not be applied for pose estimation and tracking of moving figures. It was also prone to errors and did not perform well for outlier poses.
PoseNet is a pose estimation model released in 2017 that was trained on the COCO dataset, a large-scale object detection, segmentation, and captioning dataset. It detects a total of 17 keypoints from a human’s pose. One of the main advantages of PoseNet was that it allowed the detection of multiple persons, which is generally not available in pose estimation models.
Next-generation pose estimation models, namely MoveNet, have been released and shown to outperform PoseNet.
In 2021, TensorFlow released MoveNet, which outperformed PoseNet on many different tasks. MoveNet also detects 17 keypoints in a human’s pose, but it was trained on an internal dataset by Google called Active in addition to the COCO dataset.
Initially, Fellowship.ai conducted most of their experiments with MoveNet, but soon moved on after preliminary evaluations of BlazePose revealed better suitability for the project.
BlazePose uses a lightweight CNN architecture to detect 33 keypoints and is optimized for real-time inference. Furthermore, it is part of Google’s MediaPipe Pose ML solution, which provides a convenient and well-documented way to apply the model to the project.
With the increased number of keypoints, BlazePose provides richer information on a human’s pose and was therefore a strong candidate for the project.
Using BlazePose to Compare Trainee and Trainer Videos
The initial step of the research was to augment our fitness demonstration videos dataset with both trainer videos with proper form and trainee videos with improper form. For this, a variety of fitness progress videos and “dos and don’ts” videos were gathered from the internet.
As a preliminary comparison, the following 3 videos of bicep curls were investigated:
- Video Segment 1: Trainer Video
- Video Segment 2: Trainee Video with the incorrect form
- Video Segment 3: Trainee Video with the correct form
For these 3 videos, the keypoint coordinates were extracted using MediaPipe BlazePose and preprocessed using Dynamic Time Warping (DTW).
Dynamic Time Warping (DTW)
DTW is required because videos with different time lengths are being compared. If you match each frame of the video in a one-to-one fashion, there would be large discrepancies because of lags and different speeds. DTW wraps each time series so that they can match up by ignoring the one-to-one mapping and employing many-to-one or one-to-many mapping strategies. The matching is done so that total distance is minimized between the two time series.
As seen from the above figures, the red curve is slightly longer than the blue curve, although the shapes match. DTW overcomes this issue by using a one-to-many match so the peaks and troughs are lined up. Under the hood, DTW uses a distance matrix to compute the differences between the time series and then finds the shortest paths. Since fitness demonstration videos are different in time length and are performed at different speeds, DTW is a natural solution to correct for time series differences.
To compare the videos, the following algorithms were tested:
- Algorithm 1.1: Compare 33 keypoints individually: Compare (n,3) dimensional vectors and compute the average score (n: # of frames)
- Algorithm 1.2: Compare 25 relevant keypoints individually: Same as above, but we only consider the keypoints corresponding to the upper body, which is relevant for bicep curls
- Algorithm 2.1: Compare (n, (33,3)) dimensional vectors directly
- Algorithm 2.2: Compare (n, (25,3)) dimensional vectors directly (only relevant upper-body key-points)
The below table shows the similarity scores for the different algorithms when comparing the two trainee videos with the trainer. The trainee video with the correct form had higher similarity scores with the trainer video than the one with the wrong form for all algorithms tested.
Correcting for Orientation, Angle, Body Shape and Size
From the preliminary analysis, these algorithms seem to fail when the orientation of the person or camera angle in the videos we are comparing with is different. Based on this finding, we came up with the following possible solutions:
- We can use multiple trainer videos in different orientations and consider the best of many scores.
- We can also construct mirror videos of trainees and compare them with the trainer video and consider the best original and mirror score.
- We can try other models such as Pr-VIPE. These models can potentially recognize 3D human poses from 2D images.
- We can consider joint angles and try to normalize them.
Using MediaPipe with BlazePose or other models like Pr-VIPE is not necessarily new, but advancing the state-of-the-art and accuracy of these innovations in a variety of contexts can help improve brands who deliver guided instruction solutions improve on their current systems or incorporate these technologies into their solutions without potentially increasing the cost to the customer.
Areas of immediate applicability include:
- Improving guided fitness instruction
- Motion analysis for athletes, e.g. golf swing analysis, batting analysis, offensive lineman stance in football players
- Predicting fitness goal outcomes based on the accuracy of workouts or practice
Resources and FAQs
What is MediaPipe?
"MediaPipe is a Framework for building machine learning pipelines for processing time-series data like video, audio, etc. This cross-platform Framework works in Desktop/Server, Android, iOS, and embedded devices like Raspberry Pi and Jetson Nano." Source: Introduction to MediaPipe
What is BlazePose?
"[BlazePose is a] lightweight convolutional neural network architecture for human pose estimation that is tailored for real-time inference on mobile devices. During inference, the network produces 33 body keypoints for a single person and runs at over 30 frames per second on a Pixel 2 phone. This makes it particularly suited to real-time use cases like fitness tracking and sign language recognition. Our main contributions include a novel body pose tracking solution and a lightweight body pose estimation neural network that uses both heatmaps and regression to keypoint coordinates." Source:
Is MediaPipe open source?
MediaPipe is a an open-source framework from Google for building multimodal (eg. video, audio, any time series data), cross platform (i.e Android, iOS, web, edge devices) applied ML pipelines. It is performance optimized with end-to-end on-device inference in mind.
Does MediaPipe have a GitHub repo?
Yes, Google hosts the MediaPipe repo here: https://github.com/google/mediapipe and you can visit Google's GitHub site here: https://google.github.io/mediapipe/
Does BlazePose have a GitHub repo?
Yes, you can view the BlazePose GitHub repo here: https://google.github.io/mediapipe/
Description: "This is an implementation of Google BlazePose in Tensorflow 2.x. The original paper is "BlazePose: On-device Real-time Body Pose tracking" by Valentin Bazarevsky, Ivan Grishchenko, Karthik Raveendran, Tyler Zhu, Fan Zhang, and Matthias Grundmann, which is available on arXiv. You can find some demonstrations of BlazePose from Google blog."
Where can I learn more about the MediaPipe framework?
For more information on the MediaPipe framework, LearnOpenCV has a lot of great articles to get you going: "MediaPipe powers revolutionary products and services we use daily. Unlike power-hungry machine learning Frameworks, MediaPipe requires minimal resources. It is so tiny and efficient that even embedded IoT devices can run it. In 2019, MediaPipe opened up a whole new world of opportunity for researchers and developers following its public release."