In this tutorial, you’ll learn how to use MediaPipe Hands Solution in a simple hand tracking and finger counting Python application.
In Computer Vision, feature detection is key to implementing a good and functional application. Some basic feature detection methods like edge and corner detection (check our post about the Harris Corner Detector here) can be calculated and mathematically implemented, others can be rather more complex and require a Machine Learning-based approach.
When considering “human features”, the most used and researched are usually face and hands. Identifying and tracking hands can be useful in various applications, such as: implementing gesture control, interpreting sign language, or improving solutions for augmented reality applications. Furthermore, working with this feature might be challenging, because hands can be presented in various positions, often occludind some fingers or one another. This tutorial will show a simple hand tracking and finger counting Python application using OpenCV and MediaPipe.
What is MediaPipe
MediaPipe is a framework that provides customizable Machine Learning (ML) solutions (such as face and hand detection, hair segmentation, motion tracking, etc) for live and streaming media. Their solution for hand detection and tracking is called MediaPipe Hands, and it employs ML to provide palm detection and a hand landmark model which consists of 21 3D landmarks, as shown in Figure 1.
These 3D landmarks are each composed of x, y, and z coordinates. x and y correspond to the landmark position, normalized from 0 to 1 by the image’s width and height, respectively. The z component represents how close the landmark is to the camera. We will only use the x and y coordinates in this tutorial. Additionally, the solution provides a label related to the predicted handedness of the detected hand, indicating left or right.
The implementation below works by running the MediaPipe Hands process function in each frame of the webcam video capture. For each frame, the results provide a 3D landmark model for each hand detected. For each of the hands detected, these are the steps followed:
- Check detected hand label.
- Store x and y coordinates of each landmark.
- Check each finger’s coordinates to determine if it is raised to increase finger count.
- Draw hand landmarks with draw_landmarks function.
For the third step, there are two approaches to test if a finger is raised:
- For the thumb, we’ll check the values of the THUMB_TIP and THUMB_IP x coordinates, and the hand label. The thumb is considered raised if the _TIP is located to the right of the _IP, for the left hand, and the opposite for the right hand.
- For the other fingers, we’ll check the values of the _TIP and _PIP y coordinates. The finger is considered raised if the _TIP is located higher than the _PIP.
Two important notes for the implementation of the third step are:
- Because we are using the webcam input capture, the left hand is labeled as “right”, and the right hand is labeled as “left”.
- The image’s origin [0, 0] when using the OpenCV library is in the upper left corner.
import cv2 import mediapipe as mp mp_drawing = mp.solutions.drawing_utils mp_drawing_styles = mp.solutions.drawing_styles mp_hands = mp.solutions.hands # For webcam input: cap = cv2.VideoCapture(0) with mp_hands.Hands( model_complexity=0, min_detection_confidence=0.5, min_tracking_confidence=0.5) as hands: while cap.isOpened(): success, image = cap.read() if not success: print("Ignoring empty camera frame.") # If loading a video, use 'break' instead of 'continue'. continue # To improve performance, optionally mark the image as not writeable to # pass by reference. image.flags.writeable = False image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) results = hands.process(image) # Draw the hand annotations on the image. image.flags.writeable = True image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR) # Initially set finger count to 0 for each cap fingerCount = 0 if results.multi_hand_landmarks: for hand_landmarks in results.multi_hand_landmarks: # Get hand index to check label (left or right) handIndex = results.multi_hand_landmarks.index(hand_landmarks) handLabel = results.multi_handedness[handIndex].classification.label # Set variable to keep landmarks positions (x and y) handLandmarks =  # Fill list with x and y positions of each landmark for landmarks in hand_landmarks.landmark: handLandmarks.append([landmarks.x, landmarks.y]) # Test conditions for each finger: Count is increased if finger is # considered raised. # Thumb: TIP x position must be greater or lower than IP x position, # deppeding on hand label. if handLabel == "Left" and handLandmarks > handLandmarks: fingerCount = fingerCount+1 elif handLabel == "Right" and handLandmarks < handLandmarks: fingerCount = fingerCount+1 # Other fingers: TIP y position must be lower than PIP y position, # as image origin is in the upper left corner. if handLandmarks < handLandmarks: #Index finger fingerCount = fingerCount+1 if handLandmarks < handLandmarks: #Middle finger fingerCount = fingerCount+1 if handLandmarks < handLandmarks: #Ring finger fingerCount = fingerCount+1 if handLandmarks < handLandmarks: #Pinky fingerCount = fingerCount+1 # Draw hand landmarks mp_drawing.draw_landmarks( image, hand_landmarks, mp_hands.HAND_CONNECTIONS, mp_drawing_styles.get_default_hand_landmarks_style(), mp_drawing_styles.get_default_hand_connections_style()) # Display finger count cv2.putText(image, str(fingerCount), (50, 450), cv2.FONT_HERSHEY_SIMPLEX, 3, (255, 0, 0), 10) # Display image cv2.imshow('MediaPipe Hands', image) if cv2.waitKey(5) & 0xFF == 27: break cap.release()