Hand Tracking and Finger Counting in Python with MediaPipe


In this tutorial, you’ll learn how to use MediaPipe Hands Solution in a simple hand tracking and finger counting Python application.


In Computer Vision, feature detection is key to implementing a good and functional application. Some basic feature detection methods like edge and corner detection (check our post about the Harris Corner Detector here) can be calculated and mathematically implemented, others can be rather more complex and require a Machine Learning-based approach.

When considering “human features”, the most used and researched are usually face and hands. Identifying and tracking hands can be useful in various applications, such as: implementing gesture control, interpreting sign language, or improving solutions for augmented reality applications. Furthermore, working with this feature might be challenging, because hands can be presented in various positions, often occludind some fingers or one another. This tutorial will show a simple hand tracking and finger counting Python application using OpenCV and MediaPipe.

What is MediaPipe

MediaPipe is a framework that provides customizable Machine Learning (ML) solutions (such as face and hand detection, hair segmentation, motion tracking, etc) for live and streaming media. Their solution for hand detection and tracking is called MediaPipe Hands, and it employs ML to provide palm detection and a hand landmark model which consists of 21 3D landmarks, as shown in Figure 1.

These 3D landmarks are each composed of x, y, and z coordinates. x and y correspond to the landmark position, normalized from 0 to 1 by the image’s width and height, respectively. The z component represents how close the landmark is to the camera. We will only use the x and y coordinates in this tutorial. Additionally, the solution provides a label related to the predicted handedness of the detected hand, indicating left or right.

Figure 1: MediaPipe Hands Landmark Model


The implementation below works by running the MediaPipe Hands process function in each frame of the webcam video capture. For each frame, the results provide a 3D landmark model for each hand detected. For each of the hands detected, these are the steps followed:

  1. Check detected hand label.
  2. Store x and y coordinates of each landmark.
  3. Check each finger’s coordinates to determine if it is raised to increase finger count.
  4. Draw hand landmarks with draw_landmarks function.

For the third step, there are two approaches to test if a finger is raised:

  • For the thumb, we’ll check the values of the THUMB_TIP and THUMB_IP x coordinates, and the hand label. The thumb is considered raised if the _TIP is located to the right of the _IP, for the left hand, and the opposite for the right hand.
  • For the other fingers, we’ll check the values of the _TIP and _PIP y coordinates. The finger is considered raised if the _TIP is located higher than the _PIP.

Two important notes for the implementation of the third step are:

  • Because we are using the webcam input capture, the left hand is labeled as “right”, and the right hand is labeled as “left”.
  • The image’s origin [0, 0] when using the OpenCV library is in the upper left corner.


import cv2
import mediapipe as mp
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles
mp_hands = mp.solutions.hands

# For webcam input:
cap = cv2.VideoCapture(0)
with mp_hands.Hands(
    min_tracking_confidence=0.5) as hands:
  while cap.isOpened():
    success, image = cap.read()
    if not success:
      print("Ignoring empty camera frame.")
      # If loading a video, use 'break' instead of 'continue'.

    # To improve performance, optionally mark the image as not writeable to
    # pass by reference.
    image.flags.writeable = False
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    results = hands.process(image)

    # Draw the hand annotations on the image.
    image.flags.writeable = True
    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)

    # Initially set finger count to 0 for each cap
    fingerCount = 0

    if results.multi_hand_landmarks:

      for hand_landmarks in results.multi_hand_landmarks:
        # Get hand index to check label (left or right)
        handIndex = results.multi_hand_landmarks.index(hand_landmarks)
        handLabel = results.multi_handedness[handIndex].classification[0].label

        # Set variable to keep landmarks positions (x and y)
        handLandmarks = []

        # Fill list with x and y positions of each landmark
        for landmarks in hand_landmarks.landmark:
          handLandmarks.append([landmarks.x, landmarks.y])

        # Test conditions for each finger: Count is increased if finger is 
        #   considered raised.
        # Thumb: TIP x position must be greater or lower than IP x position, 
        #   deppeding on hand label.
        if handLabel == "Left" and handLandmarks[4][0] > handLandmarks[3][0]:
          fingerCount = fingerCount+1
        elif handLabel == "Right" and handLandmarks[4][0] < handLandmarks[3][0]:
          fingerCount = fingerCount+1

        # Other fingers: TIP y position must be lower than PIP y position, 
        #   as image origin is in the upper left corner.
        if handLandmarks[8][1] < handLandmarks[6][1]:       #Index finger
          fingerCount = fingerCount+1
        if handLandmarks[12][1] < handLandmarks[10][1]:     #Middle finger
          fingerCount = fingerCount+1
        if handLandmarks[16][1] < handLandmarks[14][1]:     #Ring finger
          fingerCount = fingerCount+1
        if handLandmarks[20][1] < handLandmarks[18][1]:     #Pinky
          fingerCount = fingerCount+1

        # Draw hand landmarks 

    # Display finger count
    cv2.putText(image, str(fingerCount), (50, 450), cv2.FONT_HERSHEY_SIMPLEX, 3, (255, 0, 0), 10)

    # Display image
    cv2.imshow('MediaPipe Hands', image)
    if cv2.waitKey(5) & 0xFF == 27:



[1] MediaPipe Hands Documentation

Leave a Reply

Your email address will not be published.