1

I'm currently playing around with the MediaPipe face detection tool from Google. I am using it to extract 3D poses from an image containing a face. I'm a little confused about the returned outputs from mediapipe's detector once run on an image. In particular, the detector returns normalized keypoints (I think I understand this fine, it's a 478x3 tensor representing the keypoints on the face), face_blendshapes (not too concerned with this atm) and a facial transformation matrix.

My source of confusion is what the facial transformation matrix represents and how it ties into the currently predicted keypoints? From their documentation, they state that:

FaceLandmarker uses the matrix to transform the face landmarks from a canonical face model to the detected face, so users can apply effects on the detected landmarks.

  1. What is the canonical face model being used in this case? It's not clear to me what it is or how to get it?
  2. I'm trying to visualize the original model, but I'm unsure now, given the aforementioned, how this process works?

Does it go from canonical model -> add keypoint offsets on the face (how to calculate those?) -> multiply resulting matrix by given "transformation matrix" to retrieve the original pose/photo?

  1. What happens if I don't multiply by transformation matrix, do I just get canonical face with facial expressions based on original image, but not the actual pose/direction that the image was originally in?

I'm completely new to 3D graphics and computer vision so apologies for the stupid questions and appreciate any help.

1 Answer 1

1

I've spent a lot of time searching for these same answers. The documentation is quite poor in relation to the transformation matrix. However, I believe that you can find the canonical face model here: https://github.com/google-ai-edge/mediapipe/blob/master/mediapipe/modules/face_geometry/data/face_model_with_iris.obj

Naively, I would think that multiplying the canonical face model by the transformation matrix would yield a set of coordinates within the "screen space" similar in size/shape/location to the landmark output points. However, I've been unable to confirm this in my tinkering. There must be a step missing.

Some additional links:

Not the answer you're looking for? Browse other questions tagged or ask your own question.