I'm currently playing around with the MediaPipe face detection tool from Google. I am using it to extract 3D poses from an image containing a face. I'm a little confused about the returned outputs from mediapipe's detector once run on an image. In particular, the detector returns normalized keypoints (I think I understand this fine, it's a 478x3 tensor representing the keypoints on the face), face_blendshapes (not too concerned with this atm) and a facial transformation matrix.
My source of confusion is what the facial transformation matrix represents and how it ties into the currently predicted keypoints? From their documentation, they state that:
FaceLandmarker uses the matrix to transform the face landmarks from a canonical face model to the detected face, so users can apply effects on the detected landmarks.
- What is the canonical face model being used in this case? It's not clear to me what it is or how to get it?
- I'm trying to visualize the original model, but I'm unsure now, given the aforementioned, how this process works?
Does it go from canonical model -> add keypoint offsets on the face (how to calculate those?) -> multiply resulting matrix by given "transformation matrix" to retrieve the original pose/photo?
- What happens if I don't multiply by transformation matrix, do I just get canonical face with facial expressions based on original image, but not the actual pose/direction that the image was originally in?
I'm completely new to 3D graphics and computer vision so apologies for the stupid questions and appreciate any help.