The work is divided into two stages:
- Project 1 – Signal extraction and rule-based detection
- Project 2 – Temporal modeling using PyTorch (LSTM)
Both projects use the same core idea: extracting meaningful signals from facial landmarks using MediaPipe. The difference lies in how those signals are interpreted.
In Project 1, I built a pipeline to extract facial behavior signals directly from video using MediaPipe Face Mesh.
- Detects facial landmarks for each frame
- Computes normalized measurements:
- Smile ratio (mouth width / face width)
- Mouth open ratio (lip gap / face height)
- Head turn ratio (nose position relative to face center)
- Applies smoothing to reduce noise
- Uses threshold-based rules to classify behavior per frame
- Annotated video showing:
- landmark points
- measurement lines
- computed ratios
- detected labels (rule-based)
- Frame-by-frame decision making
- No learning involved
- Fully deterministic logic
- Works well for clear, exaggerated expressions
- Sensitive to small fluctuations and noise
In Project 2, I extend the same signal pipeline but replace rule-based decisions with a learned model.
Instead of treating each frame independently, the model looks at a sequence of frames and learns patterns over time.
- Uses the same signals generated in Project 1
- Converts them into sequences (window of 20 frames)
- Trains an LSTM model using PyTorch
- Predicts behavior based on temporal context
- neutral
- smiling
- mouth_open
- head_left
- head_right
- Annotated video similar to Project 1, but:
- labels are predicted by the trained model
- predictions are more stable over time
- Sequence-based prediction
- Uses temporal context
- More robust to noise and small fluctuations
- Learns patterns instead of relying on fixed thresholds
The main difference is not in how the signals are extracted, but in how they are interpreted.
| Aspect | Project 1 | Project 2 |
|---|---|---|
| Approach | Rule-based | Learned (PyTorch) |
| Input | Single frame | Sequence of frames |
| Logic | Thresholds | LSTM model |
| Stability | Can be noisy | More stable |
| Adaptability | Fixed rules | Learns patterns |
- Start with interpretable signal extraction
- Move towards learning-based models
- Incorporate temporal context