Author: Achita
Date: July 2024
This document provides a comprehensive summary of Visual SLAM, currently under development, synthesizing information from a variety of reputable sources, including An Invitation to 3-D Vision, lecture slides from Carnegie Mellon University and the University of Toronto, IEEE Robotics & Automation Magazine, and CS231A from Stanford University, among others. It focuses on the most critical aspects necessary for coding the program and understanding the underlying mathematical concepts.
- Introduction
- Image Sequencing
- Exploring Epipolar Geometry
- Theorems and Definitions
- Implementation of Epipolar Geometry
Visual SLAM builds off of Visual Odometry (VO), so first we need to understand that. VO is the process of estimating the position and
orientation of a camera from a sequence of images. The path estimation
is done sequentially by a new frame
Process of Visual Odometry
Capture a frame ( I_k )
Extract and match the feature between ( I_{k-1} ) and its subsequent term ( I_k )
Compute the essential matrix, ( E ), using the 8-point theorem between the image pair ( I_{k-1} ) and ( I_k ).
[ WE = 0 ]
Decompose ( E_k ) into ( R_k ) and ( t_k ) into four pairs using Singular Value Decomposition:
[ E = U \Sigma V^T ]
where ( U ) and ( V^T ) are rotation matrices.
Find the correct pose via triangulating the key points to form the transformation matrix ( T ):
[ T_k = \begin{bmatrix} R_k & t_k \ 0 & 1 \ \end{bmatrix} ]
Compute the relative scale and rescale ( t_k ) accordingly:
[ r = \frac{||x_{k-1, i} - x_{k-1,j}||}{||x_{k, i} - x_{k,j}||} ]
Concatenate the transformation by computing:
[ C_k = T_k C_{k-1} ]
We define [ I_{0:n} = {I_0, I_1, \ldots, I_{n-1}, I_n} ]
Trivially, using OpenCV, we can capture an instance of the sequence ( I_k ) and analyze at least eight key features of the image.
Import OpenCV and run a video loop that captures a single frame every iteration:
import cv2 as cv
video = cv.VideoCapture(self.captureIndex)
assert video.isOpened()
video.set(cv2.CAP_PROP_FRAME_WIDTH, self.width)
video.set(cv2.CAP_PROP_FRAME_HEIGHT, self.height)
while True:
ret, frame = video.read() # Captures frame by frame
# video.read() -> (bool of frame is read correctly, frame)
if not ret:
break # breaks if read incorrectly
if cv.waitKey(1) & 0xFF == ord(" "):
break
# Release the capture
video.release()
cv.destroyAllWindows()Step 2: Extract and match the feature between (I_{k-1}) and its subsequent term (I_k).
This step links the frontend (image capture) and backend (pose estimation).
It takes a key feature (point) from frame (I_{k-1}) and matches it to the corresponding point in the subsequent image (I_k).
To detect key features, we use Oriented FAST and Rotated BRIEF (ORB).
import cv2 as cv
orb = cv.ORB_create(nfeatures=1500)
# Keypoints and descriptors for both frames
kp1, desc1 = orb.detectAndCompute(I[i-1], None)
kp2, desc2 = orb.detectAndCompute(I[i], None)To estimate the corresponding key features between the two frames, we use Fast Library for Approximate Nearest Neighbor (FLANN).
FLANN_INDEX_KDTREE = 0
idxParams = dict(algorithm=FLANN_INDEX_KDTREE,
table_number=6,
key_size=12,
multi_probe_level=1)
searchParams = dict(checks=50)
flann = cv.FlannBasedMatcher(idxParams, searchParams)
matches = flann.knnMatch(desc1, desc2, k=2)
thresh = 0.7
goodMatches = []
# Filters the matches using Lowe's ratio test
for m, n in matches:
if m.distance < thresh * n.distance:
goodMatches.append(m)Consider two images taken at two distinct vantage points, assuming a calibrated camera (when
[ \lambda , x ;=; \Pi_0 X, ]
where:
-
$\lambda$ is the scale factor (depth), -
$\Pi_0$ is the projection$\mathbb{R}^3 \to \mathbb{R}^2$ .
Theorem 3.1 — (Epipolar Constraint)
For two images$x_1, x_2$ of a point$P$ , seen from two vantage points, the following constraint holds:[ x_2^\top ,\hat{t},R , x_1 ;=; x_2^\top E , x_1 ;=; 0, ]
where
$(R, t)$ is the relative pose (position and orientation) between the two camera frames, and the essential matrix$E$ is defined by[ E ;=; \hat{t},R. ]
Let
[ X_2 ;=; R,X_1 ;+; t. ]
Let
[ \lambda_2 , x_2 ;=; R \bigl(\lambda_1 , x_1\bigr) + t. ]
Multiplying both sides by
[ \begin{aligned} \lambda_2 , x_2 &=; R \bigl(\lambda_1 , x_1\bigr) ;+; t,\ \lambda_2 ,\hat{t}, x_2 &=; \hat{t},R \bigl(\lambda_1 , x_1\bigr) ;+; \hat{t},t,\ \lambda_2 ,\bigl(t \times x_2\bigr) &=; t \times \bigl(R,(\lambda_1 , x_1)\bigr) ;+; \bigl(t \times t\bigr),\ \lambda_2 ,\bigl(t \times x_2\bigr) &=; \hat{t},R,\bigl(\lambda_1 , x_1\bigr). \end{aligned} ]
Next, multiply on the left by
[ \begin{aligned} \lambda_2 ; x_2^\top ,\bigl(t \times x_2\bigr) &=; x_2^\top,\hat{t},R,\bigl(\lambda_1 , x_1\bigr),\ \lambda_2 ;\bigl[x_2 \cdot \bigl(t \times x_2\bigr)\bigr] &=; x_2^\top ,\hat{t},R,\bigl(\lambda_1 , x_1\bigr). \end{aligned} ]
But
[ x_2^\top ,\hat{t},R, x_1 ;=; x_2^\top ,( \hat{t},R ), x_1 ;=; 0, ]
which completes the proof.
The essential matrix is
[
E ;=; \hat{t},R,
]
which encodes the relative pose between two cameras (the translation
[ \mathcal{E} ;=; {,\hat{t},R ;\mid; R \in SO(3),; t \in \mathbb{R}^3}. ]
Claim.
A non-zero matrix$E \in \mathbb{R}^{3 \times 3}$ is an essential matrix if and only if$E$ has a singular value decomposition (SVD) of the form [ E ;=; U,\Sigma,V^\top, ] where [ \Sigma ;=; \mathrm{diag}{\sigma,\sigma,0} ;=; \begin{bmatrix} \sigma & 0 & 0\[6pt] 0 & \sigma & 0\[6pt] 0 & 0 & 0 \end{bmatrix}, ] for some$\sigma > 0$ and$U, V \in SO(3)$ .
-
$E = \hat{t},R$ implies certain rank and skew properties in$E$ . - By applying a suitable rotation, we can place
$t$ in a canonical form$(0,0,|t|)^\top$ , revealing that$E$ has two equal non-zero singular values and one zero singular value. - Reversing this process shows that any matrix with exactly two identical non-zero singular values and one zero singular value can be written as
$\hat{t},R$ with$R \in SO(3)$ .
Claim — (Pose recovery from
$E$ )
There are only two possible relative poses$(R, t)$ with$R \in SO(3)$ and$t \in \mathbb{R}^3$ corresponding to a non-zero essential matrix$E \in \mathcal{E}$ .
-
Suppose
$E = \hat{t}_1 R_1 = \hat{t}_2 R_2$ . Then[ \hat{t}_1 ;=; \hat{t}_2 , R_2 , R_1^\top. ]
-
Because
$R_2 R_1^\top$ is itself a rotation, a lemma about skew-symmetric transformations shows that$R_2 R_1^\top$ is either the identity or a rotation by$\pi$ . -
Hence, there are exactly two distinct
$(R,t)$ pairs that give the same essential matrix$E$ .
Remark — (Pose Recovery)
Both$E$ and$-E$ satisfy the same epipolar constraints. Hence, in total, there are$2 \times 2 = 4$ possible$(R,t)$ solutions. However, only one of these yields positive depth for all points in both cameras, so the other three are physically invalid.
To solve for
[ E ;=; \begin{bmatrix} e_1 & e_2 & e_3 \ e_4 & e_5 & e_6 \ e_7 & e_8 & e_9 \end{bmatrix}. ]
This can be stored as a 9-vector
[ x_2^\top , E , x_1 ;=; 0. ]
If we have
[ A , e ;=; 0. ]
With at least 8 well-chosen correspondences (hence “eight-point algorithm”),
Real-world data is noisy, so the solution
Claim — (Projection)
Let$E' \in \mathbb{R}^{3 \times 3}$ have SVD [ E' ;=; U ,\mathrm{diag}(\lambda_1,\lambda_2,\lambda_3),V^\top, ] with$U, V \in SO(3)$ and$\lambda_1 \ge \lambda_2 \ge \lambda_3$ . The matrix$E \in \mathcal{E}$ that minimizes$|,E - E'|_F^2$ is [ E ;=; U ,\mathrm{diag}!\bigl(\sigma,\sigma,0\bigr),V^\top, \quad \text{where} \quad \sigma ;=; \frac{\lambda_1 + \lambda_2}{2}. ]
- We want to minimize
$|E - E'|_F^2$ subject to$E$ having the structure$(\hat{t},R)$ . - The best choice keeps the same
$U,V$ and clamps the singular values to$(\sigma,\sigma,0)$ . - One checks via trace arguments that this leads to the smallest Frobenius norm difference.
After recovering
Lemma 3.5.
Let$\hat{T} \in so(3)$ be non-zero (thus$T \in \mathbb{R}^3$ ). If for some$R \in SO(3)$ the product$\hat{T},R$ is also skew-symmetric, then$R$ must be either the identity$I$ or a rotation by$\pi$ about the axis$u = T/|T|$ . Furthermore,[ \hat{T},\bigl(e^{\hat{u},\pi}\bigr) ;=; -,\hat{T}. ]
Proof Idea:
- Assume
$\hat{T},R$ is skew. Then$(\hat{T},R)^\top = -,\hat{T},R$ . - Use the orthonormality of
$R$ to see how it commutes with$\hat{T}$ . - Show
$R$ must be a rotation by$0$ or$\pi$ around the direction of$T$ . - The relation
$\hat{T},e^{\hat{u},\pi} = -\hat{T}$ follows.