3D Gaussian Splatting · End-to-End Scene Reconstruction

Structure from Motion: Camera Registration & Bundle Adjustment

18 min read

By the end of this reading you will be able to:

Explain how COLMAP solves the SfM problem via feature extraction, essential matrix estimation, incremental registration, and bundle adjustment using Schur complement sparse factorization
Distinguish COLMAP's incremental SfM from GLOMAP's global SfM in terms of pipeline structure, failure modes, and computational scaling
Explain how bundle adjustment uses the Schur complement to exploit camera-point sparsity and state the resulting computational cost reduction compared to dense Gauss-Newton
Identify how SfM output quality (point track length, reprojection error, point density) propagates to 3DGS initialization quality and downstream reconstruction performance

The Structure from Motion Problem

Given a set of $m$ unordered images $\{I_1, \ldots, I_m\}$ , Structure from Motion (SfM) simultaneously recovers:

Camera intrinsics $K_i$ — focal length, principal point, distortion coefficients for each camera
Camera extrinsics $\{R_i, \mathbf{t}_i\}$ — rotation and translation of each camera in world coordinates
3D point cloud $\{\mathbf{X}_j\}$ — world-frame 3D positions of distinctive scene features

For a PINHOLE camera (the model used in the Bamburgh Castle pipeline), the projection of world point $\mathbf{X} \in \mathbb{R}^3$ to image coordinates $\mathbf{p} \in \mathbb{R}^2$ is:

$\lambda \begin{pmatrix} u \\ v \\ 1 \end{pmatrix} = K \begin{bmatrix} R \mid \mathbf{t} \end{bmatrix} \begin{pmatrix} X \\ Y \\ Z \\ 1 \end{pmatrix}, \quad K = \begin{pmatrix} f & 0 & c_x \\ 0 & f & c_y \\ 0 & 0 & 1 \end{pmatrix}$

where $\lambda$ is the projective depth (not to be confused with the 3DGS transmittance accumulator). SfM must estimate all unknowns jointly — this is a highly nonlinear problem with $O(m \cdot 6 + n \cdot 3)$ unknowns for $m$ cameras and $n$ 3D points.

Feature Extraction and Matching

SfM begins by detecting and describing local feature points in each image. COLMAP uses SIFT (Scale-Invariant Feature Transform), which detects DoG extrema across scale-space and describes each keypoint with a 128-dimensional histogram of gradient orientations. For $m$ images, matching is $O(m^2)$ pairs.

In the Bamburgh Castle pipeline, COLMAP's feature_extractor and exhaustive_matcher process 1,179 images, producing 133,000 image pairs to evaluate. Of those, 6,058 are flagged as geometrically invalid before the relative pose stage even begins — a 4.5% pruning that eliminates degenerate pairs (panoramic rotation, repetitive textures, low overlap).

Relative Pose Estimation and the Essential Matrix

For each candidate image pair $(i, j)$ , SfM estimates the relative pose $(R_{ij}, \mathbf{t}_{ij})$ from point correspondences $\{(\mathbf{p}_k, \mathbf{p}_k')\}$ .

The essential matrix $E = [\mathbf{t}]_\times R$ encodes this relationship in normalized (undistorted) image coordinates $\hat{\mathbf{p}} = K^{-1}\mathbf{p}$ :

$\hat{\mathbf{p}}'^\top E\, \hat{\mathbf{p}} = 0$

For calibrated cameras, the 5-point algorithm (Nister 2004) recovers $E$ from exactly 5 point correspondences. In practice, RANSAC wraps this solver:

Draw 5 random correspondences; solve for $E$ candidates.
For each candidate, count inliers (pairs satisfying $|\hat{\mathbf{p}}'^\top E\, \hat{\mathbf{p}}| < \epsilon$ ).
Keep the $E$ with the most inliers. Repeat until the probability of missing the true model falls below $\delta$ .

The GLOMAP output reports a key filtering step: of 121,023 pairs that entered relative pose estimation, 19,797 were rejected for having fewer than 30 inlier correspondences, and 0 were rejected for low inlier ratio. This confirms the matching stage already filtered the worst pairs — only inlier count, not inlier fraction, discriminates between them.

COLMAP vs. GLOMAP: Incremental vs. Global SfM

COLMAP (incremental SfM):

Select a seed image pair with high overlap and wide baseline.
Triangulate an initial point cloud from the seed pair.
Register new cameras one at a time via PnP (Perspective-n-Point).
Run bundle adjustment after each batch of new cameras.
Repeat until all cameras are registered.

This is robust but slow for large scenes: each registration triggers a new BA solve, and the error can propagate as the reconstruction grows. COLMAP is the commented-out alternative in the Bamburgh pipeline (#!colmap mapper).

GLOMAP (global SfM) — used in the Bamburgh Castle pipeline:

Rotation averaging: estimate all absolute rotations $\{R_i\}$ globally from relative rotations $\{R_{ij}\}$ . GLOMAP uses an iterative L1 synchronization that is robust to outlier relative poses.
Track establishment: merge 2D correspondences across images into 3D tracks — sets of observations that refer to the same world point. Of 22,813 candidate tracks, 705 are discarded for inconsistency (observations from the same image assigned to the same track), leaving 16,451.
Global positioning: estimate all camera translations $\{\mathbf{t}_i\}$ jointly by solving a large sparse linear system. The Ceres report shows: Initial cost $1.9 \times 10^6$ → Final cost $1.3 \times 10^1$ in 54 iterations — a $10^5$ × cost reduction, indicating very tight convergence.
Bundle adjustment: refine all poses and 3D points jointly.

For 1,179 images, GLOMAP registers 585 cameras — 49.6% of the input. The remaining images fail because they have insufficient feature overlap with registered cameras, or their features were entirely in the rejected tracks.

Bundle Adjustment

Bundle adjustment (BA) is the core optimization in all SfM pipelines. It minimizes the total reprojection error:

$\mathcal{E}_{BA} = \sum_{i,j} \rho\!\left( \left\| \pi(K_i, R_i, \mathbf{t}_i, \mathbf{X}_j) - \mathbf{p}_{ij} \right\|^2 \right)$

where $\pi$ is the projection function, $\mathbf{p}_{ij}$ is the observed 2D keypoint, and $\rho$ is a robust loss (Huber or Cauchy) that down-weights outlier correspondences.

The Jacobian $J$ is large but sparse: each 3D point $\mathbf{X}_j$ only appears in the cameras that observe it. The normal equations $J^\top J\, \delta = -J^\top r$ are solved using the Schur complement trick:

$\left( B - C A^{-1} C^\top \right) \delta_c = -(\mathbf{g}_c - C A^{-1} \mathbf{g}_p)$

where $A$ is the block-diagonal point sub-matrix (trivially invertible), $B$ is the camera sub-matrix, and $C$ couples them. This reduces the solve from $O(n^3)$ in the full system to $O(m^3)$ in the camera system — practical for thousands of cameras with millions of points.

GLOMAP runs 3 iterations of bundle adjustment, each with a position-only sub-stage followed by a full-parameter stage. From the log:

BA Iteration	Initial Cost	Final Cost	Convergence
1 (pos-only)	2.80 × 10⁵	1.37 × 10⁵	138 iters
1 (full)	1.37 × 10⁵	1.27 × 10⁵	85 iters
2 (pos-only)	7.60 × 10⁴	7.27 × 10⁴	21 iters
3 (full)	6.08 × 10⁴	6.07 × 10⁴	13 iters

Cost reduction per iteration decreases as the solution tightens — a signature of Gauss-Newton convergence near a minimum.

Track Filtering and Triangulation Angle

After each BA iteration, GLOMAP filters tracks by reprojection error (removing points with high residuals) and by triangulation angle. Two cameras observing the same point form a baseline; the angle between the two rays to the point is the triangulation angle:

$\theta = \arccos\left( \frac{\mathbf{r}_i \cdot \mathbf{r}_j}{\|\mathbf{r}_i\| \|\mathbf{r}_j\|} \right)$

A small $\theta$ (nearly parallel rays) produces a poorly conditioned triangulation — the estimated 3D point is sensitive to small errors in the ray directions. GLOMAP discards 4,549 of 16,451 tracks with $\theta$ below a threshold (~1°), retaining 11,902 high-confidence 3D points.

The final SfM output — sparse/0/ — contains the camera model files (cameras.txt, images.txt, points3D.txt) that initialize the 3DGS training point cloud with 14,452 structured seed points.

Why This Matters for 3DGS

The quality of the SfM reconstruction directly determines 3DGS training quality:

Poor camera pose estimates → the photometric loss gradient is incoherent across views → slow convergence, floater artifacts
Dense, accurate 3D points → better initial Gaussian placement → fewer densification iterations needed
Correct camera intrinsics → the rasterizer projects Gaussians correctly → sharp renders vs. systematically blurred

The Bamburgh Castle pipeline achieves 29.0 dB PSNR at 30k iterations — a strong result for an unconstrained video capture with varying lighting and motion blur — largely because GLOMAP delivers a robust reconstruction from 585 well-registered cameras.

References

Lowe 2004 — Distinctive Image Features from Scale-Invariant Keypoints (SIFT)

Nistér 2004 — An Efficient Solution to the Five-Point Relative Pose Problem

Triggs et al. 2000 — Bundle Adjustment — A Modern Synthesis

Schönberger & Frahm 2016 — Structure-from-Motion Revisited (COLMAP)

Wang et al. 2024 — GLOMAP: Global Structure from Motion

Overview Next →