Structure from Motion: Camera Registration & Bundle Adjustment
- Explain how COLMAP solves the SfM problem via feature extraction, essential matrix estimation, incremental registration, and bundle adjustment using Schur complement sparse factorization
- Distinguish COLMAP's incremental SfM from GLOMAP's global SfM in terms of pipeline structure, failure modes, and computational scaling
- Explain how bundle adjustment uses the Schur complement to exploit camera-point sparsity and state the resulting computational cost reduction compared to dense Gauss-Newton
- Identify how SfM output quality (point track length, reprojection error, point density) propagates to 3DGS initialization quality and downstream reconstruction performance
The Structure from Motion Problem
Given a set of unordered images , Structure from Motion (SfM) simultaneously recovers:
- Camera intrinsics — focal length, principal point, distortion coefficients for each camera
- Camera extrinsics — rotation and translation of each camera in world coordinates
- 3D point cloud — world-frame 3D positions of distinctive scene features
For a PINHOLE camera (the model used in the Bamburgh Castle pipeline), the projection of world point to image coordinates is:
where is the projective depth (not to be confused with the 3DGS transmittance accumulator). SfM must estimate all unknowns jointly — this is a highly nonlinear problem with unknowns for cameras and 3D points.
Feature Extraction and Matching
SfM begins by detecting and describing local feature points in each image. COLMAP uses SIFT (Scale-Invariant Feature Transform), which detects DoG extrema across scale-space and describes each keypoint with a 128-dimensional histogram of gradient orientations. For images, matching is pairs.
In the Bamburgh Castle pipeline, COLMAP's feature_extractor and exhaustive_matcher process 1,179 images, producing 133,000 image pairs to evaluate. Of those, 6,058 are flagged as geometrically invalid before the relative pose stage even begins — a 4.5% pruning that eliminates degenerate pairs (panoramic rotation, repetitive textures, low overlap).
Relative Pose Estimation and the Essential Matrix
For each candidate image pair , SfM estimates the relative pose from point correspondences .
The essential matrix encodes this relationship in normalized (undistorted) image coordinates :
For calibrated cameras, the 5-point algorithm (Nister 2004) recovers from exactly 5 point correspondences. In practice, RANSAC wraps this solver:
- Draw 5 random correspondences; solve for candidates.
- For each candidate, count inliers (pairs satisfying ).
- Keep the with the most inliers. Repeat until the probability of missing the true model falls below .
The GLOMAP output reports a key filtering step: of 121,023 pairs that entered relative pose estimation, 19,797 were rejected for having fewer than 30 inlier correspondences, and 0 were rejected for low inlier ratio. This confirms the matching stage already filtered the worst pairs — only inlier count, not inlier fraction, discriminates between them.
COLMAP vs. GLOMAP: Incremental vs. Global SfM
COLMAP (incremental SfM):
- Select a seed image pair with high overlap and wide baseline.
- Triangulate an initial point cloud from the seed pair.
- Register new cameras one at a time via PnP (Perspective-n-Point).
- Run bundle adjustment after each batch of new cameras.
- Repeat until all cameras are registered.
This is robust but slow for large scenes: each registration triggers a new BA solve, and the error can propagate as the reconstruction grows. COLMAP is the commented-out alternative in the Bamburgh pipeline (#!colmap mapper).
GLOMAP (global SfM) — used in the Bamburgh Castle pipeline:
- Rotation averaging: estimate all absolute rotations globally from relative rotations . GLOMAP uses an iterative L1 synchronization that is robust to outlier relative poses.
- Track establishment: merge 2D correspondences across images into 3D tracks — sets of observations that refer to the same world point. Of 22,813 candidate tracks, 705 are discarded for inconsistency (observations from the same image assigned to the same track), leaving 16,451.
- Global positioning: estimate all camera translations jointly by solving a large sparse linear system. The Ceres report shows: Initial cost → Final cost in 54 iterations — a × cost reduction, indicating very tight convergence.
- Bundle adjustment: refine all poses and 3D points jointly.
For 1,179 images, GLOMAP registers 585 cameras — 49.6% of the input. The remaining images fail because they have insufficient feature overlap with registered cameras, or their features were entirely in the rejected tracks.
Bundle Adjustment
Bundle adjustment (BA) is the core optimization in all SfM pipelines. It minimizes the total reprojection error:
where is the projection function, is the observed 2D keypoint, and is a robust loss (Huber or Cauchy) that down-weights outlier correspondences.
The Jacobian is large but sparse: each 3D point only appears in the cameras that observe it. The normal equations are solved using the Schur complement trick:
where is the block-diagonal point sub-matrix (trivially invertible), is the camera sub-matrix, and couples them. This reduces the solve from in the full system to in the camera system — practical for thousands of cameras with millions of points.
GLOMAP runs 3 iterations of bundle adjustment, each with a position-only sub-stage followed by a full-parameter stage. From the log:
| BA Iteration | Initial Cost | Final Cost | Convergence |
|---|---|---|---|
| 1 (pos-only) | 2.80 × 10⁵ | 1.37 × 10⁵ | 138 iters |
| 1 (full) | 1.37 × 10⁵ | 1.27 × 10⁵ | 85 iters |
| 2 (pos-only) | 7.60 × 10⁴ | 7.27 × 10⁴ | 21 iters |
| 3 (full) | 6.08 × 10⁴ | 6.07 × 10⁴ | 13 iters |
Cost reduction per iteration decreases as the solution tightens — a signature of Gauss-Newton convergence near a minimum.
Track Filtering and Triangulation Angle
After each BA iteration, GLOMAP filters tracks by reprojection error (removing points with high residuals) and by triangulation angle. Two cameras observing the same point form a baseline; the angle between the two rays to the point is the triangulation angle:
A small (nearly parallel rays) produces a poorly conditioned triangulation — the estimated 3D point is sensitive to small errors in the ray directions. GLOMAP discards 4,549 of 16,451 tracks with below a threshold (~1°), retaining 11,902 high-confidence 3D points.
The final SfM output — sparse/0/ — contains the camera model files (cameras.txt, images.txt, points3D.txt) that initialize the 3DGS training point cloud with 14,452 structured seed points.
Why This Matters for 3DGS
The quality of the SfM reconstruction directly determines 3DGS training quality:
- Poor camera pose estimates → the photometric loss gradient is incoherent across views → slow convergence, floater artifacts
- Dense, accurate 3D points → better initial Gaussian placement → fewer densification iterations needed
- Correct camera intrinsics → the rasterizer projects Gaussians correctly → sharp renders vs. systematically blurred
The Bamburgh Castle pipeline achieves 29.0 dB PSNR at 30k iterations — a strong result for an unconstrained video capture with varying lighting and motion blur — largely because GLOMAP delivers a robust reconstruction from 585 well-registered cameras.