3D Gaussian Splatting · End-to-End Scene Reconstruction

Depth Priors, Exposure Compensation & Accelerated 3DGS Training

16 min read
By the end of this reading you will be able to:
  • Explain how scale-shift alignment converts relative Depth-Anything-V2 monocular depth to metric depth using sparse COLMAP points, and identify what fails when this alignment is omitted
  • Explain why standard Adam accumulates incorrect second-moment estimates for inactive Gaussians at scale, and how Sparse Adam corrects this by skipping updates for zero-gradient parameters
  • Implement per-image affine exposure compensation to handle auto-exposure variation across training images and identify when exposure variation causes visible reconstruction artifacts
  • Identify the contribution of each acceleration technique (depth priors, Sparse Adam, 3dgs_accel tile precomputation) to training speed and final reconstruction quality

The Initialization Problem

Standard 3DGS initializes Gaussians at the sparse SfM point cloud — in the Bamburgh Castle scene, 14,452 points derived from 585 camera views. For a 640×272 video frame, this amounts to roughly 1 Gaussian per 8.5 pixel column in a scene with complex architecture. The adaptive densification mechanism will grow this to \sim200k–600k Gaussians by 30k iterations, but early in training:

  1. Regions with no nearby Gaussians have zero gradient contribution — the splat rasterizer only backpropagates through Gaussians that tile-intersect the current view.
  2. The gradient magnitude determines which Gaussians densify (clone or split). A Gaussian with no gradient neither grows nor divides.
  3. Consequently, poorly initialized regions remain empty for many thousands of iterations, wasting compute on gradient updates to Gaussians that cannot improve those pixels.

Depth priors solve this by providing a dense initialization signal — one potential Gaussian per pixel — at the cost of scale ambiguity inherent in monocular depth estimation.

Depth-Anything-V2: Architecture and Output

Depth-Anything-V2 (Yang et al., 2024) is a monocular depth foundation model trained on synthetic labeled data and unlabeled real data via distillation. The ViT-L variant used here has:

  • Encoder: Vision Transformer Large (ViT-L/14), 307M parameters, patch size 14×14
  • Decoder: DPT (Dense Prediction Transformer) head converting patch tokens to dense depth maps
  • Output: DRH×WD \in \mathbb{R}^{H \times W}, values are relative depth — metric scale is not preserved

The prediction D(u,v)D(u,v) is a monotone function of true depth but with arbitrary global scale ss and shift bb:

D(u,v)sdtrue(u,v)+bD(u,v) \approx s \cdot d_{\text{true}}(u,v) + b

These affine parameters are scene-specific and must be aligned to a metric reference before the depths can be used to seed 3D Gaussians in world coordinates.

Scale-Shift Alignment via Sparse COLMAP Depths

The script gaussian-splatting/utils/make_depth_scale.py performs the alignment. For each registered image ii:

  1. Project each visible 3D point Xj\mathbf{X}_j into image ii to get pixel coordinates (uj,vj)(u_j, v_j) and metric depth dj=(RiXj+ti)zd_j = (R_i \mathbf{X}_j + \mathbf{t}_i)_z.
  2. Sample the predicted depth map: d^j=Di(uj,vj)\hat{d}_j = D_i(u_j, v_j).
  3. Solve the affine regression:

minsi,bij(sid^j+bidj)2\min_{s_i, b_i} \sum_j (s_i \hat{d}_j + b_i - d_j)^2

This has the closed-form solution:

si=Cov(d^,d)Var(d^),bi=dˉsid^ˉs_i = \frac{\text{Cov}(\hat{\mathbf{d}}, \mathbf{d})}{\text{Var}(\hat{\mathbf{d}})}, \quad b_i = \bar{d} - s_i \bar{\hat{d}}

The aligned depth D~i=siDi+bi\tilde{D}_i = s_i D_i + b_i is then used to unproject each pixel into a 3D point, providing a dense initialization prior for Gaussian seeding. The file make_depth_scale.py writes a JSON of {image_id: scale_factor} consumed by the training script.

Exposure Compensation

Unconstrained video captures such as the Bamburgh Castle footage have frame-to-frame brightness variation from auto-exposure, white balance shifts, and motion blur. Without compensation, the photometric loss would force Gaussians to represent these camera-state artifacts rather than scene geometry.

3DGS exposure compensation (Kerbl et al., extended) learns a per-image affine color transform:

c~i=Aic+bi\tilde{\mathbf{c}}_i = A_i \mathbf{c} + \mathbf{b}_i

where AiR3×3A_i \in \mathbb{R}^{3 \times 3} and biR3\mathbf{b}_i \in \mathbb{R}^3 are learned parameters. In the Bamburgh pipeline these are activated with the flag --train_test_exp and scheduled with:

  • --exposure_lr_init 0.001 — initial LR for exposure parameters
  • --exposure_lr_final 0.0001 — final LR after decay
  • --exposure_lr_delay_steps 5000 — warm-up period during which exposure LR is very small
  • --exposure_lr_delay_mult 0.001 — multiplier applied during the warm-up period

The warm-up prevents early exposure parameter updates from contaminating the initial Gaussian placement. Mathematically, the LR schedule is:

η(t)={η0μ1t/tdelayt<tdelayη0(ηfη0)t/Tttdelay\eta(t) = \begin{cases} \eta_0 \cdot \mu^{1 - t/t_{\text{delay}}} & t < t_{\text{delay}} \\ \eta_0 \cdot \left(\frac{\eta_f}{\eta_0}\right)^{t/T} & t \geq t_{\text{delay}} \end{cases}

where μ=0.001\mu = 0.001, tdelay=5000t_{\text{delay}} = 5000, η0=0.001\eta_0 = 0.001, ηf=0.0001\eta_f = 0.0001, T=30000T = 30000.

Sparse Adam: Why Standard Adam Breaks at Scale

In 3DGS training, each Gaussian gkg_k receives a gradient only when it overlaps a rendered tile in the current view. For a scene with 500k Gaussians and a rasterizer processing 1024×768 at 16×16 tiles, roughly 4,000 tiles exist per frame. A single Gaussian typically covers a small number of tiles — often just 1–4. So in each iteration:

  • Active Gaussians (gradient received): 10,000\sim 10{,}000 (rough estimate)
  • Inactive Gaussians (zero gradient): 490,000\sim 490{,}000

Standard Adam updates all 500k Gaussians every step, even when g=0g = 0. The Adam moment update for inactive parameters accumulates m1β1m1m_1 \leftarrow \beta_1 m_1 and m2β2m2m_2 \leftarrow \beta_2 m_2decaying the running moments without any gradient signal. This biases the effective step size for those parameters when they eventually receive gradients, slowing convergence.

Sparse Adam only updates parameters that received a non-zero gradient in the current step. Inactive parameters retain their moments exactly — no decay occurs. This preserves the adaptive step-size history and eliminates O(N)O(N) wasted tensor operations per step.

At N=500,000N = 500{,}000 Gaussians × 59 parameters × 2 moments (m1,m2m_1, m_2):

  • Standard Adam: 59M floating point updates per step, regardless of scene activity
  • Sparse Adam: 590,000\sim 590{,}000 updates per step (active only) → 100× fewer updates

The 3dgs_accel Branch: Accelerated Tile Assignment

The diff-gaussian-rasterization submodule includes a 3dgs_accel branch that precomputes 2D bounding rectangles for each Gaussian in screen space before the tiling loop.

In the baseline implementation, the CUDA kernel computes tile intersections per-Gaussian by iterating over candidate tiles and testing against the 2D Gaussian covariance. For a Gaussian with screen-space covariance Σ2D\Sigma_{2D}, the bounding rectangle is:

rect=[μu±3σu,μv±3σv]\text{rect} = [\mu_u \pm 3\sigma_u, \mu_v \pm 3\sigma_v]

where σu=(Σ2D)00\sigma_u = \sqrt{(\Sigma_{2D})_{00}}, σv=(Σ2D)11\sigma_v = \sqrt{(\Sigma_{2D})_{11}}. Rather than re-deriving this per tile, 3dgs_accel precomputes the rectangle in a single vectorized pass and uses it in a direct tile-range lookup — eliminating the inner loop over candidate tiles.

For a Gaussian covering kk tiles, the complexity changes from O(kmax)O(k_{\max}) candidate evaluations to O(1)O(1) direct range computation, where kmaxk_{\max} is the maximum tile count considered by the naive implementation. This matters for large Gaussians (high-σ\sigma ellipsoids) that could span dozens of tiles.

Training Convergence: Bamburgh Castle

With all optimizations enabled (sparse_adam, depth priors, 3dgs_accel, exposure compensation), the Bamburgh Castle scene trains in 8 minutes 25 seconds on a Colab T4/A100 GPU:

Checkpoint Iterations L1 Loss PSNR
Early 7,000 0.02350 26.96 dB
Final 30,000 0.01682 29.02 dB

The 2.06 dB improvement from 7k to 30k iterations reflects continued Gaussian densification and opacity pruning settling the scene into a stable high-quality representation. In comparison, the same scene without depth priors typically requires 50k+ iterations to reach equivalent quality.