3D Gaussian Splatting · End-to-End Scene Reconstruction

Depth Priors, Exposure Compensation & Accelerated 3DGS Training

16 min read

By the end of this reading you will be able to:

Explain how scale-shift alignment converts relative Depth-Anything-V2 monocular depth to metric depth using sparse COLMAP points, and identify what fails when this alignment is omitted
Explain why standard Adam accumulates incorrect second-moment estimates for inactive Gaussians at scale, and how Sparse Adam corrects this by skipping updates for zero-gradient parameters
Implement per-image affine exposure compensation to handle auto-exposure variation across training images and identify when exposure variation causes visible reconstruction artifacts
Identify the contribution of each acceleration technique (depth priors, Sparse Adam, 3dgs_accel tile precomputation) to training speed and final reconstruction quality

The Initialization Problem

Standard 3DGS initializes Gaussians at the sparse SfM point cloud — in the Bamburgh Castle scene, 14,452 points derived from 585 camera views. For a 640×272 video frame, this amounts to roughly 1 Gaussian per 8.5 pixel column in a scene with complex architecture. The adaptive densification mechanism will grow this to $\sim$ 200k–600k Gaussians by 30k iterations, but early in training:

Regions with no nearby Gaussians have zero gradient contribution — the splat rasterizer only backpropagates through Gaussians that tile-intersect the current view.
The gradient magnitude determines which Gaussians densify (clone or split). A Gaussian with no gradient neither grows nor divides.
Consequently, poorly initialized regions remain empty for many thousands of iterations, wasting compute on gradient updates to Gaussians that cannot improve those pixels.

Depth priors solve this by providing a dense initialization signal — one potential Gaussian per pixel — at the cost of scale ambiguity inherent in monocular depth estimation.

Depth-Anything-V2: Architecture and Output

Depth-Anything-V2 (Yang et al., 2024) is a monocular depth foundation model trained on synthetic labeled data and unlabeled real data via distillation. The ViT-L variant used here has:

Encoder: Vision Transformer Large (ViT-L/14), 307M parameters, patch size 14×14
Decoder: DPT (Dense Prediction Transformer) head converting patch tokens to dense depth maps
Output: $D \in \mathbb{R}^{H \times W}$ , values are relative depth — metric scale is not preserved

The prediction $D(u,v)$ is a monotone function of true depth but with arbitrary global scale $s$ and shift $b$ :

$D(u,v) \approx s \cdot d_{\text{true}}(u,v) + b$

These affine parameters are scene-specific and must be aligned to a metric reference before the depths can be used to seed 3D Gaussians in world coordinates.

Scale-Shift Alignment via Sparse COLMAP Depths

The script gaussian-splatting/utils/make_depth_scale.py performs the alignment. For each registered image $i$ :

Project each visible 3D point $\mathbf{X}_j$ into image $i$ to get pixel coordinates $(u_j, v_j)$ and metric depth $d_j = (R_i \mathbf{X}_j + \mathbf{t}_i)_z$ .
Sample the predicted depth map: $\hat{d}_j = D_i(u_j, v_j)$ .
Solve the affine regression:

$\min_{s_i, b_i} \sum_j (s_i \hat{d}_j + b_i - d_j)^2$

This has the closed-form solution:

$s_i = \frac{\text{Cov}(\hat{\mathbf{d}}, \mathbf{d})}{\text{Var}(\hat{\mathbf{d}})}, \quad b_i = \bar{d} - s_i \bar{\hat{d}}$

The aligned depth $\tilde{D}_i = s_i D_i + b_i$ is then used to unproject each pixel into a 3D point, providing a dense initialization prior for Gaussian seeding. The file make_depth_scale.py writes a JSON of {image_id: scale_factor} consumed by the training script.

Exposure Compensation

Unconstrained video captures such as the Bamburgh Castle footage have frame-to-frame brightness variation from auto-exposure, white balance shifts, and motion blur. Without compensation, the photometric loss would force Gaussians to represent these camera-state artifacts rather than scene geometry.

3DGS exposure compensation (Kerbl et al., extended) learns a per-image affine color transform:

$\tilde{\mathbf{c}}_i = A_i \mathbf{c} + \mathbf{b}_i$

where $A_i \in \mathbb{R}^{3 \times 3}$ and $\mathbf{b}_i \in \mathbb{R}^3$ are learned parameters. In the Bamburgh pipeline these are activated with the flag --train_test_exp and scheduled with:

--exposure_lr_init 0.001 — initial LR for exposure parameters
--exposure_lr_final 0.0001 — final LR after decay
--exposure_lr_delay_steps 5000 — warm-up period during which exposure LR is very small
--exposure_lr_delay_mult 0.001 — multiplier applied during the warm-up period

The warm-up prevents early exposure parameter updates from contaminating the initial Gaussian placement. Mathematically, the LR schedule is:

$\eta(t) = \begin{cases} \eta_0 \cdot \mu^{1 - t/t_{\text{delay}}} & t < t_{\text{delay}} \\ \eta_0 \cdot \left(\frac{\eta_f}{\eta_0}\right)^{t/T} & t \geq t_{\text{delay}} \end{cases}$

where $\mu = 0.001$ , $t_{\text{delay}} = 5000$ , $\eta_0 = 0.001$ , $\eta_f = 0.0001$ , $T = 30000$ .

Sparse Adam: Why Standard Adam Breaks at Scale

In 3DGS training, each Gaussian $g_k$ receives a gradient only when it overlaps a rendered tile in the current view. For a scene with 500k Gaussians and a rasterizer processing 1024×768 at 16×16 tiles, roughly 4,000 tiles exist per frame. A single Gaussian typically covers a small number of tiles — often just 1–4. So in each iteration:

Active Gaussians (gradient received): $\sim 10{,}000$ (rough estimate)
Inactive Gaussians (zero gradient): $\sim 490{,}000$

Standard Adam updates all 500k Gaussians every step, even when $g = 0$ . The Adam moment update for inactive parameters accumulates $m_1 \leftarrow \beta_1 m_1$ and $m_2 \leftarrow \beta_2 m_2$ — decaying the running moments without any gradient signal. This biases the effective step size for those parameters when they eventually receive gradients, slowing convergence.

Sparse Adam only updates parameters that received a non-zero gradient in the current step. Inactive parameters retain their moments exactly — no decay occurs. This preserves the adaptive step-size history and eliminates $O(N)$ wasted tensor operations per step.

At $N = 500{,}000$ Gaussians × 59 parameters × 2 moments ( $m_1, m_2$ ):

Standard Adam: 59M floating point updates per step, regardless of scene activity
Sparse Adam: $\sim 590{,}000$ updates per step (active only) → 100× fewer updates

The 3dgs_accel Branch: Accelerated Tile Assignment

The diff-gaussian-rasterization submodule includes a 3dgs_accel branch that precomputes 2D bounding rectangles for each Gaussian in screen space before the tiling loop.

In the baseline implementation, the CUDA kernel computes tile intersections per-Gaussian by iterating over candidate tiles and testing against the 2D Gaussian covariance. For a Gaussian with screen-space covariance $\Sigma_{2D}$ , the bounding rectangle is:

$\text{rect} = [\mu_u \pm 3\sigma_u, \mu_v \pm 3\sigma_v]$

where $\sigma_u = \sqrt{(\Sigma_{2D})_{00}}$ , $\sigma_v = \sqrt{(\Sigma_{2D})_{11}}$ . Rather than re-deriving this per tile, 3dgs_accel precomputes the rectangle in a single vectorized pass and uses it in a direct tile-range lookup — eliminating the inner loop over candidate tiles.

For a Gaussian covering $k$ tiles, the complexity changes from $O(k_{\max})$ candidate evaluations to $O(1)$ direct range computation, where $k_{\max}$ is the maximum tile count considered by the naive implementation. This matters for large Gaussians (high- $\sigma$ ellipsoids) that could span dozens of tiles.

Training Convergence: Bamburgh Castle

With all optimizations enabled (sparse_adam, depth priors, 3dgs_accel, exposure compensation), the Bamburgh Castle scene trains in 8 minutes 25 seconds on a Colab T4/A100 GPU:

Checkpoint	Iterations	L1 Loss	PSNR
Early	7,000	0.02350	26.96 dB
Final	30,000	0.01682	29.02 dB

The 2.06 dB improvement from 7k to 30k iterations reflects continued Gaussian densification and opacity pruning settling the scene into a stable high-quality representation. In comparison, the same scene without depth priors typically requires 50k+ iterations to reach equivalent quality.

References

Yang et al. 2024 — Depth Anything V2

Kerbl et al. 2023 — 3D Gaussian Splatting for Real-Time Radiance Field Rendering

Kingma & Ba 2015 — Adam: A Method for Stochastic Optimization

Previous Next →

Depth Priors, Exposure Compensation & Accelerated 3DGS Training

The Initialization Problem

Depth-Anything-V2: Architecture and Output

Scale-Shift Alignment via Sparse COLMAP Depths

Exposure Compensation

Sparse Adam: Why Standard Adam Breaks at Scale

The 3dgs_accel Branch: Accelerated Tile Assignment

Training Convergence: Bamburgh Castle

Privacy Policy

What we collect

What we don't collect

Your choices

Contact