3D Gaussian Splatting · Evaluation & Survey

Benchmark Datasets for Novel View Synthesis

10 min read

By the end of this reading you will be able to:

Distinguish Tanks & Temples, Deep Blending, Mip-NeRF 360, and Synthetic NeRF benchmarks by the scene properties and reconstruction challenges each is designed to stress-test
Explain why PSNR values are not comparable across datasets, identifying the specific scene properties (unbounded vs. bounded, real vs. synthetic, near-field vs. large-scale) that cause this non-comparability
Identify the failure mode of optimizing a compression method on a single dataset and explain how dataset coverage (all four benchmarks) reveals generalization failures
Select the appropriate benchmark dataset(s) to evaluate a new 3DGS compression method given specific scene characteristics and the tradeoffs being investigated

The Four Standard Benchmarks

The 3DGS compression literature evaluates almost exclusively on four datasets. Knowing their characteristics is essential for understanding which methods generalize and which overfit to specific scene types.

Outdoor / Indoor · Real Capture

Tanks and Temples

ScenesTruck, Train, Playground, M60, and others

High-resolution real-world captures with natural lighting
Unbounded scenes — background extends to infinity
Objects at varying scales: a full tank, a train car
Challenging background coverage for Gaussian rasterizers

Why it matters

The Truck scene is the de facto single-scene calibration point. Compare any paper's Truck PSNR to its own baseline 3DGS run — that delta is what matters, not the absolute number.

Indoor · Depth Sensor

Deep Blending

ScenesDrJohnson, Playroom

Challenging near-field geometry with heavy occlusion
Specular surfaces: windows, glossy floors
Significant depth-of-field variation across the scene

Why it matters

The primary stress test for view-dependent color (spherical harmonics). Methods that reduce SH degree show the largest PSNR drops here.

360° Unbounded · Indoor & Outdoor

Mip-NeRF 360

Scenesgarden, bicycle, bonsai, counter, kitchen, room, staircase

Cameras placed inside the scene, pointing outward in all directions
Background extends to infinity — the unbounded challenge
9 scenes, giving the most statistically robust averages
Wide variation in lighting, scale, and texture complexity

Why it matters

Spatial quantization and background pruning are stress-tested hardest here. Methods over-tuned on bounded datasets often badly over-prune the infinite background.

Synthetic · Blender CGI

Synthetic NeRF (Blender)

Sceneschair, drums, ficus, hotdog, lego, materials, mic, ship

Computer-generated with perfect ground-truth geometry
Objects on white backgrounds with complex materials
Zero capture noise, motion blur, or lens distortion
PSNR systematically higher (∼28–35 dB) than real-capture datasets

Why it matters

Clean ground truth isolates algorithmic quality from capture artefacts. Do not compare PSNR values here directly to real-capture scenes — they are not on the same scale.

Cross-Dataset Generalization

A key warning from the 3DGS.zip survey: methods optimized on one dataset may not generalize. The Synthetic NeRF dataset's object-centric, bounded setup is very different from Mip-NeRF 360's unbounded 360° captures. A compression method that aggressively prunes Gaussians based on viewing-frequency statistics tuned to bounded scenes may badly over-prune the infinite background in unbounded scenes.

References

Knapitsch et al. 2017 — Tanks and Temples: Benchmarking Large-Scale Scene Reconstruction

Hedman et al. 2018 — Deep Blending for Free-Viewpoint Image-Based Rendering

Barron et al. 2022 — Mip-NeRF 360: Unbounded Anti-Aliased Neural Radiance Fields

Mildenhall et al. 2020 — NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

Previous Next →

Benchmark Datasets for Novel View Synthesis

The Four Standard Benchmarks

Cross-Dataset Generalization

Privacy Policy

What we collect

What we don't collect

Your choices

Contact