3D Gaussian Splatting · Evaluation & Survey

Benchmark Datasets for Novel View Synthesis

10 min read
By the end of this reading you will be able to:
  • Distinguish Tanks & Temples, Deep Blending, Mip-NeRF 360, and Synthetic NeRF benchmarks by the scene properties and reconstruction challenges each is designed to stress-test
  • Explain why PSNR values are not comparable across datasets, identifying the specific scene properties (unbounded vs. bounded, real vs. synthetic, near-field vs. large-scale) that cause this non-comparability
  • Identify the failure mode of optimizing a compression method on a single dataset and explain how dataset coverage (all four benchmarks) reveals generalization failures
  • Select the appropriate benchmark dataset(s) to evaluate a new 3DGS compression method given specific scene characteristics and the tradeoffs being investigated

The Four Standard Benchmarks

The 3DGS compression literature evaluates almost exclusively on four datasets. Knowing their characteristics is essential for understanding which methods generalize and which overfit to specific scene types.

Outdoor / Indoor · Real Capture
Tanks and Temples
ScenesTruck, Train, Playground, M60, and others
  • High-resolution real-world captures with natural lighting
  • Unbounded scenes — background extends to infinity
  • Objects at varying scales: a full tank, a train car
  • Challenging background coverage for Gaussian rasterizers
Why it matters

The Truck scene is the de facto single-scene calibration point. Compare any paper's Truck PSNR to its own baseline 3DGS run — that delta is what matters, not the absolute number.

Indoor · Depth Sensor
Deep Blending
ScenesDrJohnson, Playroom
  • Challenging near-field geometry with heavy occlusion
  • Specular surfaces: windows, glossy floors
  • Significant depth-of-field variation across the scene
Why it matters

The primary stress test for view-dependent color (spherical harmonics). Methods that reduce SH degree show the largest PSNR drops here.

360° Unbounded · Indoor & Outdoor
Mip-NeRF 360
Scenesgarden, bicycle, bonsai, counter, kitchen, room, staircase
  • Cameras placed inside the scene, pointing outward in all directions
  • Background extends to infinity — the unbounded challenge
  • 9 scenes, giving the most statistically robust averages
  • Wide variation in lighting, scale, and texture complexity
Why it matters

Spatial quantization and background pruning are stress-tested hardest here. Methods over-tuned on bounded datasets often badly over-prune the infinite background.

Synthetic · Blender CGI
Synthetic NeRF (Blender)
Sceneschair, drums, ficus, hotdog, lego, materials, mic, ship
  • Computer-generated with perfect ground-truth geometry
  • Objects on white backgrounds with complex materials
  • Zero capture noise, motion blur, or lens distortion
  • PSNR systematically higher (∼28–35 dB) than real-capture datasets
Why it matters

Clean ground truth isolates algorithmic quality from capture artefacts. Do not compare PSNR values here directly to real-capture scenes — they are not on the same scale.

Cross-Dataset Generalization

A key warning from the 3DGS.zip survey: methods optimized on one dataset may not generalize. The Synthetic NeRF dataset's object-centric, bounded setup is very different from Mip-NeRF 360's unbounded 360° captures. A compression method that aggressively prunes Gaussians based on viewing-frequency statistics tuned to bounded scenes may badly over-prune the infinite background in unbounded scenes.