3D Gaussian Splatting · Evaluation & Survey

Image Quality Metrics: PSNR, SSIM, and LPIPS

14 min read
By the end of this reading you will be able to:
  • Compute PSNR from MSE and interpret what a 3 dB difference means in terms of relative signal quality; identify the typical PSNR range for 3DGS reconstructions
  • Explain how SSIM measures local luminance, contrast, and structural similarity, and identify the types of distortion (blur, edge shift) that SSIM detects but PSNR misses
  • Distinguish LPIPS from PSNR and SSIM in terms of what it measures (perceptual deep features vs. pixel statistics), and identify when LPIPS should take priority as the evaluation criterion
  • Select the appropriate metric or metric combination given a specific evaluation goal — detecting compression artifacts, measuring perceptual quality, or tracking pixel-level fidelity

Why Metrics Matter

Evaluating a 3DGS compression method requires comparing rendered images against ground-truth photographs. Three metrics dominate the literature: PSNR, SSIM, and LPIPS. Each captures a different aspect of image quality and they are often complementary — a method can improve one while degrading another.

PSNR — Peak Signal-to-Noise Ratio

PSNR measures average pixel-level error in log scale:

PSNR=10log10 ⁣(MAX2MSE)\text{PSNR} = 10 \log_{10}\!\left(\frac{\text{MAX}^2}{\text{MSE}}\right)

where MAX=1.0\text{MAX} = 1.0 for float images (or 255 for uint8) and MSE is the mean squared error over all pixels and channels.

Expanded: MSE=1HWCh,w,c(I^h,w,cIh,w,c)2\text{MSE} = \frac{1}{HWC}\sum_{h,w,c}\bigl(\hat{I}_{h,w,c} - I_{h,w,c}\bigr)^2

Interpreting PSNR:

  • < 20 dB: visibly poor quality
  • 25–30 dB: acceptable for compressed video
  • 30–35 dB: good quality (typical baseline 3DGS on standard scenes)
  • > 35 dB: excellent quality

Limitations: PSNR is purely pixel-wise. Two images with the same PSNR can have completely different perceptual quality — a blurry image and a sharp-but-shifted image can score identically. It also treats all spatial locations equally, ignoring perceptual importance.

SSIM — Structural Similarity Index

SSIM compares images along three dimensions: luminance, contrast, and structure.

SSIM(x,y)=(2μxμy+c1)(2σxy+c2)(μx2+μy2+c1)(σx2+σy2+c2)\text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)}

where μx,μy\mu_x, \mu_y are local means, σx2,σy2\sigma_x^2, \sigma_y^2 are local variances, σxy\sigma_{xy} is the local covariance, and c1,c2c_1, c_2 are stability constants.

SSIM is computed locally over 11×11 Gaussian-weighted patches, then averaged. The result is in [0,1][0, 1], with 1.0 meaning identical images.

Why SSIM matters for 3DGS: Compression artifacts often smear or blur texture detail. SSIM's structure term penalizes loss of local correlation patterns (edges, textures) even when mean values match, catching the kinds of degradation that PSNR misses.

LPIPS — Learned Perceptual Image Patch Similarity

LPIPS (Zhang et al. 2018) measures perceptual similarity using features extracted from a pretrained deep network (typically AlexNet or VGG):

LPIPS(x,y)=l1HlWlh,wwl(f^lhwflhw)22\text{LPIPS}(x, y) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} \|w_l \odot (\hat{f}_l^{hw} - f_l^{hw})\|_2^2

where flf_l are feature maps at layer ll, f^l\hat{f}_l are unit-normalized, and wlw_l are learned channel weights.

Key property: LPIPS is trained to match human perceptual judgments. Two images that look identical to humans but differ in pixel values (e.g., a slight spatial offset) will have low LPIPS but high MSE. Lower LPIPS = more perceptually similar.

Typical values for 3DGS:

  • < 0.05: excellent perceptual quality
  • 0.05–0.15: good, minor perceptual degradation
  • > 0.2: noticeable degradation

How the Three Metrics Complement Each Other

Metric Measures Blind To Scale
PSNR Pixel MSE Structure, perception dB, higher better
SSIM Luminance/contrast/structure Semantic content 0–1, higher better
LPIPS Perceptual features Exact pixel values 0–∞, lower better

In 3DGS compression papers, all three are reported. A method that only optimizes for PSNR can produce blurry results that score well on MSE but poorly on SSIM and LPIPS. The standard evaluation includes all three on the same test set.