Skip to content

Unable to Reproduce Paper Results with Provided Pre-trained Models #5

@aliciusschroeder

Description

@aliciusschroeder

Hi, thank you for releasing your code and pre-trained models. I've been trying to reproduce the results from your paper on FFHQ and CelebA-HQ but am observing significant discrepancies from the reported performance. I wanted to document my findings in case there's something I'm missing or a known issue.

Environment & Setup

  • Repository commit: eb4e77c84d00678e409343b52d9804b8ff31f467 (retrieved 2026-01-14)
  • Pre-trained models: Iteration 700,000 checkpoints downloaded from BaiduCloud (as linked in README)
  • Test data: FFHQ and CelebA-HQ with mixed contamination types (scribbles + rectangular image patches)

Observed Issues

I've attached a comparison image showing Input → GT → Predicted Mask → Output → Predicted Output columns across 4 test samples.

Image

1. Complete Failure on Rectangular Occlusions

The model fails entirely on square/rectangular image patches (rows 2 & 4 in attached image):

  • The calendar text obstruction remains fully visible in the output
  • The predicted mask localizes only a small central region rather than the full occlusion
  • The 'Output column shows extreme noise (scattered white pixels across the entire image), suggesting the mask estimation has collapsed

This contrasts with the paper's Table 3 results showing strong performance on "Image Occlusion" contamination patterns.

2. Poor Texture Quality on Successful Detections

Even where the model correctly identifies contamination (scribble rows 1 & 3):

  • Inpainted regions exhibit over-smoothing / "plastic skin" artifacts
  • Loss of high-frequency detail (pores, skin texture) compared to GT
  • Color blending issues leaving visible discoloration blobs (especially around mouth/chin areas)

3. Facial Feature Reconstruction

  • Lips appear undefined and blurry when occluded
  • Eye reconstruction shows asymmetry and lack of definition

Questions

  1. Are the BaiduCloud checkpoints the same ones used to generate the paper's quantitative results?
  2. Is there specific preprocessing required for the test images beyond resizing to 256×256?
  3. Were the paper results obtained with a different contamination synthesis pipeline than what's in the released code?
  4. Any known issues with certain contamination types (solid rectangles vs. irregular masks)?

Attached

  • comparison.jpg: Side-by-side comparison showing the issues described above

I'd appreciate any guidance on reproducing the reported results. Happy to provide additional details or test specific configurations if helpful.

Thanks for your time!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions