Description
While panoptic scene graph generation (PSG) aims to provide structured scene understanding, relation prediction remains prone to errors, as minor perturbations in the input can lead to inconsistent results. This thesis investigates whether a two-stage PSG pipeline can enhance relation recall compared to existing baselines and whether structured outputs can help with adversarial robustness. We present DINOgraph, which combines Mask DINO’s panoptic segmentation with a VCTree head that fuses mask- and box-level features to improve predicate prediction. We also introduce a proof-of-concept LLM auditing pipeline that identifies scene graph inconsistencies caused by a simulated label flip on a single mask. In our experiments, DINOgraph improves relation recall over two-stage baselines such as VCTree by ∼ 30% and outperforms one-stage models such as PSGFormer (by ∼ 50%). The auditing pipeline is able to detect clear contradictions. However, it exhibits non-negligible false positive and false negative rates and remains sensitive to calibration. Consequently, we currently treat it as a proof of concept rather than a deployable system. In this work, we propose an approach to overcome these limitations and advance the pipeline toward deployability. Keywords: panoptic segmentation, panoptic scene graph generation, adversarial robustness, adversarial patch attacks, large language models, LLM-auditing.
|