• Medientyp: Sonstige Veröffentlichung; Dissertation; Elektronische Hochschulschrift; E-Book
  • Titel: Holistic scene understanding through image and video scene graphs
  • Beteiligte: Cong, Yuren [Verfasser:in]
  • Erschienen: Hannover : Institutionelles Repositorium der Leibniz Universität Hannover, 2024
  • Ausgabe: published Version
  • Sprache: Englisch
  • DOI: https://doi.org/10.15488/17548
  • Schlagwörter: Scene Graph ; Szenenverständnis ; Visual Relationship Detection ; Szenegraph ; Scene Understanding ; Video Understanding ; Videoverständnis ; Erkennung visueller Beziehungen
  • Entstehung:
  • Anmerkungen: Diese Datenquelle enthält auch Bestandsnachweise, die nicht zu einem Volltext führen.
  • Beschreibung: A scene graph is a graph structure in which nodes symbolize the entities in a scene and the edges indicate the relationships between the entities. It is viewed as a potential approach to access holistic scene understanding, as well as a promising tool to bridge the domains of vision and language. Despite their potential, the field lacks a comprehensive, systematic analysis of scene graphs and their practical applications. This dissertation fills this gap with significant contributions in both image-based and video-based scene graphs. For image-based scene graphs, a two-stage scene graph generation method with high performance is first proposed. The approach performs scene graph generation by solving a neural variant of ordinary differential equations. To further reduce the time complexity and inference time of two-stage approaches, image-based scene graph generation is formulated as a set prediction problem. A Transformer-based model is proposed to infer visual relationships without giving object proposals. During the study of image-based scene graph generation, we find that the existing evaluation metrics fail to demonstrate the overall semantic difference between a scene graph and an image. To overcome this limitation, we propose a contrastive learning framework which can measure the similarity between scene graphs and images. The framework can also be used as a scene graph encoder for further applications. For video-based scene graphs, a dynamic scene graph generation method based on Transformers is proposed to capture the spatial context and the temporal dependencies. This method has become a popular baseline model in this task. Moreover, to extend the video scene graph applications, a semantic scene graph-to-video synthesis framework is proposed that can synthesize a fixed-length video with an initial scene image and discrete semantic video scene graphs. The video and graph representations are modeled by a GPT-like Transformer using an auto-regressive prior. These methods have demonstrated state-of-the-art ...
  • Zugangsstatus: Freier Zugang
  • Rechte-/Nutzungshinweise: Namensnennung (CC BY)