作成者 |
|
|
|
本文言語 |
|
出版者 |
|
発行日 |
|
巻 |
|
開始ページ |
|
終了ページ |
|
会議情報 |
|
出版タイプ |
|
アクセス権 |
|
権利関係 |
|
権利関係 |
|
関連DOI |
|
関連DOI |
|
関連URI |
|
関連HDL |
|
関連ISBN |
|
関連HDL |
|
関連情報 |
|
概要 |
Conditional image generation, which aims to generate consistent images with a user’s input, is one of the critical problems in computer vision. Text-to-image models have succeeded in generating realis...tic images for simple situations in which a few objects are present. Yet, they often fail to generate consistent images for texts representing complex situations. Scene-graph-to-image models have the advantage of generating images for complex situations based on the structure of a scene graph. We extended a scene-graph-to-image model to an image generation model from a hyper scene graph with trinomial hyperedges. Our model, termed hsg2im, improved the consistency of the generated images. However, hsg2im has difficulty in generating natural and consistent images for hyper scene graphs with many objects. The reason is that the graph convolutional network in hsg2im struggles to capture relations of distant objects. In this paper, we propose a novel image generation model which addresses thi s shortcoming by introducing object attention layers. We also use a layout-to-image model auxiliary to generate higher-resolution images. Experimental validations on COCO-Stuff and Visual Genome datasets show that the proposed model generates more natural and consistent images to user’s inputs than the cutting-edge hyper scene-graph-to-image model.続きを見る
|