SegLLM

Abstract

We present SegLLM, a novel multi-round interactive reasoning segmentation model that enhances LLM-based segmentation by exploiting conversational memory of both visual and textual outputs. By leveraging a mask-aware multimodal LLM, SegLLM re-integrates previous segmentation results into its input stream, enabling it to reason about complex user intentions and segment objects in relation to previously identified entities, including positional, interactional, and hierarchical relationships, across multiple interactions. This capability allows SegLLM to respond to visual and text queries in a chat-like manner. Evaluated on the newly curated MRSeg benchmark, SegLLM outperforms existing methods in multi-round interactive reasoning segmentation by over 20%. Additionally, we observed that training on multi-round reasoning segmentation data enhances performance on standard single-round referring segmentation and localization tasks, resulting in a 5.5% increase in cIoU for referring expression segmentation and a 4.5% improvement in Acc@0.5 for referring expression localization.

MRSeg Dataset

The success of SegLLM attributes to its high-quality multi-round segmentation dataset, called MRSeg. MRSeg incorprates a diverse set of inter-object relations such as hierarchical relationships (e.g "the hand of"), positional relationships (e.g. "to the left of"), interactional relationships (e.g. "looking at"), and other attribute-oriented queries. MRSeg is based on several widely utilized datasets, and include data from RefCOCO(+/g), Visual Genome, PACO-LVIS, LVIS, Pascal Panoptic Part, ADE20K, COCO-Stuff and MSCOCO.

SegLLM Architecture

SegLLM performs multi-round interactive image reasoning segmentation, which involves understanding complex user queries and segment entities based on their relationships with other objects. There are two key designs in SegLLM that facilitate this goal: First, we implement a mask encoding scheme that reincorporates the reference mask information back into the input stream of the LLMs. This enables the LLMs to reason about segmented masks from previous rounds. Second, we develop a mask-aware decoding scheme that allows the mask decoder to generate new masks based on both the output from the LLMs and the historical memory of output masks. The model uses the last layer hidden states associated with the [REF] and [SEG] tokens to generate both the reference mask and the target mask, seamlessly integrating past and current segmentation results. More details can be found in our Paper.

Multi-round Interactive Segmentation

SegLLM can consistently outperform existing models such as LISA. SegLLM not only excels in single round referring segmentation, it also demonstrates a strong reasoning capabilities when answering multi-round queries involving the interaction between objects or the hierarchal relation between object and object parts.

SegLLM

Multi-round Reasoning Segmentation

SegLLM is a multi-round conversation agent capable of localizing objects following natural language insturtcions

Abstract

MRSeg Dataset

SegLLM Architecture

Multi-round Interactive Segmentation