大语言模型也可以进行图像分割：使用Gemini实现工业异物检测完整代码示例-阿里云开发者社区

Gemini模型在大语言模型市场中展现出独特的优势，特别是在计算机视觉领域具有显著的技术潜力。与其他主流大语言模型相比，Gemini在目标检测和图像分割方面具备原生支持能力。较大规模的Gemini模型经过专门训练，能够直接输出边界框坐标和分割掩码，这一特性在当前的大语言模型生态中较为罕见。虽然Qwen-VL和Moondream等模型也具备类似功能，但从性能表现来看，Gemini Pro系列在图像分割任务中具有明显优势。本文将通过一个实际应用场景——工业传送带异物检测，详细介绍如何利用Gemini的图像分割能力构建完整的解决方案。

Gemini图像分割功能机制

Gemini模型具备对图像中目标对象进行精确分割的能力，可同时输出分割掩码和边界框信息。实现这一功能的关键在于构造合适的提示（prompt）。

以下是标准的提示格式：

 query="Detect ..."  
 prompt=f"{query}. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key \"box_2d\", the segmentation mask in key \"mask\", and the text label in the key \"label\". Use descriptive labels."

该提示指导模型返回JSON格式的检测结果，其中

mask

字段包含经过base64编码的PNG图像数据，精确描述了识别对象的像素级区域信息：

 [  
    {  
        "box_2d": [120, 514, 600, 998],  
        "mask": "data:image/png;base64,iVBORw0KGgoAAA...",  
        "label": "my label",  
    },  
    {  
        "box_2d": [220, 29, 609, 320],  
        "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAAAsElEQVR42....",  
        "label": "my other label",  
    },  
    ...  
 ]

项目环境搭建

为了实现完整的图像分割应用，需要建立规范的项目结构。创建名为

gemini_segmentation_project

的项目目录，其组织结构如下：

 gemini_segmentation_project/  
 ├── .env  
 ├── main.py  
 ├── image.png  
 └── requirements.txt

首先配置环境变量文件

.env

，用于安全存储API密钥信息：

 #.env  
 GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE"

项目需要一张测试图像，本示例假设使用包含多个物品的传送带图像，将其保存为

image.png

并放置在项目根目录下。

建议为项目创建独立的Python虚拟环境，以确保依赖包管理的隔离性和项目的可重现性。

项目依赖项通过

requirements.txt

文件进行管理：

 #requirements.txt  
 google-genai  
 numpy>=2.2.6  
 Pillow>=11.2.1  
 python-dotenv>=1.1.0  
 pydantic

需要注意的是，本项目要求Python版本为3.11或更高版本。在激活虚拟环境后，执行以下命令安装依赖：

 pip install -r requirements.txt

基础实现代码

在

main.py

文件中实现基本的图像分割功能：

 import os  
from io import BytesIO  
from PIL import Image  
from dotenv import load_dotenv  
from google import genai  
from google.genai import types  
from google.genai.types import HttpOptions  

load_dotenv()  

GEMINI_TIMEOUT_MS = 60 * 1000  
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"),http_options=HttpOptions(timeout=GEMINI_TIMEOUT_MS))  

if __name__ == "__main__":  
    image = "image.png"  
    query = "Detect all foreign objects in the conveyor belt"  
    prompt = f"{query}. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key \"box_2d\", the segmentation mask in key \"mask\", and the text label in the key \"label\". Use descriptive labels."  

    im = Image.open(BytesIO(open(image, "rb").read()))  
    im.thumbnail([1024,1024], Image.Resampling.LANCZOS)  

    # 运行模型以查找分割掩码  
    response = client.models.generate_content(  
        model="gemini-2.5-flash-preview-05-20",  # "gemini-2.5-pro-preview-05-06"  
        contents=[prompt, im],  
        config=types.GenerateContentConfig(  
            temperature=0.5,  
        )  
    )  

     print(response.text)

执行该脚本后，终端将输出类似以下格式的JSON数据：

 [  
    {  
        "box_2d": [219, 149, 439, 299],  
        "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACA..",  
        "label": "L-shaped white bracket",  
    },  
    {  
        "box_2d": [238, 574, 642, 638],  
        "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACA..",  
        "label": "perforated white strip (left)",  
    },  
    {  
        "box_2d": [207, 788, 638, 970],  
        "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",  
        "label": "perforated white strip (right)",  
    },  
 ]

结构化输出验证机制

大语言模型在生成结构化输出时存在一定的不确定性，可能出现格式不规范、缺失关键字段或返回空结果等问题。针对Gemini模型，常见的输出异常包括：遗漏

data:image/png;base64,

前缀、在JSON前后添加解释性文本，或返回空数组。为确保应用程序的稳定性，需要建立完善的数据验证机制。

Pydantic框架为此提供了理想的解决方案。通过定义严格的数据模式，可以对模型输出进行自动验证和类型转换。

首先定义分割输出的基础数据结构：

 from pydantic import BaseModel, ValidationError, field_validator, ConfigDict, Field  

class SegmentationOutput(BaseModel):  
  label: str   
  box_2d: list[int,int,int,int] = Field(..., description="yo,xo,y1,x1")  
  mask: str  

  @field_validator("mask", mode="before")  
  def ensure_prefix(cls, png_str: str) -> str:  
      prefix = "data:image/png;base64,"  
      if not png_str.startswith(prefix):  
          raise ValueError(f"bs64_mask must start with '{prefix}'")  
       return png_str

该模式中的

field_validator

装饰器用于验证

mask

字段是否包含正确的base64前缀，确保后续的图像解码过程能够正常执行。

为了处理模型可能返回的Markdown格式包装的JSON数据，以及进行批量验证，实现以下验证函数：

 def validate_json(json_output: str) -> list[SegmentationOutput] | None:  
    segmentation_list: list[SegmentationOutput] = []  
    lines = json_output.splitlines()  
    for i, line in enumerate(lines):  
        if line.strip() == "```json":  
            content = "\n".join(lines[i + 1 :])  
            content = content.split("```")[0]  
            json_output = content  
            break  

    try:  
        json_list = json.loads(json_output)  
    except ValueError as e:  
        raise ValueError(f"JSON output was wrongly formatted: {e}")  

    if not isinstance(json_list, list):  
        return None  

    for element in json_list:  
        try:  
            segmentation = SegmentationOutput.model_validate(element)  
            segmentation_list.append(segmentation)  
        except ValidationError as e:  
            print(f"Validation error {e}")  

     return segmentation_list or None

该函数首先尝试提取Markdown代码块中的JSON内容，然后对每个分割结果进行逐一验证，确保数据的完整性和正确性。

分割结果可视化实现

为了验证Gemini生成的分割掩码的准确性，需要将分割结果可视化显示在原始图像上。该过程涉及三个核心步骤：坐标系统转换、掩码数据处理和图像叠加渲染。

Gemini返回的边界框坐标采用0-1000的标准化范围，需要根据实际图像尺寸进行缩放转换。同时，base64编码的掩码数据需要解码为NumPy数组格式，以便进行高效的图像处理操作。最后，通过创建彩色叠加层的方式，将检测到的目标对象以不同颜色显示，并绘制相应的边界框和标签信息。

为支持NumPy数组类型，扩展原有的数据模型：

 class SegmentationItem(SegmentationOutput):  
     np_mask: np.array  
     model_config = ConfigDict(arbitrary_types_allowed=True)

实现核心的分割掩码解析函数，负责坐标转换和掩码数据处理：

 def parse_segmentation_masks(  
    predicted_str: str, *, img_height: int, img_width: int  
) -> list[SegmentationOutput]:  
    validated = validate_json(predicted_str)  
    print(validated)  
    if not validated:  
        return []  

    results: list[SegmentationOutput] = []  
    for item in validated:  
        abs_y0 = int(item.box_2d[0] / 1000 * img_height)  
        abs_x0 = int(item.box_2d[1]  / 1000 * img_width)  
        abs_y1 = int(item.box_2d[2]  / 1000 * img_height)  
        abs_x1 = int(item.box_2d[3]  / 1000 * img_width)  

        if abs_y0 >= abs_y1 or abs_x0 >= abs_x1:  
            print("Invalid bounding box", (item.box_2d))  
            continue  

        prefix = "data:image/png;base64,"  
        png_str = item.mask  
        raw_data = base64.b64decode(png_str.removeprefix(prefix))  
        pil_mask = Image.open(io.BytesIO(raw_data))  

        bbox_height = abs_y1 - abs_y0  
        bbox_width = abs_x1 - abs_x0  
        if bbox_height < 1 or bbox_width < 1:  
            print("Invalid bounding box")  
            continue  

        pil_mask = pil_mask.resize(  
            (bbox_width, bbox_height), resample=Image.Resampling.BILINEAR  
        )  
        np_mask_full = np.zeros((img_height, img_width), dtype=np.uint8)  
        np_mask_full[abs_y0:abs_y1, abs_x0:abs_x1] = np.array(pil_mask)  

        try:  
            seg_item = SegmentationItem(  
                label=item.label,  
                box_2d=[abs_y0, abs_x0, abs_y1, abs_x1],  
                mask=item.mask,  
                np_mask=np_mask_full,  
            )  
            results.append(seg_item)  
        except ValidationError as e:  
            print("Validation error in final item:", e)  
            continue  
     return results

该函数执行关键的数据转换操作，包括将标准化坐标转换为像素坐标、解码base64掩码数据、调整掩码尺寸以匹配边界框，并构建完整尺寸的掩码数组。

实现掩码叠加功能，将分割区域以半透明彩色形式显示：

 def overlay_mask_on_img(  
    img: Image.Image, mask: np.ndarray, color: str, alpha: float = 0.7  
) -> Image.Image:  
    if not (0.0 <= alpha <= 1.0):  
        raise ValueError("Alpha must be between 0.0 and 1.0")  

    try:  
        color_rgb = ImageColor.getrgb(color)  
    except ValueError as e:  
        raise ValueError(f"Invalid color name '{color}'. Error: {e}")  

    img_rgba = img.convert("RGBA")  
    width, height = img_rgba.size  

    alpha_int = int(alpha * 255)  
    overlay_color_rgba = color_rgb + (alpha_int,)  

    colored_layer = np.zeros((height, width, 4), dtype=np.uint8)  
    mask_logical = mask > 127  
    colored_layer[mask_logical] = overlay_color_rgba  

    colored_mask = Image.fromarray(colored_layer, "RGBA")  
     return Image.alpha_composite(img_rgba, colored_mask)

最后，实现完整的可视化函数，整合掩码叠加和边界框绘制功能：

 additional_colors = [colorname for (colorname, colorcode) in ImageColor.colormap.items()]  
def plot_segmentation_masks(  
    img: Image.Image, segmentation_items: list[SegmentationItem]  
) -> Image.Image:  
    colors = [  
        "red", "green", "blue", "yellow", "orange", "pink", "purple",  
        "brown", "gray", "beige", "turquoise", "cyan", "magenta", "lime",  
        "navy", "maroon", "teal", "olive", "coral", "lavender", "violet",  
        "gold", "silver",  
    ] + additional_colors  

    font = ImageFont.load_default()  

    # 使用NumPy数组而不是base64字符串叠加掩码  
    for i, item in enumerate(segmentation_items):  
        color = colors[i % len(colors)]  
        img = overlay_mask_on_img(img, item.np_mask, color)  

    draw = ImageDraw.Draw(img)  

    # 使用box_2d = [y0, x0, y1, x1]绘制边界框和标签  
    for i, item in enumerate(segmentation_items):  
        color = colors[i % len(colors)]  
        y0, x0, y1, x1 = item.box_2d  
        draw.rectangle(  
            ((x0, y0), (x1, y1)), outline=color, width=4  
        )  
        if item.label:  
            # 将标签位置稍微放在左上角上方  
            draw.text((x0 + 8, y0 - 20), item.label, fill=color, font=font)  

     return img

该函数为每个检测目标分配不同的颜色，确保视觉上的区分度，并在图像上精确绘制边界框和对应的文本标签。

完整实现代码

将上述所有组件整合，形成完整的图像分割应用程序：

 import os  
import io  
import json  
import base64  

from io import BytesIO  
from PIL import Image, ImageColor, ImageFont, ImageDraw  
from dotenv import load_dotenv  
from google import genai  
from google.genai import types  
from google.genai.types import HttpOptions  

from pydantic import BaseModel, ValidationError, field_validator, ConfigDict, Field  
import numpy as np  

load_dotenv()  

GEMINI_TIMEOUT_MS = 60 * 1000  
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"),http_options=HttpOptions(timeout=GEMINI_TIMEOUT_MS))  

class SegmentationOutput(BaseModel):  
  label: str  
  box_2d: list[int,int,int,int] = Field(..., description="yo,xo,y1,x1")  
  mask: str  

  @field_validator("mask", mode="before")  
  def ensure_prefix(cls, png_str: str) -> str:  
      prefix = "data:image/png;base64,"  
      if not png_str.startswith(prefix):  
          raise ValueError(f"bs64_mask must start with '{prefix}'")  
      return png_str  

class SegmentationItem(SegmentationOutput):  
    np_mask: np.array  
    model_config = ConfigDict(arbitrary_types_allowed=True)  

def validate_json(json_output: str) -> list[SegmentationOutput] | None:  
    segmentation_list: list[SegmentationOutput] = []  
    lines = json_output.splitlines()  
    for i, line in enumerate(lines):  
        if line.strip() == "```json":  
            content = "\n".join(lines[i + 1 :])  
            content = content.split("```")[0]  
            json_output = content  
            break  

    try:  
        json_list = json.loads(json_output)  
    except ValueError as e:  
        raise ValueError(f"JSON output was wrongly formatted: {e}")  

    if not isinstance(json_list, list):  
        return None  

    for element in json_list:  
        try:  
            segmentation = SegmentationOutput.model_validate(element)  
            segmentation_list.append(segmentation)  
        except ValidationError as e:  
            print(f"Validation error {e}")  

    return segmentation_list or None

def parse_segmentation_masks(  
    predicted_str: str, *, img_height: int, img_width: int  
) -> list[SegmentationOutput]:  
    validated = validate_json(predicted_str)  
    print(validated)  
    if not validated:  
        return []  

    results: list[SegmentationOutput] = []  
    for item in validated:  
        abs_y0 = int(item.box_2d[0] / 1000 * img_height)  
        abs_x0 = int(item.box_2d[1]  / 1000 * img_width)  
        abs_y1 = int(item.box_2d[2]  / 1000 * img_height)  
        abs_x1 = int(item.box_2d[3]  / 1000 * img_width)  

        if abs_y0 >= abs_y1 or abs_x0 >= abs_x1:  
            print("Invalid bounding box", (item.box_2d))  
            continue  

        prefix = "data:image/png;base64,"  
        png_str = item.mask  
        raw_data = base64.b64decode(png_str.removeprefix(prefix))  
        pil_mask = Image.open(io.BytesIO(raw_data))  

        bbox_height = abs_y1 - abs_y0  
        bbox_width = abs_x1 - abs_x0  
        if bbox_height < 1 or bbox_width < 1:  
            print("Invalid bounding box")  
            continue  

        pil_mask = pil_mask.resize(  
            (bbox_width, bbox_height), resample=Image.Resampling.BILINEAR  
        )  
        np_mask_full = np.zeros((img_height, img_width), dtype=np.uint8)  
        np_mask_full[abs_y0:abs_y1, abs_x0:abs_x1] = np.array(pil_mask)  

        try:  
            seg_item = SegmentationItem(  
                label=item.label,  
                box_2d=[abs_y0, abs_x0, abs_y1, abs_x1],  
                mask=item.mask,  
                np_mask=np_mask_full,  
            )  
            results.append(seg_item)  
        except ValidationError as e:  
            print("Validation error in final item:", e)  
            continue  
    return results

def overlay_mask_on_img(  
    img: Image.Image, mask: np.ndarray, color: str, alpha: float = 0.7  
) -> Image.Image:  
    if not (0.0 <= alpha <= 1.0):  
        raise ValueError("Alpha must be between 0.0 and 1.0")  

    try:  
        color_rgb = ImageColor.getrgb(color)  
    except ValueError as e:  
        raise ValueError(f"Invalid color name '{color}'. Error: {e}")  

    img_rgba = img.convert("RGBA")  
    width, height = img_rgba.size  

    alpha_int = int(alpha * 255)  
    overlay_color_rgba = color_rgb + (alpha_int,)  

    colored_layer = np.zeros((height, width, 4), dtype=np.uint8)  
    mask_logical = mask > 127  
    colored_layer[mask_logical] = overlay_color_rgba  

    colored_mask = Image.fromarray(colored_layer, "RGBA")  
    return Image.alpha_composite(img_rgba, colored_mask)

additional_colors = [colorname for (colorname, colorcode) in ImageColor.colormap.items()]  
def plot_segmentation_masks(  
    img: Image.Image, segmentation_items: list[SegmentationItem]  
) -> Image.Image:  
    colors = [  
        "red", "green", "blue", "yellow", "orange", "pink", "purple",  
        "brown", "gray", "beige", "turquoise", "cyan", "magenta", "lime",  
        "navy", "maroon", "teal", "olive", "coral", "lavender", "violet",  
        "gold", "silver",  
    ] + additional_colors  

    font = ImageFont.load_default()  

    # 使用NumPy数组而不是base64字符串叠加掩码  
    for i, item in enumerate(segmentation_items):  
        color = colors[i % len(colors)]  
        img = overlay_mask_on_img(img, item.np_mask, color)  

    draw = ImageDraw.Draw(img)  

    # 使用box_2d = [y0, x0, y1, x1]绘制边界框和标签  
    for i, item in enumerate(segmentation_items):  
        color = colors[i % len(colors)]  
        y0, x0, y1, x1 = item.box_2d  
        draw.rectangle(  
            ((x0, y0), (x1, y1)), outline=color, width=4  
        )  
        if item.label:  
            # 将标签位置稍微放在左上角上方  
            draw.text((x0 + 8, y0 - 20), item.label, fill=color, font=font)  

    return img

if __name__ == "__main__":  
    image = "image.png"  
    query = "Detect all foreign objects in the conveyor belt"  
    prompt = f"{query}. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key \"box_2d\", the segmentation mask in key \"mask\", and the text label in the key \"label\". Use descriptive labels."  

    im = Image.open(BytesIO(open(image, "rb").read()))  
    im.thumbnail([1024,1024], Image.Resampling.LANCZOS)  

    # 运行模型以查找分割掩码  
    response = client.models.generate_content(  
        model="gemini-2.5-pro-preview-05-06",  # "gemini-2.5-flash-preview-05-20"  
        contents=[prompt, im],  
        config=types.GenerateContentConfig(  
            temperature=0.5,  
        )  
    )  

    # 绘制  
    segmentation_masks = parse_segmentation_masks(response.text, img_height=im.size[1], img_width=im.size[0])  
    im = plot_segmentation_masks(im, segmentation_masks)  
     im.show()

执行完整程序后，将得到如下的可视化结果：

技术限制与分析

在实际应用中，基于Gemini的图像分割方案存在一些技术限制需要重点关注。

首先，模型在掩码数据格式化过程中可能出现不稳定现象。当对输出格式要求过于严格时，模型有时会生成包含重复字符序列的异常数据，特别是在base64掩码字符串中出现大量相同字符的连续重复，如"AAAAA..."等模式。这种现象不仅影响数据的有效性，还会显著增加base64解码的计算时间，从而影响整体系统性能。

其次，长序列重复字符的处理会带来额外的计算开销。解码包含大量重复字符的超长base64字符串需要消耗更多的处理时间和内存资源，这在对实时性要求较高的应用场景中可能成为性能瓶颈。

以下是典型的问题输出示例：

 [  
    {  
        "box_2d": [120, 514, 600, 998],  
        "mask": "data:image/png;base64,iVBORw0AAAAA....",  
        "label": "spalling with exposed rebar",  
    },  
    {  
        "box_2d": [220, 29, 609, 320],  
        "mask": "data:image/png;base64,AAAAAAAAAAAAA...",  
        "label": "spalling with exposed rebar",  
    },  
    {  
        "box_2d": [14, 11, 111, 234],  
        "mask": "data:image/png;base64,AAAAAAAAAAAAAAA...",  
        "label": "crack",  
    },  
 ]

为缓解这些问题，建议在生产环境中实施适当的错误处理机制，包括base64数据完整性检查、异常长度检测，以及必要时的重试策略。