Gemini模型在大语言模型市场中展现出独特的优势,特别是在计算机视觉领域具有显著的技术潜力。与其他主流大语言模型相比,Gemini在目标检测和图像分割方面具备原生支持能力。较大规模的Gemini模型经过专门训练,能够直接输出边界框坐标和分割掩码,这一特性在当前的大语言模型生态中较为罕见。虽然Qwen-VL和Moondream等模型也具备类似功能,但从性能表现来看,Gemini Pro系列在图像分割任务中具有明显优势。本文将通过一个实际应用场景——工业传送带异物检测,详细介绍如何利用Gemini的图像分割能力构建完整的解决方案。
Gemini图像分割功能机制
Gemini模型具备对图像中目标对象进行精确分割的能力,可同时输出分割掩码和边界框信息。实现这一功能的关键在于构造合适的提示(prompt)。
以下是标准的提示格式:
query="Detect ..."
prompt=f"{query}. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key \"box_2d\", the segmentation mask in key \"mask\", and the text label in the key \"label\". Use descriptive labels."
该提示指导模型返回JSON格式的检测结果,其中
mask
字段包含经过base64编码的PNG图像数据,精确描述了识别对象的像素级区域信息:
[
{
"box_2d": [120, 514, 600, 998],
"mask": "data:image/png;base64,iVBORw0KGgoAAA...",
"label": "my label",
},
{
"box_2d": [220, 29, 609, 320],
"mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAAAsElEQVR42....",
"label": "my other label",
},
...
]
项目环境搭建
为了实现完整的图像分割应用,需要建立规范的项目结构。创建名为
gemini_segmentation_project
的项目目录,其组织结构如下:
gemini_segmentation_project/
├── .env
├── main.py
├── image.png
└── requirements.txt
首先配置环境变量文件
.env
,用于安全存储API密钥信息:
#.env
GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE"
项目需要一张测试图像,本示例假设使用包含多个物品的传送带图像,将其保存为
image.png
并放置在项目根目录下。
建议为项目创建独立的Python虚拟环境,以确保依赖包管理的隔离性和项目的可重现性。
项目依赖项通过
requirements.txt
文件进行管理:
#requirements.txt
google-genai
numpy>=2.2.6
Pillow>=11.2.1
python-dotenv>=1.1.0
pydantic
需要注意的是,本项目要求Python版本为3.11或更高版本。在激活虚拟环境后,执行以下命令安装依赖:
pip install -r requirements.txt
基础实现代码
在
main.py
文件中实现基本的图像分割功能:
import os
from io import BytesIO
from PIL import Image
from dotenv import load_dotenv
from google import genai
from google.genai import types
from google.genai.types import HttpOptions
load_dotenv()
GEMINI_TIMEOUT_MS = 60 * 1000
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"),http_options=HttpOptions(timeout=GEMINI_TIMEOUT_MS))
if __name__ == "__main__":
image = "image.png"
query = "Detect all foreign objects in the conveyor belt"
prompt = f"{query}. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key \"box_2d\", the segmentation mask in key \"mask\", and the text label in the key \"label\". Use descriptive labels."
im = Image.open(BytesIO(open(image, "rb").read()))
im.thumbnail([1024,1024], Image.Resampling.LANCZOS)
# 运行模型以查找分割掩码
response = client.models.generate_content(
model="gemini-2.5-flash-preview-05-20", # "gemini-2.5-pro-preview-05-06"
contents=[prompt, im],
config=types.GenerateContentConfig(
temperature=0.5,
)
)
print(response.text)
执行该脚本后,终端将输出类似以下格式的JSON数据:
[
{
"box_2d": [219, 149, 439, 299],
"mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACA..",
"label": "L-shaped white bracket",
},
{
"box_2d": [238, 574, 642, 638],
"mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACA..",
"label": "perforated white strip (left)",
},
{
"box_2d": [207, 788, 638, 970],
"mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...",
"label": "perforated white strip (right)",
},
]
结构化输出验证机制
大语言模型在生成结构化输出时存在一定的不确定性,可能出现格式不规范、缺失关键字段或返回空结果等问题。针对Gemini模型,常见的输出异常包括:遗漏
data:image/png;base64,
前缀、在JSON前后添加解释性文本,或返回空数组。为确保应用程序的稳定性,需要建立完善的数据验证机制。
Pydantic框架为此提供了理想的解决方案。通过定义严格的数据模式,可以对模型输出进行自动验证和类型转换。
首先定义分割输出的基础数据结构:
from pydantic import BaseModel, ValidationError, field_validator, ConfigDict, Field
class SegmentationOutput(BaseModel):
label: str
box_2d: list[int,int,int,int] = Field(..., description="yo,xo,y1,x1")
mask: str
@field_validator("mask", mode="before")
def ensure_prefix(cls, png_str: str) -> str:
prefix = "data:image/png;base64,"
if not png_str.startswith(prefix):
raise ValueError(f"bs64_mask must start with '{prefix}'")
return png_str
该模式中的
field_validator
装饰器用于验证
mask
字段是否包含正确的base64前缀,确保后续的图像解码过程能够正常执行。
为了处理模型可能返回的Markdown格式包装的JSON数据,以及进行批量验证,实现以下验证函数:
def validate_json(json_output: str) -> list[SegmentationOutput] | None:
segmentation_list: list[SegmentationOutput] = []
lines = json_output.splitlines()
for i, line in enumerate(lines):
if line.strip() == "```json":
content = "\n".join(lines[i + 1 :])
content = content.split("```")[0]
json_output = content
break
try:
json_list = json.loads(json_output)
except ValueError as e:
raise ValueError(f"JSON output was wrongly formatted: {e}")
if not isinstance(json_list, list):
return None
for element in json_list:
try:
segmentation = SegmentationOutput.model_validate(element)
segmentation_list.append(segmentation)
except ValidationError as e:
print(f"Validation error {e}")
return segmentation_list or None
该函数首先尝试提取Markdown代码块中的JSON内容,然后对每个分割结果进行逐一验证,确保数据的完整性和正确性。
分割结果可视化实现
为了验证Gemini生成的分割掩码的准确性,需要将分割结果可视化显示在原始图像上。该过程涉及三个核心步骤:坐标系统转换、掩码数据处理和图像叠加渲染。
Gemini返回的边界框坐标采用0-1000的标准化范围,需要根据实际图像尺寸进行缩放转换。同时,base64编码的掩码数据需要解码为NumPy数组格式,以便进行高效的图像处理操作。最后,通过创建彩色叠加层的方式,将检测到的目标对象以不同颜色显示,并绘制相应的边界框和标签信息。
为支持NumPy数组类型,扩展原有的数据模型:
class SegmentationItem(SegmentationOutput):
np_mask: np.array
model_config = ConfigDict(arbitrary_types_allowed=True)
实现核心的分割掩码解析函数,负责坐标转换和掩码数据处理:
def parse_segmentation_masks(
predicted_str: str, *, img_height: int, img_width: int
) -> list[SegmentationOutput]:
validated = validate_json(predicted_str)
print(validated)
if not validated:
return []
results: list[SegmentationOutput] = []
for item in validated:
abs_y0 = int(item.box_2d[0] / 1000 * img_height)
abs_x0 = int(item.box_2d[1] / 1000 * img_width)
abs_y1 = int(item.box_2d[2] / 1000 * img_height)
abs_x1 = int(item.box_2d[3] / 1000 * img_width)
if abs_y0 >= abs_y1 or abs_x0 >= abs_x1:
print("Invalid bounding box", (item.box_2d))
continue
prefix = "data:image/png;base64,"
png_str = item.mask
raw_data = base64.b64decode(png_str.removeprefix(prefix))
pil_mask = Image.open(io.BytesIO(raw_data))
bbox_height = abs_y1 - abs_y0
bbox_width = abs_x1 - abs_x0
if bbox_height < 1 or bbox_width < 1:
print("Invalid bounding box")
continue
pil_mask = pil_mask.resize(
(bbox_width, bbox_height), resample=Image.Resampling.BILINEAR
)
np_mask_full = np.zeros((img_height, img_width), dtype=np.uint8)
np_mask_full[abs_y0:abs_y1, abs_x0:abs_x1] = np.array(pil_mask)
try:
seg_item = SegmentationItem(
label=item.label,
box_2d=[abs_y0, abs_x0, abs_y1, abs_x1],
mask=item.mask,
np_mask=np_mask_full,
)
results.append(seg_item)
except ValidationError as e:
print("Validation error in final item:", e)
continue
return results
该函数执行关键的数据转换操作,包括将标准化坐标转换为像素坐标、解码base64掩码数据、调整掩码尺寸以匹配边界框,并构建完整尺寸的掩码数组。
实现掩码叠加功能,将分割区域以半透明彩色形式显示:
def overlay_mask_on_img(
img: Image.Image, mask: np.ndarray, color: str, alpha: float = 0.7
) -> Image.Image:
if not (0.0 <= alpha <= 1.0):
raise ValueError("Alpha must be between 0.0 and 1.0")
try:
color_rgb = ImageColor.getrgb(color)
except ValueError as e:
raise ValueError(f"Invalid color name '{color}'. Error: {e}")
img_rgba = img.convert("RGBA")
width, height = img_rgba.size
alpha_int = int(alpha * 255)
overlay_color_rgba = color_rgb + (alpha_int,)
colored_layer = np.zeros((height, width, 4), dtype=np.uint8)
mask_logical = mask > 127
colored_layer[mask_logical] = overlay_color_rgba
colored_mask = Image.fromarray(colored_layer, "RGBA")
return Image.alpha_composite(img_rgba, colored_mask)
最后,实现完整的可视化函数,整合掩码叠加和边界框绘制功能:
additional_colors = [colorname for (colorname, colorcode) in ImageColor.colormap.items()]
def plot_segmentation_masks(
img: Image.Image, segmentation_items: list[SegmentationItem]
) -> Image.Image:
colors = [
"red", "green", "blue", "yellow", "orange", "pink", "purple",
"brown", "gray", "beige", "turquoise", "cyan", "magenta", "lime",
"navy", "maroon", "teal", "olive", "coral", "lavender", "violet",
"gold", "silver",
] + additional_colors
font = ImageFont.load_default()
# 使用NumPy数组而不是base64字符串叠加掩码
for i, item in enumerate(segmentation_items):
color = colors[i % len(colors)]
img = overlay_mask_on_img(img, item.np_mask, color)
draw = ImageDraw.Draw(img)
# 使用box_2d = [y0, x0, y1, x1]绘制边界框和标签
for i, item in enumerate(segmentation_items):
color = colors[i % len(colors)]
y0, x0, y1, x1 = item.box_2d
draw.rectangle(
((x0, y0), (x1, y1)), outline=color, width=4
)
if item.label:
# 将标签位置稍微放在左上角上方
draw.text((x0 + 8, y0 - 20), item.label, fill=color, font=font)
return img
该函数为每个检测目标分配不同的颜色,确保视觉上的区分度,并在图像上精确绘制边界框和对应的文本标签。
完整实现代码
将上述所有组件整合,形成完整的图像分割应用程序:
import os
import io
import json
import base64
from io import BytesIO
from PIL import Image, ImageColor, ImageFont, ImageDraw
from dotenv import load_dotenv
from google import genai
from google.genai import types
from google.genai.types import HttpOptions
from pydantic import BaseModel, ValidationError, field_validator, ConfigDict, Field
import numpy as np
load_dotenv()
GEMINI_TIMEOUT_MS = 60 * 1000
client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"),http_options=HttpOptions(timeout=GEMINI_TIMEOUT_MS))
class SegmentationOutput(BaseModel):
label: str
box_2d: list[int,int,int,int] = Field(..., description="yo,xo,y1,x1")
mask: str
@field_validator("mask", mode="before")
def ensure_prefix(cls, png_str: str) -> str:
prefix = "data:image/png;base64,"
if not png_str.startswith(prefix):
raise ValueError(f"bs64_mask must start with '{prefix}'")
return png_str
class SegmentationItem(SegmentationOutput):
np_mask: np.array
model_config = ConfigDict(arbitrary_types_allowed=True)
def validate_json(json_output: str) -> list[SegmentationOutput] | None:
segmentation_list: list[SegmentationOutput] = []
lines = json_output.splitlines()
for i, line in enumerate(lines):
if line.strip() == "```json":
content = "\n".join(lines[i + 1 :])
content = content.split("```")[0]
json_output = content
break
try:
json_list = json.loads(json_output)
except ValueError as e:
raise ValueError(f"JSON output was wrongly formatted: {e}")
if not isinstance(json_list, list):
return None
for element in json_list:
try:
segmentation = SegmentationOutput.model_validate(element)
segmentation_list.append(segmentation)
except ValidationError as e:
print(f"Validation error {e}")
return segmentation_list or None
def parse_segmentation_masks(
predicted_str: str, *, img_height: int, img_width: int
) -> list[SegmentationOutput]:
validated = validate_json(predicted_str)
print(validated)
if not validated:
return []
results: list[SegmentationOutput] = []
for item in validated:
abs_y0 = int(item.box_2d[0] / 1000 * img_height)
abs_x0 = int(item.box_2d[1] / 1000 * img_width)
abs_y1 = int(item.box_2d[2] / 1000 * img_height)
abs_x1 = int(item.box_2d[3] / 1000 * img_width)
if abs_y0 >= abs_y1 or abs_x0 >= abs_x1:
print("Invalid bounding box", (item.box_2d))
continue
prefix = "data:image/png;base64,"
png_str = item.mask
raw_data = base64.b64decode(png_str.removeprefix(prefix))
pil_mask = Image.open(io.BytesIO(raw_data))
bbox_height = abs_y1 - abs_y0
bbox_width = abs_x1 - abs_x0
if bbox_height < 1 or bbox_width < 1:
print("Invalid bounding box")
continue
pil_mask = pil_mask.resize(
(bbox_width, bbox_height), resample=Image.Resampling.BILINEAR
)
np_mask_full = np.zeros((img_height, img_width), dtype=np.uint8)
np_mask_full[abs_y0:abs_y1, abs_x0:abs_x1] = np.array(pil_mask)
try:
seg_item = SegmentationItem(
label=item.label,
box_2d=[abs_y0, abs_x0, abs_y1, abs_x1],
mask=item.mask,
np_mask=np_mask_full,
)
results.append(seg_item)
except ValidationError as e:
print("Validation error in final item:", e)
continue
return results
def overlay_mask_on_img(
img: Image.Image, mask: np.ndarray, color: str, alpha: float = 0.7
) -> Image.Image:
if not (0.0 <= alpha <= 1.0):
raise ValueError("Alpha must be between 0.0 and 1.0")
try:
color_rgb = ImageColor.getrgb(color)
except ValueError as e:
raise ValueError(f"Invalid color name '{color}'. Error: {e}")
img_rgba = img.convert("RGBA")
width, height = img_rgba.size
alpha_int = int(alpha * 255)
overlay_color_rgba = color_rgb + (alpha_int,)
colored_layer = np.zeros((height, width, 4), dtype=np.uint8)
mask_logical = mask > 127
colored_layer[mask_logical] = overlay_color_rgba
colored_mask = Image.fromarray(colored_layer, "RGBA")
return Image.alpha_composite(img_rgba, colored_mask)
additional_colors = [colorname for (colorname, colorcode) in ImageColor.colormap.items()]
def plot_segmentation_masks(
img: Image.Image, segmentation_items: list[SegmentationItem]
) -> Image.Image:
colors = [
"red", "green", "blue", "yellow", "orange", "pink", "purple",
"brown", "gray", "beige", "turquoise", "cyan", "magenta", "lime",
"navy", "maroon", "teal", "olive", "coral", "lavender", "violet",
"gold", "silver",
] + additional_colors
font = ImageFont.load_default()
# 使用NumPy数组而不是base64字符串叠加掩码
for i, item in enumerate(segmentation_items):
color = colors[i % len(colors)]
img = overlay_mask_on_img(img, item.np_mask, color)
draw = ImageDraw.Draw(img)
# 使用box_2d = [y0, x0, y1, x1]绘制边界框和标签
for i, item in enumerate(segmentation_items):
color = colors[i % len(colors)]
y0, x0, y1, x1 = item.box_2d
draw.rectangle(
((x0, y0), (x1, y1)), outline=color, width=4
)
if item.label:
# 将标签位置稍微放在左上角上方
draw.text((x0 + 8, y0 - 20), item.label, fill=color, font=font)
return img
if __name__ == "__main__":
image = "image.png"
query = "Detect all foreign objects in the conveyor belt"
prompt = f"{query}. Output a JSON list of segmentation masks where each entry contains the 2D bounding box in the key \"box_2d\", the segmentation mask in key \"mask\", and the text label in the key \"label\". Use descriptive labels."
im = Image.open(BytesIO(open(image, "rb").read()))
im.thumbnail([1024,1024], Image.Resampling.LANCZOS)
# 运行模型以查找分割掩码
response = client.models.generate_content(
model="gemini-2.5-pro-preview-05-06", # "gemini-2.5-flash-preview-05-20"
contents=[prompt, im],
config=types.GenerateContentConfig(
temperature=0.5,
)
)
# 绘制
segmentation_masks = parse_segmentation_masks(response.text, img_height=im.size[1], img_width=im.size[0])
im = plot_segmentation_masks(im, segmentation_masks)
im.show()
执行完整程序后,将得到如下的可视化结果:
技术限制与分析
在实际应用中,基于Gemini的图像分割方案存在一些技术限制需要重点关注。
首先,模型在掩码数据格式化过程中可能出现不稳定现象。当对输出格式要求过于严格时,模型有时会生成包含重复字符序列的异常数据,特别是在base64掩码字符串中出现大量相同字符的连续重复,如"AAAAA..."等模式。这种现象不仅影响数据的有效性,还会显著增加base64解码的计算时间,从而影响整体系统性能。
其次,长序列重复字符的处理会带来额外的计算开销。解码包含大量重复字符的超长base64字符串需要消耗更多的处理时间和内存资源,这在对实时性要求较高的应用场景中可能成为性能瓶颈。
以下是典型的问题输出示例:
[
{
"box_2d": [120, 514, 600, 998],
"mask": "data:image/png;base64,iVBORw0AAAAA....",
"label": "spalling with exposed rebar",
},
{
"box_2d": [220, 29, 609, 320],
"mask": "data:image/png;base64,AAAAAAAAAAAAA...",
"label": "spalling with exposed rebar",
},
{
"box_2d": [14, 11, 111, 234],
"mask": "data:image/png;base64,AAAAAAAAAAAAAAA...",
"label": "crack",
},
]
为缓解这些问题,建议在生产环境中实施适当的错误处理机制,包括base64数据完整性检查、异常长度检测,以及必要时的重试策略。
总结
Gemini模型在目标检测和图像分割领域展现出了独特的技术优势,其原生支持的分割掩码生成能力为计算机视觉应用提供了新的技术路径。通过本文展示的工业传送带异物检测案例,验证了该技术在实际应用中的可行性和实用价值。
然而,作为一项相对新兴的技术,基于大语言模型的图像分割仍面临输出稳定性、处理效率等方面的挑战。在实际部署时,需要充分考虑这些技术限制,并设计相应的容错和优化机制。
随着模型技术的持续发展和优化,预期这些限制将得到逐步改善,为更广泛的计算机视觉应用场景提供可靠的技术支持。
https://avoid.overfit.cn/post/686368d5afc44b4397c299f1ef97319a