Skip to main content
Glama

Computer Vision MCP Server

by samhains

cv-mcp

Minimal MCP server focused on computer vision: image recognition and metadata generation via OpenRouter (Gemini 2.5 family).

Goals

  • Keep it tiny and composable
  • Single tool: caption an image via URL or local file
  • No DB or app logic

Structure

  • src/cv_mcp/captioning/openrouter_client.py – image analysis client
  • src/cv_mcp/metadata/ – prompts, JSON schema, and pipeline runner
  • src/cv_mcp/mcp_server.py – MCP server exposing tools
  • cli/caption_image.py – optional CLI to test captioning locally

Env vars

  • OPENROUTER_API_KEY

Dotenv

  • Put OPENROUTER_API_KEY in a local .env file (see .env.example).
  • CLI scripts and the MCP server auto-load .env if present.

Install

  • pip install -e . (or pip install .)

⚠️ Development Note: If you have the package installed via pip install, uninstall it before working with the local development version to avoid import conflicts. Use pip uninstall cv-mcp first, then run commands directly from the repo directory.

Run MCP server (stdio)

  • Console script: cv-mcp-server (provides an MCP stdio server)
  • Configure your MCP client to launch cv-mcp-server.

MCP integration (Claude Desktop)

  • Add to Claude Desktop config (see their docs for the config location): { "mcpServers": { "cv-mcp": { "command": "cv-mcp-server", "env": { "OPENROUTER_API_KEY": "sk-or-..." } } } }
  • After saving, restart Claude Desktop and enable the tool.

Tools

  • caption_image: one-off caption (kept for compatibility)
  • alt_text: short alt text (<= 20 words)
  • dense_caption: detailed 2–6 sentence caption
  • image_metadata: structured JSON metadata with alt + caption. Params:
    • mode: double (default) uses 2 calls: vision (alt+caption) + text-only (metadata). triple uses vision for both steps.
    • caption_override: supply your own dense caption; skips the vision caption step.

MCP tool reference

  • Server: cv-mcp (stdio)
  • caption_image(image_url|file_path, prompt?, backend?, local_model_id?) -> string
  • alt_text(image_url|file_path, max_words?) -> string
  • dense_caption(image_url|file_path) -> string
  • image_metadata(image_url|file_path, caption_override?, config_path?) -> { alt_text, caption, metadata }

Examples

  • MCP call (OpenRouter): {"image_url": "https://example.com/image.jpg"}
  • MCP call (local): {"backend": "local", "file_path": "./image.jpg"}

Quick test (CLI)

  • URL: python cli/caption_image.py --image-url https://example.com/img.jpg
  • File: python cli/caption_image.py --file-path ./image.png

Metadata pipeline (CLI)

  • Double (default):
    • python cli/image_metadata.py --image-url https://example.com/img.jpg --mode double
    • Local alt+caption (still requires OpenRouter for metadata):
      • python cli/image_metadata.py --image-url https://example.com/img.jpg --mode double --ac-backend local
  • Triple (vision metadata):
    • python cli/image_metadata.py --image-url https://example.com/img.jpg --mode triple
    • Fully local (no OpenRouter required):
      • python cli/image_metadata.py --image-url https://example.com/img.jpg --mode triple --ac-backend local --meta-vision-backend local
  • With existing caption (skips the caption step):
    • python cli/image_metadata.py --image-url https://example.com/img.jpg --caption-override "<dense caption>" --mode double
  • Custom model config (JSON with caption_model, metadata_text_model, metadata_vision_model):
    • python cli/image_metadata.py --image-url https://example.com/img.jpg --config-path ./my_models.json --mode double

Schema & vocab

  • JSON schema (lean): src/cv_mcp/metadata/schema.json
  • Controlled vocab (non-binding reference): src/cv_mcp/metadata/vocab.json

Global config

  • Root file: cv_mcp.config.json (auto-detected from project root / CWD)
  • Env override: set CV_MCP_CONFIG=/path/to/config.json
  • Keys (renamed for clarity):
    • caption_model: vision model for alt+caption (OpenRouter)
    • metadata_text_model: text model for metadata (double mode)
    • metadata_vision_model: vision model for metadata (triple mode)
    • caption_backend: openrouter (default) or local for alt/dense/AC steps
    • metadata_vision_backend: openrouter (default) or local for triple mode
    • local_vlm_id: default local VLM (e.g. Qwen/Qwen2.5-VL-7B-Instruct)
    • Backwards-compat: legacy keys (ac_model, meta_text_model, meta_vision_model, ac_backend, meta_vision_backend, local_model_id) are still accepted.
  • Packaged defaults still live at src/cv_mcp/metadata/config.json and are used if no root config is found.
  • You can still provide a custom config file per-call via --config-path or the config_path tool param.

Local backends (optional)

  • Install optional deps: pip install .[local]
  • Global default: set "caption_backend": "local" (and optionally "metadata_vision_backend": "local") in cv_mcp.config.json
  • Use with MCP: pass backend: "local" in the tool params (overrides global)
  • Use with CLI: add --backend local and optionally --local-model-id Qwen/Qwen2-VL-2B-Instruct (overrides global)
  • Requires a locally available model (default: Qwen/Qwen2-VL-2B-Instruct via HF cache)
  • Or run without transformers using Ollama (no Python ML deps):
    • Install and run Ollama; pull a vision model (e.g., ollama pull qwen2.5-vl)
    • Use backend ollama and set models in the config (e.g., caption_model: "qwen2.5-vl")
    • CLI example (triple, fully local):
      • python cli/image_metadata.py --image-url https://... --mode triple --caption-backend ollama --metadata-vision-backend ollama --config-path ./configs/triple_ollama_qwen.json
    • Configure host with --ollama-host http://localhost:11434 if not default

Per-call overrides (CLI)

  • Metadata CLI now supports per-call backend overrides without editing global config:
    • --caption-backend local|openrouter|ollama (legacy: --ac-backend)
    • --metadata-vision-backend local|openrouter|ollama (legacy: --meta-vision-backend)
    • --local-vlm-id Qwen/Qwen2.5-VL-7B-Instruct (legacy: --local-model-id)
    • --ollama-host http://localhost:11434

Justfile tasks

  • A Justfile provides quick test scenarios. Use URL-only inputs, e.g. just double_flash https://example.com/img.jpg.
  • Scenarios included:
    • double_flash: Gemini 2.5 Flash for both steps
    • double_pro: Gemini 2.5 Pro for both steps
    • double_mixed_pro_text: Flash for vision alt+caption, Pro for text metadata (recommended mix for JSON reliability)
    • triple_flash / triple_pro: Flash/Pro for both vision steps
    • double_qwen_local <url> <qwen_id>: Local Qwen 2.5 VL for vision step, Pro for text metadata
    • triple_qwen_local <url> <qwen_id>: Fully local Qwen 2.5 VL for both vision steps
    • Convenience (no extra args):
      • double_qwen2b_local <url> / triple_qwen2b_local <url>
      • double_qwen7b_local <url> / triple_qwen7b_local <url>

Recommendation for mixed double

  • Put Gemini 2.5 Pro on the text metadata step and Flash on the vision alt+caption step. The metadata step benefits from better structured-JSON compliance and reasoning, while Flash keeps latency/cost down for the vision caption.
  • OpenRouter key requirements:
    • Double mode always requires OPENROUTER_API_KEY (text LLM for metadata).
    • Triple mode requires OPENROUTER_API_KEY unless both --ac-backend local and --meta-vision-backend local are set.

Examples

  • MCP tool (local): {"backend": "local", "file_path": "./image.jpg"}
  • CLI (local): python cli/caption_image.py --file-path ./image.jpg --backend local

Troubleshooting

  • 401/403 from OpenRouter: ensure OPENROUTER_API_KEY is set and valid.
  • Model selection: prefer cv_mcp.config.json at project root; or pass --config-path.
  • Large images: remote images are downloaded and sent as base64; ensure the URL is accessible.
  • Local backend: install optional deps pip install .[local] and ensure model is present/cached.

Changelog

  • See docs/CHANGELOG.md for notable changes and release notes.
Deploy Server
-
security - not tested
F
license - not found
-
quality - not tested

hybrid server

The server is able to function both locally and remotely, depending on the configuration or use case.

Enables image captioning and analysis through natural language by processing images from URLs or local files. Supports both OpenRouter's Gemini 2.5 Flash and local vision models for generating concise, descriptive captions.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/samhains/cv-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server