Skip to main content
Glama

PDF Knowledgebase MCP Server

by juanqui

PDF Knowledgebase MCP Server

A Model Context Protocol (MCP) server that enables intelligent document search and retrieval from PDF collections. Built for seamless integration with Claude Desktop, Continue, Cline, and other MCP clients, this server provides semantic search capabilities powered by OpenAI embeddings and ChromaDB vector storage.

Table of Contents

🚀 Quick Start

Step 1: Install the Server

uvx pdfkb-mcp

Step 2: Configure Your MCP Client

Claude Desktop (Most Common):

Configuration file locations:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • Linux: ~/.config/Claude/claude_desktop_config.json
{ "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...", "KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs" }, "transport": "stdio", "autoRestart": true } } }

VS Code (Native MCP) - Create .vscode/mcp.json in workspace:

{ "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...", "KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs" }, "transport": "stdio" } } }

Step 3: Verify Installation

  1. Restart your MCP client completely
  2. Check for PDF KB tools: Look for add_document, search_documents, list_documents, remove_document
  3. Test functionality: Try adding a PDF and searching for content

🏗️ Architecture Overview

MCP Integration

┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ MCP Client │ │ MCP Client │ │ MCP Client │ │ (Claude Desktop)│ │(VS Code/Continue)| │ (Other) │ └─────────┬───────┘ └─────────┬────────┘ └─────────┬───────┘ │ │ │ └──────────────────────┼───────────────────────┘ │ ┌────────────┴────────────┐ │ Model Context │ │ Protocol (MCP) │ │ Standard Layer │ └────────────┬────────────┘ │ ┌──────────────────────┼───────────────────────┐ │ │ │ ┌─────────┴───────┐ ┌─────────┴────────┐ ┌─────────┴───────┐ │ PDF KB Server │ │ Other MCP │ │ Other MCP │ │ (This Server) │ │ Server │ │ Server │ └─────────────────┘ └──────────────────┘ └─────────────────┘

Available Tools & Resources

Tools (Actions your client can perform):

Resources (Data your client can access):

  • pdf://{document_id} - Full document content as JSON
  • pdf://{document_id}/page/{page_number} - Specific page content
  • pdf://list - List of all documents with metadata

🎯 Parser Selection Guide

Decision Tree

Document Type & Priority? ├── 🏃 Speed Priority → PyMuPDF4LLM (fastest processing, low memory) ├── 📚 Academic Papers → MinerU (fast with GPU, excellent formulas) ├── 📊 Business Reports → Docling (medium speed, best tables) ├── ⚖️ Balanced Quality → Marker (medium speed, good structure) └── 🎯 Maximum Accuracy → LLM (slow, vision-based API calls) ```</search> </search_and_replace> ### Performance Comparison | Parser | Processing Speed | Memory | Text Quality | Table Quality | Best For | |--------|------------------|--------|--------------|---------------|----------| | **PyMuPDF4LLM** | **Fastest** | Low | Good | Basic | Speed priority | | **MinerU** | Fast (with GPU) | High | Excellent | Excellent | Scientific papers | | **Docling** | Medium | Medium | Excellent | **Excellent** | Business documents | | **Marker** | Medium | Medium | Excellent | Good | **Balanced (default)** | | **LLM** | Slow | Low | Excellent | Excellent | Maximum accuracy |</search> </search_and_replace> *Benchmarks from research studies and technical reports* ## ⚙️ Configuration ### Tier 1: Basic Configurations (80% of users) **Default (Recommended)**: ```json { "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...", "PDF_PARSER": "marker" }, "transport": "stdio" } } }

Speed Optimized:

{ "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...", "PDF_PARSER": "pymupdf4llm", "CHUNK_SIZE": "800" }, "transport": "stdio" } } }

Memory Efficient:

{ "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...", "PDF_PARSER": "pymupdf4llm", "EMBEDDING_BATCH_SIZE": "50" }, "transport": "stdio" } } }

Tier 2: Use Case Specific (15% of users)

Academic Papers:

{ "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...", "PDF_PARSER": "mineru", "CHUNK_SIZE": "1200" }, "transport": "stdio" } } }

Business Documents:

{ "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...", "PDF_PARSER": "docling", "DOCLING_TABLE_MODE": "ACCURATE", "DOCLING_DO_TABLE_STRUCTURE": "true" }, "transport": "stdio" } } }

Multi-language Documents:

{ "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...", "PDF_PARSER": "docling", "DOCLING_OCR_LANGUAGES": "en,fr,de,es", "DOCLING_DO_OCR": "true" }, "transport": "stdio" } } }

Maximum Quality:

{ "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...", "OPENROUTER_API_KEY": "sk-or-v1-abc123def456ghi789...", "PDF_PARSER": "llm", "LLM_MODEL": "anthropic/claude-3.5-sonnet", "EMBEDDING_MODEL": "text-embedding-3-large" }, "transport": "stdio" } } }

Essential Environment Variables

VariableDefaultDescription
OPENAI_API_KEYrequiredOpenAI API key for embeddings
KNOWLEDGEBASE_PATH./pdfsDirectory containing PDF files
CACHE_DIR./.cacheCache directory for processing
PDF_PARSERmarkerParser: marker, pymupdf4llm, mineru, docling, llm
CHUNK_SIZE1000Target chunk size for LangChain chunker
EMBEDDING_MODELtext-embedding-3-largeOpenAI embedding model

🖥️ MCP Client Setup

Claude Desktop

Configuration File Location:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • Linux: ~/.config/Claude/claude_desktop_config.json

Configuration:

{ "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...", "KNOWLEDGEBASE_PATH": "/Users/yourname/Documents/PDFs", "CACHE_DIR": "/Users/yourname/Documents/PDFs/.cache" }, "transport": "stdio", "autoRestart": true } } }

Verification:

  1. Restart Claude Desktop completely
  2. Look for PDF KB tools in the interface
  3. Test with "Add a document" or "Search documents"

VS Code with Native MCP Support

Configuration (.vscode/mcp.json in workspace):

{ "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...", "KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs" }, "transport": "stdio" } } }

Verification:

  1. Reload VS Code window
  2. Check VS Code's MCP server status in Command Palette
  3. Use MCP tools in Copilot Chat

VS Code with Continue Extension

Configuration (.continue/config.json):

{ "models": [...], "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-proj-abc123def456ghi789...", "KNOWLEDGEBASE_PATH": "${workspaceFolder}/pdfs" }, "transport": "stdio" } } }

Verification:

  1. Reload VS Code window
  2. Check Continue panel for server connection
  3. Use @pdfkb in Continue chat

Generic MCP Client

Standard Configuration Template:

{ "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "required", "KNOWLEDGEBASE_PATH": "required-absolute-path", "PDF_PARSER": "optional-default-marker" }, "transport": "stdio", "autoRestart": true, "timeout": 30000 } } }

📊 Performance & Troubleshooting

Common Issues

Server not appearing in MCP client:

// ❌ Wrong: Missing transport { "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"] } } } // ✅ Correct: Include transport and restart client { "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "transport": "stdio" } } }

Processing too slow:

// Switch to faster parser { "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-key", "PDF_PARSER": "pymupdf4llm" }, "transport": "stdio" } } }

Memory issues:

// Reduce memory usage { "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-key", "EMBEDDING_BATCH_SIZE": "25", "CHUNK_SIZE": "500" }, "transport": "stdio" } } }

Poor table extraction:

// Use table-optimized parser { "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-key", "PDF_PARSER": "docling", "DOCLING_TABLE_MODE": "ACCURATE" }, "transport": "stdio" } } }

Resource Requirements

ConfigurationRAM UsageProcessing SpeedBest For
Speed2-4 GBFastestLarge collections
Balanced4-6 GBMediumMost users
Quality6-12 GBMedium-FastAccuracy priority
GPU8-16 GBVery FastHigh-volume processing

🔧 Advanced Configuration

Parser-Specific Options

MinerU Configuration:

{ "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-key", "PDF_PARSER": "mineru", "MINERU_LANG": "en", "MINERU_METHOD": "auto", "MINERU_VRAM": "16" }, "transport": "stdio" } } }

LLM Parser Configuration:

{ "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-key", "OPENROUTER_API_KEY": "sk-or-v1-abc123def456ghi789...", "PDF_PARSER": "llm", "LLM_MODEL": "google/gemini-2.5-flash-lite", "LLM_CONCURRENCY": "5", "LLM_DPI": "150" }, "transport": "stdio" } } }

Performance Tuning

High-Performance Setup:

{ "mcpServers": { "pdfkb": { "command": "uvx", "args": ["pdfkb-mcp"], "env": { "OPENAI_API_KEY": "sk-key", "PDF_PARSER": "mineru", "KNOWLEDGEBASE_PATH": "/Volumes/FastSSD/Documents/PDFs", "CACHE_DIR": "/Volumes/FastSSD/Documents/PDFs/.cache", "EMBEDDING_BATCH_SIZE": "200", "VECTOR_SEARCH_K": "15", "FILE_SCAN_INTERVAL": "30" }, "transport": "stdio" } } }

Intelligent Caching

The server uses multi-stage caching:

Cache Invalidation Rules:

  • Changing PDF_PARSER → Full reset (parsing + chunking + embeddings)
  • Changing PDF_CHUNKER → Partial reset (chunking + embeddings)
  • Changing EMBEDDING_MODEL → Minimal reset (embeddings only)

📚 Appendix

Installation Options

Primary (Recommended):

uvx pdfkb-mcp

With Specific Parser Dependencies:

uvx pdfkb-mcp[marker] # Marker parser uvx pdfkb-mcp[mineru] # MinerU parser uvx pdfkb-mcp[docling] # Docling parser uvx pdfkb-mcp[llm] # LLM parser uvx pdfkb-mcp[langchain] # LangChain chunker

Development Installation:

git clone https://github.com/juanqui/pdfkb-mcp.git cd pdfkb-mcp pip install -e ".[dev]"

Complete Environment Variables Reference

VariableDefaultDescription
OPENAI_API_KEYrequiredOpenAI API key for embeddings
OPENROUTER_API_KEYoptionalRequired for LLM parser
KNOWLEDGEBASE_PATH./pdfsPDF directory path
CACHE_DIR./.cacheCache directory
PDF_PARSERmarkerPDF parser selection
PDF_CHUNKERunstructuredChunking strategy
CHUNK_SIZE1000LangChain chunk size
CHUNK_OVERLAP200LangChain chunk overlap
EMBEDDING_MODELtext-embedding-3-largeOpenAI model
EMBEDDING_BATCH_SIZE100Embedding batch size
VECTOR_SEARCH_K5Default search results
FILE_SCAN_INTERVAL60File monitoring interval
LOG_LEVELINFOLogging level

Parser Comparison Details

FeaturePyMuPDF4LLMMarkerMinerUDoclingLLM
SpeedFastestMediumFast (GPU)MediumSlowest
MemoryLowestMediumHighMediumLowest
TablesBasicGoodExcellentExcellentExcellent
FormulasBasicGoodExcellentGoodExcellent
ImagesBasicGoodGoodExcellentExcellent
SetupSimpleSimpleModerateSimpleSimple
CostFreeFreeFreeFreeAPI costs

Chunking Strategies

LangChain (PDF_CHUNKER=langchain):

Unstructured (PDF_CHUNKER=unstructured):

  • Intelligent semantic chunking with unstructured library
  • Zero configuration required
  • Best for document structure awareness

Troubleshooting Guide

API Key Issues:

  1. Verify key format starts with sk-
  2. Check account has sufficient credits
  3. Test connectivity: curl -H "Authorization: Bearer $OPENAI_API_KEY" https://api.openai.com/v1/models

Parser Installation Issues:

  1. MinerU: pip install mineru[all] and verify mineru --version
  2. Docling: pip install docling for basic, pip install pdfkb-mcp[docling-complete] for all features
  3. LLM: Requires OPENROUTER_API_KEY environment variable

Performance Optimization:

  1. Speed: Use pymupdf4llm parser
  2. Memory: Reduce EMBEDDING_BATCH_SIZE and CHUNK_SIZE
  3. Quality: Use mineru (GPU) or docling (CPU)
  4. Tables: Use docling with DOCLING_TABLE_MODE=ACCURATE

For additional support, see implementation details in src/pdfkb/main.py and src/pdfkb/config.py.

Related MCP Servers

  • A
    security
    A
    license
    A
    quality
    A Model Context Protocol server providing vector database capabilities through Chroma, enabling semantic document search, metadata filtering, and document management with persistent storage.
    Last updated -
    6
    35
    Python
    MIT License
    • Apple
    • Linux
  • -
    security
    F
    license
    -
    quality
    A Model Context Protocol server for ingesting, chunking and semantically searching documentation files, with support for markdown, Python, OpenAPI, HTML files and URLs.
    Last updated -
    Python
    • Apple
  • A
    security
    A
    license
    A
    quality
    A Model Context Protocol (MCP) server for the Open Library API that enables AI assistants to search for book information.
    Last updated -
    1
    9
    26
    TypeScript
    MIT License
  • -
    security
    A
    license
    -
    quality
    A Model Context Protocol server that provides intelligent file reading and semantic search capabilities across multiple document formats with security-first access controls.
    Last updated -
    5
    Python
    MIT License
    • Apple
    • Linux

View all related MCP servers

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/juanqui/pdfkb-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server