PDF Knowledgebase MCP Server

A powerful Model Context Protocol (MCP) server that transforms your PDF and Markdown document collection into an intelligent, searchable knowledge base. Built for seamless integration with Claude Desktop, VS Code, Continue, and other MCP-enabled AI assistants.

Description

pdfkb-mcp processes your documents using advanced PDF parsing, creates semantic embeddings, and provides sophisticated search capabilities through the Model Context Protocol. Whether you're managing research papers, technical documentation, or business reports, pdfkb-mcp makes your document collection instantly searchable and accessible to your AI assistant.

Motivation

I built pdfkb-mcp because I needed a way to efficiently index and search through hundreds of semiconductor datasheets and technical documents. Traditional file search wasn't sufficient—I needed semantic understanding, context preservation, and the ability to ask complex questions about technical specifications across multiple documents. This tool has transformed how I work with technical documentation, and I'm sharing it so others can benefit from intelligent document search in their workflows.

✨ Features

🤖 Intelligent Document Processing

Multiple PDF Parsers: PyMuPDF4LLM (fast), Marker (balanced), Docling (tables), MinerU (academic), LLM (complex layouts)
Markdown Support: Native processing of .md and .markdown files with metadata extraction
Smart Chunking: LangChain, semantic, page-based, and unstructured chunking strategies
Background Processing: Non-blocking document processing with intelligent caching

🔍 Advanced Search & AI

Hybrid Search: Combines semantic similarity with keyword matching (BM25) for superior results
AI Reranking: Qwen3-Reranker models improve search relevance by 15-30%
Local & Remote Embeddings: Privacy-focused local models or high-performance API-based options
Document Summarization: Auto-generates rich metadata with titles, descriptions, and summaries

🌐 Multi-Client & Remote Access

MCP Protocol Support: Works with Claude Desktop, VS Code, Continue, Cline, and other MCP clients
Web Interface: Modern web UI for document management, search, and analysis
HTTP/SSE Transport: Remote access from multiple clients simultaneously
Docker Deployment: Production-ready containerized deployment

🔒 Privacy & Performance

Local-First Option: Run completely offline with local embeddings—no API costs, full privacy
Quantized Models: GGUF models use 50-70% less memory with maintained quality
Best Practices: Background processing, health checks, monitoring, and scalability

🌐 Web Interface Preview

Once your setup is complete, you'll have access to a modern web interface for document management and search:

PDF Knowledgebase Web Interface

The web interface provides document upload, real-time processing status, semantic search, and comprehensive document management capabilities.

Key Features:

🔍 Real-time Search: Instant semantic and hybrid search
📊 Processing Status: Live updates on document processing
📈 Document Analytics: View chunks, metadata, and summaries
⚙️ System Monitoring: Server performance and resource usage

🚀 Quick Start

Get up and running in minutes using Docker/Podman with DeepInfra as your AI provider.

Prerequisites

Container Runtime: Docker or Podman installed
DeepInfra API Key: Get your free key (recommended for cost-effectiveness)
Documents: A folder with PDF or Markdown files to index

1. Set Up Docker Compose

# Download configuration and create directories
curl -o docker-compose.yml https://raw.githubusercontent.com/juanqui/pdfkb-mcp/main/docker-compose.sample.yml
mkdir -p ./documents ./cache ./logs

# Edit docker-compose.yml and update:
# 1. Volume path: "/path/to/your/documents:/app/documents:rw"
# 2. API key: PDFKB_OPENAI_API_KEY: "your-deepinfra-api-key-here"

2. Start the Server

# Using Podman (recommended)
podman-compose up -d

# Or using Docker
docker compose up -d

Access Points:

Web Interface: http://localhost:8000
MCP Endpoint: http://localhost:8000/mcp/
Health Check: http://localhost:8000/health

3. Configure Your MCP Client

Claude Desktop - Add to claude_desktop_config.json:

{
  "mcpServers": {
    "pdfkb": {
      "transport": "http",
      "url": "http://localhost:8000/mcp/"
    }
  }
}

VS Code with Continue - Add to .continue/config.json:

{
  "mcpServers": {
    "pdfkb": {
      "transport": "http",
      "url": "http://localhost:8000/mcp/"
    }
  }
}

4. Add Your Documents

Web Interface: Open http://localhost:8000
File System: Copy files to your documents directory — they're automatically detected

5. Start Searching

Ask your AI assistant to search your documents:

"What register do I need to configure to reset charging in the nPM1300?"
"Is XYZ a clock capable pin according to the nRF54L15 datasheet?"
"What is the conversion formula to interpret temperature as celcius according to the XYZ datashet?"

The setup includes:

✅ DeepInfra AI: Cost-effective embeddings, reranking, and document summarization
✅ Hybrid Search: Semantic + keyword matching
✅ Document Summarization: Auto-generated metadata (i.e. title, description)
✅ Web Interface: Document management UI
✅ Persistent Storage: Documents and cache preserved

📚 User Guide

For complete documentation, configuration options, and advanced features:

👉

The user guide includes:

📦 Installation Options - uvx, pip, Docker setup
⚙️ Configuration - Environment variables and settings
🔍 Search Features - Hybrid search, reranking, semantic chunking
🤖 Embeddings - Local, OpenAI, and HuggingFace options
🔌 MCP Clients - Setup guides for all MCP clients
🐳 Docker Deployment - Production deployment guide
🔧 Troubleshooting - Common issues and performance tuning
🎯 Advanced Features - Document summarization and enterprise features

License

This project is licensed under the MIT License - see the LICENSE file for details.

This server cannot be installed

security - not tested

license - permissive license

quality - not tested

How are these scores calculated?

hybrid server

The server is able to function both locally and remotely, depending on the configuration or use case.

A Model Context Protocol server that enables intelligent document search and retrieval from PDF collections, providing semantic search capabilities powered by OpenAI embeddings and ChromaDB vector storage.

Related MCP Servers

Chroma MCP Server
privetin
A
security
A
license
A
quality
A Model Context Protocol server providing vector database capabilities through Chroma, enabling semantic document search, metadata filtering, and document management with persistent storage.
Last updated -
6
38
MIT License
doc-lib-mcp
shifusen329
-
security
F
license
-
quality
A Model Context Protocol server for ingesting, chunking and semantically searching documentation files, with support for markdown, Python, OpenAPI, HTML files and URLs.
Last updated -
mcp-open-library
8enSmith
A
security
A
license
A
quality
A Model Context Protocol (MCP) server for the Open Library API that enables AI assistants to search for book information.
Last updated -
6
4
38
MIT License
FS-MCP Server
boleyn
-
security
A
license
-
quality
A Model Context Protocol server that provides intelligent file reading and semantic search capabilities across multiple document formats with security-first access controls.
Last updated -
5
MIT License

View all related MCP servers

Appeared in Searches

How to read PDFs and images

PDF Knowledgebase MCP Server

PDF Knowledgebase MCP Server

Description

Motivation

✨ Features

🤖 Intelligent Document Processing

🔍 Advanced Search & AI

🌐 Multi-Client & Remote Access

🔒 Privacy & Performance

🌐 Web Interface Preview

🚀 Quick Start

Prerequisites

1. Set Up Docker Compose

2. Start the Server

3. Configure Your MCP Client

4. Add Your Documents

5. Start Searching

📚 User Guide

License

Related MCP Servers

Chroma MCP Server

doc-lib-mcp

mcp-open-library

FS-MCP Server

Appeared in Searches

New MCP Servers

MCP directory API