content-core

Overview Inspect Schema Related Servers Score Discussions

MIT License

content-core
docs

mcp.md•11.5 kB

# MCP Server Documentation Content Core includes a Model Context Protocol (MCP) server that provides powerful content extraction capabilities to Claude Desktop and other MCP-compatible applications. The server exposes a single, easy-to-use tool that can extract content from URLs and files using Content Core's advanced extraction engines. ## What is MCP? The [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) is an open standard that enables AI applications to securely connect to external data sources and tools. Content Core's MCP server allows Claude Desktop to directly extract content from various sources, making it easy to process web pages, documents, videos, and other media within your conversations. ## Features - **Single tool interface**: `extract_content` function accepts either URLs or file paths - **Auto engine selection**: Uses Content Core's intelligent 'auto' engine for optimal extraction - **Rich metadata**: Returns detailed information about extraction process and content - **Structured JSON responses**: Consistent format with success/error handling - **Wide format support**: Handles web pages, PDFs, Word docs, videos, audio files, and more - **Zero-install option**: Run MCP server and CLI tools with `uvx` without local installation ## Installation ### Option 1: Install with pip (Recommended for local development) ```bash # Install Content Core (MCP server included by default) pip install content-core # The content-core-mcp command becomes available content-core-mcp ``` ### Option 2: Use with uvx (Recommended for production) ```bash # Run MCP server directly without installation uvx --from "content-core" content-core-mcp # Also works for CLI tools uvx --from "content-core" ccore https://example.com uvx --from "content-core" cclean "messy text" uvx --from "content-core" csum "long content" --context "bullet points" ``` ## Claude Desktop Setup ### Configuration Add Content Core to your Claude Desktop configuration file: **Location of config file:** - **macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json` - **Windows**: `%APPDATA%\Claude\claude_desktop_config.json` ### Production Configuration (using uvx) ```json { "mcpServers": { "content-core": { "command": "uvx", "args": [ "--from", "content-core", "content-core-mcp" ] } } } ``` ### Local Development Configuration ```json { "mcpServers": { "content-core": { "command": "uv", "args": [ "--directory", "/path/to/your/content-core", "run", "content-core-mcp" ] } } } ``` ### With Environment Variables For optimal functionality, you'll need to configure API keys. Here's what each key does: **Required:** - `OPENAI_API_KEY` - **Required for audio/video transcription and content cleaning** **Optional (but recommended):** - `FIRECRAWL_API_KEY` - **Improved web crawling and content extraction from URLs** - `JINA_API_KEY` - **Alternative web crawling service (fallback when Firecrawl unavailable)** ```json { "mcpServers": { "content-core": { "command": "uvx", "args": [ "--from", "content-core", "content-core-mcp" ], "env": { "OPENAI_API_KEY": "sk-your-openai-key-here", "FIRECRAWL_API_KEY": "fc-your-firecrawl-key-here", "JINA_API_KEY": "jina-your-jina-key-here" } } } } ``` **Note:** Without `OPENAI_API_KEY`, you won't be able to: - Transcribe audio or video files - Use AI-powered content cleaning and summarization features Without the web crawling API keys, Content Core will fall back to basic BeautifulSoup extraction for URLs, which may be less reliable for complex websites. ## Usage After setting up the MCP server, you can use it directly in Claude Desktop conversations. The server provides one main tool: ### extract_content Extract content from URLs or files using Content Core's auto engine. **Parameters:** - `url` (optional): URL to extract content from - `file_path` (optional): Local file path to extract content from **Note**: Exactly one parameter must be provided (either `url` OR `file_path`, not both). ## Examples ### Extracting from URLs **Prompt in Claude Desktop:** ``` Please extract the content from https://example.com/article ``` **What happens:** Claude will use the MCP server to extract the article content, including the title, main text, and metadata. ### Extracting from Files **Prompt in Claude Desktop:** ``` Extract the content from /path/to/document.pdf ``` **What happens:** Claude will extract text content from the PDF, including any embedded text, tables, and structural information. ### Working with Videos **Prompt in Claude Desktop:** ``` Please extract the transcript from /path/to/video.mp4 ``` **What happens:** Content Core will extract audio from the video, transcribe it to text, and return the full transcript. ### Complex Workflows **Prompt in Claude Desktop:** ``` Extract content from https://www.youtube.com/watch?v=example and summarize the key points in bullet format ``` **What happens:** 1. Claude extracts the YouTube video transcript using the MCP server 2. Claude then processes and summarizes the content as requested ## Response Format The MCP server returns structured JSON responses: ```json { "success": true, "error": null, "source_type": "url", "source": "https://example.com/article", "content": "Extracted article content...", "metadata": { "extraction_time_seconds": 2.34, "extraction_timestamp": "2025-06-19T13:00:00Z", "content_length": 1234, "identified_type": "text/html", "title": "Article Title", "final_url": "https://example.com/article", // Additional metadata specific to content type } } ``` ### Error Response ```json { "success": false, "error": "File not found: /path/to/nonexistent.pdf", "source_type": "file", "source": "/path/to/nonexistent.pdf", "content": null, "metadata": { "extraction_timestamp": "2025-06-19T13:00:00Z", "error_type": "FileNotFoundError" } } ``` ## Supported Content Types The MCP server supports all content types that Content Core can handle: ### Web Content - HTML pages - YouTube videos (transcript extraction) - Social media posts - Articles and blogs - Documentation sites ### Document Formats - PDF files - Microsoft Word (.docx) - PowerPoint (.pptx) - Excel (.xlsx) - Markdown files - Plain text files - CSV files ### Media Files - Video files (MP4, AVI, MOV, etc.) - extracts transcript - Audio files (MP3, WAV, M4A, etc.) - transcribes to text - Images (JPG, PNG, etc.) - OCR text extraction ### Other Formats - ZIP archives (extracts text from contained files) - EPUB books - AsciiDoc files - HTML files ## Configuration ### Engine Selection Content Core's MCP server uses the 'auto' engine by default, which automatically selects the best extraction method based on: - **URLs**: Firecrawl (if API key available) → Jina (if API key available) → BeautifulSoup - **Files**: Docling → Simple extraction ### API Keys To get the best extraction results, configure these API keys: **Required for Audio/Video Processing:** ```bash # Essential for transcribing audio and video files export OPENAI_API_KEY="sk-your-openai-key-here" ``` **Optional but Recommended for Web Extraction:** ```bash # For enhanced web crawling (recommended) export FIRECRAWL_API_KEY="fc-your-firecrawl-key-here" # Alternative web crawling service (fallback) export JINA_API_KEY="jina-your-jina-key-here" ``` **Additional AI Models (Optional):** ```bash # For alternative AI models export GOOGLE_API_KEY="your-google-key" ``` **What happens without these keys:** - **No OPENAI_API_KEY**: Audio/video transcription will fail - **No web crawling keys**: URLs will use basic BeautifulSoup extraction (less reliable) - **No AI model keys**: Content cleaning/summarization features won't work **Getting API Keys:** - **OpenAI**: Visit [OpenAI API Keys](https://platform.openai.com/api-keys) - **Firecrawl**: Visit [Firecrawl](https://www.firecrawl.dev/) for enhanced web scraping - **Jina**: Visit [Jina AI](https://jina.ai/) for alternative web extraction ### Engine Selection via Environment Variables For advanced users, you can override the extraction engines: ```json { "mcpServers": { "content-core": { "env": { "OPENAI_API_KEY": "sk-...", "FIRECRAWL_API_KEY": "fc-...", "CCORE_DOCUMENT_ENGINE": "simple", // Skip docling, use PyMuPDF "CCORE_URL_ENGINE": "auto" // Or firecrawl, jina } } } } ``` **Available engines:** - **Document**: `auto`, `simple`, `docling` (requires `content-core[docling]`) - **URL**: `auto`, `simple`, `firecrawl`, `jina` **Use cases:** - Set `CCORE_DOCUMENT_ENGINE=simple` to avoid docling dependency issues - Set `CCORE_URL_ENGINE=firecrawl` to always use paid service for better reliability - Set `CCORE_URL_ENGINE=simple` for faster processing without external API calls ### Custom Prompts You can customize Content Core's behavior by setting a custom prompt path: ```bash export PROMPT_PATH="/path/to/your/custom/prompts" ``` ## Troubleshooting ### Common Issues **"MCP content-core: Unexpected token" errors:** - This usually indicates output to stdout that interferes with the MCP protocol - Content Core v1.0.5+ includes fixes to suppress MoviePy and other library outputs **Connection failures:** ```bash # Test the MCP server directly content-core-mcp # Or with uvx uvx --from "content-core" content-core-mcp ``` **Missing dependencies:** ```bash # Reinstall Content Core pip install --force-reinstall content-core ``` **Audio/video extraction failing:** - Make sure `OPENAI_API_KEY` is set in your environment variables - Check that your OpenAI API key has sufficient credits - Audio/video files require OpenAI's Whisper API for transcription **Poor web extraction quality:** - Add `FIRECRAWL_API_KEY` for better web scraping results - Add `JINA_API_KEY` as a fallback option - Without these keys, basic BeautifulSoup extraction is used (limited functionality) ### Debug Mode For development and debugging, you can run the server with additional logging: ```bash # Set debug level export LOGURU_LEVEL=DEBUG content-core-mcp ``` ### Performance Considerations - **Large files**: Video and audio files may take longer to process due to transcription - **API rate limits**: Some web extraction services have rate limits - **Network connectivity**: URL extraction requires internet access ## Development ### Running Locally ```bash # Clone the repository git clone https://github.com/lfnovo/content-core cd content-core # Install with MCP dependencies uv sync --extra mcp # Run the server make mcp-server # or uv run content-core-mcp ``` ### Testing ```bash # Run MCP-specific tests uv run pytest tests/unit/test_mcp_server.py -v # Run all tests make test ``` ## Contributing Contributions to the MCP server are welcome! Please see our [Contributing Guide](../CONTRIBUTING.md) for development setup and guidelines. ## Support - **Issues**: [GitHub Issues](https://github.com/lfnovo/content-core/issues) - **Documentation**: [Main Documentation](usage.md) - **MCP Protocol**: [Model Context Protocol](https://modelcontextprotocol.io/) ## Version History - **v1.0.4**: Initial MCP server implementation - **v1.0.5**: Added output suppression for better MCP compatibility - **Latest**: Enhanced error handling and metadata support

Latest Blog Posts

The 50MB Markdown Files That Broke Our Server
By punkpeye on December 3, 2025.
react
react-router
node-js
OpenTelemetry for Model Context Protocol (MCP) Analytics and Agent Observability
By Om-Shree-0709 on November 29, 2025.
observability
mcp
opentelemetry
Securing Enterprise AI Agents with Unique Identities in the Model Context Protocol (MCP)
By Om-Shree-0709 on November 27, 2025.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/lfnovo/content-core'

If you have feedback or need assistance with the MCP directory API, please join our Discord server