The Scrapy MCP Server is a robust, enterprise-grade web scraping platform that offers comprehensive data extraction capabilities for commercial use.
Core Scraping Capabilities:
Multiple scraping methods: HTTP requests, Scrapy framework, Selenium, or Playwright with intelligent method selection
Concurrent processing: Scrape multiple URLs simultaneously with exponential backoff retry mechanisms
JavaScript support: Fully render dynamic, JavaScript-heavy websites using complete browser rendering
Advanced data extraction: Configure flexible extraction rules using simple or advanced selectors, or automatically extract structured data like contact information, social media links, product details, and addresses
Link extraction: Specialized link extraction with domain filtering and internal/external link options
Form interaction: Automatically fill and submit various form types including text inputs, checkboxes, and file uploads
Anti-Detection & Performance:
Stealth techniques: Bypass anti-bot measures using undetected-chromedriver, Playwright stealth, random User-Agent rotation, and proxy support
Performance optimization: In-memory caching, rate limiting, and intelligent request handling to prevent server overload
Monitoring tools: Track server metrics including request counts, success rates, cache statistics, and detailed performance monitoring
Enterprise Features:
Ethical compliance: Check robots.txt files for responsible data collection
Error handling: Robust error classification and handling mechanisms
Cache management: Clear scraping results cache and manage server resources
Provides web scraping capabilities using the Scrapy framework for large-scale data extraction, with support for concurrent requests, custom pipelines, and advanced crawling features.
Enables browser automation and JavaScript-heavy website scraping through Selenium WebDriver, with support for form filling, element waiting, and dynamic content extraction.
id: data-extractor sidebar_position: 1 title: Data Extractor description: Readme of Data Extractor last_update: author: Aurelius date: 2025-11-24 tags:
README
Data Extractor
Data Extractor 是一个基于 FastMCP 和 Scrapy、markdownify、pypdf、pymupdf 联合构建的强大、稳定的网页内容、PDF 内容提取 MCP Server,具备转换 Web Page、PDF Document 为 Markdown 的能力,专为商业环境中的长期使用而设计。
🛠️ MCP Server 核心工具 (14 个)
Web Page
工具名称 | 功能描述 | 主要参数 |
scrape_webpage | 单页面抓取 |
,
(自动选择),
(选择器配置),
(CSS 选择器) |
scrape_multiple_webpages | 批量页面抓取 |
(列表),
(统一方法),
(全局配置) |
scrape_with_stealth | 反检测抓取 |
,
(selenium/playwright),
(滚动加载),
|
fill_and_submit_form | 表单自动化 |
,
(选择器:值),
(是否提交),
|
extract_links | 专业链接提取 |
,
(域名过滤),
(排除域名),
(仅内部) |
extract_structured_data | 结构化数据提取 |
,
(all/contact/social/content/products/addresses) |
get_page_info | 页面信息获取 |
(目标 URL) - 返回标题、状态码、元数据 |
check_robots_txt | 爬虫规则检查 |
(域名 URL) - 检查 robots.txt 规则 |
convert_webpage_to_markdown | 页面转 Markdown |
,
,
(提取主内容),
(嵌入图片),
|
batch_convert_webpages_to_markdown | 批量 Markdown 转换 |
(列表),
,
,
,
|
PDF Document
工具名称 | 功能描述 | 主要参数 |
convert_pdf_to_markdown | PDF 转 Markdown |
(URL/路径),
(auto/pymupdf/pypdf),
,
|
batch_convert_pdfs_to_markdown | 批量 PDF 转换 |
(列表),
,
,
,
|
服务管理
工具名称 | 功能描述 | 主要参数 |
get_server_metrics | 性能指标监控 | 无参数 - 返回请求统计、性能指标、缓存情况 |
clear_cache | 缓存管理 | 无参数 - 清空所有缓存数据 |
Related MCP server: Scrapezy
🎯 快速参考
🤝 贡献
欢迎提交 Issue 和 Pull Request 来改进这个项目。
📄 许可证
MIT License - 详见 LICENSE 文件
注意: 请负责任地使用此工具,遵守网站的使用条款和 robots.txt 规则,尊重网站的知识产权。