Skip to main content
Glama
README.md4.59 kB
Data Extractor is a commercial-grade MCP Server built on FastMCP, offering robust capabilities to read, extract, and localize (into Markdown) content from web pages and PDFs with both text and images. It is purpose-built for long-term deployment in enterprise environments. ## 🛠️ MCP Server Core Tools (14) ### Web Page | 工具名称 | 功能描述 | 主要参数 | | -------------------------------------- | ------------------ | --------------------------------------------------------------------------------------------------- | | **scrape_webpage** | 单页面抓取 | `url`, `method`(自动选择), `extract_config`(选择器配置), `wait_for_element`(CSS 选择器) | | **scrape_multiple_webpages** | 批量页面抓取 | `urls`(列表), `method`(统一方法), `extract_config`(全局配置) | | **scrape_with_stealth** | 反检测抓取 | `url`, `method`(selenium/playwright), `scroll_page`(滚动加载), `wait_for_element` | | **fill_and_submit_form** | 表单自动化 | `url`, `form_data`(选择器:值), `submit`(是否提交), `submit_button_selector` | | **extract_links** | 专业链接提取 | `url`, `filter_domains`(域名过滤), `exclude_domains`(排除域名), `internal_only`(仅内部) | | **extract_structured_data** | 结构化数据提取 | `url`, `data_type`(all/contact/social/content/products/addresses) | | **get_page_info** | 页面信息获取 | `url`(目标 URL) - 返回标题、状态码、元数据 | | **check_robots_txt** | 爬虫规则检查 | `url`(域名 URL) - 检查 robots.txt 规则 | | **convert_webpage_to_markdown** | 页面转 Markdown | `url`, `method`, `extract_main_content`(提取主内容), `embed_images`(嵌入图片), `formatting_options` | | **batch_convert_webpages_to_markdown** | 批量 Markdown 转换 | `urls`(列表), `method`, `extract_main_content`, `embed_images`, `embed_options` | ### PDF Document | 工具名称 | 功能描述 | 主要参数 | | ---------------------------------- | --------------- | ----------------------------------------------------------------------------------- | | **convert_pdf_to_markdown** | PDF 转 Markdown | `pdf_source`(URL/路径), `method`(auto/pymupdf/pypdf), `page_range`, `output_format` | | **batch_convert_pdfs_to_markdown** | 批量 PDF 转换 | `pdf_sources`(列表), `method`, `page_range`, `output_format`, `include_metadata` | ### Service Management | 工具名称 | 功能描述 | 主要参数 | | ---------------------- | ------------ | ----------------------------------------- | | **get_server_metrics** | 性能指标监控 | 无参数 - 返回请求统计、性能指标、缓存情况 | | **clear_cache** | 缓存管理 | 无参数 - 清空所有缓存数据 | ## 🎯 Quick Navigation - [用户指南](https://github.com/ThreeFish-AI/data-extractor/blob/master/docs/6-User-Guide.md) - [架构设计](https://github.com/ThreeFish-AI/data-extractor/blob/master/docs/1-Framework.md) - [开发指南](https://github.com/ThreeFish-AI/data-extractor/blob/master/docs/2-Development.md) - [测试指南](https://github.com/ThreeFish-AI/data-extractor/blob/master/docs/3-Testing.md) - [配置系统](https://github.com/ThreeFish-AI/data-extractor/blob/master/docs/4-Configuration.md) - [常用指令](https://github.com/ThreeFish-AI/data-extractor/blob/master/docs/5-Commands.md) - [版本里程](https://github.com/ThreeFish-AI/data-extractor/blob/master/CHANGELOG.md) ## 🤝 Contribution 欢迎提交 [Issue](https://github.com/ThreeFish-AI/data-extractor/issues) 和 [Pull Request](https://github.com/ThreeFish-AI/data-extractor/pulls) 来改进这个项目。 ## 📄 License MIT License - 详见 [LICENSE](LICENSE) 文件 --- **注意**: 请负责任地使用此工具,遵守网站的使用条款和 robots.txt 规则,尊重网站的知识产权。

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/ThreeFish-AI/scrapy-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server