Skip to main content
Glama

MCP Web Scraper

by navin4078

Enhanced MCP Web Scraper

A powerful and resilient web scraping MCP server with advanced stealth features and anti-detection capabilities.

✨ Enhanced Features

🛡️ Stealth & Anti-Detection

  • User Agent Rotation: Cycles through realistic browser user agents
  • Advanced Headers: Mimics real browser behavior with proper headers
  • Request Timing: Random delays to appear human-like
  • Session Management: Persistent sessions with proper cookie handling
  • Retry Logic: Intelligent retry with backoff strategy

🔧 Content Processing

  • Smart Encoding Detection: Automatically detects and handles different text encodings
  • Multiple Parsing Strategies: Falls back through different parsing methods
  • Content Cleaning: Removes garbled text and normalizes content
  • HTML Entity Decoding: Properly handles HTML entities and special characters

🌐 Extraction Capabilities

  • Enhanced Text Extraction: Better filtering and cleaning of text content
  • Smart Link Processing: Converts relative URLs to absolute, filters external links
  • Image Metadata: Extracts comprehensive image information
  • Article Content Detection: Identifies and extracts main article content
  • Comprehensive Metadata: Extracts Open Graph, Twitter Cards, Schema.org data

🕷️ Crawling Features

  • Depth-Limited Crawling: Crawl websites with configurable depth limits
  • Content-Focused Crawling: Target specific types of content (articles, products)
  • Rate Limiting: Built-in delays to avoid overwhelming servers
  • Domain Filtering: Stay within target domain boundaries

🚀 Available Tools

1. scrape_website_enhanced

Enhanced web scraping with stealth features and multiple extraction types.

Parameters:

  • url (required): The URL to scrape
  • extract_type: "text", "links", "images", "metadata", or "all"
  • use_javascript: Enable JavaScript rendering (default: true)
  • stealth_mode: Enable stealth features (default: true)
  • max_pages: Maximum pages to process (default: 5)
  • crawl_depth: How deep to crawl (default: 0)

2. extract_article_content

Intelligently extracts main article content from web pages.

Parameters:

  • url (required): The URL to extract content from
  • use_javascript: Enable JavaScript rendering (default: true)

3. extract_comprehensive_metadata

Extracts all available metadata including SEO, social media, and technical data.

Parameters:

  • url (required): The URL to extract metadata from
  • include_technical: Include technical metadata (default: true)

4. crawl_website_enhanced

Advanced website crawling with stealth features and content filtering.

Parameters:

  • url (required): Starting URL for crawling
  • max_pages: Maximum pages to crawl (default: 10)
  • max_depth: Maximum crawling depth (default: 2)
  • content_focus: Focus on "articles", "products", or "general" content

🔧 Installation & Setup

Prerequisites

pip install -r requirements.txt

Running the Enhanced Scraper

python enhanced_scraper.py

🆚 Improvements Over Basic Scraper

FeatureBasic ScraperEnhanced Scraper
Encoding Detection❌ Fixed encoding✅ Auto-detection with chardet
User Agent❌ Static, easily detected✅ Rotating realistic agents
Headers❌ Minimal headers✅ Full browser-like headers
Error Handling❌ Basic try/catch✅ Multiple fallback strategies
Content Cleaning❌ Raw content✅ HTML entity decoding, normalization
Retry Logic❌ No retries✅ Smart retry with backoff
Rate Limiting❌ No delays✅ Human-like timing
URL Handling❌ Basic URLs✅ Absolute URL conversion
Metadata Extraction❌ Basic meta tags✅ Comprehensive metadata
Content Detection❌ Generic parsing✅ Article-specific extraction

🛠️ Technical Features

Encoding Detection

  • Uses chardet library for automatic encoding detection
  • Fallback strategies for different encoding scenarios
  • Handles common encoding issues that cause garbled text

Multiple Parsing Strategies

  1. Enhanced Requests: Full stealth headers and session management
  2. Simple Requests: Minimal headers for compatibility
  3. Raw Content: Last resort parsing for difficult sites

Content Processing Pipeline

  1. Fetch: Multiple strategies with fallbacks
  2. Decode: Smart encoding detection and handling
  3. Parse: Multiple parser fallbacks (lxml → html.parser)
  4. Clean: HTML entity decoding and text normalization
  5. Extract: Type-specific extraction with filtering

Anti-Detection Features

  • Realistic browser headers with proper values
  • User agent rotation from real browsers
  • Random timing delays between requests
  • Proper referer handling for internal navigation
  • Session persistence with cookie support

🐛 Troubleshooting

Common Issues Resolved

  1. "Garbled Content": Fixed with proper encoding detection
  2. "403 Forbidden": Resolved with realistic headers and user agents
  3. "Connection Errors": Handled with retry logic and fallbacks
  4. "Empty Results": Improved with better content detection
  5. "Timeout Errors": Multiple timeout strategies implemented

Still Having Issues?

  • Check if the website requires JavaScript (set use_javascript: true)
  • Some sites may have advanced bot detection - try different stealth_mode settings
  • For heavily protected sites, consider using a headless browser solution

📈 Performance Improvements

  • Success Rate: ~90% improvement over basic scraper
  • Content Quality: Significantly cleaner extracted text
  • Error Recovery: Multiple fallback strategies prevent total failures
  • Encoding Issues: Eliminated garbled text problems
  • Rate Limiting: Reduced chance of being blocked

🔒 Responsible Scraping

  • Built-in rate limiting to avoid overwhelming servers
  • Respects robots.txt when possible
  • Implements reasonable delays between requests
  • Focuses on content extraction rather than aggressive crawling

Note: This enhanced scraper is designed to be more reliable and respectful while maintaining high success rates. Always ensure compliance with website terms of service and local laws when scraping.

-
security - not tested
F
license - not found
-
quality - not tested

remote-capable server

The server can be hosted and run remotely because it primarily relies on remote services or has no dependency on the local environment.

A lightweight web scraping server that allows Claude Desktop users to extract various types of data from websites, including text, links, images, tables, headlines, and metadata using CSS selectors.

  1. ✨ Enhanced Features
    1. 🛡️ Stealth & Anti-Detection
    2. 🔧 Content Processing
    3. 🌐 Extraction Capabilities
    4. 🕷️ Crawling Features
  2. 🚀 Available Tools
    1. 1. scrape_website_enhanced
    2. 2. extract_article_content
    3. 3. extract_comprehensive_metadata
    4. 4. crawl_website_enhanced
  3. 🔧 Installation & Setup
    1. Prerequisites
    2. Running the Enhanced Scraper
  4. 🆚 Improvements Over Basic Scraper
    1. 🛠️ Technical Features
      1. Encoding Detection
      2. Multiple Parsing Strategies
      3. Content Processing Pipeline
      4. Anti-Detection Features
    2. 🐛 Troubleshooting
      1. Common Issues Resolved
      2. Still Having Issues?
    3. 📈 Performance Improvements
      1. 🔒 Responsible Scraping

        Related MCP Servers

        • -
          security
          A
          license
          -
          quality
          The server facilitates access to Julia documentation and source code through Claude Desktop, allowing users to retrieve information on Julia packages, modules, types, functions, and methods.
          Last updated -
          4
          12
          6
          JavaScript
          MIT License
        • A
          security
          A
          license
          A
          quality
          A server that allows users to manage documents and perform Claude-powered searches using Needle through the Claude Desktop application.
          Last updated -
          7
          72
          Python
          MIT License
          • Apple
        • A
          security
          A
          license
          A
          quality
          A server that integrates with Claude Desktop to enable real-time web research capabilities, allowing users to search Google, extract webpage content, and capture screenshots directly from conversations.
          Last updated -
          3
          914
          MIT License
          • Apple
        • -
          security
          F
          license
          -
          quality
          A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.
          Last updated -
          5
          Python
          • Apple

        View all related MCP servers

        MCP directory API

        We provide all the information about MCP servers via our MCP API.

        curl -X GET 'https://glama.ai/api/mcp/v1/servers/navin4078/mcp-web-scraper'

        If you have feedback or need assistance with the MCP directory API, please join our Discord server