GitHub - ArchitW/projectS: Scrapper to rip all the pages of the websites

πŸ•·οΈ ProjectS – Production-Ready Web Scraping Framework

A production-ready scraping server designed for long-running, scalable, and maintainable web scraping.

Built with industry best practices:

  • Clear separation of concerns
  • Pluggable fetch mechanisms (HTTP & Browser)
  • Proxy & User-Agent rotation
  • Spider-based page discovery
  • Domain-organized HTML storage
  • File-based caching system
  • Graceful exit handling

✨ Key Features

  • βœ… Two scraping modes
    • requests (fast, lightweight)
    • selenium (browser-based, JS-heavy sites)
  • πŸ”„ Proxy rotation (pluggable)
  • 🧬 User-Agent rotation
  • πŸ•ΈοΈ Spider-based page discovery with domain & path filtering
  • πŸ’Ύ Smart file-based caching (skip re-downloading existing pages)
  • πŸ“¦ Domain-organized HTML storage (e.g., data/raw_html/example.com/)
  • πŸ”„ Auto-refresh index pages (always get latest updates)
  • ⌨️ Graceful exit (Ctrl+C to stop cleanly)
  • 🧱 Clean, modular architecture

🧠 Architecture Overview

Spider β†’ Fetcher β†’ HTML Storage β†’ (Later Parsing)
           ↑
   Requests / Selenium

Important design principle:

Fetching HTML is separated from parsing & extraction.

This allows you to:

  • Re-parse data later
  • Debug failures easily
  • Change extraction logic without re-scraping
  • Resume scraping across runs (file-based caching)

πŸ“ Project Structure

projectS/
β”œβ”€β”€ config/                # Global configuration
β”‚   β”œβ”€β”€ settings.py         # All configuration values
β”‚   └── logging.py          # Logging setup
β”‚
β”œβ”€β”€ core/                  # Core framework logic
β”‚   β”œβ”€β”€ fetchers/           # Requests / Selenium fetchers
β”‚   β”‚   β”œβ”€β”€ base.py
β”‚   β”‚   β”œβ”€β”€ requests_fetcher.py
β”‚   β”‚   └── selenium_fetcher.py
β”‚   β”‚
β”‚   β”œβ”€β”€ spider/             # Page discovery logic
β”‚   β”‚   β”œβ”€β”€ base_spider.py
β”‚   β”‚   └── simple_spider.py
β”‚   β”‚
β”‚   β”œβ”€β”€ storage/            # HTML persistence
β”‚   β”‚   └── html_storage.py
β”‚   β”‚
β”‚   β”œβ”€β”€ proxy/              # Proxy rotation
β”‚   β”‚   └── proxy_manager.py
β”‚   β”‚
β”‚   β”œβ”€β”€ user_agent/         # User-Agent rotation
β”‚   β”‚   └── user_agent_manager.py
β”‚   β”‚
β”‚   └── utils/              # Helpers (robots, URLs, etc.)
β”‚       β”œβ”€β”€ robots.py
β”‚       └── url.py
β”‚
β”œβ”€β”€ sites/                 # Site-specific spider implementations
β”‚   └── example_site/
β”‚       β”œβ”€β”€ run.py          # Run script for this site
β”‚       └── spider.py       # Custom spider logic (optional)
β”‚
β”œβ”€β”€ data/                  # Stored HTML & outputs
β”‚   └── raw_html/
β”‚       β”œβ”€β”€ example.com/
β”‚       β”œβ”€β”€ another-site.com/
β”‚       └── ...
β”‚
β”œβ”€β”€ requirements.txt
└── README.md

πŸš€ Getting Started

1️⃣ Clone the repository

git clone https://github.com/ArchitW/projectS.git
cd projectS

2️⃣ Create virtual environment

python3 -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate

3️⃣ Install dependencies

pip install -r requirements.txt

4️⃣ Run the example scraper

python3 -m sites.example_site.run

🎯 Quick Configuration

Edit sites/example_site/run.py:

START_URL = "https://example.com/"

# Filter URLs by path (empty = crawl entire site)
ALLOWED_PATHS = ['products/', 'blog/']  # Only crawl these paths
# ALLOWED_PATHS = []  # Crawl entire site

# Choose fetcher
fetcher = SeleniumFetcher()  # For JS-heavy sites
# fetcher = RequestsFetcher()  # For static sites

🧰 Scraping Modes

πŸ”Ή Mode 1: Requests (HTTP scraping)

Best for:

  • Static pages
  • APIs
  • Fast crawling

Features:

  • Lightweight
  • Easy proxy support
  • Low resource usage

Usage:

from core.fetchers.requests_fetcher import RequestsFetcher
from core.proxy.proxy_manager import ProxyManager

fetcher = RequestsFetcher(proxy_manager=ProxyManager(proxies))

πŸ”Ή Mode 2: Selenium (Browser scraping)

Best for:

  • JavaScript-heavy sites
  • Dynamic content
  • Bot-protected pages

Features:

  • Real browser execution
  • JS rendering
  • Cookie & session handling
  • Configurable wait times

Usage:

from core.fetchers.selenium_fetcher import SeleniumFetcher

fetcher = SeleniumFetcher()
html = fetcher.fetch(url)  # Waits for page to load
# OR
html = fetcher.fetch(url, additional_wait=3)  # Wait extra 3 seconds for AJAX

Configuration (config/settings.py):

SELENIUM_HEADLESS = True  # Run in headless mode
SELENIUM_PAGE_LOAD_TIMEOUT = 30  # Max page load time
SELENIUM_WAIT_TIME = 10  # Implicit wait for elements

πŸ”„ Proxy Rotation

Proxies are managed independently via a ProxyManager.

You can plug in:

  • Datacenter proxies
  • Residential proxies
  • Rotating proxy services
ProxyManager([
  "http://user:pass@ip:port",
  "http://ip:port"
])

Download free proxies from:


πŸ•ΈοΈ Spider & Page Discovery

The spider discovers links based on:

  • Domain filtering: Only follows links from the same domain
  • Path filtering: Optional path prefix matching
  • Duplicate prevention: Tracks visited URLs in memory

Configuration:

spider = SimpleSpider(
    base_url="https://example.com/",
    allowed_paths=['products/', 'blog/']  # Optional filtering
)

Examples:

# Only crawl product pages
ALLOWED_PATHS = ['products/']

# Crawl multiple sections
ALLOWED_PATHS = ['shop/', 'blog/', 'about/']

# Crawl entire site
ALLOWED_PATHS = []

πŸ’Ύ Smart HTML Storage

File Organization

HTML files are organized by domain with meaningful filenames:

data/raw_html/
β”œβ”€β”€ example.com/
β”‚   β”œβ”€β”€ index.html
β”‚   β”œβ”€β”€ products-laptop.html
β”‚   β”œβ”€β”€ products-phone.html
β”‚   └── about-team.html
└── another-site.com/
    β”œβ”€β”€ index.html
    └── ...

URL to Filename Mapping:

  • https://example.com/ β†’ example.com/index.html
  • https://example.com/products/laptop β†’ example.com/products-laptop.html
  • https://example.com/about/team β†’ example.com/about-team.html

File-Based Caching

Smart Resume Feature:

  • βœ… Skips downloading pages that already exist
  • βœ… Reads existing files to discover new links
  • βœ… Only downloads NEW pages
  • πŸ”„ Always re-downloads index page (to catch new content)

Example Output:

πŸ”„ Deleted existing index page to refresh content
βœ“ Saved: https://example.com/ -> data/raw_html/example.com/index.html
⏭️  Skipped download (file exists): https://example.com/page-1
⏭️  Skipped download (file exists): https://example.com/page-2
βœ“ Saved: https://example.com/new-page -> data/raw_html/example.com/new-page.html

Benefits:

  • πŸ§ͺ Debug scraping issues
  • πŸ” Re-parse without re-scraping
  • πŸ’Ύ Resume interrupted scrapes
  • πŸš€ Incremental updates (run daily to catch new pages)

⌨️ Running & Control

Start Scraping

python3 -m sites.example_site.run

Stop Gracefully

Press Ctrl+C to stop cleanly:

⚠️  Interrupted by user (Ctrl+C)
πŸ›‘ Stopping scraper gracefully...
🧹 Cleaning up...
βœ… Browser closed. Goodbye!

The scraper will:

  • Stop downloading immediately
  • Close the browser properly
  • No zombie processes or hung browsers

βš™οΈ Configuration Reference

Edit config/settings.py:

# HTTP Requests
REQUEST_TIMEOUT = 30  # Request timeout in seconds

# Selenium
SELENIUM_HEADLESS = True  # Run browser in headless mode
SELENIUM_PAGE_LOAD_TIMEOUT = 30  # Max page load time
SELENIUM_WAIT_TIME = 10  # Implicit wait for elements

# Storage
RAW_HTML_DIR = "data/raw_html"  # HTML storage directory

⚠️ Legal & Ethical Scraping

This framework is designed to be used responsibly.

You must:

  • βœ… Respect robots.txt
  • βœ… Follow site terms of service
  • βœ… Add delays & rate limits
  • βœ… Avoid scraping sensitive or private data
  • βœ… Use responsibly and legally

πŸ›£οΈ Roadmap / Future Extensions

Planned enhancements:

  • ⏱️ Rate limiting & retries
  • πŸ€– Ban detection
  • βš™οΈ Async fetchers (httpx)
  • πŸ§ͺ Playwright support
  • 🐳 Dockerization
  • 🌐 FastAPI control layer
  • πŸ“Š Database ingestion pipeline
  • πŸ” Sitemap parsing

πŸ‘¨β€πŸ’» Who is this for?

  • Backend engineers
  • Data engineers
  • Interview prep projects
  • Personal scraping servers
  • Learning real-world scraping architecture

🧠 Philosophy

Scraping is a backend system β€” not a script.

This repo treats scraping as a long-running, maintainable service, not a one-off Python file.


πŸ“œ License

Use responsibly. You are responsible for how you use this software.