GitHub - ArchitW/projectS: Scrapper to rip all the pages of the websites

🕷️ ProjectS – Production-Ready Web Scraping Framework

A production-ready scraping server designed for long-running, scalable, and maintainable web scraping.

Built with industry best practices:

Clear separation of concerns
Pluggable fetch mechanisms (HTTP & Browser)
Proxy & User-Agent rotation
Spider-based page discovery
Domain-organized HTML storage
File-based caching system
Graceful exit handling

✨ Key Features

✅ Two scraping modes
- requests (fast, lightweight)
- selenium (browser-based, JS-heavy sites)
🔄 Proxy rotation (pluggable)
🧬 User-Agent rotation
🕸️ Spider-based page discovery with domain & path filtering
💾 Smart file-based caching (skip re-downloading existing pages)
📦 Domain-organized HTML storage (e.g., data/raw_html/example.com/)
🔄 Auto-refresh index pages (always get latest updates)
⌨️ Graceful exit (Ctrl+C to stop cleanly)
🧱 Clean, modular architecture

🧠 Architecture Overview

Spider → Fetcher → HTML Storage → (Later Parsing)
           ↑
   Requests / Selenium

Important design principle:

Fetching HTML is separated from parsing & extraction.

This allows you to:

Re-parse data later
Debug failures easily
Change extraction logic without re-scraping
Resume scraping across runs (file-based caching)

📁 Project Structure

projectS/
├── config/                # Global configuration
│   ├── settings.py         # All configuration values
│   └── logging.py          # Logging setup
│
├── core/                  # Core framework logic
│   ├── fetchers/           # Requests / Selenium fetchers
│   │   ├── base.py
│   │   ├── requests_fetcher.py
│   │   └── selenium_fetcher.py
│   │
│   ├── spider/             # Page discovery logic
│   │   ├── base_spider.py
│   │   └── simple_spider.py
│   │
│   ├── storage/            # HTML persistence
│   │   └── html_storage.py
│   │
│   ├── proxy/              # Proxy rotation
│   │   └── proxy_manager.py
│   │
│   ├── user_agent/         # User-Agent rotation
│   │   └── user_agent_manager.py
│   │
│   └── utils/              # Helpers (robots, URLs, etc.)
│       ├── robots.py
│       └── url.py
│
├── sites/                 # Site-specific spider implementations
│   └── example_site/
│       ├── run.py          # Run script for this site
│       └── spider.py       # Custom spider logic (optional)
│
├── data/                  # Stored HTML & outputs
│   └── raw_html/
│       ├── example.com/
│       ├── another-site.com/
│       └── ...
│
├── requirements.txt
└── README.md

🚀 Getting Started

1️⃣ Clone the repository

git clone https://github.com/ArchitW/projectS.git
cd projectS

2️⃣ Create virtual environment

python3 -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate

3️⃣ Install dependencies

pip install -r requirements.txt

4️⃣ Run the example scraper

python3 -m sites.example_site.run

🎯 Quick Configuration

Edit sites/example_site/run.py:

START_URL = "https://example.com/"

# Filter URLs by path (empty = crawl entire site)
ALLOWED_PATHS = ['products/', 'blog/']  # Only crawl these paths
# ALLOWED_PATHS = []  # Crawl entire site

# Choose fetcher
fetcher = SeleniumFetcher()  # For JS-heavy sites
# fetcher = RequestsFetcher()  # For static sites

🧰 Scraping Modes

🔹 Mode 1: Requests (HTTP scraping)

Best for:

Static pages
APIs
Fast crawling

Features:

Lightweight
Easy proxy support
Low resource usage

Usage:

from core.fetchers.requests_fetcher import RequestsFetcher
from core.proxy.proxy_manager import ProxyManager

fetcher = RequestsFetcher(proxy_manager=ProxyManager(proxies))

🔹 Mode 2: Selenium (Browser scraping)

Best for:

JavaScript-heavy sites
Dynamic content
Bot-protected pages

Features:

Real browser execution
JS rendering
Cookie & session handling
Configurable wait times

Usage:

from core.fetchers.selenium_fetcher import SeleniumFetcher

fetcher = SeleniumFetcher()
html = fetcher.fetch(url)  # Waits for page to load
# OR
html = fetcher.fetch(url, additional_wait=3)  # Wait extra 3 seconds for AJAX

Configuration (config/settings.py):

SELENIUM_HEADLESS = True  # Run in headless mode
SELENIUM_PAGE_LOAD_TIMEOUT = 30  # Max page load time
SELENIUM_WAIT_TIME = 10  # Implicit wait for elements

🔄 Proxy Rotation

Proxies are managed independently via a ProxyManager.

You can plug in:

Datacenter proxies
Residential proxies
Rotating proxy services

ProxyManager([
  "http://user:pass@ip:port",
  "http://ip:port"
])

Download free proxies from:

🕸️ Spider & Page Discovery

The spider discovers links based on:

Domain filtering: Only follows links from the same domain
Path filtering: Optional path prefix matching
Duplicate prevention: Tracks visited URLs in memory

Configuration:

spider = SimpleSpider(
    base_url="https://example.com/",
    allowed_paths=['products/', 'blog/']  # Optional filtering
)

Examples:

# Only crawl product pages
ALLOWED_PATHS = ['products/']

# Crawl multiple sections
ALLOWED_PATHS = ['shop/', 'blog/', 'about/']

# Crawl entire site
ALLOWED_PATHS = []

💾 Smart HTML Storage

File Organization

HTML files are organized by domain with meaningful filenames:

data/raw_html/
├── example.com/
│   ├── index.html
│   ├── products-laptop.html
│   ├── products-phone.html
│   └── about-team.html
└── another-site.com/
    ├── index.html
    └── ...

URL to Filename Mapping:

https://example.com/ → example.com/index.html
https://example.com/products/laptop → example.com/products-laptop.html
https://example.com/about/team → example.com/about-team.html

File-Based Caching

Smart Resume Feature:

✅ Skips downloading pages that already exist
✅ Reads existing files to discover new links
✅ Only downloads NEW pages
🔄 Always re-downloads index page (to catch new content)

Example Output:

🔄 Deleted existing index page to refresh content
✓ Saved: https://example.com/ -> data/raw_html/example.com/index.html
⏭️  Skipped download (file exists): https://example.com/page-1
⏭️  Skipped download (file exists): https://example.com/page-2
✓ Saved: https://example.com/new-page -> data/raw_html/example.com/new-page.html

Benefits:

🧪 Debug scraping issues
🔁 Re-parse without re-scraping
💾 Resume interrupted scrapes
🚀 Incremental updates (run daily to catch new pages)

⌨️ Running & Control

Start Scraping

python3 -m sites.example_site.run

Stop Gracefully

Press Ctrl+C to stop cleanly:

⚠️  Interrupted by user (Ctrl+C)
🛑 Stopping scraper gracefully...
🧹 Cleaning up...
✅ Browser closed. Goodbye!

The scraper will:

Stop downloading immediately
Close the browser properly
No zombie processes or hung browsers

⚙️ Configuration Reference

Edit config/settings.py:

# HTTP Requests
REQUEST_TIMEOUT = 30  # Request timeout in seconds

# Selenium
SELENIUM_HEADLESS = True  # Run browser in headless mode
SELENIUM_PAGE_LOAD_TIMEOUT = 30  # Max page load time
SELENIUM_WAIT_TIME = 10  # Implicit wait for elements

# Storage
RAW_HTML_DIR = "data/raw_html"  # HTML storage directory

⚠️ Legal & Ethical Scraping

This framework is designed to be used responsibly.

You must:

✅ Respect robots.txt
✅ Follow site terms of service
✅ Add delays & rate limits
✅ Avoid scraping sensitive or private data
✅ Use responsibly and legally

🛣️ Roadmap / Future Extensions

Planned enhancements:

⏱️ Rate limiting & retries
🤖 Ban detection
⚙️ Async fetchers (httpx)
🧪 Playwright support
🐳 Dockerization
🌐 FastAPI control layer
📊 Database ingestion pipeline
🔍 Sitemap parsing

👨‍💻 Who is this for?

Backend engineers
Data engineers
Interview prep projects
Personal scraping servers
Learning real-world scraping architecture

🧠 Philosophy

Scraping is a backend system — not a script.

This repo treats scraping as a long-running, maintainable service, not a one-off Python file.

📜 License

Use responsibly. You are responsible for how you use this software.