π·οΈ ProjectS β Production-Ready Web Scraping Framework
A production-ready scraping server designed for long-running, scalable, and maintainable web scraping.
Built with industry best practices:
- Clear separation of concerns
- Pluggable fetch mechanisms (HTTP & Browser)
- Proxy & User-Agent rotation
- Spider-based page discovery
- Domain-organized HTML storage
- File-based caching system
- Graceful exit handling
β¨ Key Features
- β
Two scraping modes
requests(fast, lightweight)selenium(browser-based, JS-heavy sites)
- π Proxy rotation (pluggable)
- 𧬠User-Agent rotation
- πΈοΈ Spider-based page discovery with domain & path filtering
- πΎ Smart file-based caching (skip re-downloading existing pages)
- π¦ Domain-organized HTML storage (e.g.,
data/raw_html/example.com/) - π Auto-refresh index pages (always get latest updates)
- β¨οΈ Graceful exit (Ctrl+C to stop cleanly)
- π§± Clean, modular architecture
π§ Architecture Overview
Spider β Fetcher β HTML Storage β (Later Parsing)
β
Requests / Selenium
Important design principle:
Fetching HTML is separated from parsing & extraction.
This allows you to:
- Re-parse data later
- Debug failures easily
- Change extraction logic without re-scraping
- Resume scraping across runs (file-based caching)
π Project Structure
projectS/
βββ config/ # Global configuration
β βββ settings.py # All configuration values
β βββ logging.py # Logging setup
β
βββ core/ # Core framework logic
β βββ fetchers/ # Requests / Selenium fetchers
β β βββ base.py
β β βββ requests_fetcher.py
β β βββ selenium_fetcher.py
β β
β βββ spider/ # Page discovery logic
β β βββ base_spider.py
β β βββ simple_spider.py
β β
β βββ storage/ # HTML persistence
β β βββ html_storage.py
β β
β βββ proxy/ # Proxy rotation
β β βββ proxy_manager.py
β β
β βββ user_agent/ # User-Agent rotation
β β βββ user_agent_manager.py
β β
β βββ utils/ # Helpers (robots, URLs, etc.)
β βββ robots.py
β βββ url.py
β
βββ sites/ # Site-specific spider implementations
β βββ example_site/
β βββ run.py # Run script for this site
β βββ spider.py # Custom spider logic (optional)
β
βββ data/ # Stored HTML & outputs
β βββ raw_html/
β βββ example.com/
β βββ another-site.com/
β βββ ...
β
βββ requirements.txt
βββ README.md
π Getting Started
1οΈβ£ Clone the repository
git clone https://github.com/ArchitW/projectS.git
cd projectS2οΈβ£ Create virtual environment
python3 -m venv env source env/bin/activate # On Windows: env\Scripts\activate
3οΈβ£ Install dependencies
pip install -r requirements.txt
4οΈβ£ Run the example scraper
python3 -m sites.example_site.run
π― Quick Configuration
Edit sites/example_site/run.py:
START_URL = "https://example.com/" # Filter URLs by path (empty = crawl entire site) ALLOWED_PATHS = ['products/', 'blog/'] # Only crawl these paths # ALLOWED_PATHS = [] # Crawl entire site # Choose fetcher fetcher = SeleniumFetcher() # For JS-heavy sites # fetcher = RequestsFetcher() # For static sites
π§° Scraping Modes
πΉ Mode 1: Requests (HTTP scraping)
Best for:
- Static pages
- APIs
- Fast crawling
Features:
- Lightweight
- Easy proxy support
- Low resource usage
Usage:
from core.fetchers.requests_fetcher import RequestsFetcher from core.proxy.proxy_manager import ProxyManager fetcher = RequestsFetcher(proxy_manager=ProxyManager(proxies))
πΉ Mode 2: Selenium (Browser scraping)
Best for:
- JavaScript-heavy sites
- Dynamic content
- Bot-protected pages
Features:
- Real browser execution
- JS rendering
- Cookie & session handling
- Configurable wait times
Usage:
from core.fetchers.selenium_fetcher import SeleniumFetcher fetcher = SeleniumFetcher() html = fetcher.fetch(url) # Waits for page to load # OR html = fetcher.fetch(url, additional_wait=3) # Wait extra 3 seconds for AJAX
Configuration (config/settings.py):
SELENIUM_HEADLESS = True # Run in headless mode SELENIUM_PAGE_LOAD_TIMEOUT = 30 # Max page load time SELENIUM_WAIT_TIME = 10 # Implicit wait for elements
π Proxy Rotation
Proxies are managed independently via a ProxyManager.
You can plug in:
- Datacenter proxies
- Residential proxies
- Rotating proxy services
ProxyManager([ "http://user:pass@ip:port", "http://ip:port" ])
Download free proxies from:
- https://github.com/vakhov/fresh-proxy-list
- https://proxyscrape.com/free-proxy-list
- https://free-proxy-list.net/
- https://geonode.com/free-proxy-list
πΈοΈ Spider & Page Discovery
The spider discovers links based on:
- Domain filtering: Only follows links from the same domain
- Path filtering: Optional path prefix matching
- Duplicate prevention: Tracks visited URLs in memory
Configuration:
spider = SimpleSpider( base_url="https://example.com/", allowed_paths=['products/', 'blog/'] # Optional filtering )
Examples:
# Only crawl product pages ALLOWED_PATHS = ['products/'] # Crawl multiple sections ALLOWED_PATHS = ['shop/', 'blog/', 'about/'] # Crawl entire site ALLOWED_PATHS = []
πΎ Smart HTML Storage
File Organization
HTML files are organized by domain with meaningful filenames:
data/raw_html/
βββ example.com/
β βββ index.html
β βββ products-laptop.html
β βββ products-phone.html
β βββ about-team.html
βββ another-site.com/
βββ index.html
βββ ...
URL to Filename Mapping:
https://example.com/βexample.com/index.htmlhttps://example.com/products/laptopβexample.com/products-laptop.htmlhttps://example.com/about/teamβexample.com/about-team.html
File-Based Caching
Smart Resume Feature:
- β Skips downloading pages that already exist
- β Reads existing files to discover new links
- β Only downloads NEW pages
- π Always re-downloads index page (to catch new content)
Example Output:
π Deleted existing index page to refresh content
β Saved: https://example.com/ -> data/raw_html/example.com/index.html
βοΈ Skipped download (file exists): https://example.com/page-1
βοΈ Skipped download (file exists): https://example.com/page-2
β Saved: https://example.com/new-page -> data/raw_html/example.com/new-page.html
Benefits:
- π§ͺ Debug scraping issues
- π Re-parse without re-scraping
- πΎ Resume interrupted scrapes
- π Incremental updates (run daily to catch new pages)
β¨οΈ Running & Control
Start Scraping
python3 -m sites.example_site.run
Stop Gracefully
Press Ctrl+C to stop cleanly:
β οΈ Interrupted by user (Ctrl+C)
π Stopping scraper gracefully...
π§Ή Cleaning up...
β
Browser closed. Goodbye!
The scraper will:
- Stop downloading immediately
- Close the browser properly
- No zombie processes or hung browsers
βοΈ Configuration Reference
Edit config/settings.py:
# HTTP Requests REQUEST_TIMEOUT = 30 # Request timeout in seconds # Selenium SELENIUM_HEADLESS = True # Run browser in headless mode SELENIUM_PAGE_LOAD_TIMEOUT = 30 # Max page load time SELENIUM_WAIT_TIME = 10 # Implicit wait for elements # Storage RAW_HTML_DIR = "data/raw_html" # HTML storage directory
β οΈ Legal & Ethical Scraping
This framework is designed to be used responsibly.
You must:
- β
Respect
robots.txt - β Follow site terms of service
- β Add delays & rate limits
- β Avoid scraping sensitive or private data
- β Use responsibly and legally
π£οΈ Roadmap / Future Extensions
Planned enhancements:
- β±οΈ Rate limiting & retries
- π€ Ban detection
- βοΈ Async fetchers (
httpx) - π§ͺ Playwright support
- π³ Dockerization
- π FastAPI control layer
- π Database ingestion pipeline
- π Sitemap parsing
π¨βπ» Who is this for?
- Backend engineers
- Data engineers
- Interview prep projects
- Personal scraping servers
- Learning real-world scraping architecture
π§ Philosophy
Scraping is a backend system β not a script.
This repo treats scraping as a long-running, maintainable service, not a one-off Python file.
π License
Use responsibly. You are responsible for how you use this software.