urls2disk - Rust

Crate urls2disk [] [src]

Build Status

crates.io | docs.rs | github.com

urls2disk is a rust crate that helps you to download a series of webpages in parallel and save them to disk. Depending on your choice, it will either write the raw bytes of the webpages to disk or it will first convert them to PDF before writing them to disk. It's helpful for general webscraping as well as for converting a bunch of webpages to PDF.

A key feature of urls2disk is that you can set a maximum number of requests per second while downloading webpages; so you can effectively throttle yourself so as not to run afoul of any servers that will block you if you hit them with too many requests at once.

Under the hood, urls2disk uses wkhtmltopdf to convert webpages to PDF if you choose that option; so to use it you'll need wkhtmltopdf installed on your machine. Installing wkhtmltopdf on macOS with Homebrew is super simple. Just brew install Caskroom/cask/wkhtmltopdf in your terminal. For other systems or if you don't have Homebrew, you're on your own for installing wkhtmltopdf, but perhaps at some point I'll lookup instructions for how to install it on different setups and include them here. As far as versions go, I've only tested with wkhtmltopdf 0.12.4.

Here's an example of downloading Apple, Inc.'s annual reports from 2010-2017 from the SEC website using urls2disk:

extern crate reqwest;
extern crate urls2disk;

use std::fs;
use std::path::Path;

use urls2disk::{wkhtmltopdf, ClientBuilder, Result, SimpleDocument, Url};




fn run() -> Result<()> {
    
    let output_directory = Path::new("./data");
    if !output_directory.exists() {
        fs::create_dir_all(output_directory)?;
    }

    
    
    let base = "https://www.sec.gov/Archives/edgar/data/";
    let urls = vec![
        "320193/000119312510238044/d10k.htm",
        "320193/000119312511282113/d220209d10k.htm",
        "320193/000119312512444068/d411355d10k.htm",
        "320193/000119312513416534/d590790d10k.htm",
        "320193/000119312514383437/d783162d10k.htm",
        "320193/000119312515356351/d17062d10k.htm",
        "320193/000162828016020309/a201610-k9242016.htm",
        "320193/000032019317000070/a10-k20179302017.htm",
    ].iter()
        .map(|stem| format!("{}{}", &base, stem))
        .collect::<Vec<String>>();

    
    
    
    
    
    let html_documents = urls.iter()
        .enumerate()
        .map(|(i, url_string)| {
            let filename = format!("Apple 10-K {}.html", i + 2010);
            let path = output_directory.join(&filename);
            let url = url_string.parse::<Url>()?;
            let wkhtmltopdf = false;
            let document = SimpleDocument::new(path, url, wkhtmltopdf);
            Ok(Box::new(document))
        })
        .collect::<Result<Vec<Box<SimpleDocument>>>>()?;

    
    
    
    
    let pdf_documents = urls.iter()
        .enumerate()
        .map(|(i, url_string)| {
            let filename = format!("Apple 10-K {}.pdf", i + 2010);
            let path = output_directory.join(&filename);
            let url = url_string.parse::<Url>()?;
            let wkhtmltopdf = true;
            let document = SimpleDocument::new(path, url, wkhtmltopdf);
            Ok(Box::new(document))
        })
        .collect::<Result<Vec<Box<SimpleDocument>>>>()?;

    
    let mut documents = [&html_documents[..], &pdf_documents[..]].concat();

    
    
    
    
    let client = ClientBuilder::default()
        .set_max_requests_per_second(9)
        .set_max_threads_cpu(4)
        .set_max_threads_io(50)
        .set_reqwest_client(reqwest::Client::new())
        .set_wkhtmltopdf_setting(wkhtmltopdf::Setting::Zoom(3.5))
        .set_wkhtmltopdf_settings(vec![
            wkhtmltopdf::Setting::DisableExternalLinks(true),
            wkhtmltopdf::Setting::DisableJavascript(true),
        ])
        .build()?;

    
    
    
    
    client.get_documents(&mut documents)?;

    
    
    
    Ok(())
}

fn main() {
    run().unwrap();
}
Client

A Client downloads and writes to disk a slice of boxed objects implementing Document. It does this in parallel to maximize efficiency, but will never exceed the maximum number of requests per second provided by the user nor the maximum number of threads provided. Additionally, if the object implemeting Document returns true from its wkhtmltopdf() method, the Client will use wkhtmltopdf to convert what it downloads to PDF before writing it to disk.

ClientBuilder

A ClientBuilder can be used to create a Client with custom configuration.

SimpleDocument

SimpleDocument is a model struct implementing the Document trait. Although you can certainly use this struct, you may want to consider writing your own simple struct implementing Document in order to provide more customized behavior.

Document

Document is a trait for representing objects that can be downloaded and written to disk using the Client struct. If an object implementing Document returns true from its wkhtmltopdf() method, it will be converted to PDF before it is written to disk.

Error

Error is an alias for failure::Error

Result

Result<T> is an alias for Result<T, Error>

Url

Url is an alias for url::Url