GitHub - jonstuebe/scraper: Node.js based scraper using headless chrome

Node.js based scraper using headless chrome

Installation

$ npm install @jonstuebe/scraper

Features

Scrape top ecommerce sites (Amazon, Walmart, Target)
Return basic product information (title, price, image, description)
Easy to use API to scrape any website

API

Simply require the package and initialize with a url and pass a callback function to receive the data.

es5

const Scraper = require("@jonstuebe/scraper");

// run inside of an async function
(async () => {
  const data = await Scraper.scrapeAndDetect(
    "http://www.amazon.com/gp/product/B00X4WHP5E/"
  );
  console.log(data);
})();

es6

import Scraper from "@jonstuebe/scraper";

// run inside of an async function
(async () => {
  const data = await Scraper("http://www.amazon.com/gp/product/B00X4WHP5E/");
  console.log(data);
})();

with promises

import Scraper from "@jonstuebe/scraper";

Scraper("http://www.amazon.com/gp/product/B00X4WHP5E/").then(data => {
  console.log(data);
});

shared scraper instance

If you are going to be running the scraper a number of times in succession, it's recommended to share the same chromium instance for each sequential/parallel scrape.

import puppeteer from "puppeteer";
import Scraper from "@jonstuebe/scraper";

// run inside of an async function
(async () => {
  const browser = await puppeteer.launch();
  let products = [
    "https://www.target.com/p/corinna-angle-leg-side-table-wood-threshold-8482/-/A-53496420",
    "https://www.target.com/p/glasgow-metal-end-table-black-project-62-8482/-/A-52343433"
  ];

  let productsData = [];
  for (const product of products) {
    const productData = await Scraper(product, browser);
    productsData.push(productData);
  }

  await browser.close(); // make sure and close the browser otherwise the instances will continue to run in the backround on your machine

  console.table(productsData);
})();

emulate devices

If you want to emulate a device, pass in a puppeteer device as the third agument:

import puppeteer from "puppeteer";
import Scraper from "@jonstuebe/scraper";

// run inside of an async function
(async () => {
  const data = await Scraper(
    "http://www.amazon.com/gp/product/B00X4WHP5E/",
    null,
    puppeteer.devices["iPhone SE"]
  );
  console.log(data);
})();

custom scrapers

const Scraper = require("@jonstuebe/scraper");

(async () => {
  const site = {
    name: "npm",
    hosts: ["www.npmjs.com"],
    scrape: async page => {
      const name = await Scraper.getText("div.content-column > h1 > a", page);
      const version = await Scraper.getText(
        "div.sidebar > ul:nth-child(2) > li:nth-child(2) > strong",
        page
      );
      const author = await Scraper.getText(
        "div.sidebar > ul:nth-child(2) > li.last-publisher > a > span",
        page
      );

      return {
        name,
        version,
        author
      };
    }
  };

  const data = await Scraper.scrape(
    "https://www.npmjs.com/package/lodash",
    site
  );
  console.log(data);
})();

Contributing

If you want to add any sites, or just have an idea or feature, go ahead and fork this repo and send me a pull request. I'll be happy to take a look when I can and get back to you.

Issues

For any and all issues/bugs, please post a description and code sample to reproduce the problem on the issues page.

License

MIT