GitHub - llnl/scraper: Python library for getting metadata from source code hosting tools

Scraper is a tool for scraping and visualizing open source data from various code hosting platforms, such as: GitHub.com, GitHub Enterprise, GitLab.com, hosted GitLab, and Bitbucket Server.

Getting Started: Code.gov

Code.gov is a newly launched website of the US Federal Government to allow the People to access metadata from the governments custom developed software. This site requires metadata to function, and this Python library can help with that!

To get started, you will need a GitHub Personal Auth Token to make requests to the GitHub API. This should be set in your environment or shell rc file with the name GITHUB_API_TOKEN:

    $ export GITHUB_API_TOKEN=XYZ

    $ echo "export GITHUB_API_TOKEN=XYZ" >> ~/.bashrc

Additionally, to perform the labor hours estimation, you will need to install cloc into your environment. This is typically done with a Package Manager such as npm or homebrew.

Then to generate a code.json file for your agency, you will need a config.json file to coordinate the platforms you will connect to and scrape data from. An example config file can be found in demo.json. Once you have your config file, you are ready to install and run the scraper!

    # Install Scraper from a local copy of this repository
    $ pip install -e .
    # OR
    # Install Scraper from PyPI
    $ pip install llnl-scraper

    # Run Scraper with your config file ``config.json``
    $ scraper --config config.json

A full example of the resulting code.json file can be found here.

Config File Options

The configuration file is a json file that specifies what repository platforms to pull projects from as well as some settings that can be used to override incomplete or inaccurate data returned via the scraping.

The basic structure is:

License

Scraper is released under an MIT license. For more details see the LICENSE file.

LLNL-CODE-705597