GitHub - AlphonseBrandon/web-scraper: As a data scientist, where to find data for my next project was always a hassle, just not anymore :) There are tons of information on websites today, being able to scrap of any information I want from any website has been a game changer for me.

-- Project Status: Completed

Introduction

As a data scientist, where to find data for my next project was always a hassle, just not anymore :)

There are tons of information on websites today, being able to scrap of any information I want from any website has been a game changer for me.

Objective

Scrap data (in tables) from wikipedia page

1668891788973

Methods Used

  • Data collection
  • Data mining
  • Data engineering
  • Data warehousing
  • Webscrapping
  • Data pipelining

Technologies

  • Python
  • Pandas
  • Numpy
  • Beautifull-Soup

Project Description

In this mini project, I flex my web scrapping muscles and get holywood movies data from 2018 to 2022 to analyse trends.

I make use of the pythons beautiful soup package and scrap data from wikipedia website. Using the scraped data, I collected additional data from 'The Movie Database (TMDB)' website through thier API and added additional features to the scrapped data that will help me build a movie recommendation engine. At the end, I build a datapipeline that automatically fetches and preprocess movies from any given here into a suitable format and saves to disc

Make sure to have a stable internet connection if you want to run the scripts in this project, it's fun :)

Getting Started

  1. Clone this repository
  2. Create a python virtual environment
  3. Install the requirements.txt file
  4. Checkout the src/01_utils and edit the path you want the scrapped data to be saved in
  5. run the script in the directory src/data_pipeline