GitHub - AlphonseBrandon/web-scraper: As a data scientist, where to find data for my next project was always a hassle, just not anymore :) There are tons of information on websites today, being able to scrap of any information I want from any website has been a game changer for me.

-- Project Status: Completed

Introduction

As a data scientist, where to find data for my next project was always a hassle, just not anymore :)

There are tons of information on websites today, being able to scrap of any information I want from any website has been a game changer for me.

Objective

Scrap data (in tables) from wikipedia page

Methods Used

Data collection
Data mining
Data engineering
Data warehousing
Webscrapping
Data pipelining

Technologies

Python
Pandas
Numpy
Beautifull-Soup

Project Description

In this mini project, I flex my web scrapping muscles and get holywood movies data from 2018 to 2022 to analyse trends.

I make use of the pythons beautiful soup package and scrap data from wikipedia website. Using the scraped data, I collected additional data from 'The Movie Database (TMDB)' website through thier API and added additional features to the scrapped data that will help me build a movie recommendation engine. At the end, I build a datapipeline that automatically fetches and preprocess movies from any given here into a suitable format and saves to disc

Make sure to have a stable internet connection if you want to run the scripts in this project, it's fun :)

Getting Started

Clone this repository
Create a python virtual environment
Install the requirements.txt file
Checkout the src/01_utils and edit the path you want the scrapped data to be saved in
run the script in the directory src/data_pipeline