-- Project Status: Completed
Introduction
As a data scientist, where to find data for my next project was always a hassle, just not anymore :)
There are tons of information on websites today, being able to scrap of any information I want from any website has been a game changer for me.
Objective
Scrap data (in tables) from wikipedia page
Methods Used
- Data collection
- Data mining
- Data engineering
- Data warehousing
- Webscrapping
- Data pipelining
Technologies
- Python
- Pandas
- Numpy
- Beautifull-Soup
Project Description
In this mini project, I flex my web scrapping muscles and get holywood movies data from 2018 to 2022 to analyse trends.
I make use of the pythons beautiful soup package and scrap data from wikipedia website. Using the scraped data, I collected additional data from 'The Movie Database (TMDB)' website through thier API and added additional features to the scrapped data that will help me build a movie recommendation engine. At the end, I build a datapipeline that automatically fetches and preprocess movies from any given here into a suitable format and saves to disc
Make sure to have a stable internet connection if you want to run the scripts in this project, it's fun :)
Getting Started
- Clone this repository
- Create a python virtual environment
- Install the requirements.txt file
- Checkout the src/01_utils and edit the path you want the scrapped data to be saved in
- run the script in the directory src/data_pipeline
