GitHub - EnockMagara/MapReduceMLOps

Movie Review Sentiment Analysis (MapReduce + ML)

A comprehensive MLOps project for sentiment analysis of movie reviews using Apache Spark MapReduce and machine learning.

Demo

Project Overview

This project implements a complete MapReduce + ML pipeline for predicting movie review sentiment:

MapReduce Processing: Single-pass feature extraction with word counts and TF-IDF
Machine Learning: Naive Bayes, Logistic Regression, and Random Forest classifiers
MLOps: MLflow tracking, model versioning, and evaluation metrics
Scalability: Apache Spark for distributed processing

🏗️ Project Structure

MMS/
├── config/                 # Configuration files
│   └── config.yaml        # Main configuration
├── src/                   # Source code
│   ├── config/           # Configuration management
│   ├── data/             # Data loading utilities
│   ├── processing/       # MapReduce processing
│   ├── models/           # Model training and evaluation
│   └── utils/            # Utility functions
├── scripts/              # Executable scripts
│   ├── train.py         # Training script
│   ├── predict.py       # Prediction script
│   └── evaluate.py      # Evaluation script
├── tests/               # Unit tests
├── docker/              # Docker configuration
├── data/                # Data directory
├── models/              # Trained models
├── logs/                # Log files
├── artifacts/           # Model artifacts
└── mlruns/             # MLflow tracking

Quick Start

1. Installation

# Clone the repository
git clone <repository-url>
cd MMS

# Install dependencies
make install

# Setup project directories
make setup

2. Data Preparation

Place your movie review datasets in the data/ directory:

train.csv - Training data
test.csv - Test data
unsupervised.csv - Additional unsupervised data

3. Training

Spark MLlib Models

# Train with default settings
make train

# Train with cross-validation
make train-cv

# Train specific algorithm
python scripts/train.py --algorithm logistic_regression

PyTorch Neural Networks

# Train LSTM model
make train-lstm

# Train Transformer model
make train-transformer

# Train BERT model
make train-bert

# Train any PyTorch model
python scripts/train_pytorch.py --model-type lstm

Hybrid Training (Compare Both Approaches)

# Train both Spark and PyTorch models and compare
make train-hybrid

# Custom hybrid training
python scripts/train_hybrid.py --spark-algorithm naive_bayes --pytorch-model bert --compare

4. Prediction

Spark Models

# Predict single text
make predict

# Batch prediction
python scripts/predict.py --model-path models/naive_bayes_model --input-file data/test.csv --output-file predictions.csv

PyTorch Models

# Predict with PyTorch model
make predict-pytorch

# Custom PyTorch prediction
python scripts/predict_pytorch.py --model-path models/pytorch_lstm_model.pth --model-type lstm --text "This movie is great!"

5. Evaluation

# Evaluate model
make evaluate

🔧 Configuration

Edit config/config.yaml to customize:

Data paths: Raw and processed data locations
Spark settings: Memory, cores, and other Spark configurations
Text processing: Stopwords, vocabulary size, feature extraction
Model parameters: Algorithm, hyperparameters, cross-validation
MLOps settings: MLflow tracking, logging, and artifacts

🐳 Docker Support

Build and Run

# Build Docker image
make docker-build

# Run with Docker Compose
make docker-run

Access Services

Spark UI: http://localhost:4040
MLflow UI: http://localhost:5000

Testing

# Run all tests
make test

# Run tests with coverage
make test-coverage

Processing Pipelines

Spark MapReduce Pipeline

Map Phase: Tokenize text → emit (word, 1) pairs
Reduce Phase: Aggregate word counts → build vocabulary
Feature Engineering: TF-IDF + bag-of-words extraction
Model Training: Spark MLlib algorithms

PyTorch Neural Network Pipeline

Data Preprocessing: Text cleaning and tokenization
Vocabulary Building: Word-to-index mapping
Model Architecture: LSTM/Transformer/BERT networks
Training: Gradient descent with early stopping

Hybrid Approach

Best of Both Worlds: Spark for data processing + PyTorch for deep learning
Performance Comparison: Side-by-side evaluation of different approaches
Flexibility: Choose the best model for your specific use case

🤖 Supported Models

Spark MLlib Models

Naive Bayes: Fast and effective for text classification
Logistic Regression: Linear model with good interpretability
Random Forest: Ensemble method for robust predictions

PyTorch Neural Networks

LSTM: Long Short-Term Memory networks for sequence modeling
Transformer: Self-attention based architecture for text understanding
BERT: Pre-trained transformer model for state-of-the-art performance

MLOps Features

MLflow Integration: Experiment tracking and model versioning
Comprehensive Logging: Structured logging with different levels
Model Evaluation: Multiple metrics (accuracy, precision, recall, F1, AUC)
Cross-Validation: Hyperparameter tuning with k-fold CV
Artifact Management: Model and data versioning

Monitoring

MLflow Tracking

Experiment runs and parameters
Model performance metrics
Artifact storage and versioning

Logging

Structured logging to files and console
Different log levels (DEBUG, INFO, WARNING, ERROR)
Performance and error tracking

🛠️ Development

Code Quality

# Format code
make format

# Type checking
make type-check

Adding New Features

Create new modules in src/
Add corresponding tests in tests/
Update configuration in config/config.yaml
Add CLI options in scripts

📝 Usage Examples

Training with Custom Parameters

python scripts/train.py \
    --algorithm naive_bayes \
    --cross-validate \
    --log-level DEBUG

Batch Prediction

python scripts/predict.py \
    --model-path models/naive_bayes_model \
    --input-file data/new_reviews.csv \
    --output-file results/predictions.csv

Model Evaluation

python scripts/evaluate.py \
    --model-path models/naive_bayes_model \
    --test-file data/test.csv

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Run the test suite
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Apache Spark for distributed processing
MLflow for MLOps capabilities
The movie review dataset providers
The open-source community for excellent tools and libraries