Probably the best curated list of data science software in Python
Contents
- Contents
- Machine Learning
- Deep Learning
- Automated Machine Learning
- Natural Language Processing
- Computer Audition
- Computer Vision
- Time Series
- Reinforcement Learning
- Graph Machine Learning
- Graph Manipulation
- Learning-to-Rank & Recommender Systems
- Probabilistic Graphical Models
- Probabilistic Methods
- Model Explanation
- Optimization
- Genetic Programming
- Feature Engineering
- Visualization
- Data Manipulation
- Deployment
- Statistics
- Distributed Computing
- Experimentation
- Data Validation
- Evaluation
- Computations
- Web Scraping
- Spatial Analysis
- Quantum Computing
- Conversion
- Contributing
- License
Machine Learning
General Purpose Machine Learning
Gradient Boosting
Ensemble Methods
Imbalanced Datasets
Random Forests
Kernel Methods
Deep Learning
PyTorch
TensorFlow
JAX
- JAX - Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more.
- FLAX - A neural network library for JAX that is designed for flexibility.
- Optax - A gradient processing and optimization library for JAX.
Others
Automated Machine Learning
Natural Language Processing
Computer Audition
- torchaudio - An audio library for PyTorch.

- librosa - Python library for audio and music analysis.
- Yaafe - Audio features extraction.
- aubio - A library for audio and music analysis.
- Essentia - Library for audio and music analysis, description, and synthesis.
- LibXtract - A simple, portable, lightweight library of audio feature extraction functions.
- Marsyas - Music Analysis, Retrieval, and Synthesis for Audio Signals.
- muda - A library for augmenting annotated audio data.
- madmom - Python audio and music signal processing library.
Computer Vision
Time Series
Reinforcement Learning
- Gymnasium - An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym).
- PettingZoo - An API standard for multi-agent reinforcement learning environments, with popular reference environments and related utilities.
- MAgent2 - An engine for high performance multi-agent environments with very large numbers of agents, along with a set of reference environments.
- Stable Baselines3 - A set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines.
- Shimmy - An API conversion tool for popular external reinforcement learning environments.
- EnvPool - C++-based high-performance parallel environment execution engine (vectorized env) for general RL environments.
- RLlib - Scalable Reinforcement Learning.
- Tianshou - An elegant PyTorch deep reinforcement learning library.

- Acme - A library of reinforcement learning components and agents.
- Catalyst-RL - PyTorch framework for RL research.

- d3rlpy - An offline deep reinforcement learning library.
- DI-engine - OpenDILab Decision AI Engine.

- TF-Agents - A library for Reinforcement Learning in TensorFlow.

- TensorForce - A TensorFlow library for applied reinforcement learning.

- TRFL - TensorFlow Reinforcement Learning.

- Dopamine - A research framework for fast prototyping of reinforcement learning algorithms.
- keras-rl - Deep Reinforcement Learning for Keras.

- garage - A toolkit for reproducible reinforcement learning research.
- Horizon - A platform for Applied Reinforcement Learning.
- rlpyt - Reinforcement Learning in PyTorch.

- cleanrl - High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG).
- Machin - A reinforcement library designed for pytorch.

- SKRL - Modular reinforcement learning library (on PyTorch and JAX) with support for NVIDIA Isaac Gym, Isaac Orbit and Omniverse Isaac Gym.

- Imitation - Clean PyTorch implementations of imitation and reward learning algorithms.

Graph Machine Learning
Graph Manipulation
- Networkx - Network Analysis in Python.
- Rustworkx - A high performance Python graph library implemented in Rust.
- graph-tool - an efficient Python module for manipulation and statistical analysis of graphs (a.k.a. networks).
- igraph - Python interface for igraph.
Learning-to-Rank & Recommender Systems
Probabilistic Graphical Models
- pomegranate - Probabilistic and graphical models for Python.

- pgmpy - A python library for working with Probabilistic Graphical Models.
- pyAgrum - A GRaphical Universal Modeler.
Probabilistic Methods
Model Explanation
Genetic Programming
Optimization
Feature Engineering
General
Feature Selection
Visualization
General Purposes
- Matplotlib - Plotting with Python.
- seaborn - Statistical data visualization using matplotlib.
- prettyplotlib - Painlessly create beautiful matplotlib plots.
- python-ternary - Ternary plotting library for Python with matplotlib.
- missingno - Missing data visualization module for Python.
- chartify - Python library that makes it easy for data scientists to create charts.
- physt - Improved histograms.
Interactive plots
Map
- folium - Makes it easy to visualize data on an interactive open street map
- geemap - Python package for interactive mapping with Google Earth Engine (GEE)
Automatic Plotting
- HoloViews - Stop plotting your data - annotate your data and let it visualize itself.
- AutoViz: Visualize data automatically with 1 line of code (ideal for machine learning)
- SweetViz: Visualize and compare datasets, target values and associations, with one line of code.
NLP
- pyLDAvis: Visualize interactive topic model
Deployment
- fastapi - Modern, fast (high-performance), a web framework for building APIs with Python
- streamlit - Make it easy to deploy the machine learning model
- streamsync - No-code in the front, Python in the back. An open-source framework for creating data apps.
- gradio - Create UIs for your machine learning model in Python in 3 minutes.
- Vizro - A toolkit for creating modular data visualization applications.
- datapane - A collection of APIs to turn scripts and notebooks into interactive reports.
- binder - Enable sharing and execute Jupyter Notebooks
- Deepnote - Deepnote is a drop-in replacement for Jupyter with an AI-first design, sleek UI, new blocks, and native data integrations. Use Python, R, and SQL locally in your favorite IDE, then scale to Deepnote cloud for real-time collaboration, Deepnote agent, and deployable data apps.
Statistics
Data Manipulation
Data Frames
Pipelines
Data-centric AI
- cleanlab - The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
- snorkel - A system for quickly generating training data with weak supervision.
- dataprep - Collect, clean, and visualize your data in Python with a few lines of code.
Synthetic Data
- ydata-synthetic - A package to generate synthetic tabular and time-series data leveraging the state-of-the-art generative models.

Distributed Computing
Experimentation
- mlflow - Open source platform for the machine learning lifecycle.
- Neptune - A lightweight ML experiment tracking, results visualization, and management tool.
- dvc - Data Version Control | Git for Data & Models | ML Experiments Management.
- envd - 🏕️ machine learning development environment for data science and AI/ML engineering teams.
- Sacred - A tool to help you configure, organize, log, and reproduce experiments.
- Ax - Adaptive Experimentation Platform.

Data Validation
- great_expectations - Always know what to expect from your data.
- pandera - A lightweight, flexible, and expressive statistical data testing library.
- deepchecks - Validation & testing of ML models and data during model development, deployment, and production.

- evidently - Evaluate and monitor ML models from validation to production.
- TensorFlow Data Validation - Library for exploring and validating machine learning data.
- DataComPy- A library to compare Pandas, Polars, and Spark data frames. It provides stats and lets users adjust for match accuracy.
Evaluation
Computations
- NumPy - The fundamental package for scientific computing with Python
- Dask - Parallel computing with task scheduling.

- bottleneck - Fast NumPy array functions written in C.
- CuPy - NumPy-like API accelerated with CUDA.
- scikit-tensor - Python library for multilinear algebra and tensor factorizations.
- numdifftools - Solve automatic numerical differentiation problems in one or more variables.
- quaternion - Add built-in support for quaternions to numpy.
- adaptive - Tools for adaptive and parallel samping of mathematical functions.
- NumExpr - A fast numerical expression evaluator for NumPy that comes with an integrated computing virtual machine to speed calculations up by avoiding memory allocation for intermediate results.
Web Scraping
- BeautifulSoup: The easiest library to scrape static websites for beginners
- Scrapy: Fast and extensible scraping library. Can write rules and create customized scraper without touching the core
- Selenium: Use Selenium Python API to access all functionalities of Selenium WebDriver in an intuitive way like a real user.
- Pattern: High level scraping for well-establish websites such as Google, Twitter, and Wikipedia. Also has NLP, machine learning algorithms, and visualization
- twitterscraper: Efficient library to scrape Twitter
Spatial Analysis
Quantum Computing
- qiskit - Qiskit is an open-source SDK for working with quantum computers at the level of circuits, algorithms, and application modules.
- cirq - A python framework for creating, editing, and invoking Noisy Intermediate Scale Quantum (NISQ) circuits.
- PennyLane - Quantum machine learning, automatic differentiation, and optimization of hybrid quantum-classical computations.
- QML - A Python Toolkit for Quantum Machine Learning.
Conversion
- sklearn-porter - Transpile trained scikit-learn estimators to C, Java, JavaScript, and others.
- ONNX - Open Neural Network Exchange.
- MMdnn - A set of tools to help users inter-operate among different deep learning frameworks.
- treelite - Universal model exchange and serialization format for decision tree forests.
Contributing
Contributions are welcome! 😎
Read the contribution guideline.
License
This work is licensed under the Creative Commons Attribution 4.0 International License - CC BY 4.0
