GitHub - krzjoa/awesome-python-data-science: Probably the best curated list of data science software in Python.

pyds

Awesome


Probably the best curated list of data science software in Python

Contents

Machine Learning

General Purpose Machine Learning

Gradient Boosting

Ensemble Methods

Imbalanced Datasets

Random Forests

Kernel Methods

Deep Learning

PyTorch

TensorFlow

JAX

  • JAX - Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more.
  • FLAX - A neural network library for JAX that is designed for flexibility.
  • Optax - A gradient processing and optimization library for JAX.

Others

Automated Machine Learning

Natural Language Processing

Computer Audition

  • torchaudio - An audio library for PyTorch. PyTorch based/compatible
  • librosa - Python library for audio and music analysis.
  • Yaafe - Audio features extraction.
  • aubio - A library for audio and music analysis.
  • Essentia - Library for audio and music analysis, description, and synthesis.
  • LibXtract - A simple, portable, lightweight library of audio feature extraction functions.
  • Marsyas - Music Analysis, Retrieval, and Synthesis for Audio Signals.
  • muda - A library for augmenting annotated audio data.
  • madmom - Python audio and music signal processing library.

Computer Vision

Time Series

Reinforcement Learning

  • Gymnasium - An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly Gym).
  • PettingZoo - An API standard for multi-agent reinforcement learning environments, with popular reference environments and related utilities.
  • MAgent2 - An engine for high performance multi-agent environments with very large numbers of agents, along with a set of reference environments.
  • Stable Baselines3 - A set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines.
  • Shimmy - An API conversion tool for popular external reinforcement learning environments.
  • EnvPool - C++-based high-performance parallel environment execution engine (vectorized env) for general RL environments.
  • RLlib - Scalable Reinforcement Learning.
  • Tianshou - An elegant PyTorch deep reinforcement learning library. PyTorch based/compatible
  • Acme - A library of reinforcement learning components and agents.
  • Catalyst-RL - PyTorch framework for RL research. PyTorch based/compatible
  • d3rlpy - An offline deep reinforcement learning library.
  • DI-engine - OpenDILab Decision AI Engine. PyTorch based/compatible
  • TF-Agents - A library for Reinforcement Learning in TensorFlow. TensorFlow
  • TensorForce - A TensorFlow library for applied reinforcement learning. TensorFlow
  • TRFL - TensorFlow Reinforcement Learning. sklearn
  • Dopamine - A research framework for fast prototyping of reinforcement learning algorithms.
  • keras-rl - Deep Reinforcement Learning for Keras. Keras compatible
  • garage - A toolkit for reproducible reinforcement learning research.
  • Horizon - A platform for Applied Reinforcement Learning.
  • rlpyt - Reinforcement Learning in PyTorch. PyTorch based/compatible
  • cleanrl - High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG).
  • Machin - A reinforcement library designed for pytorch. PyTorch based/compatible
  • SKRL - Modular reinforcement learning library (on PyTorch and JAX) with support for NVIDIA Isaac Gym, Isaac Orbit and Omniverse Isaac Gym. PyTorch based/compatible
  • Imitation - Clean PyTorch implementations of imitation and reward learning algorithms. PyTorch based/compatible

Graph Machine Learning

Graph Manipulation

  • Networkx - Network Analysis in Python.
  • Rustworkx - A high performance Python graph library implemented in Rust.
  • graph-tool - an efficient Python module for manipulation and statistical analysis of graphs (a.k.a. networks).
  • igraph - Python interface for igraph.

Learning-to-Rank & Recommender Systems

Probabilistic Graphical Models

  • pomegranate - Probabilistic and graphical models for Python. PyTorch based/compatible
  • pgmpy - A python library for working with Probabilistic Graphical Models.
  • pyAgrum - A GRaphical Universal Modeler.

Probabilistic Methods

Model Explanation

Genetic Programming

Optimization

Feature Engineering

General

Feature Selection

Visualization

General Purposes

  • Matplotlib - Plotting with Python.
  • seaborn - Statistical data visualization using matplotlib.
  • prettyplotlib - Painlessly create beautiful matplotlib plots.
  • python-ternary - Ternary plotting library for Python with matplotlib.
  • missingno - Missing data visualization module for Python.
  • chartify - Python library that makes it easy for data scientists to create charts.
  • physt - Improved histograms.

Interactive plots

Map

  • folium - Makes it easy to visualize data on an interactive open street map
  • geemap - Python package for interactive mapping with Google Earth Engine (GEE)

Automatic Plotting

  • HoloViews - Stop plotting your data - annotate your data and let it visualize itself.
  • AutoViz: Visualize data automatically with 1 line of code (ideal for machine learning)
  • SweetViz: Visualize and compare datasets, target values and associations, with one line of code.

NLP

  • pyLDAvis: Visualize interactive topic model

Deployment

  • fastapi - Modern, fast (high-performance), a web framework for building APIs with Python
  • streamlit - Make it easy to deploy the machine learning model
  • streamsync - No-code in the front, Python in the back. An open-source framework for creating data apps.
  • gradio - Create UIs for your machine learning model in Python in 3 minutes.
  • Vizro - A toolkit for creating modular data visualization applications.
  • datapane - A collection of APIs to turn scripts and notebooks into interactive reports.
  • binder - Enable sharing and execute Jupyter Notebooks
  • Deepnote - Deepnote is a drop-in replacement for Jupyter with an AI-first design, sleek UI, new blocks, and native data integrations. Use Python, R, and SQL locally in your favorite IDE, then scale to Deepnote cloud for real-time collaboration, Deepnote agent, and deployable data apps.

Statistics

Data Manipulation

Data Frames

Pipelines

Data-centric AI

  • cleanlab - The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
  • snorkel - A system for quickly generating training data with weak supervision.
  • dataprep - Collect, clean, and visualize your data in Python with a few lines of code.

Synthetic Data

  • ydata-synthetic - A package to generate synthetic tabular and time-series data leveraging the state-of-the-art generative models. pandas compatible

Distributed Computing

Experimentation

  • mlflow - Open source platform for the machine learning lifecycle.
  • Neptune - A lightweight ML experiment tracking, results visualization, and management tool.
  • dvc - Data Version Control | Git for Data & Models | ML Experiments Management.
  • envd - 🏕️ machine learning development environment for data science and AI/ML engineering teams.
  • Sacred - A tool to help you configure, organize, log, and reproduce experiments.
  • Ax - Adaptive Experimentation Platform. sklearn

Data Validation

  • great_expectations - Always know what to expect from your data.
  • pandera - A lightweight, flexible, and expressive statistical data testing library.
  • deepchecks - Validation & testing of ML models and data during model development, deployment, and production. sklearn
  • evidently - Evaluate and monitor ML models from validation to production.
  • TensorFlow Data Validation - Library for exploring and validating machine learning data.
  • DataComPy- A library to compare Pandas, Polars, and Spark data frames. It provides stats and lets users adjust for match accuracy.

Evaluation

Computations

  • NumPy - The fundamental package for scientific computing with Python
  • Dask - Parallel computing with task scheduling. pandas compatible
  • bottleneck - Fast NumPy array functions written in C.
  • CuPy - NumPy-like API accelerated with CUDA.
  • scikit-tensor - Python library for multilinear algebra and tensor factorizations.
  • numdifftools - Solve automatic numerical differentiation problems in one or more variables.
  • quaternion - Add built-in support for quaternions to numpy.
  • adaptive - Tools for adaptive and parallel samping of mathematical functions.
  • NumExpr - A fast numerical expression evaluator for NumPy that comes with an integrated computing virtual machine to speed calculations up by avoiding memory allocation for intermediate results.

Web Scraping

  • BeautifulSoup: The easiest library to scrape static websites for beginners
  • Scrapy: Fast and extensible scraping library. Can write rules and create customized scraper without touching the core
  • Selenium: Use Selenium Python API to access all functionalities of Selenium WebDriver in an intuitive way like a real user.
  • Pattern: High level scraping for well-establish websites such as Google, Twitter, and Wikipedia. Also has NLP, machine learning algorithms, and visualization
  • twitterscraper: Efficient library to scrape Twitter

Spatial Analysis

  • GeoPandas - Python tools for geographic data. pandas compatible
  • PySal - Python Spatial Analysis Library.

Quantum Computing

  • qiskit - Qiskit is an open-source SDK for working with quantum computers at the level of circuits, algorithms, and application modules.
  • cirq - A python framework for creating, editing, and invoking Noisy Intermediate Scale Quantum (NISQ) circuits.
  • PennyLane - Quantum machine learning, automatic differentiation, and optimization of hybrid quantum-classical computations.
  • QML - A Python Toolkit for Quantum Machine Learning.

Conversion

  • sklearn-porter - Transpile trained scikit-learn estimators to C, Java, JavaScript, and others.
  • ONNX - Open Neural Network Exchange.
  • MMdnn - A set of tools to help users inter-operate among different deep learning frameworks.
  • treelite - Universal model exchange and serialization format for decision tree forests.

Contributing

Contributions are welcome! 😎
Read the contribution guideline.

License

This work is licensed under the Creative Commons Attribution 4.0 International License - CC BY 4.0