danesherbs - Overview
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Python 18.2k 2.9k
OpenAI Frontier Evals
Python 1.2k 143
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
Python 1.4k 236
BitBlaster-16 is a 16-bit computer built from scratch using only NAND gates and data flip-flops as primitives! :)
Python 2
Want to get better at making better estimates under uncertainty? No? Well, now you can!
Python 4
Implementation of OpenAI's "Learning to Summarize with Human Feedback"
Jupyter Notebook 7 1