NetflixPrizeBOF
This wiki is in the process of being archived due to lack of usage and the resources necessary to serve it — predominately to bots, crawlers, and LLM companies. Edits are discouraged.
Pages are preserved as they were at the time of archival. For current information, please visit python.org.
If a change to this archive is absolutely needed, requests can be made via the infrastructure@python.org mailing list.
Discuss approaches to the Netflix prize using Python, getting started with PyFlix for new people, algorithm + code performance, etc
Some Netflix code in Python will be shown/run (KNN, NMF, ARTmap, SVD, etc).
I will be posting the code later this month on my blog: Data Wrangling
Some links for those just getting started:
Register a Team in order to download the Netflix data
PyFlix library for efficiently handling the dataset.
Movielens dataset - smaller dataset to debug your code with...
Some approaches:
Netflix forum KNN discussion - includes numpy, weave specifics
Paul Harrison's approach - using numpy and weave
Dartmouth paper - using EM/NMF approach with Movielens data
BellKor paper - Progress prize winner
Hadoop MapReduce code for working with the Netflix data
More here:
Performance pointers:
If you need to go parallel for Netlfix, ElasticWulf public Amazon EC2 images come with mpi4py, IPython1, pyflix, numpy, scipy, weave, pyrex, etc. already installed and configured. The python code for launching your own beowulf on EC2 using the images is on google code.
Parallel Programming is useful for lots of ML algorithms. How to Write Parallel Programs is a good book. Amazon Consider jython, since ML is often CPU-bound, and jython has no GIL.