NetflixPrizeBOF

This wiki is in the process of being archived due to lack of usage and the resources necessary to serve it — predominately to bots, crawlers, and LLM companies. Edits are discouraged.
Pages are preserved as they were at the time of archival. For current information, please visit python.org.
If a change to this archive is absolutely needed, requests can be made via the infrastructure@python.org mailing list.

Discuss approaches to the Netflix prize using Python, getting started with PyFlix for new people, algorithm + code performance, etc

Some Netflix code in Python will be shown/run (KNN, NMF, ARTmap, SVD, etc).

I will be posting the code later this month on my blog: Data Wrangling

Some links for those just getting started:

Register a Team in order to download the Netflix data
PyFlix library for efficiently handling the dataset.
Movielens dataset - smaller dataset to debug your code with...

Some approaches:

Simon Funk approach
Timely Development code for Simon Funk approach
Netflix forum KNN discussion - includes numpy, weave specifics
Basic KNN in SQL
Tivo KNN paper
Erik Shelly's approach
Dan Tillberg's page
Paul Harrison's approach - using numpy and weave
Dartmouth paper - using EM/NMF approach with Movielens data
BellKor paper - Progress prize winner
Hadoop MapReduce code for working with the Netflix data

More here:

Performance pointers:

http://www.scipy.org/PerformancePython
http://wiki.python.org/moin/PythonSpeed/PerformanceTips
http://www.scipy.org/Weave
If you need to go parallel for Netlfix, ElasticWulf public Amazon EC2 images come with mpi4py, IPython1, pyflix, numpy, scipy, weave, pyrex, etc. already installed and configured. The python code for launching your own beowulf on EC2 using the images is on google code.

Parallel Programming is useful for lots of ML algorithms. How to Write Parallel Programs is a good book. Amazon Consider jython, since ML is often CPU-bound, and jython has no GIL.