Tuesday, October 10, 2006


Edit: Nevermind, you can disregard this whole thing.

I downloaded the data, but it comes in 17770 separate files. Eighteen THOUSAND separate files. I have no particular interest in hacking together a macro to turn those files into something useful.

Of course, if someone else already HAS, I'd be very willing to download their already-compiled list...

Call me lazy, but it would take me hours and hours just to recombine these files.

I'm downloading the data set for the Netflix prize. It will be the largest data set I've ever tried to run my user preference algorithm on.

I don't think I'll win the prize - ever - for a few reasons. First, it's not a "fire and forget" thing. They'll keep mumbling along for FIVE YEARS before passing out the prize, and I'm not willing to work gratis for quite that long.

Second, their data is bunged.

They did it to keep people from using the data to "make certain inferences about Netflix customers". Hrrrrmmmm... what, exactly, are we supposed to use the data for, then?

They say that the missing data doesn't affect THEIR algorithm's accuracy. But, you know, they just admitted their algorithm's accuracy is 8.5%.


Anyhow, we'll see whether I can do anything with it. The biggest problem, at the moment, is that I don't actually have a database program installed on this computer. Ha! I'll have to get one.


Darius Kazemi said...

I downloaded it the other day, but only to run visualizations of user preference networks on.

It's not a terribly user-friendly format. I don't know what they were thinking, including a separate file for each movie...

Craig Perko said...

You're kidding. I haven't finished downloading it yet, so I don't know what format it's in...

But one file per movie?

First, one file per ANYTHING is outdated and painful. Second, per MOVIE is not the way to do it.

Guh. What idiocy.

Hey, do you have any of those preference visualizations?

Gary said...

I also took a look at. Even after converting all 17700 files (I think that's the count, can't remember) to a binary format it still takes 700mb or memory just to hold all of it (100 million records x (4 bytes per userid, 2 per date, and 1 per score) plus the 17700 movie ids). Not exactly something that's easy to keep in memory and do any indexing or interesting things without quite a bit of memory.

Craig Perko said...

You wouldn't happen to still HAVE the data in a useful format, would you?

Gary said...

Well I have it, but with only 45k/s upload it's not exactly easy to send it to you. Far easier if I could just send you the program or the c++ source. You want me to send it to you?

Craig Perko said...

The program would be more useful to me than the source, if it works. Sure - I'd mail you, but your blogger account isn't up. Contact me through mine.

Gary said...

Well I don't have a blogger acocunt so that's probably why. I do have a google account so that's likely why it displays as such.

In any case give this a link a try.

Craig Perko said...

Link works great, thanks. I'll redownload the data and see what there is to see.

Thanks for your help!