Tuesday, August 12, 2014

Movie Adventures: Data Sets from the IMDB

Working With Fun Data Sources

Actually, most data is interesting.  When I make examples to show concepts to others, though I try to choose something that many people can relate and also visualize typical relations between each sample.

Rest in Peace, Mr. Williams (Robin)
The IMDB, One of the Internet's Early Content Pioneers

Movies are a personal favorite.  It turns out that the IMDB, aka: The Internet Movie Database reads like a searchable encyclopedia of movie facts and stats... a database in many ways.

Apparently, the IMDB has its own alternate interfaces to their database.  Check it out and pull down a few samples.  I am currently in the process of loading a compressed sample of their "movies.list" entity (they have quite a few entities in their model).  It's close to 40 mb of text, but it's coming down the network pipe pretty slow right now.  I'll report back as soon as I am able to uncompress it and peek at the formatting.  Apparently the IMDB doesn't leave much documentation with respect to how to use their data extracts.

A Comment on Data Stewardship

Please be sure to review the link to the IMDB Data Terms of Use, as it's "not the usual yaddah, yaddah".  Make sure that you are using the data, which is identified as:

The data is NOT FREE although it may be used for free in specific circumstances.

There are also some guidelines on another IMDB page discussing the guidelines of uses of their "not-free" data in personal or commercial software products.  Reading these rules and thoroughly understanding their limitations and restrictions is a good exercise in data stewardship.  Stewardship is applies to information gathered by individuals and their affiliated organizations (employers, businesses, etc.).

A Sample Data File Format 

Here's a peek at what's inside the downloadable data files.  Not quite like an industry "standard" csv (primitive comma separated values) or xml (a little geeky) layouts, but I think they're consistent enough for parsing.  It might be good if I could get to some data on movies I actually recognize... :-)

"'n Shrink" (2009) {Who's Got the Pills? (#1.1)} 2010
"'n Shrink" (2009) {Why's It So Hot? (#1.2)} 2010
"'N Sync TV" (1998) 1998-????
"'Ons Sterrenkookboek'" (2007) 2007-2008
"'Orrible" (2001) 2001-????
"'Orrible" (2001) {Dirty Dozen (#1.7)} 2001
"'Orrible" (2001) {Dog Locked (#1.5)} 2001
"'Orrible" (2001) {Enter the Garage (#1.2)} 2001
"'Orrible" (2001) {May the Best Man Win (#1.4)} 2001

In general, this should be a fun data set to work with, even for just this entity alone.  The count is about 2.97 million records spanning names from the early creation of motion picture science to the modern day. (Wow, that's quite a few.)

Update: An Appeal from the IMDB.com Team

The ftp site distributing the IMDB data extracts had an interesting appeal you may want to read and respond to.  I'm sure that the more who put in their input will influence the site owners to implement even better alternative access methods and api's for their site's content.  What do you think?  If you found this data source useful (perhaps for an ongoing basis), be sure to contact them and weigh in your opinion!


We're in the process of reviewing how we make our data available to the outside world with the goal of making it easier for anyone to innovate and answer interesting questions with the data. Since you're reading this, you use our current ftp solution to get data (or are thinking about it) and we'd love to get your feedback on the current process for accessing data and what we could do to make it easier for you to use in the future. We have some specific questions below, but would be just as happy hearing about how you access and use IMDb data to make a better
overall experience.

Please head over to https://getsatisfaction.com/imdb/topics/api_bulk_data_access 

to let us know what you think.


1. What works/doesn't work for you with the current model?
2. Do you access entertainment data from other sources in addition to IMDb?
3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
5. Does how you plan on using the data impact how you want to have it delivered?
6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
7. Are our T&Cs easy for you to understand and follow?
Thanks for your time and feedback!



Be sure to report back with your findings and creative ideas for interpreting and reading through the data.  I'm exited because millions of records pushes small database systems to levels where SQL optimizations are actually noticeable (well, hopefully more than with only hundreds!). 

Article Summary:
How to use fun and interesting data sets from external sources in your database designs, models and personal software development projects.  Also tips on how to observe data stewardship practices such as following the legal restrictions and limitations placed on the use of otherwise private, proprietary or protected data. 

Post a Comment