Wednesday, August 20, 2014

Oracle-XE plus Linux = ? (Day One)

So, I was very impressed with the initial documentation on the integration of Oracle-XE (the free rdbms available from Oracle) and Linux.  Installation was supposed to be simply a single line call to the Ubuntu software libraries through the "apt-get install" command. It has turned out to be more lengthy of a task (try at least one working day) with lots of implied steps in between.

Still, I am glad I have chosen this route because the documentation, though scattered is very comprehensive.  It has been a challenge to pick up, select and connect various units of working instructions for different parts of the installation... from different sources.

So far, I have prepared my platform, which mostly the base requirements to run XE on any system:
  1. Ubuntu Linux installation 14.04 via a virtual machine system partition set up on my host OS (my workstation).  it has assigned 2 GB of RAM, 30 GB of disk space and one cpu unit of computing power.
  2. Assigned swap space for extra capacity and extended a dedicated disk partition for storage of the Oracle database instance.
At present, I am downloading the XE software for 11gR2.  That will be the version used in this ongoing discussion for this installation.  So far, things look good.  I can really appreciate an organized installation methodology. 

It helps to know where you can find everything and what you named it at the time it was installed.  Believe me, it looks like it will be the deciding factor of whether to move onward, or to chuck it and start over (ouch).

Onward.  More soon.

Tuesday, August 12, 2014

Movie Adventures: Data Sets from the IMDB

Working With Fun Data Sources

Actually, most data is interesting.  When I make examples to show concepts to others, though I try to choose something that many people can relate and also visualize typical relations between each sample.

Rest in Peace, Mr. Williams (Robin)
The IMDB, One of the Internet's Early Content Pioneers

Movies are a personal favorite.  It turns out that the IMDB, aka: The Internet Movie Database reads like a searchable encyclopedia of movie facts and stats... a database in many ways.

Apparently, the IMDB has its own alternate interfaces to their database.  Check it out and pull down a few samples.  I am currently in the process of loading a compressed sample of their "movies.list" entity (they have quite a few entities in their model).  It's close to 40 mb of text, but it's coming down the network pipe pretty slow right now.  I'll report back as soon as I am able to uncompress it and peek at the formatting.  Apparently the IMDB doesn't leave much documentation with respect to how to use their data extracts.

A Comment on Data Stewardship

Please be sure to review the link to the IMDB Data Terms of Use, as it's "not the usual yaddah, yaddah".  Make sure that you are using the data, which is identified as:

The data is NOT FREE although it may be used for free in specific circumstances.

There are also some guidelines on another IMDB page discussing the guidelines of uses of their "not-free" data in personal or commercial software products.  Reading these rules and thoroughly understanding their limitations and restrictions is a good exercise in data stewardship.  Stewardship is applies to information gathered by individuals and their affiliated organizations (employers, businesses, etc.).

A Sample Data File Format 

Here's a peek at what's inside the downloadable data files.  Not quite like an industry "standard" csv (primitive comma separated values) or xml (a little geeky) layouts, but I think they're consistent enough for parsing.  It might be good if I could get to some data on movies I actually recognize... :-)

"'n Shrink" (2009) {Who's Got the Pills? (#1.1)} 2010
"'n Shrink" (2009) {Why's It So Hot? (#1.2)} 2010
"'N Sync TV" (1998) 1998-????
"'Ons Sterrenkookboek'" (2007) 2007-2008
"'Orrible" (2001) 2001-????
"'Orrible" (2001) {Dirty Dozen (#1.7)} 2001
"'Orrible" (2001) {Dog Locked (#1.5)} 2001
"'Orrible" (2001) {Enter the Garage (#1.2)} 2001
"'Orrible" (2001) {May the Best Man Win (#1.4)} 2001

In general, this should be a fun data set to work with, even for just this entity alone.  The count is about 2.97 million records spanning names from the early creation of motion picture science to the modern day. (Wow, that's quite a few.)

Update: An Appeal from the Team

The ftp site distributing the IMDB data extracts had an interesting appeal you may want to read and respond to.  I'm sure that the more who put in their input will influence the site owners to implement even better alternative access methods and api's for their site's content.  What do you think?  If you found this data source useful (perhaps for an ongoing basis), be sure to contact them and weigh in your opinion!


We're in the process of reviewing how we make our data available to the outside world with the goal of making it easier for anyone to innovate and answer interesting questions with the data. Since you're reading this, you use our current ftp solution to get data (or are thinking about it) and we'd love to get your feedback on the current process for accessing data and what we could do to make it easier for you to use in the future. We have some specific questions below, but would be just as happy hearing about how you access and use IMDb data to make a better
overall experience.

Please head over to 

to let us know what you think.


1. What works/doesn't work for you with the current model?
2. Do you access entertainment data from other sources in addition to IMDb?
3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
5. Does how you plan on using the data impact how you want to have it delivered?
6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
7. Are our T&Cs easy for you to understand and follow?
Thanks for your time and feedback!


Be sure to report back with your findings and creative ideas for interpreting and reading through the data.  I'm exited because millions of records pushes small database systems to levels where SQL optimizations are actually noticeable (well, hopefully more than with only hundreds!). 

Article Summary:
How to use fun and interesting data sets from external sources in your database designs, models and personal software development projects.  Also tips on how to observe data stewardship practices such as following the legal restrictions and limitations placed on the use of otherwise private, proprietary or protected data. 

Saturday, August 9, 2014

Git Tips: Refreshing Source Code from the Origin (upstream source)

git-scm: quick tips for your reference

I sourced this one from the blog, Gitready: fetching upstream changes.

This tip comes from Dav Glass, whose work on YUI at GitHub requires him to bring in commits frequently. So, he’s created a helpful alias so he can merge in changes easily.

First off, he’s got a consistent naming scheme for his upstream remote:

git remote add upstream git://

Just throw this in your

and you’re all set:

     pu = !"git fetch origin -v; git fetch upstream -v; git merge upstream/master"

git pu

will grab all of the latest changes from both remotes, and then merge in the commits from upstream.