IMDB extractor

From WandoraWiki
Jump to: navigation, search

IMDB extractor transforms Internet Movie Database data files into a topic map browsable with Wandora. Extractor has been created for demonstration purposes only. Wandora does not contain any IMDB data files. Also, be aware that Wandora or Wandora authors have no rights to give you any permission to use IMDB data. If you plan to use IMDB topic maps beyond personal usage, you should contact IMDB Licensing department.

You may download IMDB datafiles from

As datafiles are extremely large you can't extract data to memory topic maps but have to use database topic maps. Wandora does not transfer all IMDB files. Current extractor transfers only

  • actors
  • actresses
  • keywords
  • countries
  • language
  • locations
  • genres
  • movies
  • biographies
  • producers
  • directors
  • plot summaries
  • running times
  • release dates

To prepare the extraction download all required data files and unpack them to your local file system. Then create a database topic map and start extractor with File > Extract > Media > IMDB Extractor. Wandora requests a folder containing IMDB data files or a single data file and starts the extraction after successful data file or folder identification. IMDB data files are very large and you should be patient as the extraction may take a while.

Below is a screenshot of Wandora viewing associations of movie Dr. Strangelove.... Notice the layer structure. Each IMDB datafile has been extracted to a separate database topic map.

Wandora with imdb.gif


Contents

Step by step example of extracting IMDB with Wandora

This chapter is a step by step tutorial showing you how to use IMDB extractor and database topic maps. Tutorial extractions were made in a Ubuntu Linux 8.1 running on top of Sun's VirtualBox (running on top of Windows XP). Next screen shot views system properties of the Ubuntu Linux used for IMDB extractions. Notice the memory amount given for the Linux. We gave the Ubuntu 1500 MB of memory. Our experiences suggest you should give Linux memory as much as possible. With small memory footprints the IMDB extraction fails after heavy swapping. Now start Ubuntu Linux and log in.

Imdb 09.png

Downloading IMDB datafiles

After Ubuntu launch, start WWW browser in Ubuntu and

  • Download IMDB data files:
  • Unzip all data files in shell with gunzip or right click each data file icon and select option Extract Here.

Now you should have all required IMDB data files ready for extraction as shown below.

Imdb 15.png

Setting up Wandora

We prepare Wandora application next. In Ubuntu

  • Download Wandora application.
  • Install Wandora
  • Start Linux shell with menu option Applications > Accessories > Terminal
    • Open Wandora's bin directory.
    • Change execution rights of Wandora-huge.sh to allow execution.
    • Finally add Java's bin directory to the PATH environment variable.

Here is how I did previous steps:

akivela@virtual-ubuntu:~/Desktop$ cd wandora/bin
akivela@virtual-ubuntu:~/Desktop/wandora/bin$ dir
SetClasspath.bat  Wandora.bat	    Wandora-large.bat  Wandora-mini.sh
SetClasspath.sh   Wandora-huge.bat  Wandora-large.sh   Wandora.sh
Wandora-4g.sh	  Wandora-huge.sh   Wandora-mini.bat
akivela@virtual-ubuntu:~/Desktop/wandora/bin$ chmod a+x Wandora-huge.sh
akivela@virtual-ubuntu:~/Desktop/wandora/bin$ PATH=$PATH:/home/akivela/jre1.6.0_13/bin
akivela@virtual-ubuntu:~/Desktop/wandora/bin$

Now you are ready to start Wandora application in Linux. Write ./Wandora-huge.sh in terminal and hit enter. Wandora application should start.

Imdb 01.png

Setting up databases for IMDB topic maps

As stated in the beginning of IMDB extractor documentation above, you need a database topic map to store extracted topic map as it is very large. To prepare database topic map start another terminal window in Ubuntu with option Applications > Accessories > Terminal. In terminal

  • Install MySQL server with command sudo apt-get install mysql-server.
  • Log into the MySQL server with command mysql --user=<your-username> --password=<your-password>
  • Create empty databases with MySQL command create database <database-name>; (notice ending semicolon) for next database names:
    • imdb_actors
    • imdb_actresses
    • imdb_countries
    • imdb_genres
    • imdb_movies
  • Prepare each created database with Wandora specific database table structures in wandora/build/resources/conf/database/db_mysql.sql. In detail:
    • Select database with MySQL command use <database-name>;, for example use imdb_actors; (notice ending semicolon).
    • Read database table creation clauses from external file with MySQL command source wandora/build/resources/conf/database/db_mysql.sql; (notice ending semicolon). Notice that you may have to change the path of db_mysql.sql depending on you Wandora installation directory and your current directory.

Below is my terminal capture of previous steps. After these steps I have six empty database topic maps in local MySQL and I am ready for actual IMDB extractions.

akivela@virtual-ubuntu:~$ sudo apt-get install mysql-server
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following extra packages will be installed:
  mysql-server-5.0
Suggested packages:
  tinyca mailx
The following NEW packages will be installed:
  mysql-server mysql-server-5.0
0 upgraded, 2 newly installed, 0 to remove and 349 not upgraded.
Need to get 26.9MB of archives.
After this operation, 87.7MB of additional disk space will be used.
Do you want to continue [Y/n]? y
Get:1 http://fi.archive.ubuntu.com intrepid/main mysql-server-5.0 5.0.67-0ubuntu6 [26.8MB]
Get:2 http://fi.archive.ubuntu.com intrepid/main mysql-server 5.0.67-0ubuntu6 [54.9kB]                      
Fetched 26.9MB in 25s (1073kB/s)                                                                            
Preconfiguring packages ...
Selecting previously deselected package mysql-server-5.0.
(Reading database ... 100052 files and directories currently installed.)
Unpacking mysql-server-5.0 (from .../mysql-server-5.0_5.0.67-0ubuntu6_i386.deb) ...
Selecting previously deselected package mysql-server.
Unpacking mysql-server (from .../mysql-server_5.0.67-0ubuntu6_all.deb) ...
Processing triggers for man-db ...
Setting up mysql-server-5.0 (5.0.67-0ubuntu6) ...
 * Stopping MySQL database server mysqld                                                              [ OK ] 
Reloading AppArmor profiles : done.
 * Starting MySQL database server mysqld                                                              [ OK ] 
 * Checking for corrupt, not cleanly closed and upgrade needing tables.

Setting up mysql-server (5.0.67-0ubuntu6) ...


akivela@virtual-ubuntu:~$ mysql --user=root --password=mypass
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.0.67-0ubuntu6 (Ubuntu)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> create database imdb_actors;
Query OK, 1 row affected (0.01 sec)

mysql> create database imdb_actresses;
Query OK, 1 row affected (0.00 sec)

mysql> create database imdb_countries;
Query OK, 1 row affected (0.00 sec)

mysql> create database imdb_directors;
Query OK, 1 row affected (0.01 sec)

mysql> create database imdb_genres;
Query OK, 1 row affected (0.00 sec)

mysql> create database imdb_movies;
Query OK, 1 row affected (0.00 sec)

mysql> use imdb_actors;
Database changed
mysql> source /home/akivela/Desktop/wandora/build/resources/conf/database/db_mysql.sql
Query OK, 0 rows affected (0.03 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.04 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0 

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0 

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> use imdb_actresses;
Database changed
mysql> source /home/akivela/Desktop/wandora/build/resources/conf/database/db_mysql.sql
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0 

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> use imdb_countries;
Database changed
mysql> source /home/akivela/Desktop/wandora/build/resources/conf/database/db_mysql.sql
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> use imdb_directors;
Database changed
mysql> source /home/akivela/Desktop/wandora/build/resources/conf/database/db_mysql.sql
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> use imdb_genres;
Database changed
mysql> source /home/akivela/Desktop/wandora/build/resources/conf/database/db_mysql.sql
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.03 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.03 sec)
Records: 0  Duplicates: 0  Warnings: 0 

Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> use imdb_movies;
Database changed
mysql> source /home/akivela/Desktop/wandora/build/resources/conf/database/db_mysql.sql
Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec) 

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

Query OK, 0 rows affected (0.00 sec)
Records: 0  Duplicates: 0  Warnings: 0 

mysql> 
mysql> 
mysql>

Extracting IMDB with Wandora

Go back to the Wandora application started earlier and select menu option Layers > New layer. A dialog window opens. Select Database on drop down selector labeled Type. Layer creation dialog should now look something like this:

Imdb 02.png

Select MySQL test in database settings list and click Edit button. Another dialog opens for database settings (see image below). In this dialog you can enter database's name, user name, and password. Change database name to imdb_actors. Change user field to your database user name. Change password field to the user's password.

Imdb 03.png

Now click OK button and database configuration window closes reveling previous dialog window. Enter name for the layer, say imdb_actors, keep the MySQL test database configuration selected, and click OK button. Wandora creates a new topic map layer and shows it left bottom corner of Wandora application window (see below). Now select the created layer by clicking it. Selected layer is little darker than unselected. Now all "write" operations go to the selected database topic map layer.

If created layer is dark red, your new layer is broken. Layer is broken when database connection fails for some reason. Check Wandora's terminal window for specific error message. I managed to break a layer couple of times by entering wrong user name and password for the database.

Imdb 05.png

Next we are going to start the IMDB extraction. Select menu option File > Extract > Media > IMDB extract.... Wandora opens a Files/Urls/Raw selector. Keep the Files tab open and click Browse button. A file selector opens. Go to the directory you uncompressed IMDB data files and select actors.list (see below). To start extraction press Extract button. As IMDB data files are extremely large, it is not very surprising the extraction takes several hours. For example, extracting >9 million rows of actors.list took ~6 hours in my virtual Ubuntu.

Imdb 07.png

When extraction finishes, you can request statistics from the database topic map layer with menu option Layers > Statistics > Layer info.... It took my system several minutes to open layer statistics dialog window:

Imdb 08.png

Extracted topic map contained little over 2 million topics and near 3 million associations. It is very important you to understand that trying to access such topic map in Wandora is extremely slow and causes OutOfMemory exceptions easily. As a thumb rule do not try to search anything that could generate a result set with millions of hits. Also, do not open association type topics, role topics, or class topics as they probably generate extremely large topic table structures Wandora can't handle.

Now, to continue extracting other IMDB files, drop extracted layer imdb_actors with menu option Layers > Delete layer... Database topic map layer deletion doesn't touch the database content and you can open it again later on. It's just more convenient to do the extraction when there are no other topic map layers disturbing.

Now you should do all the steps described above to all other IMDB data files. You should extract each data file to it's own database topic map:

actresses.list --> imdb_acresses
movies.list --> imdb_movies
genres.list --> imdb_genres
countries.list --> imdb_countries
directors.list --> imdb_directors

Merging IMDB database topic map layers

Now you should have all IMDB data files extracted. Final step is to open all generated topic maps to Wandora as separate layers. In Wandora, for each database topic map

  • Select menu option Layers > New layer...
  • Change topic map type to Database
  • Edit default settings of MySQL test as you did while preparing the extraction.
  • Give unique name for the layer and hit OK.

As a result, your Wandora should look something like below and you can continue accessing the merged IMDB topic. Be careful, the layer stack is huge and you get easily OutOfMemory exceptions as said above :)


Imdb 16.png


Below user has searched with Brazil. Result set contains Terry Gilliam's movie called Brazil. User opens the movie to the topic panel.


Imdb 10.png


Below user has scrolled downwards to see all associations of the movie Brazil.


Imdb 11.png


Below user has double clicked the topic of Terry Gilliam. Topic is open at topic panel and user can see all associations of Terry Gilliam.


Imdb 12.png


If user scrolls downward, topic panel reveals an association table representing all movies, Terry Gilliam has directed.


Imdb 13.png


One of Terry Gilliam's directions is movie Twelve Monkeys. User double clicks the topic in Terry Gilliam's director table and topic Twelve Monkeys is opened to the topic panel.


Imdb 14.png
Personal tools