Description of major activities The work plan included three main steps:

  1. Data acquisition: to collect language data from the web
  2. Corpus evaluation: to evaluate their ecological validity through psycholinguistic testing
  3. Pedagogic exploration: to test the value of the archive by exploring practical pedagogical applications and tangible workflows

Here I’ve included a number of scripts useful in this project (no guarantee is provided –use at your own risk).

Getting meta-data from the IMDb for download dump for a language (contact directly and inquire as they prohibit screen scraping).

From the language dump a file `export.txt` documents the links between the provided files (named with an internal record id and the IMDbID.

This script will parse those entries and select the most downloaded subtitle version and move and rename it with a number of basic IMDb meta-information categories. You will need to modify the script to allow for other languages and modify the source/target directories to copy to.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# usage: $ python
# author: Jerid Francom, 2013-03-30

import sys, os, imdb, csv, time

d = csv.reader(open('export.txt', 'r'), delimiter='\t') # read in the data

count = 0 # loop count
imdbids = {} # dict for imdbid/files/dnlds
print "Processing IMDb IDs..."
for item in d: # loop through all row entries find the subtitle file most downloaded
if count > 0: # skip the header row
opensubid = item[1] # Opensubtitles ID
imdbid = item[9] # IMDb ID
dnlds = item[11] # downloads

if imdbid in imdbids: # see if the imbd id is in the dict
if int(imdbids[imdbid][1]) < int(dnlds): # compare dnlds
add = 1
# print "%s: %s to %s " % (imdbid, imdbids[imdbid][1], dnlds)
add = 0
add = 1
# print "%s: %s" % (imdbid, dnlds)

if add == 1: # add the ids to the dict
imdbids[imdbid] = (opensubid, dnlds) # add opensub and dnlds to the dict for this imdbid

count = count + 1
print "... finished!"

count = 1
total = len(imdbids)
print "Begin searching the IMDb..."
ia = imdb.IMDb() # open the IMDb interface
for imdbid in imdbids:
try: # attempt to get the imdbid information
r = ia.get_movie(imdbid) # get the meta-data for this IMDb ID
ia.update(r) # update the meta-data

# recode some key meta-data
year = r['year']
genre = r['genres'] if "genres" in r.keys() else "none"
language = r['language codes'] if "language codes" in r.keys() else ['en']

countries = r['countries'] if "countries" in r.keys() else ['USA']
countries = countries[0].replace(" ", "-")

kind = r['kind'] if "kind" in r.keys() else "movie"
kind = r['kind']
kind = kind.replace(" ", "-")

title = r['title']
title = title.replace(":", "")
title = title.replace("/", "")
title = title.replace(" ", "-")

file_name = "%s_%s_%s_%s_%s_%s_%s.txt" % (language[0], countries, year, title, kind, genre[0], imdbid) # dest filename

opensub_filename = imdbids[imdbid][0] # get opensubid, original filename
source_file = os.path.join("files/", opensub_filename)
dest_file = os.path.join("named_files/", file_name)

print "%s of %s:\n%s" % (count, total, file_name)
count = count + 1

try: # attempt to rename the file
os.rename(source_file, dest_file)
time.sleep(1) # rest for a second
except OSError, e:
print "Error: %s" % (e[1])
except Exception, err:
sys.stderr.write('Error getting IMDb info\n')
time.sleep(60) # wait for 1 min
print "...done!"