ACTIV-ES: a comparable, cross-dialect corpus of ‘everyday’ Spanish from Argentina, Mexico, and Spain

The first release of the ACTIV-ES Spanish dialect corpus based on TV/film transcripts is now available here: https://github.com/francojc/activ-es

It includes 3,460,172 total tokens (Argentina: 1,103,039 Mexico: 976,192 Spain: 1,380,941) and comes in running text and word list (1:5 gram) formats. Each format has both a plain text and part-of-speech tagged version.

For more information about the development and evaluation of this resource you can download our paper at the Ninth Annual Language Resources and Evaluation Conference (LREC 2014) here: https://www.academia.edu/6962707/ACTIV-ES_a_comparable_cross-dialect_corpus_of_everyday_Spanish_from_Argentina_Mexico_and_Spain
plot_country-year-genre

Install graphical interface for TreeTagger on Windows

Here’s a slimmed down step-by-step instruction list on how to install the TreeTagger graphical interface on a Windows machine.

1. Download the Tree-Tagger software for Windows.

2. Unzip this file into your C:\Program files\ directory. Using WinZip, make sure you have the “Use folder names” box ticked and extract all files.

3. Download the parameter file(s) that you need and extract them into the subdirectory C:\Program Files\TreeTagger\lib

4. Download and drop the graphical interface files (tagger and training programs) in the C:\Program Files\TreeTagger\bin subdirectory.

5. Then make a shortcut to the desktop by right-clicking on the tagger and/or training programs and selecting create shortcut. Drag that shortcut to the desktop.

You should now be able to launch TreeTagger from the desktop.

Install vislcg3 tools on Mac OS X

Here are the instructions to install the vislcg3 constraint grammar on a Mac.

1. Install the Xcode developer tools (App Store)

2. Install cmake and boost. I use Homebrew, but I imagine you could use MacPorts or Fink.

3. Install ICU. This takes a few steps:
A. Download the package here: http://download.icu-project.org/files/icu4c/4.8.1/icu4c-4_8_1-src.tgz (or the latest version) and decompress it:

$ gunzip -d < icu4c-4_8_1-src.tgz | tar -xvf -

Then run:

$ cd icu/source/

It's a good idea to make sure the permissions are set so run:

$ chmod +x runConfigureICU configure install-sh

B. Now run the runConfigureICU like so:

$ ./runConfigureICU MacOSX

C. You'll then make and make install, and you should be golden:

$ make
$ sudo make install

4. Now it's time to get to vislcg3.
A. Download the files from the svn repository:

$ svn co http://beta.visl.sdu.dk/svn/visl/tools/vislcg3/trunk vislcg3

Then move into the main directory:

$ cd vislcg3/

B. Do a checkup on the install:

$ ./cmake.sh

C. Run make and make install to finalize this thing.

$ make
$ sudo make install

D. Now check to see that it's in your path:

$ which vislcg3

And if you get a path to the binary, you're ready to go!

Word Clouds with Wordle

While searching for code examples in R for creating word clouds, I stumbled across this neat tool to create on-the-fly word clouds. Wordle provides a clean and basic interface to creating your own wordle or browsing others made on the site.

Here’s a cloud I created based on the Mexican film “Y tu mamá también” from 2001.

You can create your own wordle here.

New extensions for corpus data

I’ve been on a TED marathon recently. Just the other day this presentation by Deb Roy appeared and blew my mind. This is an amazing project that really underlines the power of corpus data in providing a snapshot of the secret world of language use. Deb at one point makes the suggestion that this approach is as important as the telescope was. I completely agree. The visualizations help drive this point home in that they are very effective in communicating the complex relationships between language use and behavioral interaction in an intuitive way.

Hats off.

Deb Roy: The birth of a word