English Multiword Expression Lexicons
=====================================

Compiled by Nathan Schneider, Emily Danchik, Chris Dyer, and Noah A. Smith.

 - version 1.0 (2014-04-19): 9 lexicons, SAID extraction script, Yelp word clusters

The lexical resources can be downloaded at 

  http://www.ark.cs.cmu.edu/LexSem/

This dataset is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/) license (see LICENSE).


Contents
--------

This is a collection of type-level lexical resources that were used 
to help identify multiword expressions in (Schneider et al., *TACL* 2014).
The resources are as follows:

## Multiword Lexicons

  - cedt_mwes.json: Multiword lemmas, named entities, and CPHR and DPHR phrases 
    from the English side of the Prague Czech-English Dependency Treebank (Čmejrek et al., 2005; 
    http://ufal.mff.cuni.cz/pcedt2.0/; http://catalog.ldc.upenn.edu/LDC2012T08)

  - enwikt.json: Multiword entries from English Wiktionary (http://en.wiktionary.org; 
    data from https://toolserver.org/~enwikt/definitions/enwikt-defs-20130814-en.tsv.gz)

  - LVCs.json: List of light verb constructions provided by Claire Bonial

  - oyz_idioms.json: Multiword entries from Oyz's compilation of dictionary entries for 
    frequent English verbs (http://home.postech.ac.kr/~oyz/doc/idiom.html)

  - phrases_dot_net.json: Multiword entries on the Phrases.net website

  - semcor_mwes.json: Multiword entries in SemCor (Miller et al., 1993; accessed with NLTK)

  - vpc.json: Verb-particle constructions in the dataset of (Baldwin, 2008; 
    http://www.csse.unimelb.edu.au/research/lt/resources/vpc/vpc.tgz)

  - wikimwe.json: Entries from WikiMwe (Hartmann et al., 2011; 
    http://www.ukp.tu-darmstadt.de/data/lexical-resources/wikimwe/)

  - wordnet_mwes.json: Multiword lemma entries in English WordNet (Fellbaum, 1998; 
    http://wordnet.princeton.edu/; accessed with NLTK)

## SAID Extraction Script

  - said2json.py: The SAID idioms database (Kuiper et al., 2003) is 
    [distributed by LDC](http://catalog.ldc.upenn.edu/LDC2003T10). This script will access 
    a local installation of SAID to extract a JSON file of lexicon entries. 
    Run it by passing the path to the 'data' directory of the SAID installation:

        $ python2.7 said2json.py /path/to/said/data > said.json

## Word Clusters

  - yelpac-c1000-m25.gz: These were obtained by running Brown clustering (implementation 
    by Liang, 2005; https://github.com/percyliang/brown-cluster) on the Yelp Academic Dataset 
    (https://www.yelp.com/academic_dataset), which has 21 million words of online reviews.
    It is a hard hierarchical clustering into 1000 clusters of words appearing at least 25 times.


JSON Format
-----------

Each of the lexicon JSON files contains one entry per line. Two examples from cedt_mwes.json:

```
{"count": 1, "lemmas": ["ibm", "australia", "ltd."], "datasource": "Prague CEDT 2.0", "label": "NE"}
{"count": 3, "lemmas": ["have", "hand"], "datasource": "Prague CEDT 2.0", "label": "DPHR"}
```

The fields in each entry are:

  1. `lemmas` or `words`: words or lemmas comprising the expression (some resources 
     provide lemmas; others provide fully inflected words). LVCs.json instead has 
     `verblemma` and `noun`.

  2. `poses`: parts of speech, if available

  3. `datasource`: name of the lexicon

  4. `label`: category of expression (resource-specific; may simply be `"MWE"`)

  5. `count`: frequency in the source data, if available

There are also some fields specific to a single resource (`pmi` for wikimwe.json, 
`context` for vpc.json, `contexts` for LVCs.json, `files` for semcor_mwes.json).


Further information
-------------------

These resources were used to train a system that identifies multiword expressions in context; 
this is described in

 -  Nathan Schneider, Emily Danchik, Chris Dyer, and Noah A. Smith (2014). 
    Discriminative lexical semantic segmentation with gaps: running the MWE gamut. 
    _Transactions of the Association for Computational Linguistics._

Contact [Nathan Schneider](http://nathan.cl) with questions.