STREUSLE Dataset ================ STREUSLE stands for Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions. It supersedes the Comprehensive __Multiword Expressions__ corpus [1] (which was used for the experiments in [2]). STREUSLE adds semantic supersenses in addition to the MWE annotations. The supersense labels apply to single- and multiword __noun__ and __verb__ expressions, as described in [3], and __preposition__ expressions, as described in [4, 5]. STREUSLE and associated documentation and tools can be downloaded from: . PrepWiki, the lexical resource that supported preposition supersense annotation and that explains the category hierarchy, can be accessed at . This dataset's multiword expression and supersense annotations are licensed under a [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/) license (see LICENSE). The source sentences and part-of-speech annotations, which are from the Reviews section of the __English Web Treebank__ (EWTB; [6]), are redistributed with permission of Google and the Linguistic Data Consortium, respectively. References: - [1] Nathan Schneider, Spencer Onuffer, Nora Kazour, Emily Danchik, Michael T. Mordowanec, Henrietta Conrad, and Noah A. Smith. Comprehensive annotation of multiword expressions in a social web corpus. _Proceedings of the 9th Linguistic Resources and Evaluation Conference_, Reykjavík, Iceland, May 26–31, 2014. - [2] Nathan Schneider, Emily Danchik, Chris Dyer, and Noah A. Smith. Discriminative lexical semantic segmentation with gaps: running the MWE gamut. _Transactions of the Association for Computational Linguistics_, 2(April):193−206, 2014. http://www.cs.cmu.edu/~ark/LexSem/mwe.pdf - [3] Nathan Schneider and Noah A. Smith. A corpus and model integrating multiword expressions and supersenses. _Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies_, Denver, Colorado, May 31–June 5, 2015. - [4] Nathan Schneider, Jena D. Hwang, Vivek Srikumar, and Martha Palmer. A hierarchy with, of, and for preposition supersenses. _Proceedings of the 9th Linguistic Annotation Workshop_, Denver, Colorado, June 5, 2015. - [5] Nathan Schneider, Jena D. Hwang, Vivek Srikumar, Meredith Green, Abhijit Suresh, Kathryn Conger, Tim O'Gorman, and Martha Palmer. A corpus of preposition supersenses. _Proceedings of the 10th Linguistic Annotation Workshop_, Berlin, Germany, August 11, 2016. - [6] Ann Bies, Justin Mott, Colin Warner, and Seth Kulick. English Web Treebank. Linguistic Data Consortium, Philadelphia, Pennsylvania, August 16, 2012. Files ----- - ACKNOWLEDGMENTS.md: Contributors and support that made this dataset possible. - TAGSET.md: List of class labels with explanations. - LICENSE: License. - streusle.sst: Initial annotations, in human-readable and JSON formats, along with gold POS tags. - streusle.tags: Automatic conversion of streusle.sst to the tagging scheme appropriate for training sequence models. A few intricately structured MWEs have been simplified to fit the tagging scheme, and lemmas from the WordNet lemmatizer have been added. - streusle.tags.sst: Conversion of streusle.tags back to the .sst format, now with lemmas and tags. - streusle.upos.tags, streusle.upos.tags.sst: The above files, but replacing gold PTB POS tags with [Universal POS tags](http://universaldependencies.github.io/docs/en/pos/all.html) obtained by applying [this script](https://gist.github.com/nschneid/beed0bcda5b42e530011) to the gold trees in the EWTB. - STREUSLE3.0-mwes.tsv: All multiword expressions annotated in the corpus: frequency count, lowercased words, strength (`_` = strong MWE, `~` = weak MWE), and part-of-speech sequence. - STREUSLE3.0-mwe-types.txt: Just the lowercased word sequences annotated as MWEs. - STREUSLE3.0-strong-mwe-types.txt: Just the strong MWEs. - psst-tokens.tsv: Human-readable display of preposition supersense annotations, one line per token. (Excludes prepositions labeled with a non-supersense class, such as \`i.) - Supersense-PB manual verification sample.xlsx: Spreadsheet containing raw data and analysis for the study of correspondences between preposition supersense and PropBank function tags (Schneider et al. 2016, §4). The same spreadsheet can be accessed [online](https://docs.google.com/spreadsheets/d/1DR9z--cPMY2a3y4GghggPe4XNoODO70Da-zPiyMXwg8/edit?usp=sharing) through Google Docs. - splits/: Experimental train/test splits. See splits/README.md for details. .sst Format ----------- (Based on CMWE's .mwe format.) 1 sentence per line. 3 tab-separated columns: sentence ID; human-readable MWE annotation from CMWE; JSON data structure with POS-tagged words, MWE groupings, and class (supersense) annotations associated with the first token of the expression they apply to. Note that token indices are 1-based. The .tags.sst JSON object adds lemmas and tags in the JSON object. .tags Format ------------ (CoNLL-esque format based on CMWE's .tags format.) 1 token per line, with blank lines separating sentences. 9 tab-separated columns: 1. token offset 2. word 3. lowercase lemma 4. POS 5. full MWE+class tag 6. offset of parent token (i.e. previous token in the same MWE), if applicable 7. strength level encoded in the tag, if applicable: `_` for strong, `~` for weak 8. class (usually supersense) label, if applicable: see [TAGSET.md](TAGSET.md) 9. sentence ID Contact ------- Questions should be directed to: Nathan Schneider [nathan.schneider@georgetown.edu]() http://nathan.cl History ------- - STREUSLE 3.0: 2016-08-23. Added preposition supersenses - STREUSLE 2.1: 2015-09-25. Various improvements chiefly to auxiliaries, prepositional verbs; added \`p class label as a stand-in for preposition supersenses to be added in a future release, and \`i for infinitival 'to' where it should not receive a supersense. From 2.0 (not counting \`p and \`i): * Annotations have changed for 877 sentences (609 involving changes to labels, 474 involving changes to MWEs). * 877 class labels have been changed/added/removed, usually involving a non-supersense label or triggered by an MWE change. Most frequently (118 cases) this was to replace `stative` with the auxiliary label \`a. In only 21 cases was a supersense label replaced with a different supersense label. - STREUSLE 2.0: 2015-03-29. Added noun and verb supersenses - CMWE 1.0: 2014-03-26. Multiword expressions for 55k words of English web reviews