10-K Corpus ----------- http://www.ark.cs.cmu.edu/10K Version 1.0 released March 31, 2009. Last addendum: September 18, 2009. If you publish research based on these data, please cite the following paper: Shimon Kogan, Dimitry Levin, Bryan R. Routledge, Jacob S. Sagi, and Noah A. Smith. Predicting Risk from Financial Reports with Regression. In Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference (NAACL-HLT), May-June 2009. http://www.cs.cmu.edu/~nasmith/papers/kogan+levin+routledge+sagi+smith.naacl09.pdf More details about this corpus can be found in the paper. The corpus contains 10-K reports from many US companies during years 1996-2006, as well as measured volatility of stock returns for the twelve-month periods preceding and following each report. The data are organized by the year of the report. For year yyyy, there are several files: dist/yyyy.full.tgz - the original 10-K reports (named key.txt) dist/yyyy.mda.tgz - the MD&A sections from the 10-Ks (named key.mda) dist/yyyy.tok.tgz - the tokenized MD&A sections (named key.mda) (The files in the above tarballs are similarly named; the string up to the . is a unique key for the report.) dist/yyyy.logvol.-12.txt - maps key to log volatility in preceding year dist/yyyy.logvol.+12.txt - maps key to log volatility in following year dist/yyyy.meta.txt - maps key to a date (yyyymmdd format), URL, company name, and SEC code) There are two perl scripts: extract_MDA.pl extracts MD&A section from full 10-K reports. tokenize_new.pl does strict tokenization as in the NAACL-HLT 2009 paper. Note (9/18/2009) some discrepencies in the tokenization script and the data have been found, at least having to do with whether hyphenated words are split by whitespace or not (the script distribution *does* split words at hyphens; the data appear not to have been so split). We believe this is the only discrepancy, but have not checked systematically.