10-K Corpus
-----------
http://www.ark.cs.cmu.edu/10K
Version 1.0 released March 31, 2009.
Last addendum:  September 18, 2009.

If you publish research based on these data, please cite the following
paper:

  Shimon Kogan, Dimitry Levin, Bryan R. Routledge, Jacob
  S. Sagi, and Noah A. Smith.  Predicting Risk from Financial Reports
  with Regression.  In Proceedings of the North American Association for
  Computational Linguistics Human Language Technologies Conference
  (NAACL-HLT), May-June 2009.

  http://www.cs.cmu.edu/~nasmith/papers/kogan+levin+routledge+sagi+smith.naacl09.pdf

More details about this corpus can be found in the paper.

The corpus contains 10-K reports from many US companies during years
1996-2006, as well as measured volatility of stock returns for the
twelve-month periods preceding and following each report.  The data
are organized by the year of the report.

For year yyyy, there are several files:

  dist/yyyy.full.tgz - the original 10-K reports (named key.txt)
  dist/yyyy.mda.tgz - the MD&A sections from the 10-Ks (named key.mda)
  dist/yyyy.tok.tgz - the tokenized MD&A sections (named key.mda)

  (The files in the above tarballs are similarly named; the string up
  to the . is a unique key for the report.)

  dist/yyyy.logvol.-12.txt - maps key to log volatility in preceding year
  dist/yyyy.logvol.+12.txt - maps key to log volatility in following year
  dist/yyyy.meta.txt - maps key to a date (yyyymmdd format), URL,
                       company name, and SEC code)

There are two perl scripts:

  extract_MDA.pl extracts MD&A section from full 10-K reports.
  tokenize_new.pl does strict tokenization as in the NAACL-HLT 2009 paper.

Note (9/18/2009) some discrepencies in the tokenization script and the
data have been found, at least having to do with whether hyphenated
words are split by whitespace or not (the script distribution *does*
split words at hyphens; the data appear not to have been so split).
We believe this is the only discrepancy, but have not checked
systematically.