AQMAR Arabic Wikipedia Named Entity Corpus This dataset contains text extracted from a small corpus of Arabic Wikipedia articles and hand-annotated for named entities. It is described in the paper Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer, and Noah A. Smith (2012), Recall-Oriented Learning of Named Entities in Arabic Wikipedia. Proceedings of EACL. and can be downloaded at http://www.ark.cs.cmu.edu/AQMAR/ This dataset is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License (see LICENSE). The 28 articles are listed below, organized by domain. Each article was tagged by 1 of 2 annotators. Annotators were encouraged to devise up to 3 article-specific entity classes to supplement the traditional four (PERson, ORGanization, LOCation, and generic MIScellaneous); these custom categories are summarized below each article. Each data file consists of one line per token; each line has the Arabic token (UTF-8 encoding) followed by the BIO tag (e.g., B-MIS0 for the first word of a generic miscellaneous entity mention). To see counts of entity positions (B, I, O) in each article, you can use the Bash command: for f in *.txt; do paste <(cut -d' ' -f2 $f | grep '^B' | wc -l) <(cut -d' ' -f2 $f | grep '^I' | wc -l) <(cut -d' ' -f2 $f | grep '^O' | wc -l) <(echo $f); done See the featureFiles subdirectory and its README for the featurized versions of these articles that were used in experiments. HISTORY Crusades.txt MIS-1: Name of wars Damascus.txt Ibn_Tolun_Mosque.txt Imam_Hussein_Shrine.txt Islamic_Golden_Age.txt MIS-1: Name of an era: Middle Ages, Renaissance Islamic_History.txt MIS-1: Name of wars Ummaya_Mosque.txt SCIENCE Atom.txt MIS-1: Name of particles (e.g. neutron) MIS-2: Name of theories (e.g. Dalton Theory) MIS-3: Name of chemical elements (e.g. Uranium) Enrico_Fermi.txt MIS-1: Name of chemical elements (e.g. Uranium) MIS-2: English entities (i.e. written in Latin characters) Light.txt MIS-1: Kind of radiation (Electromagnetic) Nuclear_Power.txt MIS-1: Name of chemical elements (e.g. Uranium) MIS-2: English entities Periodic_Table.txt MIS-1: Name of chemical elements (e.g. Uranium) MIS-2: Name of theories (e.g. Dalton Theory) MIS-3: Name of particles (e.g. neutron) Physics.txt MIS-1: Science names (e.g. Astronomy) MIS-2: Names of theories and formulas (e.g. Kepler laws) Razi.txt MIS-1: Book titles (e.g. Spiritual Medicine) SPORTS Christiano_Ronaldo.txt MIS-1: Name of championships (e.g. European Cup) Football.txt MIS-1: Name of sports (e.g. rugby) MIS-2: English entities Portugal_football_team.txt MIS-1: Name of championships (e.g. European Cup) Raul_Gonzales.txt MIS-1: Name of championships (e.g. European Cup) MIS-2: Name of prizes (e.g. Goden Boot) Real_Madrid.txt MIS-1: Name of championships (e.g. European Cup) MIS-2: Spanish entities Soccer_Worldcup.txt Summer_Olypics2004.txt MIS-1: Name of sport events (e.g. Winter Olympics) TECHNOLOGY Computer_Software.txt MIS-1: English entities MIS-2: Name of Software (e.g. Miscrosoft Word) MIS-3: Name of computer component (e.g. CPU) Computer.txt MIS-1: English entities MIS-2: Name computer component (e.g. CPU) MIS-3: Name of computer types (e.g. microcomputer) Internet.txt MIS-1: Name of network concepts: protocol, Internet MIS-2: English entities Linux.txt MIS-1: Name of software or hardware (e.g. Emacs, Linux) Richard_Stallman.txt MIS-1: Name of software or hardware (e.g. Emacs, Kernel) MIS-2: English entities Solaris.txt MIS-1: English entities MIS-2: Name of software or hardware (e.g. Emacs, Solaris) X_window_system.txt MIS-1: Name of software or hardware (e.g. Emacs, Kernel) MIS-2: English entities