Buckwalter Arabic Morphological Analyzer Version 1.0


Introduction

This file contains documentation on the Buckwalter Arabic Morphological Analyzer Version 1.0, Linguistic Data Consortium (LDC) catalog number LDC2002L49 and ISBN 1-58563-257-0.

The Buckwalter Arabic Morphological Analyzer is used for POS-tagging Arabic text.

Data

The data consists primarily of three Arabic-English lexicon files: prefixes (299 entries), suffixes (618 entries), and stems (82158 entries representing 38600 lemmas). The lexicons are supplemented by three morphological compatibility tables used for controlling prefix-stem combinations (1648 entries), stem-suffix combinations (1285 entries), and prefix-suffix combinations (598 entries). The actual code for morphology analysis and POS tagging is contained in a Perl script. The documentation consists of a readme file with a description of the lexicon files, the morphological compatibility tables, the morphology analysis algorithm, a summary of stem morphological categories, and a table with the author's Arabic transliteration system. For more detailed information about the data, please see the original documentation file readme.txt.

Please see file.tbl for the directory structure of this publication, as well as a complete list of files.

Updates

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2002L49.

Content Copyright

Portions © 2002 QAMUS LLC (www.qamus.org), © 2002 Trustees of the University of Pennsylvania

The Linguistic Data Consortium is releasing this software under the GNU General Public License; organizations interested in licensing the lexicon and/or morphological analyzer for commercial use should contact:
QAMUS LLC
7010 NE Dolphin Dr
Bainbridge Island, WA 98110-1050
(206) 855-9608
ATTN: Tim Buckwalter


Contact: ldc@ldc.upenn.edu
© 2002 Linguistic Data Consortium, Trustees of the University of Pennsylvania. All Rights Reserved.