Vol. 1 No. 1 (2022): COMPUTER LINGUISTICS: PROBLEMS, SOLUTIONS, PROSPECTS
Articles

THE EXPLOITATION OF CORPORA IN NATURAL LANGUAGE PROCESSING

Published 2022-05-19

Keywords

  • language analysis,
  • human intuition,
  • annotation,
  • disambiguation

Abstract

One of the first things required for natural language processing
(NLP) tasks is a corpus. In linguistics and NLP, corpus (literally Latin for body)
refers to a collection of texts. Such collections may be formed of a single language
of texts, or can span multiple languages -- there are numerous reasons for which
multilingual corpora (the plural of corpus) may be useful. Corpora may also consist
of themed texts (historical, Biblical, etc.). Corpora are generally solely used for
statistical linguistic analysis and hypothesis testing.