Sunday, April 6, 2014


The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. A simple click for each query will automatically fill in the form for you, search through the more than 450 million words of text, and then display the results.

The Corpus of Contemporary American English (COCA) is used by tens of thousands of users every month  (linguists, teachers, translators, and other researchers).

The corpus contains more than 450 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2012 and the corpus is also updated regularly with the most recent texts from the Summer 2012.

The corpus is suitable for looking at current and ongoing changes in the English language.
The interface allows you to search for  exact words or phrases, wildcards, lemmas, part of speech, or any combinations of these.  You can  search for surrounding words (collocates) within a ten-word window (e.g. all nouns somewhere near faint, all adjectives near woman, or all verbs near  feelings), which often gives you good insight into the meaning and use of a word.

The corpus also allows you to easily limit searches by frequency and compare the frequency of words, phrases, and grammatical constructions, in at least two main ways:

-By genre: comparisons between spoken, fiction, popular magazines, newspapers, and academic, or even between sub-genres (or domains), such as movie scripts, sports magazines, newspaper editorial, or scientific journals


-Over time: compare different years from 1990 to the present time

You can also easily carry out semantically-based queries of the corpus. For example, you can contrast and compare the collocates of two related words (little/small, smart/stupid,  men/women), to determine the difference in meaning or use between these words. 

You can find the frequency and distribution of synonyms for nearly 60,000 words and also compare their  frequency in different genres, and also use these word lists as part of other queries. Finally, you can easily create your own lists of semantically-related words, and then use them directly as part of the query.

The corpus was created by Mark Davies of Brigham Young University. Start your five minute guided tour, which will show the major features of the corpus here:

No comments: