Grape vine fruit

In this post, I’ll try to explain what a collocation is, how we can retrieve collocations from a text corpus, and how to interpret the results. The main theory explaination of this post is largely inspired (not to say outrageously copied!) by part 5.3.4 of the Foundations of Statistical Natural Language, Manning and Schütze.

After the boring theory, we’ll apply this technic to a french case law dataset. All the code is available in the following git repository judilibre-eda.

Table of content

Definitions

A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things.

As Choueka said:

[A collocation is defined as] a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components.

Choueka (1988)

Properties of collocation

More precisly a collocation is a group of words having one or more of the three following properties: non-compositionality, non-substitutability and non-modifiability.

Non-compositionality

Collocations are characterized by limited compositionality in that there is usually an element of meaning added to the combination of the meaning of each part.

Idioms are the most extreme examples of non-compositionality. Idioms like to kick the bucket or to hear it through the grapevine only have an indirect historical relationship to the meanings of the parts of the expression. We are not talking about buckets or grapevines literally when we use these idioms.

Non-substitutability

We cannot substitute near-synonyms for the components of a collocation. For example, we can’t say yellow wine instead of white wine even though yellow is as good a description of the color of white wine as white is (it is kind of a yellowish white).

Non-modifiability

Many collocations cannot be freely modified with additional lexical material or through grammatical transformations. This is especially true for frozen expressions like idioms. For example, we can’t modify frog in to get a frog in one’s throat into to get an ugly frog in one’s throat although usually nouns like frog can be modified by adjectives like ugly. Similarly, going from singular to plural can make an idiom ill-formed, for example in people as poor as church mice.

Types

There a various types of collocations:

  • noun phrases, like strong tea and weapons of mass destruction
  • phrasal verbs, like to make up, verb particle constructions, important part of the lexicon of English, combination of a main verb and a particle, often correspond to a single lexeme in other languages, often non-adjancent words
  • light verbs, like make a decision or do a favor. There is hardly anything about the meaning of make, take or do that would explain why we have to say make a decision instead of take a decision and do a favor instead of make a favor
  • stock phrases, like the rich and powerful
  • subtle and not-easily-explainable patterns of word usage, like a stiff breeze, or broad daylight
  • phraseme or set phrase is a multi word utterance where at least one of whose components is selectionnaly constrained or restricted by linguistic convention such that it is not freely chosen.
  • idiomatic phrase or idiom, completely frozen expressions, like proper nouns
    • proper nouns, proper names, quite different from lexical collocation but usually included
    • saying or a proverb, figure of speech, foxed expression
  • terminological expressions, like group of words in technical domains that are often compositional but they may have to be treated consistently for certain NLP tasks such as translation.

Applications in NLP

Collocations are important for a number of applications:

  • natural language generation, to make sure that the output sounds natural and mistakes like powerful tea or to take a decision are avoided
  • computational lexicography, to automatically identify the important collocations to be listed in a dictionary entry
  • word tokenizer/parsing, so that preference can be given to parse with natural collocations
  • corpus linguistic research, the study of social phenomena like the reinforcement of cultural stereotypes through language (Stubbs 1996)

Co-occurrence VS Collocation

In linguistics, co-occurences or terms associations are graphemes where words are strongly associated with each other, but do not necessarily occur in a common grammatical unit and with a particular order. In other words, a co-occurrence is an extension of word counting in higher dimensions. The co-presence of more than one word/token within the same contextual window has to be statistically significative.

When it’s proved that there is a semantical or gramatical dependency between two words, we call it collocation. Then, collocation is special case of co-occurrence, or “free phrases”, where all of the members are chosen freely, based exclusively on their meaning and the message that the speaker wishes to communicate.

Co-occurence and semantic field

Co-occurrence can be interpreted as an indicator of semantic proximity. When two words or more have a semantical relationship, co-occurrence notion is at the base of thematic, semantic field and isotopy. It is a more general association of words that are likely to be used in the same context.

In semantic and semiotiquen, isotopy is the redondancy of element in a corpus enabling to understand it. For example, the redondancy of the first person (I), make it easy to understand that the same person is talking. Redondancy of the same semantic field enable us to understand that we are talking about the same theme.

Principal approaches of finding collocations

There are various and complementary ways to look for collocations:

  • selection of collocations by frequency
  • selection based on mean and variance of the distance between focal word and collocating word
  • hypothesis testing
  • mutual information

In the remaining parts of this post, we are going to focus on hypothesis testing and more specifically on the likelihood ratio test.

Dunning likelihood collocation score

Likelihood ratios are more appropriate for sparse data than the \(\chi^2\) test. But they also have the advantage that the statistic we are computing, a likelihood ratio, is more interpretable than the \(\chi^2\) statistic.

At the end, the likelihood ratio is simply a number that tells us how much more likely one hypothesis is than the other.

In applying the likelihood ratio test to collocation discovery, we examine the following two alternative explanations for the occurrence frequency of a bigram \(w_1 w_2\) (Dunning 1993)

  • First hypothesis is \(H_1 : Pr(w_2 \vert w_1) = p = Pr(w_2 \vert \bar{w_1} )\)
  • Second hypothesis is \(H_2 : Pr(w_2 \vert w_1) = p_1 \ne p_2 = Pr(w_2 \vert \bar{w_1} )\)

Collocation spaces

Given \(c_1 = \vert w_1 \vert\), \(c_2 = \vert w_2 \vert\) and \(c_{12} = \vert w_{12} \vert\) are the number of occurences of the corresponding grapheme and \(N = \vert \Omega \vert\) the total number of tokens/words in the corpus, we define the probability of having this two words adjacent, and the opposite probability:

  • \(Pr(w_2 \vert w_1) = p_1 = { c_{12} \over c_1 }\),
  • \(Pr(w_2 \vert \bar{w_1} ) = p_2 = { c_2 - c_{12} \over N - c_1 }\).

The next table shows the probability of having words \(w_2\) and \(w_1\) adjacent, and the probability of having word \(w_2\) not adjacent to \(w_1\).

  \(Pr(w_2 \vert w_1)\) \(Pr(w_2 \vert \bar{w_1} )\)
\(H_1\) \(p_1=p={ c_2 \over N }\) \(p_2=p= { c_2 \over N }\)
\(H_2\) \(p_1={ c_{12} \over c_1 }\) \(p_2= { c_2 - c_{12} \over N - c_1 }\)

Following the first hypothesis, the likelihood of observing the data given \(H_1\) is the product of the likelihood of observing \(c_{12}\) words out of \(c_1\) with the likelihood of observing \(c_2 - c_{12}\) words out of \(N - c_1\):

\[L(H_1) = Pr( w_2 \vert w_1, H_1) \times Pr( w_2 \vert \bar{w_1}, H_1)\]

We note \(b()\) the probability mass function of the binomial distribution:

\[b(k;n,p) = \binom{n}{k} p^k (1-p)^{n-k}\]

So that the likelihood of \(H_1\) can be written:

\[L(H_1) = b(c_{12};c_1,p) \times b(c_2 - c_{12};N - c_1,p)\]

Similarly the likelihood of observing the data given \(H_2\) is the product of the likelihood of observing \(c_{12}\) words out of \(c_1\) with the likelihood of observing \(c_2 - c_{12}\) words out of \(N - c_1\):

\[L(H_2) = Pr( w_2 \vert w_1, H_2) \times Pr( w_2 \vert \bar{w_1}, H_2)\] \[L(H_2) = b(c_{12};c_1,p_1) \times b(c_2 - c_{12};N - c_1,p_2)\]

The likelihood ratio \(\lambda\) is the quotient of the likelihoods:

\[\lambda = { L(H_1) \over L(H_2) }\]

Generally if \(\lambda \gt 1\) the first hypothesis is more likely than the second. For instance, if \(\lambda = 4\) then the \(H_1\) is four times more likely than \(H_2\) to happen. In our case, the first hypothesis is much less likely than the second one. So it is common to take the logarithm of the ratio, with a negative constant before \(-2 log(\lambda)\).

Advantages of likelihood ratio

One advantage of likelihood ratios is that they have a clear intuitive interpretation. They can also be more appropriate for sparse data than the \(\chi^2\) test.

For hypothesis testing, if \(\lambda\) is a likelihood ratio of a particular form, then the quantity \(-2 log(\lambda)\) is \(\chi^2\) distributed when the data sample is large enough. So giving that \(-2 log(\lambda) \sim \chi^2\) we can test the hypothesis \(H_1\) against the alternative hypothesis \(H_2\).

In the next section we are going to apply the likelihood ratio to find collocation in a case law corpus.

Sample use case on Judilibre Open Data

If you are interested in data extraction, you’d like to have a look at the python generated swagger client available here.

The French court of cassation initiated the JUDILIBRE project aimed at the design and in-house development of a search engine in the corpus of case law, making it available to the public.

After extracting an arbitrary part of the corpus through the search engine API, I persisted the results in a json line file.

Word Cloud

Just a quick look at a word cloud illustration of the dataset:

Corpus Word Cloud

Bigram collocation

The method collocation_2 enable us to find collocations and ordered them following the negative log likelihood score:

collocation_2(judilibre_text, method="llr", stop_words=stop_words)
{"cour d'appel": 14779.656345618061,
 'code civil': 9034.437842527477,
 'dès lors': 8061.607047568378,
 'bon droit': 2470.063562327007,
 'procédure civile': 2074.7855327925286,
 'peut être': 1936.8836188208438,
 'doit être': 1856.4831975258871,
 "d'un immeuble": 1694.5615662280786,
 'chose jugée': 1512.3425659159825,
 'après avoir': 1478.3002163179476,
 'condition suspensive': 1253.7851148695208,
 'rédaction antérieure': 1221.519857102709,
 'base légale': 1133.1024031677157,
 'sous seing': 1120.8378898578583,
 'seing privé': 1114.6228341225883,
 "d'autre part": 1037.4683304932462,
 "qu'une cour": 990.3205509155646,
 "cassation l'arrêt": 977.635114391793,
 "l'acte authentique": 969.402554409081,
 "d'un acte": 947.6827965402367}

The bigram cour d’appel is more than 14000 times more likely under the \(H_2\) hypothesis (d’appel is more likely to follow cour) than its base rate of occurrence would suggest.

Pointwise mutual information

It may be interesting to look at the collocations ranked by the pointwise mutual information, another hypothesis test method:

collocation_2(judilibre_text, method="pmi", stop_words=stop_words)
{'bonnes moeurs': 17.259674311869706,
 "d'échelle mobile": 17.259674311869706,
 "donneur d'aval": 17.259674311869706,
 'maniere fantaisiste': 17.259674311869706,
 "pétition d'hérédité": 17.259674311869706,
 'simulations chiffrées': 17.259674311869706,
 'trimestre echu': 17.259674311869706,
 'viciait fondamentalement': 17.259674311869706,
 '1035 1036': 16.674711811148548,
 '13-18 383': 16.674711811148548,
 '757 758-6': 16.674711811148548,
 'associations syndicales': 16.674711811148548,
 'coemprunteurs souscrivent': 16.674711811148548,
 'collectivités territoriales': 16.674711811148548,
 'dissimulée derrière': 16.674711811148548,
 "désirant l'acquérir": 16.674711811148548,
 'endettement croissant': 16.674711811148548,
 'huis clos': 16.674711811148548,
 'mètre carré': 16.674711811148548,
 'potentiellement significatives': 16.674711811148548}

P-value

If we want a more detailed output we can use the detailed_collocation_2 function:

  \(w_1\) \(\vert w_1 \vert\) \(w_2\) \(\vert w_2 \vert\) score p-value
0 cour 948 d’appel 1230 14779.7 0
1 code 1051 civil 830 9034.44 0
2 dès 477 lors 773 8061.61 0
3 bon 205 droit 972 2470.06 0
4 procédure 496 civile 371 2074.79 0
5 peut 817 être 952 1936.88 0
6 doit 368 être 952 1856.48 0
7 d’un 1879 immeuble 261 1694.56 0
8 chose 157 jugée 110 1512.34 0
9 après 296 avoir 326 1478.3 0
10 condition 175 suspensive 73 1253.79 0
11 rédaction 189 antérieure 132 1221.52 0
12 base 99 légale 156 1133.1 0
13 sous 247 seing 67 1120.84 0
14 seing 67 privé 84 1114.62 0
15 d’autre 69 part 210 1037.47 0
16 qu’une 302 cour 948 990.321 0
17 cassation 186 l’arrêt 329 977.635 0
18 l’acte 647 authentique 227 969.403 0
19 d’un 1879 acte 538 947.683 0
20 justifie 38 légalement 139 917.195 0
21 viole 89 l’article 1901 914.622 0
22 société 411 civile 371 866.908 0
23 officier 63 public 217 851.046 0
24 acte 538 authentique 227 786.687 0
25 cet 322 acte 538 766.902 0
26 cet 322 officier 63 760.165 0
27 régime 191 matrimonial 54 736.566 0
28 bonne 48 foi 86 734.162 0
29 sécurité 49 sociale 63 732.558 0

For example, we can look up the value of sécurité sociale in the table and reject \(H_1\) for this bigram on a confidence level of 5% since the critical value for one degree of freedom is 7.88.

Sources