Grape vine fruit

In this post, I’ll try to explain what a collocation is, how we can retrieve collocations from a text corpus, and how to interpret the results. The main theory explaination of this post is largely inspired (not to say outrageously copied!) by part 5.3.4 of the Foundations of Statistical Natural Language, Manning and Schütze.

After the boring theory, we’ll apply this technic to a french case law dataset. All the code is available in the following git repository judilibre-eda.

Table of content

Definitions
Dunning likelihood collocation score
- Advantages of likelihood ratio
Sample use case on Judilibre Open Data
Sources

Definitions

A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things.

As Choueka said:

[A collocation is defined as] a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components.

Choueka (1988)

Properties of collocation

More precisly a collocation is a group of words having one or more of the three following properties: non-compositionality, non-substitutability and non-modifiability.

Non-compositionality

Collocations are characterized by limited compositionality in that there is usually an element of meaning added to the combination of the meaning of each part.

Idioms are the most extreme examples of non-compositionality. Idioms like to kick the bucket or to hear it through the grapevine only have an indirect historical relationship to the meanings of the parts of the expression. We are not talking about buckets or grapevines literally when we use these idioms.

Non-substitutability

We cannot substitute near-synonyms for the components of a collocation. For example, we can’t say yellow wine instead of white wine even though yellow is as good a description of the color of white wine as white is (it is kind of a yellowish white).

Non-modifiability

Many collocations cannot be freely modified with additional lexical material or through grammatical transformations. This is especially true for frozen expressions like idioms. For example, we can’t modify frog in to get a frog in one’s throat into to get an ugly frog in one’s throat although usually nouns like frog can be modified by adjectives like ugly. Similarly, going from singular to plural can make an idiom ill-formed, for example in people as poor as church mice.

Types

There a various types of collocations:

noun phrases, like strong tea and weapons of mass destruction
phrasal verbs, like to make up, verb particle constructions, important part of the lexicon of English, combination of a main verb and a particle, often correspond to a single lexeme in other languages, often non-adjancent words
light verbs, like make a decision or do a favor. There is hardly anything about the meaning of make, take or do that would explain why we have to say make a decision instead of take a decision and do a favor instead of make a favor
stock phrases, like the rich and powerful
subtle and not-easily-explainable patterns of word usage, like a stiff breeze, or broad daylight
phraseme or set phrase is a multi word utterance where at least one of whose components is selectionnaly constrained or restricted by linguistic convention such that it is not freely chosen.
idiomatic phrase or idiom, completely frozen expressions, like proper nouns
- proper nouns, proper names, quite different from lexical collocation but usually included
- saying or a proverb, figure of speech, foxed expression
terminological expressions, like group of words in technical domains that are often compositional but they may have to be treated consistently for certain NLP tasks such as translation.

Applications in NLP

Collocations are important for a number of applications:

natural language generation, to make sure that the output sounds natural and mistakes like powerful tea or to take a decision are avoided
computational lexicography, to automatically identify the important collocations to be listed in a dictionary entry
word tokenizer/parsing, so that preference can be given to parse with natural collocations
corpus linguistic research, the study of social phenomena like the reinforcement of cultural stereotypes through language (Stubbs 1996)

Co-occurrence VS Collocation

In linguistics, co-occurences or terms associations are graphemes where words are strongly associated with each other, but do not necessarily occur in a common grammatical unit and with a particular order. In other words, a co-occurrence is an extension of word counting in higher dimensions. The co-presence of more than one word/token within the same contextual window has to be statistically significative.

When it’s proved that there is a semantical or gramatical dependency between two words, we call it collocation. Then, collocation is special case of co-occurrence, or “free phrases”, where all of the members are chosen freely, based exclusively on their meaning and the message that the speaker wishes to communicate.

Co-occurence and semantic field

Co-occurrence can be interpreted as an indicator of semantic proximity. When two words or more have a semantical relationship, co-occurrence notion is at the base of thematic, semantic field and isotopy. It is a more general association of words that are likely to be used in the same context.

In semantic and semiotiquen, isotopy is the redondancy of element in a corpus enabling to understand it. For example, the redondancy of the first person (I), make it easy to understand that the same person is talking. Redondancy of the same semantic field enable us to understand that we are talking about the same theme.

Principal approaches of finding collocations

There are various and complementary ways to look for collocations:

selection of collocations by frequency
selection based on mean and variance of the distance between focal word and collocating word
hypothesis testing
mutual information

In the remaining parts of this post, we are going to focus on hypothesis testing and more specifically on the likelihood ratio test.

Dunning likelihood collocation score

Likelihood ratios are more appropriate for sparse data than the \(\chi^2\) test. But they also have the advantage that the statistic we are computing, a likelihood ratio, is more interpretable than the \(\chi^2\) statistic.

At the end, the likelihood ratio is simply a number that tells us how much more likely one hypothesis is than the other.

In applying the likelihood ratio test to collocation discovery, we examine the following two alternative explanations for the occurrence frequency of a bigram \(w_1 w_2\) (Dunning 1993)

First hypothesis is \(H_1 : Pr(w_2 \vert w_1) = p = Pr(w_2 \vert \bar{w_1} )\)
Second hypothesis is \(H_2 : Pr(w_2 \vert w_1) = p_1 \ne p_2 = Pr(w_2 \vert \bar{w_1} )\)

Collocation spaces

Given \(c_1 = \vert w_1 \vert\), \(c_2 = \vert w_2 \vert\) and \(c_{12} = \vert w_{12} \vert\) are the number of occurences of the corresponding grapheme and \(N = \vert \Omega \vert\) the total number of tokens/words in the corpus, we define the probability of having this two words adjacent, and the opposite probability:

\(Pr(w_2 \vert w_1) = p_1 = { c_{12} \over c_1 }\),
\(Pr(w_2 \vert \bar{w_1} ) = p_2 = { c_2 - c_{12} \over N - c_1 }\).

The next table shows the probability of having words \(w_2\) and \(w_1\) adjacent, and the probability of having word \(w_2\) not adjacent to \(w_1\).

	\(Pr(w_2 \vert w_1)\)	\(Pr(w_2 \vert \bar{w_1} )\)
\(H_1\)	\(p_1=p={ c_2 \over N }\)	\(p_2=p= { c_2 \over N }\)
\(H_2\)	\(p_1={ c_{12} \over c_1 }\)	\(p_2= { c_2 - c_{12} \over N - c_1 }\)

Following the first hypothesis, the likelihood of observing the data given \(H_1\) is the product of the likelihood of observing \(c_{12}\) words out of \(c_1\) with the likelihood of observing \(c_2 - c_{12}\) words out of \(N - c_1\):

\[L(H_1) = Pr( w_2 \vert w_1, H_1) \times Pr( w_2 \vert \bar{w_1}, H_1)\]

We note \(b()\) the probability mass function of the binomial distribution:

\[b(k;n,p) = \binom{n}{k} p^k (1-p)^{n-k}\]

So that the likelihood of \(H_1\) can be written:

\[L(H_1) = b(c_{12};c_1,p) \times b(c_2 - c_{12};N - c_1,p)\]

Similarly the likelihood of observing the data given \(H_2\) is the product of the likelihood of observing \(c_{12}\) words out of \(c_1\) with the likelihood of observing \(c_2 - c_{12}\) words out of \(N - c_1\):

\[L(H_2) = Pr( w_2 \vert w_1, H_2) \times Pr( w_2 \vert \bar{w_1}, H_2)\] \[L(H_2) = b(c_{12};c_1,p_1) \times b(c_2 - c_{12};N - c_1,p_2)\]

The likelihood ratio \(\lambda\) is the quotient of the likelihoods:

\[\lambda = { L(H_1) \over L(H_2) }\]

Generally if \(\lambda \gt 1\) the first hypothesis is more likely than the second. For instance, if \(\lambda = 4\) then the \(H_1\) is four times more likely than \(H_2\) to happen. In our case, the first hypothesis is much less likely than the second one. So it is common to take the logarithm of the ratio, with a negative constant before \(-2 log(\lambda)\).

Advantages of likelihood ratio

One advantage of likelihood ratios is that they have a clear intuitive interpretation. They can also be more appropriate for sparse data than the \(\chi^2\) test.

For hypothesis testing, if \(\lambda\) is a likelihood ratio of a particular form, then the quantity \(-2 log(\lambda)\) is \(\chi^2\) distributed when the data sample is large enough. So giving that \(-2 log(\lambda) \sim \chi^2\) we can test the hypothesis \(H_1\) against the alternative hypothesis \(H_2\).

In the next section we are going to apply the likelihood ratio to find collocation in a case law corpus.

Sample use case on Judilibre Open Data

If you are interested in data extraction, you’d like to have a look at the python generated swagger client available here.

The French court of cassation initiated the JUDILIBRE project aimed at the design and in-house development of a search engine in the corpus of case law, making it available to the public.

After extracting an arbitrary part of the corpus through the search engine API, I persisted the results in a json line file.

Word Cloud

Just a quick look at a word cloud illustration of the dataset:

Corpus Word Cloud

Bigram collocation

The method collocation_2 enable us to find collocations and ordered them following the negative log likelihood score:

collocation_2(judilibre_text, method="llr", stop_words=stop_words)

{"cour d'appel": 14779.656345618061,
 'code civil': 9034.437842527477,
 'dès lors': 8061.607047568378,
 'bon droit': 2470.063562327007,
 'procédure civile': 2074.7855327925286,
 'peut être': 1936.8836188208438,
 'doit être': 1856.4831975258871,
 "d'un immeuble": 1694.5615662280786,
 'chose jugée': 1512.3425659159825,
 'après avoir': 1478.3002163179476,
 'condition suspensive': 1253.7851148695208,
 'rédaction antérieure': 1221.519857102709,
 'base légale': 1133.1024031677157,
 'sous seing': 1120.8378898578583,
 'seing privé': 1114.6228341225883,
 "d'autre part": 1037.4683304932462,
 "qu'une cour": 990.3205509155646,
 "cassation l'arrêt": 977.635114391793,
 "l'acte authentique": 969.402554409081,
 "d'un acte": 947.6827965402367}

The bigram cour d’appel is more than 14000 times more likely under the \(H_2\) hypothesis (d’appel is more likely to follow cour) than its base rate of occurrence would suggest.

Pointwise mutual information

It may be interesting to look at the collocations ranked by the pointwise mutual information, another hypothesis test method:

collocation_2(judilibre_text, method="pmi", stop_words=stop_words)

{'bonnes moeurs': 17.259674311869706,
 "d'échelle mobile": 17.259674311869706,
 "donneur d'aval": 17.259674311869706,
 'maniere fantaisiste': 17.259674311869706,
 "pétition d'hérédité": 17.259674311869706,
 'simulations chiffrées': 17.259674311869706,
 'trimestre echu': 17.259674311869706,
 'viciait fondamentalement': 17.259674311869706,
 '1035 1036': 16.674711811148548,
 '13-18 383': 16.674711811148548,
 '757 758-6': 16.674711811148548,
 'associations syndicales': 16.674711811148548,
 'coemprunteurs souscrivent': 16.674711811148548,
 'collectivités territoriales': 16.674711811148548,
 'dissimulée derrière': 16.674711811148548,
 "désirant l'acquérir": 16.674711811148548,
 'endettement croissant': 16.674711811148548,
 'huis clos': 16.674711811148548,
 'mètre carré': 16.674711811148548,
 'potentiellement significatives': 16.674711811148548}

P-value

If we want a more detailed output we can use the detailed_collocation_2 function:

	\(w_1\)	\(\vert w_1 \vert\)	\(w_2\)	\(\vert w_2 \vert\)	score
0	cour	948	d’appel	1230	14779.7
1	code	1051	civil	830	9034.44
2	dès	477	lors	773	8061.61
3	bon	205	droit	972	2470.06
4	procédure	496	civile	371	2074.79
5	peut	817	être	952	1936.88
6	doit	368	être	952	1856.48
7	d’un	1879	immeuble	261	1694.56
8	chose	157	jugée	110	1512.34
9	après	296	avoir	326	1478.3
10	condition	175	suspensive	73	1253.79
11	rédaction	189	antérieure	132	1221.52
12	base	99	légale	156	1133.1
13	sous	247	seing	67	1120.84
14	seing	67	privé	84	1114.62
15	d’autre	69	part	210	1037.47
16	qu’une	302	cour	948	990.321
17	cassation	186	l’arrêt	329	977.635
18	l’acte	647	authentique	227	969.403
19	d’un	1879	acte	538	947.683
20	justifie	38	légalement	139	917.195
21	viole	89	l’article	1901	914.622
22	société	411	civile	371	866.908
23	officier	63	public	217	851.046
24	acte	538	authentique	227	786.687
25	cet	322	acte	538	766.902
26	cet	322	officier	63	760.165
27	régime	191	matrimonial	54	736.566
28	bonne	48	foi	86	734.162
29	sécurité	49	sociale	63	732.558

For example, we can look up the value of sécurité sociale in the table and reject \(H_1\) for this bigram on a confidence level of 5% since the critical value for one degree of freedom is 7.88.