Compute simple corpus descriptive metrics
Introduction
When Andrew Ng says
“Hey, everyone, the neural network is good enough. Let’s not mess with the code anymore. The only thing you’re going to do now is build processes to improve the quality of the data.” — Andrew Ng
it somehow echoes a word of Thomas Henry Huxley, at the time defending Darwin’s theory in On the Origin of Species against William Thomson:
“Mathematics may be compared to a mill of exquisite workmanship, which grinds you stuff of any degree of fineness; but, nevertheless, what you get out depends upon what you put in; and as the grandest mill in the world will not extract wheat-flour from peascod, so pages of formulae will not get a definite result out of loose data.” — Thomas Henry Huxley
However, even on simple data fields like customer addresses assessing text quality is a challenging problem. As a concrete example of metric, we would like to measure the proportion of spelling/typing mistakes or the number of unknown words. In this article we present a quick way to analyse unstructured text and understand more about its linguistics features. We will only focus on descriptive measures. This is the first step to assess the quality of the data as we keep improving it.
And remember what Lord Kelvin said
“If you can’t measure it, you can’t improve it” — William Thomson (Lord Kelvin)
Notations and context
To start, let’s define a quick theorical context, before digging into the descriptive metrics.
We define a corpus \(\mathcal{C} = {d_i}\) as a list of document. The number of document can be seen as the cardinality of the corpus \(\vert \mathcal{C} \vert = N\).
Each document has a set of text attribute/field. Each field is indexed with j from 1 to M, such \(a_{ij}\) is the j\(^{th}\) field of the i\(^{th}\) document.
In this case, each document has only one text field, the title.
We define a tokenizer function \(f_{tokenizer}\) as \(f_{tokenizer} : a_{ij} \rightarrow [t_1, ..., t_k, ... , t_L]\), with \({t_k}\) the list of tokens corresponding to the \(a_{ij}\) field.
We define \(\mathcal{T} = t_{ijk}\) the token k for the field j of the i\(^{th}\) document. We can also note \(\mathcal{T}_j = t_{ik}\) the list of token for a given field j.
We define the dictionary \(\mathcal{D} = { t_i }\) given a tokenizer function \(f_{tokenizer}\), as the list of unique tokens generated by a tokenizer on a given corpus.
Descriptive metric list
Having already seen some basic function of text analysis with nltk.book.Text
, we describe and compute some basic metrics to find characteristic and informative thing about the hackernews story dataset.
To see the code used to build the following table go this way.
Name | Formula | Value | Description | |
---|---|---|---|---|
0 | Count | \(\vert \mathcal{O} \vert\) | 500 | Number of title |
1 | Unique count | \(\vert \mathcal{O}_{unique} \vert\) | 497 | Number of unique title |
2 | Token count | \(\vert \mathcal{T} \vert\) | 2853 | Total number of tokens |
3 | Dictionary length | \(\vert \mathcal{D} \vert\) | 2195 | Total number of unique tokens |
4 | Lem dictionary length | \(\vert \mathcal{D}_{lemme} \vert\) | 1817 | Total number of unique lemmetized tokens |
5 | Alpha lem dictionary length | \(\vert \mathcal{D}_{\alpha-lemme} \vert\) | 1766 | Total number of unique alpha lemmetized tokens |
6 | Average length | \(\bar{M_i}\) | 49.69 | Average number of tokens |
7 | Min and Max length | \(\{ min(M_i), max(M_i) \}\) | (7, 81) | Minimum and maximum number of tokens |
8 | Median length | \(\tilde{M_i}\) | 51 | Median number of tokens |
9 | Std length | \(s_{M_i}\) | 19.12 | Standard deviation of the number of tokens |
10 | Duplicate proportion | \(\vert \mathcal{O} \vert - \vert \mathcal{O}_{unique} \vert \over \vert \mathcal{O} \vert\) | 0.006 | Proportion of title that appears more than once |
11 | Numerical frequency | \(\vert \mathcal{T}_{numerical} \vert \over \vert \mathcal{T} \vert\) | 0.0315 | Frequency of numerical tokens |
12 | Numerical proportion | \(\vert \mathcal{D}_{numerical} \vert \over \vert \mathcal{D}_{lemme} \vert\) | 0.0281 | Proportion of numerical tokens |
13 | In vocabulary | \(\vert \mathcal{D}_{\alpha-lemme} \cap \mathcal{D}_{NLTK} \vert \over \vert \mathcal{D}_{\alpha-lemme} \vert\) | 0.7599 | Proportion of tokens inside the NLTK vocabulary |
14 | Out of vocabulary | \(\vert \mathcal{D}_{\alpha-lemme} \vert - \vert \mathcal{D}_{\alpha-lemme} \cap \mathcal{D}_{NLTK} \vert \over \vert \mathcal{D}_{\alpha-lemme} \vert\) | 0.2401 | Proportion of tokens outside the NLTK vocabulary |
15 | Lexical diversity | \(\vert \mathcal{D} \vert \over \vert \mathcal{T} \vert\) | 0.7694 | Dictionary count over the token count |
16 | Hapaxes | \(\vert \mathcal{D}_{hapax} \vert \over \vert \mathcal{D} \vert\) | 0.8378 | Proportion of token that occur once (hapax legomena) |
17 | Uppercase items | \(\vert \mathcal{O}_{upper} \vert \over \vert \mathcal{O} \vert\) | 0.0 | Proportion of uppercased title |
18 | Uppercased token proportion | \(\vert \mathcal{T}_{upper} \vert \over \vert \mathcal{T} \vert\) | 0.0648 | Proportion of uppercased token |
In the following sections we give some information on Lexical Diversity, Hapaxes, Uppercased token proporition and Out of vocabulary.
Lexical Diversity and Hapaxes Legomena
Lexical Diversity LD is the number of word in the Dictionary over the number of word in the Corpus
\[\text{LD} = { \vert \mathcal{D} \vert \over \vert \mathcal{T} \vert }\]Texts that are lexically diverse use a wide range of vocabulary, avoid repetition, use precise language and tend to use synonyms to express ideas. As lexical diversity can be the proof of a well-written and diverse textual data, it is however biased by the size of the corpus. Besides, if they are too many rare words, so-called hapaxes, words that occur only once, the language model may lack some context to guess what the hapaxes mean. Also note that too many hapaxes may simply be the characteristic of a small dataset.
Hapaxes proportion HP, is the proportion of words that occur once in a given corpus.
\[\text{HP} = { \vert \mathcal{D}_{hapax} \vert \over \vert \mathcal{D} \vert }\]Here \(\text{HP} = 83.78 \%\), which is clearly the consequence of the small size of the dataset.
Jargon and Abbreviations
When training a language model on a specific domain, we aim to learn the specific words that characterize the domain, the so-called jargon. A way to measure the level of domain specialization can be to extract the number and diversity of abbreviations. We try to do this by searching uppercased words of more than one character.
\[\text{UT} = { \vert \mathcal{T}_{upper} \vert \over \vert \mathcal{T} \vert }\]Here \(\text{UT} = 6.48 \%\), giving that stop words are not removed and that the corpus is made up of titles. It would be interesting to see the distribution of those words.
As a complement, looking at the words outside of the NLTK vocabulary OOV can also be informative.
\[\text{OOV} = 1 - { \vert \mathcal{D}_{\alpha-lemme} \cap \mathcal{D}_{NLTK} \vert \over \vert \mathcal{D}_{\alpha-lemme} \vert }\]To conclude…
In a data centric view, being able to see the evolution of the data through metrics is crucial. As we iteratively improve the dataset, we want to measure how the data quality impacts the model prediction. When a model is deployed in production, monitoring the data drift is key and one of the responsability of the rising MLOps role.