Multiple fruits

Introduction

When Andrew Ng says

“Hey, everyone, the neural network is good enough. Let’s not mess with the code anymore. The only thing you’re going to do now is build processes to improve the quality of the data.” — Andrew Ng

it somehow echoes a word of Thomas Henry Huxley, at the time defending Darwin’s theory in On the Origin of Species against William Thomson:

“Mathematics may be compared to a mill of exquisite workmanship, which grinds you stuff of any degree of fineness; but, nevertheless, what you get out depends upon what you put in; and as the grandest mill in the world will not extract wheat-flour from peascod, so pages of formulae will not get a definite result out of loose data.” — Thomas Henry Huxley

However, even on simple data fields like customer addresses assessing text quality is a challenging problem. As a concrete example of metric, we would like to measure the proportion of spelling/typing mistakes or the number of unknown words. In this article we present a quick way to analyse unstructured text and understand more about its linguistics features. We will only focus on descriptive measures. This is the first step to assess the quality of the data as we keep improving it.

And remember what Lord Kelvin said

“If you can’t measure it, you can’t improve it” — William Thomson (Lord Kelvin)

Notations and context

To start, let’s define a quick theorical context, before digging into the descriptive metrics.

We define a corpus \(\mathcal{C} = {d_i}\) as a list of document. The number of document can be seen as the cardinality of the corpus \(\vert \mathcal{C} \vert = N\).

Each document has a set of text attribute/field. Each field is indexed with j from 1 to M, such \(a_{ij}\) is the j\(^{th}\) field of the i\(^{th}\) document.

In this case, each document has only one text field, the title.

We define a tokenizer function \(f_{tokenizer}\) as \(f_{tokenizer} : a_{ij} \rightarrow [t_1, ..., t_k, ... , t_L]\), with \({t_k}\) the list of tokens corresponding to the \(a_{ij}\) field.

We define \(\mathcal{T} = t_{ijk}\) the token k for the field j of the i\(^{th}\) document. We can also note \(\mathcal{T}_j = t_{ik}\) the list of token for a given field j.

We define the dictionary \(\mathcal{D} = { t_i }\) given a tokenizer function \(f_{tokenizer}\), as the list of unique tokens generated by a tokenizer on a given corpus.

Descriptive metric list

Having already seen some basic function of text analysis with nltk.book.Text, we describe and compute some basic metrics to find characteristic and informative thing about the hackernews story dataset.

To see the code used to build the following table go this way.

  Name Formula Value Description
0 Count \(\vert \mathcal{O} \vert\) 500 Number of title
1 Unique count \(\vert \mathcal{O}_{unique} \vert\) 497 Number of unique title
2 Token count \(\vert \mathcal{T} \vert\) 2853 Total number of tokens
3 Dictionary length \(\vert \mathcal{D} \vert\) 2195 Total number of unique tokens
4 Lem dictionary length \(\vert \mathcal{D}_{lemme} \vert\) 1817 Total number of unique lemmetized tokens
5 Alpha lem dictionary length \(\vert \mathcal{D}_{\alpha-lemme} \vert\) 1766 Total number of unique alpha lemmetized tokens
6 Average length \(\bar{M_i}\) 49.69 Average number of tokens
7 Min and Max length \(\{ min(M_i), max(M_i) \}\) (7, 81) Minimum and maximum number of tokens
8 Median length \(\tilde{M_i}\) 51 Median number of tokens
9 Std length \(s_{M_i}\) 19.12 Standard deviation of the number of tokens
10 Duplicate proportion \(\vert \mathcal{O} \vert - \vert \mathcal{O}_{unique} \vert \over \vert \mathcal{O} \vert\) 0.006 Proportion of title that appears more than once
11 Numerical frequency \(\vert \mathcal{T}_{numerical} \vert \over \vert \mathcal{T} \vert\) 0.0315 Frequency of numerical tokens
12 Numerical proportion \(\vert \mathcal{D}_{numerical} \vert \over \vert \mathcal{D}_{lemme} \vert\) 0.0281 Proportion of numerical tokens
13 In vocabulary \(\vert \mathcal{D}_{\alpha-lemme} \cap \mathcal{D}_{NLTK} \vert \over \vert \mathcal{D}_{\alpha-lemme} \vert\) 0.7599 Proportion of tokens inside the NLTK vocabulary
14 Out of vocabulary \(\vert \mathcal{D}_{\alpha-lemme} \vert - \vert \mathcal{D}_{\alpha-lemme} \cap \mathcal{D}_{NLTK} \vert \over \vert \mathcal{D}_{\alpha-lemme} \vert\) 0.2401 Proportion of tokens outside the NLTK vocabulary
15 Lexical diversity \(\vert \mathcal{D} \vert \over \vert \mathcal{T} \vert\) 0.7694 Dictionary count over the token count
16 Hapaxes \(\vert \mathcal{D}_{hapax} \vert \over \vert \mathcal{D} \vert\) 0.8378 Proportion of token that occur once (hapax legomena)
17 Uppercase items \(\vert \mathcal{O}_{upper} \vert \over \vert \mathcal{O} \vert\) 0.0 Proportion of uppercased title
18 Uppercased token proportion \(\vert \mathcal{T}_{upper} \vert \over \vert \mathcal{T} \vert\) 0.0648 Proportion of uppercased token

In the following sections we give some information on Lexical Diversity, Hapaxes, Uppercased token proporition and Out of vocabulary.

Lexical Diversity and Hapaxes Legomena

Lexical Diversity LD is the number of word in the Dictionary over the number of word in the Corpus

\[\text{LD} = { \vert \mathcal{D} \vert \over \vert \mathcal{T} \vert }\]

Texts that are lexically diverse use a wide range of vocabulary, avoid repetition, use precise language and tend to use synonyms to express ideas. As lexical diversity can be the proof of a well-written and diverse textual data, it is however biased by the size of the corpus. Besides, if they are too many rare words, so-called hapaxes, words that occur only once, the language model may lack some context to guess what the hapaxes mean. Also note that too many hapaxes may simply be the characteristic of a small dataset.

Hapaxes proportion HP, is the proportion of words that occur once in a given corpus.

\[\text{HP} = { \vert \mathcal{D}_{hapax} \vert \over \vert \mathcal{D} \vert }\]

Here \(\text{HP} = 83.78 \%\), which is clearly the consequence of the small size of the dataset.

Jargon and Abbreviations

When training a language model on a specific domain, we aim to learn the specific words that characterize the domain, the so-called jargon. A way to measure the level of domain specialization can be to extract the number and diversity of abbreviations. We try to do this by searching uppercased words of more than one character.

\[\text{UT} = { \vert \mathcal{T}_{upper} \vert \over \vert \mathcal{T} \vert }\]

Here \(\text{UT} = 6.48 \%\), giving that stop words are not removed and that the corpus is made up of titles. It would be interesting to see the distribution of those words.

As a complement, looking at the words outside of the NLTK vocabulary OOV can also be informative.

\[\text{OOV} = 1 - { \vert \mathcal{D}_{\alpha-lemme} \cap \mathcal{D}_{NLTK} \vert \over \vert \mathcal{D}_{\alpha-lemme} \vert }\]

To conclude…

In a data centric view, being able to see the evolution of the data through metrics is crucial. As we iteratively improve the dataset, we want to measure how the data quality impacts the model prediction. When a model is deployed in production, monitoring the data drift is key and one of the responsability of the rising MLOps role.

Sources