Multiple fruits

Introduction

When Andrew Ng says

“Hey, everyone, the neural network is good enough. Let’s not mess with the code anymore. The only thing you’re going to do now is build processes to improve the quality of the data.” — Andrew Ng

it somehow echoes a word of Thomas Henry Huxley, at the time defending Darwin’s theory in On the Origin of Species against William Thomson:

“Mathematics may be compared to a mill of exquisite workmanship, which grinds you stuff of any degree of fineness; but, nevertheless, what you get out depends upon what you put in; and as the grandest mill in the world will not extract wheat-flour from peascod, so pages of formulae will not get a definite result out of loose data.” — Thomas Henry Huxley

However, even on simple data fields like customer addresses assessing text quality is a challenging problem. As a concrete example of metric, we would like to measure the proportion of spelling/typing mistakes or the number of unknown words. In this article we present a quick way to analyse unstructured text and understand more about its linguistics features. We will only focus on descriptive measures. This is the first step to assess the quality of the data as we keep improving it.

And remember what Lord Kelvin said

“If you can’t measure it, you can’t improve it” — William Thomson (Lord Kelvin)

Notations and context

To start, let’s define a quick theorical context, before digging into the descriptive metrics.

We define a corpus \(\mathcal{C} = {d_i}\) as a list of document. The number of document can be seen as the cardinality of the corpus \(\vert \mathcal{C} \vert = N\).

Each document has a set of text attribute/field. Each field is indexed with j from 1 to M, such \(a_{ij}\) is the j\(^{th}\) field of the i\(^{th}\) document.

In this case, each document has only one text field, the title.

We define a tokenizer function \(f_{tokenizer}\) as \(f_{tokenizer} : a_{ij} \rightarrow [t_1, ..., t_k, ... , t_L]\), with \({t_k}\) the list of tokens corresponding to the \(a_{ij}\) field.

We define \(\mathcal{T} = t_{ijk}\) the token k for the field j of the i\(^{th}\) document. We can also note \(\mathcal{T}_j = t_{ik}\) the list of token for a given field j.

We define the dictionary \(\mathcal{D} = { t_i }\) given a tokenizer function \(f_{tokenizer}\), as the list of unique tokens generated by a tokenizer on a given corpus.

Descriptive metric list

Having already seen some basic function of text analysis with nltk.book.Text, we describe and compute some basic metrics to find characteristic and informative thing about the hackernews story dataset.

To see the code used to build the following table go this way.

	Name	Formula	Value	Description
0	Count	\(\vert \mathcal{O} \vert\)	500	Number of title
1	Unique count	\(\vert \mathcal{O}_{unique} \vert\)	497	Number of unique title
2	Token count	\(\vert \mathcal{T} \vert\)	2853	Total number of tokens
3	Dictionary length	\(\vert \mathcal{D} \vert\)	2195	Total number of unique tokens
4	Lem dictionary length	\(\vert \mathcal{D}_{lemme} \vert\)	1817	Total number of unique lemmetized tokens
5	Alpha lem dictionary length	\(\vert \mathcal{D}_{\alpha-lemme} \vert\)	1766	Total number of unique alpha lemmetized tokens
6	Average length	\(\bar{M_i}\)	49.69	Average number of tokens
7	Min and Max length	\(\{ min(M_i), max(M_i) \}\)	(7, 81)	Minimum and maximum number of tokens
8	Median length	\(\tilde{M_i}\)	51	Median number of tokens
9	Std length	\(s_{M_i}\)	19.12	Standard deviation of the number of tokens
10	Duplicate proportion	\(\vert \mathcal{O} \vert - \vert \mathcal{O}_{unique} \vert \over \vert \mathcal{O} \vert\)	0.006	Proportion of title that appears more than once
11	Numerical frequency	\(\vert \mathcal{T}_{numerical} \vert \over \vert \mathcal{T} \vert\)	0.0315	Frequency of numerical tokens
12	Numerical proportion	\(\vert \mathcal{D}_{numerical} \vert \over \vert \mathcal{D}_{lemme} \vert\)	0.0281	Proportion of numerical tokens
13	In vocabulary	\(\vert \mathcal{D}_{\alpha-lemme} \cap \mathcal{D}_{NLTK} \vert \over \vert \mathcal{D}_{\alpha-lemme} \vert\)	0.7599	Proportion of tokens inside the NLTK vocabulary
14	Out of vocabulary	\(\vert \mathcal{D}_{\alpha-lemme} \vert - \vert \mathcal{D}_{\alpha-lemme} \cap \mathcal{D}_{NLTK} \vert \over \vert \mathcal{D}_{\alpha-lemme} \vert\)	0.2401	Proportion of tokens outside the NLTK vocabulary
15	Lexical diversity	\(\vert \mathcal{D} \vert \over \vert \mathcal{T} \vert\)	0.7694	Dictionary count over the token count
16	Hapaxes	\(\vert \mathcal{D}_{hapax} \vert \over \vert \mathcal{D} \vert\)	0.8378	Proportion of token that occur once (hapax legomena)
17	Uppercase items	\(\vert \mathcal{O}_{upper} \vert \over \vert \mathcal{O} \vert\)	0.0	Proportion of uppercased title
18	Uppercased token proportion	\(\vert \mathcal{T}_{upper} \vert \over \vert \mathcal{T} \vert\)	0.0648	Proportion of uppercased token

In the following sections we give some information on Lexical Diversity, Hapaxes, Uppercased token proporition and Out of vocabulary.

Lexical Diversity and Hapaxes Legomena

Lexical Diversity LD is the number of word in the Dictionary over the number of word in the Corpus

\[\text{LD} = { \vert \mathcal{D} \vert \over \vert \mathcal{T} \vert }\]

Texts that are lexically diverse use a wide range of vocabulary, avoid repetition, use precise language and tend to use synonyms to express ideas. As lexical diversity can be the proof of a well-written and diverse textual data, it is however biased by the size of the corpus. Besides, if they are too many rare words, so-called hapaxes, words that occur only once, the language model may lack some context to guess what the hapaxes mean. Also note that too many hapaxes may simply be the characteristic of a small dataset.

Hapaxes proportion HP, is the proportion of words that occur once in a given corpus.

\[\text{HP} = { \vert \mathcal{D}_{hapax} \vert \over \vert \mathcal{D} \vert }\]

Here \(\text{HP} = 83.78 \%\), which is clearly the consequence of the small size of the dataset.

Jargon and Abbreviations

When training a language model on a specific domain, we aim to learn the specific words that characterize the domain, the so-called jargon. A way to measure the level of domain specialization can be to extract the number and diversity of abbreviations. We try to do this by searching uppercased words of more than one character.

\[\text{UT} = { \vert \mathcal{T}_{upper} \vert \over \vert \mathcal{T} \vert }\]

Here \(\text{UT} = 6.48 \%\), giving that stop words are not removed and that the corpus is made up of titles. It would be interesting to see the distribution of those words.

As a complement, looking at the words outside of the NLTK vocabulary OOV can also be informative.

\[\text{OOV} = 1 - { \vert \mathcal{D}_{\alpha-lemme} \cap \mathcal{D}_{NLTK} \vert \over \vert \mathcal{D}_{\alpha-lemme} \vert }\]

To conclude…

In a data centric view, being able to see the evolution of the data through metrics is crucial. As we iteratively improve the dataset, we want to measure how the data quality impacts the model prediction. When a model is deployed in production, monitoring the data drift is key and one of the responsability of the rising MLOps role.

Compute simple corpus descriptive metrics