Skip to content

Text pre-processing

The classical software engineering sees a text as a string of bytes, encoded with a certain standard like UTF-8. The text mining was always a mystery for undergraduate for me, who struggled with if-else-switch for classifying text. The if-else-switch approach is, however, deterministic and can ensure that your program should work as you want it to. On the other side, with the increasing size of documents, you would want to have another skillset methodology that gets you classifying the text right without spending hours hardcoding the rules.

If you have ever done a machine learning project, you may hear of data cleaning or feature engineering, which is basically forming and mixing your data until it makes sense for humans. The thumb rule is, if a human can see somehow the smallest pattern in a dataset, a computer can surely do it, too.

In the context of natural language processing, text processing could also be understood as reducing the dataset’s complexity and present it in a form in which a computer can read and understand it. This is not a very trivial task on the first glimpse but since our target is not composing a poem but just automatizing the classifying process, we can overcome the hurden with some simple algorithms. And surprisingly the established algorithms out there are doing a very good job.

First of all, given a random text in English that was written by a human being, we can read and understand this text by applying our sense of grammar and rules which we have learned along our school time. The traditional NLP algorithm that we will talk in following, however, doesn’t actually work as we do. They rely much less on grammar but rather more on statistics, in particular, bayesian statistics. With a set of labeled text documents (documents which are already classified by human experts), we can create a statistic model to help us classifying unseen documents. Since we are talking about statistics, the predictions are mostly not deterministic and can also be wrong. But this fact does not deny a good statistic model’s usefulness.

The idea is pretty simple, with the mentioned set of labeled text documents, we want to generate a so-called corpus, which could be understood as a numerical matrix. The columns of the matrix consist of unique words, the rows of the matrix present a unique document of the set. Each cell of the corpus indicates how often a unique word appears in a matrix. This is called the bag-of-words approach, as the text is represented as word counts, regardless of word position inside the document. This algorithm is of course not perfect since it can not analyze reliably the sematic of a sentence since the position of words are not considered.

For text mining use cases, dimensionality reduction needs to be applied. Dimensionality reduction could be understood as a method of removing dimensions of a space, which contains very low information content. This definition is stolen by my understanding of Entropy. This approach, however, should only be applied with care and particularly if you use a bag-of-words representation of text.

First of all, we want to remove punctuation and any nonalphanumeric characters from the text. The decision of which character should be removed depends on the task. Think of a hashtag if you want to analyze tweets. In some cases, we also even want to remove numerical characters since they are pretty useless in terms of document classification.

Second of all, we may want to remove stopwords, which vary from language to language.  Words like “the” or “am” could be removed safely and are often included in most text mining software packages’ hit list. If you are working in a domain-specific field like medical, a term like “Patient” could also be considered as a stopword.

The next step has the name “Stemming”, which requires lots of human work to get done. Stemming means reducing words to their most basic form using grammatical laws. A set of words like “goes”, “went” or “gone” can be safely transformed into their simple form of “go”.  Since I am personally not a linguistician, I prefer to use software packages that map a list of words to their basic form. In the python environment, we can rely on the help of nltk.

For infrequently used terms, which may appear less than 1% or 0.5% in the whole corpus, we may want to remove them as well. The gains can be twofold: an enormous decrease in dimensionality. On the other hand, those words can be the decisive key to the correct prediction. For medical documents, we can think of a very rare disease that has a very small prevalence.

Going through those basic steps, one is in possession of a text corpus and can start to choose a machine learning algorithm for the classifying task. An algorithm itself is not even that important for this task, a large corpus of high quality is of much more importance.



Published inMachine Learning

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *