Natural language processing (NLP)
What is Natural language processing?
Natural language processing (NLP) – Natural Language Process, or NLP for short, is a field of study focused on the interactions between human language and computers. NLP helps machines “read” text by simulating the human ability to understand language. It sits at the intersection of computer science, artificial intelligence, and computational linguistics.
Natural Language Processing (NLP) is a field of Artificial Intelligence (AI) that focuses on quantifying human language to make it intelligible to machines. It combines the power of linguistics and computer science to study the rules and structure of language, and create intelligent systems capable of understanding, analyzing, and extracting meaning from text and speech.
Linguistics is used to understand the structure and meaning of a text by analyzing different aspects like syntax, semantics, pragmatics, and morphology. Then, computer science transforms this linguistic knowledge into rule-based or machine learning algorithms that can solve specific problems and perform desired tasks.
Take Gmail, for example. Your emails are automatically categorized as Promotions, Social, Primary, or Spam thanks to an NLP task called text classification. By breaking down words and identifying patterns, rules, and relationships between them, machines automatically learn which category to assign emails.
Why Is Natural Language Processing Important?
Natural Language Processing plays an especially important role in structuring big data because it prepares text and speech for machines so that they can interpret, process, and organize information. Some of the main advantages of NLP include:
- Large-scale analysis. Natural Language Processing can help machines perform language-based tasks such as reading text, identifying what is important, and detecting sentiment at scale. If you receive an influx of customer support tickets, you do not need to hire more staff. NLP tools can be scaled up or down as needed.
- Automate processes in real-time. Machine learning tools, equipped with natural language processing, can learn to understand and analyze information without human help – quickly, effectively, and around the clock.
- Consistent and unbiased criteria.NLP machines are not subjective like humans. The tag data based on one set of rules, so you do not have to worry about inconsistent and inaccurate results.
Natural Language Processing & Algorithms
In this section, we will focus on two primary natural language processing techniques and their sub-tasks.
Syntactic analysis ― also known as parsing or syntax analysis ― identifies the syntactic structure of a text and the dependency relationships between words, represented on a diagram called a parse tree.
Syntax analysis involves many different sub-tasks, including:
This is the most basic task in natural language processing. It is used to break up a string of words into semantically useful units called tokens and works by defining boundaries, that is, a criterion of where a token begins or ends.
You can use sentence tokenization to split sentences within a text, or word tokenization to split words within a sentence. Generally, word tokens can be separated by blank spaces, and sentence tokens by stops. However, you can perform high-level tokenization for more complex structures, like words that often go together, otherwise known as collocations (for example, New York).
Here is an example of how to word tokenization simplifies text:
Customer service could not be better! = [“customer service”, “could”, “not”, “be”, “better”]
Part-of-speech tagging (abbreviated as PoS tagging) involves adding a part of speech category to each token within a text. Some common PoS tags are verb, adjective, noun, pronoun, conjunction, preposition, intersection, among others. In this case, the example above would look like this:
“Customer service”: NOUN, “could”: VERB, “not”: ADVERB, be”: VERB, “better”: ADJECTIVE, “!”: PUNCTUATION
PoS tagging is useful for identifying relationships between words and, therefore, understand the meaning of sentences.
Dependency grammar refers to the way the words in a sentence are connected to each other. A dependency parser, therefore, analyzes how ‘headwords’ are related and modified by other words to understand the syntactic structure of a sentence:
Constituency Parsing aims to visualize the entire syntactic structure of a sentence by identifying phrase structure grammar. Basically, it consists of using abstract terminal and non-terminal nodes associated with words, as shown in this example:
You can try different parsing algorithms and strategies depending on the nature of the text you intend to analyze, and the level of complexity you would like to achieve.
Lemmatization & Stemming
When we speak or write, we tend to use inflected forms of a word (words in their different grammatical forms). To make these words easier for computers to understand, NLP uses lemmatization and stemming to transform them back to their root form.
The word as it appears in the dictionary – its root form – is called a lemma. For example, the words ‘are, is, am, were, and been’, are grouped under the lemma ‘be’. So, if we apply this lemmatization to “African elephants have 4 nails on their front feet”, the result will look something like this:
African elephants have 4 nails on their front feet = [“African”, “elephant”, “have”, “4”, “nail”, “on”, “their”, “foot”]
This example is useful to see how the lemmatization changes the sentence using its base form (e.g. the word “feet” was changed to “foot).
When we refer to stemming, the root form of a word is called a stem. Stemming ‘trims’ words, so word stems may not always be semantically correct.
For example, stemming the words “consult”, “consultant”, “consulting”, and “consultants”, would result in the root form “consult”.
While lemmatization is dictionary-based and chooses the appropriate lemma based on context, stemming operates on single words without considering the context. For example, in the sentence:
“This is better”
The word “better” is transformed into the word “good” by a lemmatizer but is unchanged by stemming. Even though they can lead to less-accurate results, stemmers are easier to build and perform faster than lemmatizers. However, the latter is better if you are seeking more precise linguistic rules.
Stop word Removal
Removing stop words is an important step in NLP text processing. It involves filtering out high-frequency words that add little or no semantic value to a sentence, for example, which, too, at, for, is, etc.
You can even customize lists of stop words to include words that you want to ignore.
Let us say you would like to classify customer service tickets based on their topics. In this example: “Hello, I’m having trouble logging in with my new password”, it may be useful to remove stop words like “hello”, “I”, “am”, “with”, “my”, so you’re left with the words that help you understand the topic of the ticket: “trouble”, “logging in”, “new”, “password”.
Semantic Natural Language Processing
The semantic analysis focuses on identifying the meaning of language. However, since language is polysemic and ambiguous, semantics is considered one of the most challenging areas in NLP.
Semantic tasks analyze the structure of sentences, word interactions, and related concepts, to discover the meaning of words, as well as understand the topic of a text.
Some sub-tasks of semantic analysis include:
Word Sense Disambiguation
Depending on their context, words can have different meanings. Take the word “book”, for example:
- You should read this book; it is a great novel!
- You should book the flights as soon as possible.
- You should close the books by the end of the year.
- You should do everything in the book to avoid potential complications.
There are two main techniques that can be used for Word Sense Disambiguation (WSD): knowledge-based (or dictionary approach) and a supervised approach. The first one tries to infer meaning by observing the dictionary definitions of ambiguous terms within a text; while the latter is based on machine learning algorithms that learn from examples (training data).
This task consists of identifying semantic relationships between two or more entities in a text. Entities can be names, places, organizations, etc.; and relationships can be established in a variety of ways. For example, in the phrase “Susan lives in Los Angeles”, a person (Susan) is related to a place (Los Angeles) by the semantic category “lives in”.
Rule-Based vs Machine Learning Natural Language Processing
There are two main technical approaches to Natural Language Processing that create different types of systems: one is based on linguistic rules and the other on machine learning methods. In this section, we will examine the advantages and disadvantages of each one and the possibility of combining both (hybrid approach).
Rule-based systems are the earliest approach to NLP and involve applying hand-crafted linguistic rules to text. Each rule is formed by an antecedent and a prediction:
IF this happens (antecedent), THEN this will be the outcome (prediction).
For example, imagine you’d like to perform sentiment analysis to classify positive and negative opinions in product reviews. First, you would have to create a list of positive words (such as good, best, excellent, etc.), and a list of negative words (bad, worst, frustrating, etc.). Then, you will need to go through each review and count the number of negative and positive words within each text to determine the overall sentiment.
Since rules are determined by humans, this type of system is easy to understand and provides fairly accurate results with minimal effort. Another advantage of rule-based systems is that they do not require training data, which makes them a good option if you do not have much data and are just starting your analysis.
However, manually crafting and enhancing rules can be a difficult and cumbersome task, and often requires a linguist or a knowledge engineer. Also, adding too many rules can lead to complex systems with contradictory rules.
Machine Learning Models
Machine Learning consists of algorithms that can learn to understand language based on previous observations. The system uses statistical methods to build its own ‘knowledge bank’ and is trained to make associations between an input and its corresponding output.
Let us go back to the sentiment analysis example. With machine learning, you can build a model to automatically classify opinions as positive, negative, or neutral. But first, you need to train your classifier by manually tagging text examples, until it is ready to make its own predictions for unseen data.
You will also need to transform the text examples into something a machine can understand (vectors), a process known as feature extractor or text vectorization. Once the texts have been transformed into vectors, they are fed to a machine learning algorithm together with their expected output (tags) to create a classification model. This model can then discern which features best represent the texts, and make predictions for unseen data:
The biggest advantage of machine learning models is their ability to learn on their own, with no need to define manual rules. All you will need is a good set of training data, with several examples for each of the tags you would like to analyze.
Over time, machine learning models often deliver higher precision than rule-based systems, and the more training data you feed them, the more accurate they are.
However, you will need training data that is relevant to the problem you want to solve to build an accurate machine learning model.
A third approach involves combining both rule-based and machine learning systems. That way, you can benefit from the advantages of each of them and gain higher accuracy in your results.
Natural Language Processing Algorithms
Natural language processing algorithms are usually based on machine learning algorithms. Below are some of the most popular ones that you can use depending on the task you want to perform:
Text Classification Algorithms
Text classification is the process of organizing unstructured text into predefined categories (tags). Text classification tasks include sentiment analysis, intent detection, topic modeling, and language detection.
Some of the most popular algorithms for creating text classification models are:
- Naive Bayes: a collection of probabilistic algorithms that draw from the probability theory and Bayes’ Theorem to predict the tag of a text. According to Bayes’ Theorem, the probability of an event happening (A) can be calculated if a prior event (B) has happened.
This model is called naive because it assumes that each variable (features or predictors) is independent, has no effect on the others, and each variable has an equal impact on the outcome. Naive Bayes algorithm is used for text classification, sentiment analysis, recommendation systems, and spam filters.
- Support Vector Machines (SVM): this is an algorithm mostly used to solve classification problems with high accuracy. Supervised classification models aim to predict the category of a piece of text based on a set of manually tagged training examples.
To do that, SVM turns training examples into vectors and draws a hyperplane to differentiate two classes of vectors: those that belong to a certain tag and those that do not belong to that one tag. Based on which side of the boundary they land; the model will be able to assign one tag or another. SVM algorithms can be especially useful when you have a limited amount of data.
- Deep Learning: this set of machine learning algorithms are based on artificial neural networks. They are perfect for processing large volumes of data, but in turn, require a large training corpus. Deep learning algorithms are used to solve complex NLP problems.
Text Extraction Algorithms
Text extraction consists of extracting specific pieces of data from a text. You can use extraction models to pull out keywords, entities such as company names or locations, otherwise known as entity recognition, or to summarize the text. Here are the most common algorithms for text extraction:
- TF-IDF (term frequency-inverse document frequency): this statistical approach determines how relevant a word is within a text in a collection of documents and is often used to extract relevant keywords from the text. The importance of a word increases based on the number of times it appears in a text (text frequency) but decreases based on the frequency it appears in the corpus of texts (inverse document frequency).
- Regular Expressions(regex): A regular expression is a sequence of characters that define a pattern. Regex checks if a string contains a determined search pattern, for example in text editors or search engines and is often used for extracting keywords and entities from text.
- CRF(conditional random fields): this machine learning approach learns patterns and extracts data by assigning a weight to a set of features in a sentence. This approach can create patterns that are richer and more complex than those patterns created with regex, enabling machines to determine better outcomes for more ambiguous expressions.
- Rapid Automatic Keyword Extraction (RAKE): this algorithm for keyword extraction uses a list of stop words and phrase delimiters to identify relevant words or phrases within a text. Basically, it analyzes the frequency of a word and its co-occurrence with other words.
Topic Modeling Algorithms
Topic modeling is a method for clustering groups of words and similar expressions within a set of data. Unlike topic classification, topic modeling is an unsupervised method, which means that it infers patterns from data without needing to define categories or tag data beforehand.
The main algorithms used for topic modeling include:
- Latent Semantic Analysis (LSA): this method is based on the distributional hypothesis and identifies words and expressions with similar meanings that occur in similar pieces of text. It is the most frequent method for topic modeling.
- Latent Dirichlet Allocation (LDA): this is a generative statistical model that assumes that documents contain various topics and that each topic contains words with certain probabilities of occurrence.