An Overview of Natural Language Processing and Speech Technologies with Common Methods

James Byrne
Artificial Intelligence in Plain English
15 min readFeb 20, 2021

--

source: Image from Pixabay

Introduction

We are going to cover an overview of Language Technologies along with some common methods for them:

  1. What is Natural Language Processing?
  2. Background: Brief look at Artificial Intelligence, Machine Learning and Deep Learning and the differences between them.

2. Natural Language Understanding (NLU).

3. Speech Technologies.

What is Natural Language Processing?

Natural Language Processing (NLP) is a subfield of linguistics, Computer Science and Artificial Intelligence concerning the interactions between computers and human languages to process and analyse large quantities of data.

You’re probably wondering what all that means

Natural Language Processing is anything where computers try to understand and process human language (for example, English, Spanish, Arabic or Mandarin). Researchers have been working on NLP since the 1950s but within the last 5–10 years, there are a lot more innovative technologies, such as Siri, Google Translate, Predictive Text and Google Duplex.

Examples of Natural Language Processing (Google Search, Siri, Google Translate, Predictive Text). top left screenshot by author of Google Search, top right image of Google Translate logo from Wikimedia, bottom left image of Apple Siri icon from wikia.nocookie, bottom right: screenshot of Apple Notes App by author

Artificial Intelligence, Machine Learning and Deep Learning

Before getting into NLP, let’s quickly explain what Artificial Intelligence, Machine Learning and Deep learning are and the differences between them as many people get them mixed up and use the wrong one.

Image by author

As you can see in the diagram, Artificial Intelligence (AI) contains both Machine Learning (ML) and Deep Learning (DL). Machine Learning is a part of Artificial Intelligence and Deep Learning is a part of Machine Learning.

What is Artificial Intelligence?

Broadly speaking, it’s a system or machine that simulates the intelligence of humans to do specific tasks and that can iteratively improve themselves (get better as time goes on). Examples of Artificial Intelligence can range from facial recognition, robot vacuums (like the Roomba) and chatbots.

What is Machine Learning?

A system that automatically learns and improves with experience without explicitly writing code for it. For example, you could train a system to predict what type of fruit is in a photo. The system will learn different rules including a pear is green, an apple and an orange are round, etc. There are two types of Machine Learning, Supervised and Unsupervised Learning.

Supervised Learning is learning when the data set consists of inputs and outputs and is generally used to predict things. The outputs are what you want to predict and the inputs are what you use to predict the output. The example above, which predicts the type of fruit in an image, is an example of Supervised Learning, the image is the input and the type of fruit is the output.

Unsupervised Learning is learning when the data set does not have outputs. Therefore, it only has inputs. Normally this type of learning is used to

  • Find patterns in a dataset, for example, to predict which items are regularly bought together in an e-commerce website.
  • Put entries in a dataset into groups, for example, to group similar customers of a product together.

What is Deep Learning?

Deep Learning is a type of Machine Learning which uses a specific type of algorithms that are called Neural Networks. The concept of Neural Networks comes from how the human brain works, where neurons are interconnected. Generally, you would need a big dataset to train Deep Learning models. Examples include face recognition, text generation and image classification.

Neuron

Image by author

A neuron has many inputs and one output. The neuron takes in the inputs (X1, X2, …, Xn) and performs some arithmetic after which it will return the output y.

Neural Network

Image by author

Think of a neural network as a complex function, it has its inputs (X1, X2, …, Xn) and it has its outputs (y1, …, ym). When the neural network is learning, it learns how to predict its output from its inputs. If you look at it as a function, it is learning the equation of the function. A neural network has an input layer (green), one or more hidden layers (blue) and an output layer (Red).

For example, a neural network has three inputs and two outputs. The inputs, (X1, X2, X3), are the dimensions of a cuboid. The output y1 is the volume of the cuboid and y2 is its surface area. The hidden layer performs the necessary arithmetic to produce its output, like a function.

What Does This Have to Do With Natural Language Processing?

Image by author

Natural Language Processing is a part of Artificial Intelligence and it overlaps with Machine Learning and Deep Learning. Machine Learning and Deep Learning algorithms are used in Natural Language Processing.

Natural Language Understanding

An area of NLP that focuses on machine reading comprehension. So, everything to do with understanding language. Examples of this include; Question Answering, Text classification, Information Retrieval, Sentiment Analysis, Named Entity Recognition and Machine Translation.

Question Answering

left image by author, right image from Smart Home Explained

A user asks a question and the machine replies with an answer, simple, right? Not really! There are many difficulties to understanding a user’s questions like if they use slang, different verb conjugations, homonyms which are words with more than one meaning and synonyms which are a group of different words that mean the same thing.

Machines struggle to decide which meaning of a word is currently used. Humans do not struggle with this (for the most part) because we understand the context in which the word is used. Machines on the other hand do not have this understanding. Examples of homonyms are band (music group or elastic/rubber band), left (to leave / left and right), park (to park / playground) and bear (to tolerate, animal).

Synonyms cause a lot of problems when understanding language because we can describe the same idea using different words. Ask and enquire, disappear and vanish and mistake and error are all examples of synonyms. If an idea is described using different words, then a computer would not know that it’s describing the same thing. This is why computers must group synonyms to be able to read and comprehend text. However, synonyms are not always interchangeable. For example, the words big and large both mean the same when describing an object but only big makes sense in terms of age.

Text Classification

Classification is a type of machine learning algorithm where it tries to label the input data. For example, the input is a photo of a fruit and the algorithm tries to identify (label) the fruit. A classifier model has a fixed number of labels, for example, the fruit classifier can only classify bananas, apples and pears.

Text classification labels text. Text can be classified into the subject of an article, whether it is technology, sport or entertainment, or it could try to identify whether an email is spam or not.

Image from towards data science

How do you classify text?

To start off, you need a dataset composed of documents with their correct labels associated with them. We will use a spam detection classifier for this example.

So, we have a dataset of tens of thousands of emails (or even more) and we know which ones are spam. We then train a model to determine whether an email is spam or not. The emails are the input and the output whether it’s spam or not.

Pre-processing using the Count Vectorizer

We can’t just use the text as inputs, we need to transform the text into a form that the algorithm will be able to work with. Count Vectorizer is a simple method to represent a collection of words with numbers. It creates a vector for each document in the dataset consisting of the number of times each word occurs in the document. For example, if we have two documents Doc1 = The quick brown fox and Doc2 = Jumps over the lazy dog. The vector for both of these words are represented as follows.

Image by author

These vectors can then be used as the input to determine whether an email is spam or not, or for any other model.

What is Information retrieval?

Processing a dataset consisting of documents to be able to retrieve them quickly based on keywords from a users query. Google Search is a massive information retrieval system where the documents are web pages and the user queries are what the user types into Google.

Google Search has to overcome some difficulties which include:

  • The number of web pages on the internet, 4.2 Billion (x10⁹). Google Search needs to process all of these web pages for information retrieval. They need an efficient yet effective approach to process them.
  • Web pages update regularly! So not only does Google have a huge amount of web pages to process, they have to frequently update what they store about the website.
  • When a user (like you or me) search for something on Google, we expect fast results, we don’t want to be waiting long.
  • Difficulties based on Linguistics. These include synonyms, homonyms and verb conjugations.

How do we prepare the documents for information retrieval?

TF-IDF = Term Frequency-Inverse Document Frequency

TF-IDF is a simple way for information retrieval and it has its limitations, which include not taking into account homonyms and synonyms. TF-IDF is important for information retrieval as it can be used to identify how similar two documents are, or how relevant a query is to a document.

TF-IDF evaluates the importance of each word in a document. Important words are those that only occur in that document (or similar documents). For example, cat, social networks, natural language processing, etc. Words that are not considered important are words that have nothing to do with the subject but are used to explain the topic. For example, for, not, no, that, has, it. The importance of a word increases proportionally to the occurrences of that word in the document.

To use TF-IDF you need a dataset containing many different texts, for example, all of the articles on Wikipedia. The TF-IDF value is composed of two values, Term Frequency (TF) and Inverse Document Frequency (IDF).

Image by author

For example, we have a document with 100 words. The word cat occurs 3 times. We have 10 million documents and the word cat occurs 1000 times. The TF-IDF value is calculated as follows:

Image by author

The results of getting the TF-IDF value for each unique word in each document in a dataset is a vector that represents each document like in the table below

Image by author

TF-IDF and Information Retrieval

Now we have our vectors for each document, how do we perform information retrieval? The process for the user query What is Natural Language Processing is as follows

  1. We transform the search query, What is Natural Language Processing, into a vector of TF-IDF values, just like a row in the table above so that it is in the same format as the vectors that represent the documents.
  2. We perform a calculation called Cosine Similarity with the query vector and the vector of each document. Cosine Similarity is a measure of the similarity of two non-zero vectors.
  3. We then select the N documents with the highest Cosine Similarity scores to return to the user.

Can you use TF-IDF for other things? Yes!

TF-IDF can be used for many NLP tasks including analysing documents and text classification because it is a way of transforming text into a numeric form. It can be used instead of Count Vectorizer in the example above.

Named Entity Recognition

Named Entity Recognition is the task of identifying and classifying entities (key information) in text. An entity can be a word (or series of words) that refer to a specific thing, for example, football, Barack Obama or Canada.

The process of Named Entity Recognition is:

  1. detect a word or series of words as an entity
  2. classify the entity. You can create your own list of labels to classify named entities. For example, your list can consist of places, people, organisations and dates.

Below is an example of Named Entity Recognition, and you can try it out for yourself here.

screenshot by author from explosion.ai webpage

Machine Translation

Machine translation is the translation of text by a system without any human intervention. Google Translate is the most famous machine translation system.

With over 7000 languages in the world, there has never been a better time for instant translations where communication is essential to navigate the global world that we currently live in.

There are two general ways to perform Machine Translation, the older way, Rule-Based Machine Translation and the modern way, Statistical Machine Translation.

Rule-Based Machine Translation involves programmers and language experts collaborating to create a collection of rules which will be used by a system to translate from one specific language to another specific language. They would use dictionaries, grammatical rules and semantic patterns from the two languages to create the collection of rules. The disadvantages of this approach are that its very laborious, it does not include slang or informal use of a language and it does not take into account the changing nature of language (meaning that the rules would need to be updated to reflect the changes in the language).

Statistical Machine Translation does not use any knowledge or grammatical rules of the languages its translation from/to. Algorithms analyse texts that have previously been translated (by humans) and it explores the probabilities of the order of words and how the smaller parts in the text join together. This results in a database of translations based on the statistical probabilities that a word or phrase in the language is another word or phrase in the other language.

Machine Translation does have some issues to overcome, for example, gender bias. Since Machine Translation is based on text translated by humans, it is affected by social bias. In this case, it is especially affected when you translate to a gender-specific language from a language that is not as gender-specific. For example, Turkish to English.

In the past, Google translated the equivalent of he/she is a doctor to its masculine form and the equivalent of he/she is a nurse to its feminine form. However, they have corrected this and provide both masculine and feminine translations.

Image from Google Blog

Speech Technologies

Let’s look into how Speech to Text and Text to Speech work. They are the inverse of each other and each has their difficulties to overcome.

Speech to Text

Speech to Text: Image from Predictive Hacks

With any model, there are two general steps involved, the first is building and training it and then the second is using the model.

Building and training a Speech to Text model requires a large amount of data. The dataset consists of inputs, in this case, they are audio recordings of speech and the respective outputs are the transcription of what was said in the audio recording.

Image by author

Once you have the dataset, pre-processing is performed to clean the dataset. This includes feature extraction which reduces the dimensionality of the input data to only keep the important features of the data.

After feature extraction is performed, we can train a model to convert speech to text. Testing a model is vital to ensure the model performs well. One important thing to note is that this is an iterative process, so it is normal to go back to the pre-processing and training steps.

Once the model is built and trained, it is ready to be used! To use the model, speech is recorded and then the same pre-processing steps that were performed on the dataset to train the model are performed on the recorded speech. The model is run with the pre-processed features as the input and the model will output text that represents what was said in the audio!

Why is It so Difficult for a Computer to Understand Speech?

Many reasons make it difficult to create a Speech to Text model such as various dialects and accents, words that sound the same, phrases that sound very similar and compound words. What helps us humans to understand speech is context. In this case, context is the situation in which a word is used. For example, a listener will already understand what was said previously and know what topic is being discussed. It is very difficult to make a machine understand context because context is implicit.

Different dialects and accents create problems when trying to understand speech. This is because the same word can be pronounced in many different ways depending on the accent. On top of that, different words may sound the same in different accents.

Words that sound the same are called homophones. English has many examples of homophones including they’re, there and their, knight and night and allowed and aloud. If you are given these three words they’re, there and their and you are asked to fill in the underlined blank space in the sentence _____ is a sale on in the shop, you would pick there, right? How did you know that that is the correct answer? Context! We know intuitively that the other two options are wrong. We, as humans, are good at understanding context, something which a machine is bad at.

You might be thinking, why can’t we just program all of these rules into a machine? It’s unrealistic, languages are very complex with many exceptions. To make matters worse, language is continuously changing and adapting so if a set of rules are made, they would become outdated very quickly!

Similarly to words that sound the same, there are sentences that sound extremely similar, which makes it difficult for a machine (even a person at times) to differentiate between them. Examples are, Will the new display recognise speech? and Will the nudest play wreck a nice beach or I want four candles and I want fork handles. With context, this would not be an issue, it would be clear which out of the two is the correct one.

Finally, compound words also make it difficult for a computer to understand speech. Examples of compound words include crossroad (cross road), fourteen (four teen), backyard (back yard), playschool (play school).

All of these together make it complex and difficult for a machine to understand speech.

Text to Speech

The opposite of Speech to Text. The idea here is to give a machine some text and make it say it.

Image from towards data science

How does it work? Here is one way to do it! First, you take in the text, then, you normalise it to remove any symbols or shorthand words. After this, the text is analysed with part of speech tagging which identifies whether a word is a noun, verb, pronoun, etc. Each word is subsequently converted into its respective list of phonemes. A phoneme is a single sound and words are made up of a sequence of sounds (not to be confused with letters!). Finally, the computer says the text using the list of phonemes.

Image by author

Not exactly, there are some difficulties with Text to Speech. One of the most notable are words that are spelt similarly but sound different. For example, Through, tough, though, thought and thorough. This is a big problem in English, but isn’t a problem in all languages, like Spanish for example. Spanish is a phonetic language meaning that words are spelt as they sound. Because of this Spanish doesn’t suffer from this problem.

Why Bother with Speech Technologies?

Speech technologies allow users to interact with a system differently. People with certain disabilities may find it easier to interact with a voice user interface. Another reason is that it is a less intrusive way for users to interact with a system, they can keep doing what they were doing whilst they communicate with the system which minimises the interruption for the user. Finally, you can use it as an assistant, Google has created a new feature for their assistant called Google Duplex.

Google Duplex

At the moment, Google Duplex has been launched in the US to reserve restaurants. However, Google has a big problem to overcome which is to make people feel comfortable with this type of technology. When it was announced, there were a lot of people who felt uncomfortable with the idea of speaking to a machine without knowing it since the voice is so human-like.

How to Get Started with Natural Language Processing?

I recommend using the python programming language, most NLP tasks are done using Python and it is a beginner-friendly language. Also, Natural Language ToolKit (NLTK) is a framework to learn NLP. There is also an NLTK Book that covers NLTK and many basic NLP tasks. But most importantly, start with a project that interests you!

Other Stories

Click here if you’d like to read about Word2Vec, a neural network that creates vectors to represent words.

That’s it!

Thank you for reading the article, this is my first one so I would love some feedback! If you find any mistakes in the article please reach out to me and I’ll happily make the change. Follow me on Twitter for more tech related content and tweet the article!

More content at plainenglish.io

--

--

I am a Software Engineer and I sometimes do some Data Science and I want to share what I know and learn knew things!