Transformers are a powerful form of artificial intelligence (AI) used in machine learning. They are based on a novel neural network architecture that learns to process data in a way that mimics the human brain. This allows machines to understand the content of natural language, recognize objects in images, and even generate new words and sentences. With the help of transformers, machines can tackle complex tasks such as natural language processing and computer vision with unprecedented accuracy.
What are transformers?
Transformers are a relatively new type of neural network aimed at solving sequences with easy processing of long-range dependencies. Today it is the most advanced technique in the field of natural speech processing (NLP).
With their help, you can translate text, write poetry and articles, and even generate computer code. Unlike recurrent neural networks (RNNs), transformers do not process sequences in order. For example, if the source data is text, then they do not need to process the end of the sentence after processing the beginning. Thanks to this, such a neural network can be parallelized and trained much faster.
When did they appear?
Transformers were first described by engineers from Google Brain in their work “Attention Is All You Need” in 2017.
One of the main differences from existing data processing methods is that the input sequence can be passed in parallel so that the GPU can be used efficiently and also increase the learning rate.
Why are transformers needed?
Until 2017, engineers used deep learning to understand text using recurrent neural networks.
Suppose, when translating a sentence from English into Russian, RNN will take an English sentence as input, process words one at a time, and then sequentially output their Russian counterparts. The key word here is “consistent”. Word order is important in a language, and you can’t just mix them up.
Here RNNs face a number of problems. First, they try to process large sequences of text. By the time they advance to the end of the paragraph, they “forget” the content of the beginning. For example, an RNN-based translation model may have problems remembering the gender of a long text object.
Second, RNNs are hard to train. They are known to be prone to the so-called problem vanishing/exploding gradient.
Thirdly, they process words sequentially, a recurrent neural network is difficult to parallelize. This means that it is not possible to speed up training by using more GPUs. Therefore, it cannot be trained on a large amount of data.
How do transformers work?
The main components of transformers are encoder and decoder.
An encoder transforms incoming information (like text) and converts it into a vector (set of numbers). The decoder, in turn, decodes it as a new sequence (for example, the answer to a question) of words in another language, depending on the purposes for which the neural network was created.
Other innovations behind transformers boil down to three main concepts:
- positional encoders (Positional Encodings);
- attention (attention);
- self-attention (Self-Attention).
Let’s start with the first – position encoders. Let’s say you need to translate a text from English into Russian. Standard RNN models “understand” word order and process them sequentially. However, this makes it difficult to parallelize the process.
Positional encoders overcome this barrier. The idea is to take all the words in the input sequence – in this case an English sentence – and add a number to each in its order. So, you feed the network the following sequence:
[(“Red”, 1), (“fox”, 2), (“jumps”, 3), (“over”, 4), (“lazy”, 5), (“dog”, 6)]
Conceptually, this can be seen as shifting the burden of understanding word order from the structure of the neural network to the data itself.
At first, before transformers learn from any information, they don’t know how to interpret these positional encodings. But as the model sees more and more examples of sentences and their encodings, it learns how to use them effectively.
The structure presented above is oversimplified – the authors of the original study used sinusoidal functions to come up with positional encodings, not simple integers 1, 2, 3, 4, but the essence is the same. By storing word order as data rather than structure, the neural network is easier to train.
Attention is the structure of a neural network, introduced into the context of machine translation in 2015. To understand this concept, let’s turn to the original article.
Imagine that we need to translate the phrase into French:
“The agreement on the European Economic Area was signed in August 1992”.
The French equivalent of the expression is as follows:
“L’accord sur la zone économique européenne a été signé en août 1992”.
The worst translation option is a direct search for analogues of words from English in French, one by one. This cannot be done for several reasons.
First, some words in the French translation are reversed:
European Economic Area against “la zone economique européenne”.
Secondly, the French language is rich in gender words. To match the female object “la zone”adjectives economics And “europeenne” must also be put in the feminine gender.
Attention helps to avoid such situations. Its mechanism allows the text model to “look” at each word in the original sentence when deciding how to translate them. This is demonstrated by the visualization from the original article:
It’s kind of heat map, showing what the model “pays attention to” when it translates each word in the French sentence. As you would expect when the model outputs the word “europeenne”it largely takes into account both input words – European And economic.
The training data helps to tell the model which words to “pay attention to” at each step. By observing thousands of English and French sentences, the algorithm learns interdependent word types. He learns to take into account gender, plurality and other rules of grammar.
The attention engine has been an extremely useful tool for natural language processing since its discovery in 2015, but in its original form it was used in conjunction with recurrent neural networks. Thus, the innovation of the 2017 article on Transformers was intended in part to do away with RNN entirely. That’s why the 2017 entry is called Attention is All You Need.
The last part of transformers is the turn of attention, called “self-attention”.
If attention helps to align words when translating from one language to another, then self-attention allows the model to understand the meaning and patterns of the language.
For example, consider these two sentences:
“Nikolay lost his car key”
“Crane Key Headed South”
Word “key” here means two very different things that we humans, knowing the situation, can easily distinguish between their meanings. Self-attention allows the neural network to understand a word in the context of the words around it.
So when the model processes the word “key” in the first sentence, she might draw attention to “cars” and understand that we are talking about a metal rod of a special shape for the lock, and not something else.
In the second sentence, the model may pay attention to the words “crane” And “south”to attribute “key” to a flock of birds. Self-attention helps neural networks to disambiguate words, make partial markingstudy semantic roles and much more.
Where are they used?
Transformers were originally positioned as a neural network for processing and understanding natural language. In the four years since their inception, they have gained popularity and have appeared in many services used daily by millions of people.
One of the simplest examples is BERT language model by Google, developed in 2018.
On October 25, 2019, the tech giant announced the start of using the algorithm in the English version of the search engine in the United States. A month and a half later, the company expanded the list of supported languages up to 70, including Russian, Ukrainian, Kazakh and Belarusian.
The original English model was trained on the BooksCorpus dataset of 800 million words and articles from Wikipedia. The basic BERT contained 110 million parameters, while the extended BERT contained 340 million.
Another example of a popular transformer-based language model is GPT (Generative Pre-trained Transformer) by OpenAI.
Today, the most current version of the model is GPT-3. It was trained on a 570 GB dataset, and the number of parameters was 175 billion, which makes it one of the largest language models.
GPT-3 can generate articles, answer questions, be used as the basis for chatbots, perform semantic searches, and create short summaries of texts.
Also, on the basis of GPT-3, an AI assistant for automatic coding of GitHub Copilot was developed. It is based on a special version of the GPT-3 Codex AI, trained on a set of data from lines of code. Researchers have already calculated that since the release in August 2021, 30% of the new code on GitHub has been written using Copilot.
In addition, transforms are increasingly being used in Yandex services, for example, Search, News, and Translator, Google products, chat bots, and so on. And the Sber company released its own modification of GPT, trained on 600 GB of Russian-language texts.
What are the prospects for transformers?
To date, the potential of transformers is still not disclosed. They have already proven themselves well in word processing, but recently this type of neural network is being considered in other tasks, such as computer vision.
At the end of 2020, CV models showed good results in some popular benchmarks, such as object detection on the COGO dataset or image classification on ImageNet.
In October 2020, researchers from Facebook AI Research published an article describing the model Data-efficient Image Transformers (DeiT)based on transformers. According to the authors, they found a way to train the algorithm without a huge set of labeled data and obtained a high image recognition accuracy of 85%.
In May 2021, experts from Facebook AI Research presented a computer vision algorithm DINO open source that automatically segments objects in photos and videos without manual labeling. It is also based on transformers, and the segmenting accuracy has reached 80%.
Thus, in addition to NLP, transformers are increasingly being used in other tasks.
What threats are transformers?
In addition to the obvious advantages, NLP transformers carry a number of threats. The creators of GPT-3 more than once declaredthat the neural network can be used for massive spam attacks, harassment or disinformation.
In addition, the language model is subject to bias towards certain groups of people. Even though the developers have reduced the toxicity of GPT-3, they are still not ready to provide access to the tool to a wide range of developers.
In September 2020, Middlebury College researchers published report about the risks of radicalization of society associated with the spread of large language models. They noted that GPT-3 shows “significant improvements” in the creation of extremist texts compared to its predecessor GPT-2.”
Criticized technology and one of the “fathers of deep learning” Jan LeKun. He saidthat many expectations about the capabilities of large language models are unrealistic.
“Trying to build intelligent machines by scaling language models is like building planes to fly to the moon. You can break altitude records, but going to the moon will require a completely different approach,” LeCun wrote.
Found a mistake in the text? Select it and press CTRL+ENTER
CryptoNewsHerald Newsletters: Keep your finger on the pulse of the bitcoin industry!
Transformers are a powerful machine learning tool that can be used to identify patterns and make predictions. They are capable of processing large amounts of data quickly and accurately, making them well-suited for a variety of tasks. With the increasing availability of data, transformers provide an efficient way to utilize this data and create powerful models. With the right data and parameters, transformers can be used to solve a variety of problems in the field of machine learning.
What are transformers? (machine learning)
Transformers are a type of neural network architecture used for natural language processing (NLP) tasks. These networks are based on the idea of self-attention, which allows them to look at the entire input sentence at once to better identify relationships between words. This makes them well-suited to tasks like language translation, summarization, and question answering.