AI From gibberish to annihilation

Artificial Intelligence (AI) has hit the headlines recently thanks to increasing concerns about the threat such technology poses to the future of humanity.

The destruction of the human race by AI is entirely possible but is unlikely to happen any time soon. However, AI in the form of Large Language Models (LLMs) already possesses the potential to cause harm and in a number of ways.

What are LLMs?

An LLM is a form of AI algorithm that utilises deep learning techniques and massive data sets to understand, summarize, generate and predict new content. In other words, an LLM is a type of generative AI such as OpenAI’s ChatGPT that has been designed to deliver text-based content. ChatGPT can also translate content.

So far so good, you may think. LLMs could clearly offer many benefits to us humble humans. But we should all consider the law of unintended consequences. There are big issues with this big tech, and they are already rearing their ugly heads.

What are the dangers posed by LLMs and Neural Machine Translation?

LLMs are trained using vast amounts of data. The technology calls upon that corpus of data to gather the information and words required to create the requested content and to render it in the requested language. LLMs are essentially predictive tools that use probability to assess which words should come next.

Unfortunately, AI models are trained on incomplete and often contradictory data. This leads the tech to associate words and phrases with certain concepts even when such associations are inappropriate, resulting in the generation of factually incorrect or nonsensical outputs. This phenomenon is known as “hallucination” and it is a significant issue.

Generating gibberish

As LLMs are using probabilities to choose and arrange words, outputs usually make sense, but not always. The tech can generate text that is simply nonsense and there is no evidence that the systems are making up words. In health and safety settings where clarity is key, translations that contain gibberish could have serious consequences.

Japanese translators have recently highlighted strange new Japanese words appearing online. ChatGPT appears to have been inventing words in Japanese and inserting them into content. When the new terms are searched for online, the search results feature AI-generated descriptions for these words that are all dated recently. Many of the words concerned seem to be combinations of obscure Chinese characters for simple, everyday terms. This is a puzzling and worrying situation.

Is AI beginning to manipulate our minds by controlling language?

Lies, damn lies and statistics

In addition to inventing words, LLMs also deliver factually incorrect outputs. This is largely the result of processing contradictory data. For instance, if asked to provide a history of a sporting event such as the English FA Cup Final, the tech will draw on numerous articles and reference works. If any of that source material is factually incorrect or contradictory, this can confuse the LLM and result in the generation of erroneous content. ChatGPT has already shown that it can struggle with dates. Many events, including sporting fixtures, take place every year. LLMs may merge details from more than one year and then create outputs that contain references to things that occurred in a previous year.

While stating the wrong result for a sporting fixture or including incidents from previous years doesn’t seem particularly worrying, it is deeply concerning that LLMs are spitting out erroneous information regarding far more serious subjects.

The AI-generated content could be posted online and then proliferate and that matters if the subject of that content is terrorism, politics or race, for instance. It’s easy to see how misinformation could spread quickly and how difficult it might be in the future to discern what is true and what is fake news. There is no doubt that the brilliance of the technology could inspire overconfidence in the outputs it generates. For this reason, many experts feel that LLMs should be taught to express uncertainty.

Overly accurate information

Just as LLMs can deliver inaccurate information, they may also spit out content that is too accurate. The tech could draw on data that is true but that is sensitive or confidential and not meant to be shared. Such exposés could prove to be incredibly harmful to individuals and institutions. It could also be very dangerous.

The trouble with open-source tech

Google’s Bard and OpenAI’s ChatGPT are free to use, but they are not open-source applications. Both are backed by moderators and analysts that work to prevent the tech being used for harm. You shouldn’t be able to generate content designed to spark a riot or disrupt an election using these platforms.

However, Meta’s LLaMA and associated LLMs can be run by anyone with the hardware to support them. The latest Meta LLaMa LLM can be run on some commercially available laptops. Anyone could use the software without their work being monitored and anyone could include Vladimir Putin and Donald Trump. Such LLMs present the very real danger of unscrupulous people using them to harass individuals or groups, to intimidate people, to inspire unrest, to spread fake news and even to disrupt democracy.

Training AI with AI

If new LLMs in the future were to be trained using data from other AI platforms, this could lead to big trouble including a proliferation of gibberish and factually incorrect content. That situation could be described as garbage in, garbage out!

LLMs such as ChatGPT are attracting more and more users, and this increased usage is already creating a new online ecosystem of AI-generated content. Until recently, AI has been trained using data generated by humans. But if the new AI-generated content is ultimately used to train future AI models, the resulting defects may be irreversible.

Research suggests that if multiple generations of AI systems are trained off each other, the data used to train them will be polluted to the extent that AI models could collapse. This situation would also cause what scientists refer to as “data poisoning” and that would see vast amounts of false information appearing online. In other words, AI models trained by AI models will create and present an alternative reality.

Food for thought

AI is giving us much to think about and we might need to think a whole lot harder in the future if we are unable to trust anything we see online. The early symptoms of major problems ahead are now beginning to manifest themselves and the issues are likely to accelerate rapidly.

What now appear to be relatively benign or humorous translation errors by ChatGPT are actually signs that there’s trouble ahead. LLMs are already making up words and confusing data from multiple sources. What’s next? Language mutation? Incitement to war? A complete implosion of the internet? Today’s gibberish could be a forewarning of darker days to come.