IBM - International Business Machines Corporation

10/14/2024 | News release | Distributed by Public on 10/15/2024 00:32

Large(r) language models: Too big to fail

OpenAI made history recently by securing a USD 6.6 billion investment to scale up its large language models-increasing their size, data volume and computational resources. Meanwhile, Anthropic's CEO said his company already has USD 1 billion models in development, with USD 100 billion models coming soon.

But as spending balloons, new research published in Nature suggests that LLMs may in fact become less reliable as they grow.

The crux of the problem, according to researchers from the Polytechnic University of Valencia, is the assumption that as LLMs become more powerful and better aligned by using strategies such as fine-tuning and filtering, they also become more reliable from a user perspective. Or, put differently: people may make the false assumption that as models become more powerful, their errors will follow a predictable pattern that humans can understand and adjust their queries to.

What a human finds difficult, however, is not necessarily the same as what an LLM finds difficult, the researchers found. Using old and new models of OpenAI's ChatGPT, Meta's Llama and BigScience's BLOOM, the researchers tested core numerical, scientific and knowledge skills using tasks involving addition, vocabulary, geographical knowledge and basic and advanced science questions.

Overall, the study observed that newer, larger language models performed better on tasks that humans rated as higher in difficulty, but they are still far from perfect on tasks that humans consider easy, leading to no operating conditions where these models could be trusted to be flawless. And since newer LLMs improve mainly on the high-difficulty instances, it exacerbates the disparity between what humans find difficult and LLM success.

Rather than asking whether bigger LLMs are better, we should ask, "Can you fact-check a model quickly?" says Bishwaranjan Bhattacharjee, a Master Inventor at IBM. The problem, however, is that humans are bad at spotting errors made by the models and often misjudge incorrect model outputs as correct, even when given the option of saying "I'm not sure."

"Errors have gone up substantially for newer LLMs, as they now rarely avoid answering questions beyond their competence," says paper co-author Lexin Zhou. "The bigger problem is that these newer LLMs confidently provide incorrect responses." People using an LLM for tasks in areas where they don't have deep expertise may have a false sense of their reliability, as they can't spot errors as easily. These findings indicate that humans are not well-equipped to serve as reliable supervisors of these models.

The LLM lifecycle

Given the limitations and expense of LLMs, some experts feel that businesses will start with bigger models, then opt for more customized, fit-to-purpose models later on. LLMs can accommodate a broad range of requirements, provide maximal optionality and help prove the business case for AI when a company is first starting out. Then, as organizations hone in on the most strategic use cases, they can optimize the models to create smaller, more discreet and more cost-efficient language models that fit their specific needs.

"The large language model is like a Swiss army knife," says Edward Calvesbert, VP of Product Management at IBM's watsonx, on a recent episode of Mixture of Experts. "It's going to give you a lot of flexibility, but eventually, you're going to want to use the fit-for-purpose tool to get the job done."

eBook: How to choose the right foundation model
Was this article helpful?
YesNo
Tech Reporter, IBM