Artificial-intelligence systems that can churn out fluent text, such as OpenAI’s ChatGPT, are the newest darlings of the technology industry. But when faced with mathematical queries that require reasoning to answer, these large language models (LLMs) often stumble. Take, for instance, this algebra problem:
A line parallel to y = 4x + 6 passes through (5, 10). What is the y-coordinate of the point where this line crosses the y-axis?
Although LLMs can sometimes answer these types of question correctly, they more often get them wrong. In one early test of its reasoning abilities, ChatGPT scored just 26% when faced with a sample of questions from the ‘MATH’ data set of secondary-school-level mathematical problems1.
This is to be expected: given input text, an LLM simply generates new text in accordance with statistical regularities in the words, symbols and sentences that make up the model’s training data. It would be startling if just learning language patterns could allow LLMs to mimic mathematical reasoning reliably.
But back in June 2022, an LLM called Minerva, created by Google, had already defied these expectations — to some extent. Minerva scored 50% on questions in the MATH data set2, a result that shocked some researchers in artificial intelligence (AI; see ‘Minerva’s mathematics test’).
To read more, click here.