As large language models (LLMs) gain momentum worldwide, there’s a growing need for reliable ways to measure their performance. Benchmarks that evaluate LLM outputs allow developers to track ...
OpenAI Group PBC and Mistral AI SAS today introduced new artificial intelligence models optimized for cost-sensitive use cases. OpenAI is rolling out two algorithms called GPT-5.4 mini and GPT 5.4 ...
As new large language models, or LLMs, are rapidly developed and deployed, existing methods for evaluating their safety and discovering potential vulnerabilities quickly become outdated. To identify ...
Artificial intelligence has traditionally advanced through automatic accuracy tests in tasks meant to approximate human knowledge. Carefully crafted benchmark tests such as The General Language ...
This study introduces MathEval, a comprehensive benchmarking framework designed to systematically evaluate the mathematical reasoning capabilities of large language models (LLMs). Addressing key ...
The rivalry between Qwen 3.5 and Sonnet 4.5 highlights the shifting priorities in large language model development. Qwen 3.5, created by Alibaba, prioritizes offline deployment, allowing it to operate ...
OpenAI has released two compact models in its GPT‑5.4 family, branded GPT‑5.4 mini and GPT‑5.4 nano, and is promoting them as ...
Google Research has proposed a training method that teaches large language models to approximate Bayesian reasoning by learning from the predictions of an optimal Bayesian system. The approach focuses ...
This illustrates a widespread problem affecting large language models (LLMs): even when an English-language version passes a safety test, it can still hallucinate dangerous misinformation in other ...