OpenAI-backed AI model performance benchmark may be flawed: Meta
Meta researchers claimed that OpenAI-backed SWE-bench Verified, a popular benchmark used for evaluating AI models, could be flawed. "We found...loopholes in the benchmark...Anthropic’s Claude...Alibaba Cloud’s Qwen...'cheated'...on it," Jacob Kahn, Meta's Fair AI lab manager, posted on GitHub. Post added that AI models looked up known solutions available on GitHub and presented them as their own.