Meta researchers claimed that OpenAI-backed SWE-bench Verified, a popular benchmark used for evaluating AI models, could be flawed. "We found...loopholes in the benchmark...Anthropic’s Claude...Alibaba Cloud’s Qwen...'cheated'...on it," Jacob Kahn, Meta's Fair AI lab manager, posted on GitHub. Post added that AI models looked up known solutions available on GitHub and presented them as their own.
short by
Shristi Acharya /
08:45 am on
10 Sep