Samsung Electronics announced on the 25th that it has developed the 'TrueBench' metric to measure and quantify the performance of artificial intelligence (AI) models.
TrueBench is an indicator for evaluating the productivity of various AI models, such as ChatGPT, developed by Samsung Research, the advanced research and development organization of Samsung Electronics' Device eXperience (DX) division. It focuses on assessing AI work productivity by subdividing into 10 categories, 46 tasks, and 2,485 evaluation criteria. It is particularly based on checklists used in real office tasks, such as content creation, data analysis, document summarization and translation, and continuous conversation, which are frequently used by companies.
Users can select and compare up to five models at once. Unlike existing metrics that are primarily English-focused, it also supports evaluation of results in a total of 12 languages, including Korean, English, Japanese, and Spanish. This means that the evaluation values may differ when using Korean versus English, even for the same AI service.
A Samsung Electronics representative explained, "TrueBench is designed to evaluate not only the accuracy of the answers provided by AI models but also whether the intent or context of the questions was understood," adding that it was meticulously created through repetitive cross-verification using AI. Kyung-Hoon Jeon, CTO of the DX division and head of Samsung Research (President), stated, "Through TrueBench, we will establish a standard for evaluating the productivity performance of AI models."
Lee Dong-hoon
AI-translated with ChatGPT. Provided as is; original Korean text prevails.
ⓒ dongA.com. All rights reserved. Reproduction, redistribution, or use for AI training prohibited.