AI text generators like ChatGPT, which were made available to the public, have sparked a lot of debate among those who hail the technology as a significant advancement in communication as well as those who predict its disastrous consequences. In any case, artificial intelligence-generated text is famously buggy, and human assessment remains the highest quality level in guaranteeing precision, particularly with regards to applications like producing long-structure rundowns of mind-boggling texts. But there are no acknowledged principles for human assessment of long-structure synopses, and that implies that even the highest quality level is suspect.
To correct this deficiency, a group of PC researchers, led by Kalpesh Krishna, an alumni understudy in the Monitoring School of Data and PC Sciences at UMass Amherst, has recently delivered a bunch of rules called LongEval. The guidelines were awarded the Outstanding Paper Prize for their presentation at the European Chapter of the Association for Computational Linguistics.
Krishna, who started this research while doing an internship at the Allen Institute for AI, says, “There is currently no reliable way to evaluate long-form generated text without humans, and even current human evaluation protocols are expensive, time-consuming, and highly variable.” In order to construct long-form text-generation algorithms that are more accurate, a suitable human evaluation framework is necessary.
“At the moment, there is no reliable way to evaluate long-form generated text without humans, and current human evaluation protocols are expensive, time-consuming, and highly variable. A suitable human evaluation framework is critical to building more accurate long-form text-generation algorithms.”
Krishna, who began this research during an internship at the Allen Institute for AI.
Krishna and his group, including Mohit Iyyer, right-hand teacher of software engineering at UMass Amherst, went through 162 papers on lengthy structure synopsis to comprehend how human assessment functions, and in doing so, they found that 73% of the papers didn’t perform human assessment on lengthy structure outlines by any stretch of the imagination. The leftover papers utilized broadly dissimilar assessment rehearsals.
According to Iyyer, “this lack of standards is problematic because it hampers reproducibility and prevents meaningful comparisons between different systems.”
Krishna and his co-authors developed a list of three comprehensive recommendations that cover how and what an evaluator should read in order to evaluate the summary’s reliability and advance the goal of efficient, reproducible, and standardized protocols for human evaluation of AI-generated summaries.
“With LongEval, I’m exceptionally amped up for the possibility of having the option to precisely and immediately assess long-structure text age calculations with people,” says Krishna. “LongEval has been made very user-friendly and made available as a Python library. I’m eager to perceive how the examination local area expands upon it and utilizations LongEval in their exploration.”
The examination is distributed on the arXiv preprint server.
More information: Kalpesh Krishna et al, LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization, arXiv (2023). DOI: 10.48550/arxiv.2301.13298