ChatGPT Is Terrible at Checking Its Own Code

A recent study points to one way to make it more accurate

3 min read

A question mark made out of binary code.
iStock

This article is part of our exclusive IEEE Journal Watch series in partnership with IEEE Xplore.

There’s a lot of hype around ChatGPT’s ability to produce code and, so far, the AI program just isn’t on par with its human counterparts. But how good is the AI program at catching its own mistakes?

Researchers in China put ChatGPT to the test in a recent study, evaluating its ability to assess its own code for correctness, vulnerabilities and successful repairs. The results, published 5 November in IEEE Transactions on Software Engineering, show that the AI program is overconfident, often suggesting that code is more satisfactory than it is in reality. The results also show what sort of prompts and tests might improve ChatGPT’s self-verification abilities.

Xing Hu, an associate professor at Zhejiang University, led the study. She emphasizes that, with the growing use of ChatGPT in software development, ensuring the quality of its generated code has become increasingly important.

Hu and her colleagues first tested ChatGPT-3.5’s ability to produce code using several large coding datasets.

Their results show that it can generate “correct” code—code that does what it’s suppose to do—with an average success rate of 57 percent, generate code without security vulnerabilities with a success rate of 73 percent, and repair incorrect code with an average success rate of 70 percent.

So it is successful sometimes, but it still making quite a few mistakes.

Asking ChatGPT to Check Its Coding Work

First, the researchers asked ChatGPT-3.5 to check its own code for correctness using direct prompts, which involve asking it to check whether the code meets a specific requirement.

Thirty-nine percent of the time it erroneously said that code was correct when it was not. It also incorrectly said that code was free of security vulnerabilities 25 percent of the time, and that it had successfully repaired code when it had not 28 percent of the time.

Interestingly, ChatGPT was able to catch more of its own mistakes when the researchers gave it guiding questions, which ask ChatGPT to agree or disagree with assertions that the code does not meet the requirements. Compared to direct prompts, these guiding questions led to the increased detection of incorrectly generated code by an average of 25 percent, increased identification of vulnerabilities by 69 percent, and increased recognition of failed program repairs by 33 percent.

Another important finding was that, although asking ChatGPT to generate test reports was not more effective than direct prompts at identifying incorrect code, it was useful for increasing the number of vulnerabilities flagged in ChatGPT-generated code.

Hu and her colleagues report in this study that ChatGPT demonstrated some instances of self-contradictory hallucinations in its behavior, where it initially generated code or completions that it deems correct or secure but later contradicts this belief during self-verification.

“The inaccuracies and self-contradictory hallucinations observed during ChatGPT’s self-verification underscore the importance of exercising caution and thoroughly evaluating its output,” Hu says. “ChatGPT should be regarded as a supportive tool for developers, rather than a replacement for their role as autonomous software creators and testers.”

As part of their study, the researchers also ran some tests using ChatGPT-4, finding that it does show substantial performance improvements in code generation, code completion, and program repair compared to ChatGPT-3.5.

“However, the overall conclusion regarding the self-verification capabilities of GPT-4 and GPT-3.5 remains similar,” Hu says, noting that GPT-4 still frequently misclassifies its generated incorrect code as correct, its vulnerable code as non-vulnerable, and its failed program repairs as successful, especially when using the direct question prompt.

As well, instances of self-contradictory hallucinations are also observed in GPT-4’s behavior, she adds.

“To ensure the quality and reliability of the generated code, it is essential to integrate ChatGPT’s capabilities with human expertise,” Hu emphasizes.

The Conversation (0)
  翻译: