Authors:
Jesia Yuki
1
;
Mohammadhossein Amouei
1
;
Benjamin C. M. Fung
1
;
Philippe Charland
2
and
Andrew Walenstein
3
Affiliations:
1
School of Information Studies, McGill University, Montreal, QC, Canada
;
2
Mission Critical Cyber Security Section, Defence R&D Canada, Quebec, QC, Canada
;
3
BlackBerry Limited, Waterloo, ON, Canada
Keyword(s):
Assembly Code, Reverse Engineering, CodeBERT, Transformers, Code Summarization.
Abstract:
This study explores the field of software reverse engineering through the lens of code summarization, which involves generating informative and concise summaries of code functionality. A significant aspect of this research is the application of assembly code summarization in malware analysis, highlighting its critical role in understanding and mitigating potential security threats. Although there have been recent efforts to develop code summarization techniques for high-level programming languages, to the best of our knowledge, this study is the first attempt to generate comments for assembly code. For this purpose, we first built a carefully curated dataset of assembly function-comment pairs. We then focused on automatic assembly code summarization using transfer learning with pre-trained natural language processing (NLP) models, including BERT, DistilBERT, RoBERTa, and CodeBERT. The results of our experiments show a notable advantage of Code-BERT: despite its initial training on hi
gh-level programming languages alone, it excels in learning assembly language, outperforming other pre-trained NLP models.
(More)