For a comprehensive list of my academic publications, please visit my Google Scholar profile
Junda He, Jieke Shi, Terru Zhuo Yue et al.
Under review. 2025
This forward-looking SE 2030 paper aims to steer the research community toward advancing LLM-as-a-Judge for evaluating LLMgenerated software artifacts, while also sharing potential research paths to achieve this goal. We provide a literature review of existing SE studies on LLM-as-a-Judge and envision these frameworks as reliable, robust, and scalable human surrogates capable of evaluating software artifacts with consistent, multi-faceted assessments by 2030 and beyond. To validate this vision, we analyze the limitations of current studies, identify key research gaps, and outline a detailed roadmap to guide future developments of LLM-as-a-Judge in software engineering..
Junda He, Jieke Shi, Terru Zhuo Yue et al.
Under review. 2025
This forward-looking SE 2030 paper aims to steer the research community toward advancing LLM-as-a-Judge for evaluating LLMgenerated software artifacts, while also sharing potential research paths to achieve this goal. We provide a literature review of existing SE studies on LLM-as-a-Judge and envision these frameworks as reliable, robust, and scalable human surrogates capable of evaluating software artifacts with consistent, multi-faceted assessments by 2030 and beyond. To validate this vision, we analyze the limitations of current studies, identify key research gaps, and outline a detailed roadmap to guide future developments of LLM-as-a-Judge in software engineering..
Alessio Bucaioni, Martin Weyssow, Junda He, Yunbo Lyu, David Lo
IEEE International Conference on Software Architecture (ICSA) 2025
The integration of large language models into software systems is transforming capabilities such as natural language understanding, decision-making, and autonomous task execution. However, the absence of a commonly accepted software reference architecture hinders systematic reasoning about their design and quality attributes. This gap makes it challenging to address critical concerns like privacy, security, modularity, and interoperability, which are increasingly important as these systems grow in complexity and societal impact. In this paper, we describe our emerging results for a preliminary functional reference architecture as a conceptual framework.
Alessio Bucaioni, Martin Weyssow, Junda He, Yunbo Lyu, David Lo
IEEE International Conference on Software Architecture (ICSA) 2025
The integration of large language models into software systems is transforming capabilities such as natural language understanding, decision-making, and autonomous task execution. However, the absence of a commonly accepted software reference architecture hinders systematic reasoning about their design and quality attributes. This gap makes it challenging to address critical concerns like privacy, security, modularity, and interoperability, which are increasingly important as these systems grow in complexity and societal impact. In this paper, we describe our emerging results for a preliminary functional reference architecture as a conceptual framework.
Junda He, Christoph Treude, David Lo
ACM Transactions on Software Engineering and Methodology (TOSEM) 2025
Integrating Large Language Models (LLMs) into autonomous agents marks a significant shift in the research landscape by offering cognitive abilities that are competitive with human planning and reasoning. This paper explores the transformative potential of integrating Large Language Models into Multi-Agent (LMA) systems for addressing complex challenges in software engineering (SE).
Junda He, Christoph Treude, David Lo
ACM Transactions on Software Engineering and Methodology (TOSEM) 2025
Integrating Large Language Models (LLMs) into autonomous agents marks a significant shift in the research landscape by offering cognitive abilities that are competitive with human planning and reasoning. This paper explores the transformative potential of integrating Large Language Models into Multi-Agent (LMA) systems for addressing complex challenges in software engineering (SE).
Junda He, Bowen Xu, Zhou Yang et al.
Empirical Software Engineering (EMSE) 2025
Inspired by the recent success of pre-trained models (PTMs) in natural language processing (NLP), we present PTM4Tag+, a tag recommendation framework for Stack Overflow posts that utilizes PTMs in language modeling. PTM4Tag+ is implemented with a triplet architecture, which considers three key components of a post, i.e., Title, Description, and Code, with independent PTMs. We utilize a number of popular pre-trained models, including the BERT-based models (e.g., BERT, RoBERTa, CodeBERT, BERTOverflow, and ALBERT), and encoder-decoder models (e.g., PLBART, CoTexT, and CodeT5).
Junda He, Bowen Xu, Zhou Yang et al.
Empirical Software Engineering (EMSE) 2025
Inspired by the recent success of pre-trained models (PTMs) in natural language processing (NLP), we present PTM4Tag+, a tag recommendation framework for Stack Overflow posts that utilizes PTMs in language modeling. PTM4Tag+ is implemented with a triplet architecture, which considers three key components of a post, i.e., Title, Description, and Code, with independent PTMs. We utilize a number of popular pre-trained models, including the BERT-based models (e.g., BERT, RoBERTa, CodeBERT, BERTOverflow, and ALBERT), and encoder-decoder models (e.g., PLBART, CoTexT, and CodeT5).
Jieke Shi, Zhou Yang, Junda He, Bowen Xu, Dongsun Kim, DongGyun Han, David Lo
ACM Transactions on Software Engineering and Methodology (TOSEM) 2024
Given the increasing adoption of modern AI-enabled control systems, ensuring their safety and reliability has become a critical task in software testing. One prevalent approach to testing control systems is falsification, which aims to find an input signal that causes the control system to violate a formal safety specification using optimization algorithms. However, applying falsification to AI-enabled control systems poses two significant challenges: (1) it requires the system to execute numerous candidate test inputs, which can be time-consuming, particularly for systems with AI models that have many parameters, and (2) multiple safety requirements are typically defined as a conjunctive specification, which is difficult for existing falsification approaches to comprehensively cover.
Jieke Shi, Zhou Yang, Junda He, Bowen Xu, Dongsun Kim, DongGyun Han, David Lo
ACM Transactions on Software Engineering and Methodology (TOSEM) 2024
Given the increasing adoption of modern AI-enabled control systems, ensuring their safety and reliability has become a critical task in software testing. One prevalent approach to testing control systems is falsification, which aims to find an input signal that causes the control system to violate a formal safety specification using optimization algorithms. However, applying falsification to AI-enabled control systems poses two significant challenges: (1) it requires the system to execute numerous candidate test inputs, which can be time-consuming, particularly for systems with AI models that have many parameters, and (2) multiple safety requirements are typically defined as a conjunctive specification, which is difficult for existing falsification approaches to comprehensively cover.
Terry Yue Zhuo, Minh Chien Vu*, Jenny Chim*, Junda He*, Indraneil Paul* et al. (* equal contribution)
International Conference on Learning Representations (ICLR) 2025
Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. Solving challenging and practical requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs.
Terry Yue Zhuo, Minh Chien Vu*, Jenny Chim*, Junda He*, Indraneil Paul* et al. (* equal contribution)
International Conference on Learning Representations (ICLR) 2025
Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. Solving challenging and practical requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs.
Chen Gong, Zhou Yang, Yunpeng Bai, Junda He, Jieke Shi et al.
IEEE Symposium on Security and Privacy (SP) 2024
Reinforcement learning (RL) makes an agent learn from trial-and-error experiences gathered during the interaction with the environment. Recently, offline RL has become a popular RL paradigm because it saves the interactions with environments. In offline RL, data providers share large pre-collected datasets, and others can train high-quality agents without interacting with the environments. This paradigm has demonstrated effectiveness in critical tasks like robot control, autonomous driving, etc. However, less attention is paid to investigating the security threats to the offline RL system. This paper focuses on backdoor attacks, where some perturbations are added to the data (observations) such that given normal observations, the agent takes high-rewards actions, and low-reward actions on observations injected with triggers. In this paper, we propose Baffle (Backdoor Attack for Offline Reinforcement Learning), an effective backdoor attack framework for offline RL.
Chen Gong, Zhou Yang, Yunpeng Bai, Junda He, Jieke Shi et al.
IEEE Symposium on Security and Privacy (SP) 2024
Reinforcement learning (RL) makes an agent learn from trial-and-error experiences gathered during the interaction with the environment. Recently, offline RL has become a popular RL paradigm because it saves the interactions with environments. In offline RL, data providers share large pre-collected datasets, and others can train high-quality agents without interacting with the environments. This paradigm has demonstrated effectiveness in critical tasks like robot control, autonomous driving, etc. However, less attention is paid to investigating the security threats to the offline RL system. This paper focuses on backdoor attacks, where some perturbations are added to the data (observations) such that given normal observations, the agent takes high-rewards actions, and low-reward actions on observations injected with triggers. In this paper, we propose Baffle (Backdoor Attack for Offline Reinforcement Learning), an effective backdoor attack framework for offline RL.
Junda He, Zhou Yang, Jieke Shi et al.
International Conference on Software Engineering (ICSE) 2024
Sequential decision-making processes (SDPs) are fundamental for complex real-world challenges, such as autonomous driving, robotic control, and traffic management. While recent advances in Deep Learning (DL) have led to mature solutions for solving these complex problems, SDMs remain vulnerable to learning unsafe behaviors, posing significant risks in safety-critical applications. However, developing a testing framework for SDMs that can identify a diverse set of crash-triggering scenarios remains an open challenge. To address this, we propose CureFuzz, a novel curiosity-driven black-box fuzz testing approach for SDMs.
Junda He, Zhou Yang, Jieke Shi et al.
International Conference on Software Engineering (ICSE) 2024
Sequential decision-making processes (SDPs) are fundamental for complex real-world challenges, such as autonomous driving, robotic control, and traffic management. While recent advances in Deep Learning (DL) have led to mature solutions for solving these complex problems, SDMs remain vulnerable to learning unsafe behaviors, posing significant risks in safety-critical applications. However, developing a testing framework for SDMs that can identify a diverse set of crash-triggering scenarios remains an open challenge. To address this, we propose CureFuzz, a novel curiosity-driven black-box fuzz testing approach for SDMs.
Junda He, Xin Zhou, Bowen Xu, Ting Zhang, Kisub Kim, Zhou Yang, Ferdian Thung, Ivana Clairine Irsan, David Lo
ACM Transactions on Software Engineering and Methodology (TOSEM) 2024
The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content. The performance of such solutions hinges significantly on the selection of representation models for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers' interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon neural networks such as convolutional neural network and transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been comprehensively evaluated.
Junda He, Xin Zhou, Bowen Xu, Ting Zhang, Kisub Kim, Zhou Yang, Ferdian Thung, Ivana Clairine Irsan, David Lo
ACM Transactions on Software Engineering and Methodology (TOSEM) 2024
The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content. The performance of such solutions hinges significantly on the selection of representation models for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers' interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon neural networks such as convolutional neural network and transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been comprehensively evaluated.
Xin Zhou, Bowen Xu, DongGyun Han, Zhou Yang, Junda He, David Lo
IEEE International Conference on Software Maintenance and Evolution (ICSME) 2023
We propose CCBERT (Code Change BERT), a new Transformer-based pre-trained model that learns a generic representation of code changes based on a large-scale dataset containing massive unlabeled code changes. CCBERT is pre-trained on four proposed self-supervised objectives that are specialized for code change representation learning.
Xin Zhou, Bowen Xu, DongGyun Han, Zhou Yang, Junda He, David Lo
IEEE International Conference on Software Maintenance and Evolution (ICSME) 2023
We propose CCBERT (Code Change BERT), a new Transformer-based pre-trained model that learns a generic representation of code changes based on a large-scale dataset containing massive unlabeled code changes. CCBERT is pre-trained on four proposed self-supervised objectives that are specialized for code change representation learning.
Junda He, Bowen Xu, Zhou Yang, DongGyun Han, Chengran Yang, David Lo
IEEE/ACM International Conference on Program Comprehension (ICPC) 2022
Stack Overflow is often viewed as one of the most influential Software Question & Answer (SQA) websites, containing millions of programming-related questions and answers. Tags play a critical role in efficiently structuring the contents in Stack Overflow and are vital to support a range of site operations, e.g., querying relevant contents. Poorly selected tags often introduce extra noise and redundancy, which raises problems like tag synonym and tag explosion. Thus, an automated tag recommendation technique that can accurately recommend high-quality tags is desired to alleviate the problems mentioned above. Inspired by the recent success of pre-trained language models (PTMs) in natural language processing (NLP), we present PTM4Tag, a tag recommendation framework for Stack Overflow posts that utilize PTMs.
Junda He, Bowen Xu, Zhou Yang, DongGyun Han, Chengran Yang, David Lo
IEEE/ACM International Conference on Program Comprehension (ICPC) 2022
Stack Overflow is often viewed as one of the most influential Software Question & Answer (SQA) websites, containing millions of programming-related questions and answers. Tags play a critical role in efficiently structuring the contents in Stack Overflow and are vital to support a range of site operations, e.g., querying relevant contents. Poorly selected tags often introduce extra noise and redundancy, which raises problems like tag synonym and tag explosion. Thus, an automated tag recommendation technique that can accurately recommend high-quality tags is desired to alleviate the problems mentioned above. Inspired by the recent success of pre-trained language models (PTMs) in natural language processing (NLP), we present PTM4Tag, a tag recommendation framework for Stack Overflow posts that utilize PTMs.
Zhou Yang, Jieke Shi, Junda He, David Lo
International Conference on Software Engineering (ICSE) 2022
In this paper, we propose ALERT (nAturaLnEss AwaRe ATtack), a black-box attack that adversarially transforms inputs to make victim models produce wrong outputs. Different from prior works, this paper considers the natural semantic of generated examples at the same time as preserving the operational semantic of original inputs. Our user study demonstrates that human developers consistently consider that adversarial examples generated by ALERT are more natural than those generated by the state-of-the-art work by Zhang et al. that ignores the naturalness requirement.
Zhou Yang, Jieke Shi, Junda He, David Lo
International Conference on Software Engineering (ICSE) 2022
In this paper, we propose ALERT (nAturaLnEss AwaRe ATtack), a black-box attack that adversarially transforms inputs to make victim models produce wrong outputs. Different from prior works, this paper considers the natural semantic of generated examples at the same time as preserving the operational semantic of original inputs. Our user study demonstrates that human developers consistently consider that adversarial examples generated by ALERT are more natural than those generated by the state-of-the-art work by Zhang et al. that ignores the naturalness requirement.