Clustering Similar Text Elements and Finding Duplicates Sample Clauses
Clustering Similar Text Elements and Finding Duplicates. In software projects, especially open source projects, there are often issue trackers where users can submit bug reports and feature requests. Popular projects have hundreds or even thousands of open issues. Once a new issue is submitted, one of the project maintainer needs to check and mark duplicates. Such task is important as it helps to reduce the number of open issues to those that matter. Research has tackled this problem by introducing automated approaches for detecting duplicates [40], [18]. In addition, research had tried to understand whether duplicates are harmful or not. For example, ▇▇▇▇▇▇▇▇▇▇ et al. [2] found that duplicates are created because previous reports lack information and that duplicates can add value by including more information. Sometimes, bug reports are only interesting for developers if they have a certain level of severity and if a ▇▇▇▇▇▇▇▇ ▇▇▇▇ of people are affected [28]. Therefore, there are approaches that group issues by their type (e.g. bug reports, feature requests) and then cluster them by their similarity (e.g., based on NLP metrics, such as tf-idf). These clusters need to reach a minimum size to be considered for the next release of the software [37]. The specific approach to issues and bug report clustering varies a lot based on context. For example, developers and technical savvy stakeholders attach the stack trace (i.e., information about the active subroutines of the program affected by the bug) to the report. Consequently, machine learning techniques leverage these information to group stack traces together, which are more structured than natural language text, and extrapolate similar reports [53]. Evaluated internally at Microsoft, this approach resulted in a F-measure of 0.88 and can facilitate diagnosis and prioritization of issues to be addressed. However, as the authors report [53], its efficacy in large scale open source projects. Similarly, ▇▇▇▇▇ et al. [54] augmented natural language processing features with execution information about the context in which the issue/bug was observed. They were able, training their model on the Firefox bug report dataset, to detect up to 93% of duplicate bugs (compared to the 72% of using natural language features alone). Clustering issue trackers items and bug reports according to their text element is useful to automatically generate summaries. To that end, ▇▇▇▇▇▇▇ et al. [55] clustered bug reports leveraging their conversational features1 such as the position of the s...
