INTRODUCTION
Today, most software development companies have adopted agile development methodologies such as SCRUM, Kanban, and XP. Most of these agile methodologies recommend the capture of requirements through user stories 1. In this context, a user story is a short description of what some part of the software should do from the perspective of some stakeholder interested in the new feature that the software should provide or possess. Although, over the years, several structures have been proposed for writing user stories, most are now written in a strict and compact way that captures who it is for, what is expected as a system response, and optionally why it is relevant following the structure: "As a (type of user), I want (goal), so that (some reason)" 1.
In agile software development, requirements in the form of user stories are frequently managed in an Issue Management System (IMS). An Issue Management System is a computer application designed to help ensure software quality and support to programmers and other stakeholders in the tracking process. These systems include Jira, OpenProject, and Redmine, among others. An IMS can be configured as an issue tracker, a bug tracker, or a project management tool. Specifically, in agile software development, it is common to employ an IMS as a supporting tool for keeping track of the open development issues in a software project (2.The term "issue" is attributed to the unit of work improve a computer system. Therefore, this term can describe most of the kinds of tasks that are needed to track when developing a computer system (2.
IMS systems allow development teams to organize a collection of user stories in meaningful fragments like epics, themes and sprints. In addition, these systems manage other issues types, such as errors, change requests, and others. Although these systems allow the user to categorize or label an issue explicitly, selecting the right category for a new issue is up to the person creating it. That means that this information needs to be included or assigned correctly. Poorly categorizing issues causes many user stories to be buried in a large volume of data, making it difficult to identify them.
An analysis was performed on a dataset containing more than 1.5 million issues to support this claim 3. Among other data, for each issue, the issue type and a summary description are stored in the dataset for each case. Using different kinds of string-matching patterns, we have found that a high percentage of issues have an incorrect type assigned, or their summary information needs to be correted. Figure 1 and Figure 2 illustrate the results of two searches performed on the dataset. Figure 1 shows 10 of the 10829 records obtained after filtering all the issues using "story" as the issue type. In Figure 1 it can be seen that although the issues were classified as user stories, only the labeled with ID 1088 comply with the compact format of a user story mentioned above. Also, in Figure 1, it can be seen that many issues that cannot be identified as requirements (issues labeled as 0,1,3, 1089, or 1099, for example).

Figure 1 Issues obtained when filtering with the search pattern "Story" applied to the Issue Type field.
We performed a second search, looking for the string "as a" in the "Summary" field of the issues. Figure 2 shows the records obtained, whose IDs range from 0 to 3058. This figure, shows that some issues were not classified as user stories, although they were expressed using the compact format of a user story (se, for example, issues labeled as 3, and 4).

Figure 2 Issues obtained when filtering with the search pattern "as a" applied to the Summary field.
This preliminary analysis shows that while IMSs are helpful tools to support the management of software development projects, users can assign the wrong issue type or label to an issue or omit that information. Thus, it is necessary to have an efficient approach to classifying issues, which can be integrated into an IMS to provide it with the capabilities to identify the type of issue in an automated way.
Moreover, the correct identification of user stories interests' software engineering for several reasons. For the members of a software project team that employ an IMS, having a supporting tool for automatically categorizing of issues as user stories can save time and error occurrences, improving the whole quality of the project documentation.
For organizations that have multiple related projects, it is important to have an integrated requirements base. Requirements engineering activities are no longer associated with an individual system development process and, thus, an individual project 3. In contrast, it is viewed as an independent activity executed across multiple projects and product developments. Therefore, an approach to identify the issues that constitute "user stories" in an IMS repository is helpful to retrieve them and feed an integrated requirements base, regardless of whether they were categorized as "user stories."
A recent research trend is the application of computational linguistic techniques to user stories to solve classic challenges in requirements engineering, such as the formulation of high-quality requirements or the creation of better models of system functionalities 2. However, the success of these studies strongly depends on the correctness of the categories or labels assigned to the issues in an extensive IMS repository. Therefore, correctly identifying user stories is a starting point for applying these approaches.
This work presents three neural network models with different architectures to classify issues as User Stories. Then, we ran an experiment to evaluate the performance of each model.
This paper is organized as follows. The first section presents some related works. After that, some theoretical concepts of the methods and materials used to better understand this work are introduced. Then, details about the datasets generated and used are discussed, and the implemented models are presented. Subsequently, the main results obtained by testing the different models are described, and a comparison is offered that considers various aspects such as the accuracy of the models, syntactic analysis, and semantic analysis capability. Finally, the conclusions are drawn.
Related works
One of the features provided by most IMS is to assign a category or a set of labels to the generated issues with the aim, at least in theory, to facilitate their management and retrieval. Several authors have studied the use of labels to categorize issues in an IMS. In 4, the authors analyzed a population of more than three million GitHub projects and gave some insights on how labels are used in them. Their results reveal that, even if the label mechanism is scarcely used, using labels favors the resolution of issues. They also conclude that not all projects use labels similarly (e.g., for some, labels are only a way to prioritize the project, while others use them to signal their temporal evolution as they move along in the development workflow).
In a study conducted on closed issue reports of three open-source software systems from Jira, it has been observed that the label given to the issue reports about bugs or improvement is incorrect 5. The authors manually classified more than 7000 closed issue reports from five popular open-source software systems to analyze the accuracy of already labeled reports. Their findings state that 33.8% of closed issue reports were misclassified.
The authors in 6 manually classified a dataset and applied machine learning algorithms for bug classification. In (7, an automated approach is proposed to label an issue either as a bug or other request based on fuzzy set theory. The labeling of bug reports is done in three phases. First, text from the bug reports is preprocessed. Second, the Fuzzy technique is applied, and third, the labeling is done using scores obtained after fuzzification. In 8, the authors selected seven projects in GitHub and built classification models based on issue information, text descriptions, and comments to improve the maintenance tasks for development teams. Text information was preprocessed with text data mining techniques and information retrieval. Then, they evaluated the performance of classifiers with several metrics. They conclude that very suitable classifiers may be obtained to label the issues or suggest the most suitable candidate labels.
These contributions employed datasets obtained from repositories of IMS configured for bug tracking and not for project management. For that reason, the focus of these works has been on the correct classification/labeling of defects or bugs. However, our work employs datasets obtained from IMS repositories used for project management and software development, and we focused on the identification of issues related to requirements definition, such as "user stories".
In the last years, a research trend has emerged regarding applying computational linguistic techniques to user stories to solve classic challenges in requirements engineering, such as the formulation of high-quality requirements or the creation of better models of system functionalities 9. A research line is the extraction of conceptual models from natural language (NL) requirements, which can help to identify dependencies, redundancies, and conflicts between requirements from lengthy textual specifications. To extract meaningful models from requirements expressed in NL, researchers have been proposing heuristic rules for the identification of entities and relationships whenever the text matches particular patterns of the given language (usually English). For example, in 9, is proposed an automated approach based on natural language processing that extracts conceptual models from user story requirements. In another work 10, the authors proposed an approach to generate i* models from user stories. In 11, contributions are made toward mapping user stories and use case models. Also, in 12, user stories are used to extract quality attributes for early architecture decision-making. A common denominator of all these proposals is that they require user stories as input, so mislabeled user stories harm the results of such studies. Consequently, to anticipate better results from these user story studies, an approach that correctly identifies subjects as either "user stories" or "non-user stories" is required.
Background
In this section, the theoretical concepts on which this work is based are exposed. We describe the concepts and models used in this paper, such as User Stories, Recurrent Neural Networks, Bidirectional Long Short-Term Memory Recurrent Neural Networks, and Natural Language Processing with Neural Networks, among others.
User stories
Outside the world of software, a user story could be referred to as a customer's testimonial or narrative; however, it has a whole different meaning for software professionals. In terms of software development, a user story is a short description of something or a piece of software it is supposed to do, told from the perspective of the person who desires the new feature. Although going back to its beginnings, user stories were proposed as unstructured text but with some size restrictions 1, nowadays, it follows a compact template for writing them. The template captures who it is, its expectations of the system function, and, optionally, why it is significant (13. Although many different templates exist, 70% of practitioners use the template: "As a (type of user), I want (goal), [so that (some reason)]" (1. Next, two examples of user stories using such a template are introduced.
• Example 1: As a visitor, I want to purchase an event ticket.
• Example 2: As an event organizer, I want to search for new events by favorited organizers, So that I know of events first.
Natural language processing with neural networks
Natural Language Processing (NPL) is a subfield of linguistic, computer, information engineering, and artificial intelligence sciences dedicated to interacting with computer equipment and human natural language, particularly how computer programs process and analyze large amounts of information. The problems often addressed with these techniques are speech recognition, understanding natural language such as sentiment analysis, text generation, automatic text summarization, and automatic entity recognition14. Although there exist several natural language processing techniques, in recent years, there has been a significant boom in the use of Deep Learning models 14 because of their ability to capture the syntactic and semantic information of words in large unlabeled bodies of text. Word vectors (word embeddings) are a standard component found in current NLP system architectures 14. Word embeddings are vectors of real numbers representing terms correlating relative similarities with semantic similarities 15, generally learned by neural networks. They can represent the context of the word and can provide information about relations with other words. Hence, the meaning or semantic context of words can be predicted accurately as they can capture syntactic and semantic information about the words 16. Following this trend, there are analyzed and used popular models at the moment of this work.
Recurrent Neural Network (RNN) architectures have become a typical and famous neural network model because of their capabilities to process sequential inputs and learn its dependencies 17, proving to be very helpful in NLP tasks. An RNN is a neural network where the connections between neurons form a directed graph, making a temporal sequence through X t time steps, feeding each hidden state H t to the next time step, as shown in Figure 3. The network thus has a dynamic temporal behavior. Unlike common networks, RNNs can use an internal state (memory state) to process sequences of inputs. However, they have problems with long-term dependencies due to gradient vanishing 17.
Otherwise, long short-term memory (LSTM) is a recurrent neural network architecture type that avoids the problem of gradient vanishing. LSTM is augmented by recurrent “forgetting” gates, preventing the backward propagation error from vanishing or exploding. In this type of network errors can go backward through a virtually unlimited number of layers unfolded in space. As shown in Figure 4, the internal memory cell C t is controlled by a set of gate networks: a forget gate network f t an input gate network i t , and an output gate network o t . The forget gate network controls how much information of the internal cell C t should be passed into the next time step. The input gate network is used to scale the input block u t to the internal cell. Consequently, LSTM can learn tasks that require memories of events that occurred thousands of times in previous training steps, thus making it capable of handling long-term dependencies 17.
On the other hand, Bi-directional Recurrent Neuronal Networks (BRNN) have a specific structure. The state neurons of a regular RNN are split into a part that is responsible for the positive time direction (forward states) and a part for the negative time direction (backward states), as shown in Figure 5. These outputs of two types of states are not necessarily connected to inputs in the opposite states 18. Using time directions in the same network, input information in the past (t-1 in Figure 5) and the future (t+1 in Figure 5) of the currently evaluated time frame (t) can be used to minimize the objective function without the need for delays, unlike common RNN that require these "delays" to include future information. Using the LSTM and BRNN models, the model can handle long-term dependencies and analyze the whole sentence forward and backward 19.

Figure 5 General structure of the bidirectional recurrent neural network (BRNN) shown unfolded in time for three-time steps.
Commonly, nowadays, different NLP tasks entail a great effort in terms of time and computing power consumption, so as an alternative to creating a model from scratch or too general, the transfer learning technology has emerged 20. Transfer Learning (TL) is a machine learning method with the perspective of providing a better and faster solution with less effort for collecting the needed training information and reusing it in another similar model 20. In 20, it is defined as: "Given a Ds domain and a source Ts learning task, and a Dt domain and target Tt learning task, the TL aims to enhance the learning of the target predictive function f(x) in Dt using the knowledge in Ds and Ts, where Ds ≠ Dt, or Ts ≠Tt." Word embeddings are a good example of transfer learning since neural networks generally learn them in a domain for a learning task, and these learned word embeddings can be applied in a different domain for other learning tasks. Hence, those vectors of real numbers are transferred from one model to another.
Word representations, such as Word Embeddings, are a crucial component in many neural language models 21. ELMo (Embeddings from Language Models) incorporates a form of deep word representation based on a feature-based approach, where each token is assigned a representation that is a function of the entire input sequence 21. The vectors derived from a trained LSTM network with a pair of linguistic models are used in an extended text corpus. These representations are a function of all the layers of a Bidirectional Linguistic Model (biLM) 21. ELMo looks at the entire sentence before assigning each word in its embedding. It uses a bi-directional LSTM trained on a specific task to create contextual word embedding. The ELMo LSTM, once trained on a massive dataset, could be used as a constituent in other NLP models aimed at language modeling. In (22, an implementation of a module with this architecture and an application trained in 1 billion words is presented. This module returns as output a set of fixed embeddings for each LSTM layer, the learned aggregation composed of 3 layers, and a mean-pooled vector representation of the input.
There are two strategies for applying pre-training in linguistic models: the characteristics-based approach and the parameter adjustment approach (23. Feature-based models such as ELMo 21 use architectures that include pre-trained representations as additional features. On the other hand, models that use parameter resetting introduce parameters to specific tasks, trying to simplify and adjust all the pre-trained parameters. However, current techniques based on the parameter-matching approach use unidirectional linguistic models 23. BERT (Bidirectional Encoder Representations from Transformers) 23 alleviates this problem using a masked linguistic model. The linguistic model masks some of the input tokens and aims to predict the original id of the vocabulary by linking the contexts from the right and left; hence it is bidirectional.
BERT uses a masked language modeling objective to pre-train the transformer network on an extensive unlabeled data 24. In 25, it can be found an implementation and examples of the use of a module that fits this architecture trained in Wikipedia and BookCorpus. Assuming that the entries are pre-processed as required by this module implementation, it returns as output representations of each token in the input sequence and an entire grouped representation of the entry.
Attention mechanisms have become an integral part of sequential modeling in various tasks, allowing the modeling of dependencies regardless of the distance between input and output sequences. These mechanisms are generally used with some RNN26. These models use the so-called attention functions, which are nothing more than a function that can be described as the mapping of a query and a set of identifier-value pairs to an output, where the query, the identifiers, and the values are all vectors. The output is calculated as the weighted sum of the values, where the weight of each value is calculated by a query compatibility function with the corresponding identifier 26. In 26, various types of attention functions, such as "Scaled Dot-Product Attention," "Multi-Head Attention," and "Self- Attention," are presented and explained.
A model called "Transformer" 26 is completely based on the Self-Attention and Multi-Head Attention models. This model does not use alienated RNNs or convolutions; it follows an encoder-decoder architecture completely connected between its layers. That means that the encoder maps an input sequence of symbol representations to a continuous representation. Then, the decoder generates an output of the symbols for each element at a time 27.
MATERIAL AND METHODS
This section introduces the models proposed to identify user stories in issue management systems. For the resolution of the problem presented in this work, the CRISP-DM methodology was followed 28. The main steps of the methodology are listed below:
Business understanding. This initial phase focuses on understanding the problem establishing the data mining goals and the success criteria.
Data understanding. The data understanding phase starts with initial data collection and proceeds with activities to get familiar with the data and to identify data quality problems.
Data preparation. The data preparation phase covers all activities to construct the final dataset, which will be fed into the models.
Modeling. In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values.
Evaluation. Before proceeding to the final deployment of the model, it is essential to evaluate the model more thoroughly, reviewing its metrics and behavior in the real application.
Deployment. This task takes the evaluation results and concludes a strategy for deployment of the data mining result(s) into the business.
Data understanding and preparation
The models were trained by taking data from public sources containing real software development project problems 29) (30. These sources contain positive examples of user stories (sentences in the format described previously) and negative examples (erroneous user stories or sentences with a similar syntaxis to user stories but with a different purpose). An algorithm was implemented to generate additional examples by splitting and mixing positive examples into random parts using the TensorFlow Tokenizer to obtain a more extensive data set suitable for testing the models. This implementation is available in 31. A manual classification work was performed to differentiate the examples to which each classification class belonged, thus introducing into the model an index of human error, given that there was no record of the previously classified data. The resulting dataset includes a total of 7997 positive and negative examples, of which 2618 are positive, as shown in Table 1, and the remainder are negative, as shown in Table 2.
Therefore, a binary classification problem is presented, where the issues classified as user stories belong to the positive class (1) and the rest to the negative class (0). The entire dataset obtained can be found in 32.
A BRNN-LSTM model for User Story issues classification
The first model proposed for User Stories classification is based on an architecture for a bidirectional LSTM neural network (Figure 6). The model has a maximum of Word Embeddings equal to the vocabulary length, with 300 dimensions each and 125 bidirectional LSTM layers. A dropout layer was used to prevent overfitting, and a sigmoid activation function in the output layer.
For the implementation of this model, Python 3 and TensorFlow 2.0.0-rc0 for GPUs were used. The implemented model is available in 33.
An ELMo-based model for User Story issues classification
Furthermore, a custom Keras layer for TensorFlow, whose implementation was taken from 35 and subsequently integrated and adapted to our model, was used to build the ELMo-based model using the ELMo2 module available through the Tensorflow Hub platform 34. Besides, a dropout layer was added to the model to prevent overfitting and a sigmoid activation function. Figure 7 illustrates a general view of the sequential model using the ELMo module.
For this implementation, Tensor Flow 1.14 was used due to support and compatibility problems of the module with TensorFlow2.0 and the Tensor Flow-Hub library 36. The model implementation is available in 37.
A BERT-based model for User Story issues classification
The BERT module bert_uncased_L-12_H-768_A-12/1, available through the Tensorflow Hub platform 34, which provides a simple way to share Tensorflow models, was used to implement the BERT-based model. We used Keras with Tensorflow backend to build our BERT-based model. Before Keras can use the core TensorFlow model, a customized Keras layer must be defined to render it in the appropriate format 38 correctly.
As shown in Figure 8, after the inputs are preprocessed, the ids for the tokens and their respective masks are obtained, which fed the BERT layer. Finally, to avoid overfitting, a dropout layer is placed at the output of the BERT module, and subsequently, a sigmoid function is used.
The implementation of this model used Tensor Flow 1.14 to avoid some compatibility issues with TensorFlow2.0 and the Tensor Flow-Hub library. The implemented model is available in (39.
EVALUATION AND DISCUSSION OF THE RESULTS
In this section, the main results obtained by the different models are presented, and then a comparison is made between them. For each model, once the dataset was loaded, it was randomly divided into 70% for training and 30% for testing using the function train_test_split from 40. Also, during training, the 70% was divided again into 30% for validation using the validation_split parameter available when training TensorFlow models.
Training and validating the models
For the BRNN-LSTM model, an Adam optimizer with a learning rate of 0.01 and a batch size of 35 was used, as shown in Figure 9 after 15 epochs of 23 seconds each, accuracies of 0.9579 and 0.9624 were achieved in validation and testing, respectively.
In Figure 10, it is analyzed the performance graphs by accuracy and loss. As it can be observed, there exist some overfitting, which could lead to missing classifications.
For validating the automatic generation algorithm, the same training and validation process was run for the BRNN-LSTM model using the original dataset. After that, the F1 score was 0.8796 against 0.9545 for the model trained using the enhanced dataset, as shown in Figure 11. As can be seen, the model using the enhanced dataset has a better score; hence, it was decided to continue the training and validation of the rest of the models using the enhanced data set only.
For the ELMo-based model, an SGD optimizer and a batch size of 35 were used, and after 34 epochs of 23 seconds each, an accuracy of 0.9607 in validation was obtained, as shown in Figure 12.
An analysis of the performance of this model shows a better performance than the previous one without a relevant overfitting (Figure 13).
On the other hand, for the BERT-based model, an SGD optimizer and a batch size of 35 were used, and after 7 epochs of 4 minutes each, an accuracy of 0.9676 in validation is obtained, as shown in Figure 14. An analysis of the performance of this model shows a better performance than the previous one without relevant overfitting (Figure 15).
Comparison between the models
We evaluated the proposed models using a set of new issues that do not belong to the training or the validation datasets (Table 3). The results obtained by testing the different models are listed in the column titled "Probability of being a User Story." From these results, several observations are made:
The identification of short user stories improves as the complexity of the applied model increases (the BRNN-LSTM being the simplest and the BERT-based the most complex), as observed in the first example of Table 3.
Considering the User Story 5 example, the BRNN-LSTM model is not able to recognize this example as unfavorable. In contrast, others return a lower probability, ensuring this is not a positive example.
Besides orthographic errors or unknown words in a user story, all the models can generalize it correctly, as seen in the User Story 7 example.
After implementing the models and assessing their results, a comparison can be made. The metrics considered are the F1-score obtained during the validation through the library Sklearn [40], the complexity, the training time (tr-effort), the syntactic analysis (parsing), and the semantic analysis (semantic). Table 4 shows the results of the comparison.
As observed in Table 4, the models obtained similar values for the validation metrics; however, the most notable difference lies in the ability to semantically and syntactically analyze user stories. The BERT-based model has a slightly superior generalization capability. Besides, although the BERT-based model complexity is higher (in terms of the times and number of training epochs), it can be observed that there exists an improvement in the parsing and semantic interpretation. In contrast, the parsing of issues is similar for the BERT-based and ELMo-based models.
CONCLUSIONS
In this work, three different neural network models were implemented to identify user stories in large volumes of data. From the results obtained using these models, it was analyzed which is better for the classification of issues records. Also, it was concluded that the BERT-based model can analyze the text syntactically and semantically with higher accuracy and performance. Future work will involve improving the dataset used by increasing the number of cases, finding a better balance between positive and negative classes, and then retraining the models to enhance the obtained results. A limitation of the approach is that a previous loading and extraction process of issues of any type in an IMS is needed to have a dataset and then feed the model to extract the user stories. In other words, some coding knowledge is still required to use the proposed models.
This work can be the first step to applying other techniques to analyze user stories within Issues Management systems. The approach can be embedded in an IMS tool for automatically categorizing of incidents as user stories, which would save time and avoid error occurrences, therefore improving the quality of the software development project documentation. Therefore, this proposal allows performing any study based on user stories and locating possible requirements or requests for new functionalities in a large repository, even if the incidents are not labeled as such.










text new page (beta)



















