1 Introduction
Knowledge discovery is used to extract useful rules or patterns from data sets. To discovered knowledge from massive data, various data mining techniques are used. Frequent pattern mining is one of the more important approaches for generating hidden knowledge from massive data. For frequent pattern mining, Apriori was proposed [1], and it has been revealed that the algorithm has limitations, multiple scans and generating a large number of candidate itemsets. To solve this problem, Han et al. proposed the FP-Growth algorithm [10]. This can discover all frequent patterns with only two database scans. Frequent pattern mining can be applied to transactional data as well as sequential data [2]. There are different application areas of frequent pattern mining. Frequent pattern mining can be applied to mine knowledge from the mobile service accessed sequences. As we know, smartphones are widely used by people. Smartphones are used to perform online transactions, retrieving information, social messaging, video calling, chatting, etc. Services which are accessed by smartphones or laptop devices are called mobile web services. These web services are lightweight applications used to perform a specific task such as booking a ticket through a mobile app or sending a message through WhatsApp. A particular user may access series of services at different times at different locations or a single location. To extract the interesting pattern of services, data mining techniques are used. By sequential pattern mining [2], [17] web services sequence can be extracted. These sequences are helpful to find the behavior of a specific user. The generated web service patterns are used in different fields like behavior analysis of users, finding most accessed services occurrences, planning a new service, promoting business, etc. Figure 1 shows the simple scenario for mobile web service sequence generation.
In figure 1, different mobile web services are accessed by the mobile users at different locations. Here Home, Hospital, Restaurant, Park, etc., are the different locations and WhatsApp, News, Facebook, Chat, etc., are mobile web services which as denoted by S1, S2, S3 and so on. Id 1, 2, 3... n represent different service access sequences, such as Id 1= {S5, S3, S1, S6}, Id 2= {S2, S1, S3}, and so on [16], [17]. The traditional sequence pattern mining approach only considers the ítems that are sequentially purchased. They do not include any constraint or factor like price, profit or preferences of items. Sometimes the low frequency of items may be important. For example, let us assume there exists a pattern <gmail, facebook> in a sequence and also assume it is a low-frequency pattern in the sequence database. To handle this, Yun et al. proposed a new research approach, namely weighted sequential pattern mining [29], in which different weights are assigned to items by the importance of each item. Weighted frequent pattern mining [31] considers the importance of items and share frequent pattern mining [7] represents the occurrence of items in transactions as nonprofit binary values. In these frameworks, patterns with high weight value can be extracted even if they occur infrequently. Lan et al. [13] proposed an approach for finding weighted sequential patterns. They apply this approach to the traditional transaction and items. Some items with high profit but low count may not be discovered in a sequence database by using traditional sequential pattern mining approaches. To address this problem, Ahmed et al., proposed a new research approach, namely utility sequential pattern mining, which considers not only quantities and timestamp of items in sequence, but also individual profits of items in a quantitative sequence database [5].To find the most valuable pattern, utility plays an important role in data mining. Utility mining [23] has emerged as one of the most valuable research topics in the frequent pattern mining field. In utility mining, each item has an external utility such as profit, price and internal utility which indicates the non-binary value of items in transactional sequence [32].
In literature, various approaches are available for extracting useful frequent patterns using a utility from transactional databases. No current research addresses utility-based frequent pattern extraction form mobile web services sequences. Hence problem statement can be express as: How to discovered frequent pattern from mobile web service accessing sequence using specific utilities?
The utility-based approach can be used to discover frequent patterns from mobile web services sequences. Mobile web service frequent pattern mining is a new application of frequent pattern mining. In this regard, mobile web service accessed preference can be used as a utility value for a mobile web service. A utility-based approach is proposed to reduce a large number of candidate generations. Major contributions of this proposed work are summarized as follows:
The work uses an effectual sequence maximum utility (SMU) approach for the strong upper bound of utility support in subsequences.
The UBFPM (Utility Based Frequent Pattern Mining) approach has been proposed for finding interesting mobile web services patterns.
The proposed approach speeds up the execution efficiency in finding utility based patterns.
In the experiments, various datasets, both real and synthetic, are used to evaluate the performance of the proposed approach with state-of-the-art algorithms. The results show that the proposed UBFPM approach has good performance regarding execution time, memory consumption and a number of candidate generations.
The roadmap of the paper is as follows: the next section briefly recalls the history of the work related to this study. Preliminaries and problems are defined in section 3. Section 4 introduces our approach. Section 5 presents the experimental results on multiple datasets. The last section presents the conclusions.
2 Definitions and Related Work
To clearly describe frequent pattern mining approach for mobile web service, a set of relevant terms and related study is discussed in this section. It includes problem descrption and relevant definitions.
2.1 Description of the Problem and Definitions
Let us assume mobile web services with accessing sequence database given in Table 1, in which each row consists of two features, ID, and services sequence. There are six mobile web services in the dataset, respectively denoted as S1 to S6. We also assume the utility value of each mobile web service as shown in Table 2. Utility is a measure of how useful i.e. profitable a service is.Here accessing preference of a service is considered as an utility value. These values are randomly generated for experiment and assumed in the example.
We adopt definitions similar to those presented in the previous works [5], [6], [11], [16], [20], [22], [23], [31]. Let a set of web service I be {S1, S2, … Sm}. An itemset X, containing k items, is called k-itemset and its length is k. A sequence is an ordered list of services, such as ID8=<S4S2S5>. A mobile web service sequence database D= {ID1, ID2, … IDn} [28] contains n sequence itemsets which are a subset of itemset I.
Definition 1, Sub and super sequences: Let A and B be two sequences. A=<X1 X2 X3...Xm> and B=<Y1 Y2 Y3...Yn>. If there exist an integer 1≤ i1< i2<... im≤ n such that X1⊆Yi1,X2⊆Yi2 ,….,Xm⊆Yim , then A is called a subsequence of B, and B is called a super sequence of A [11].
Definition 2, Utility value: The utility value of a mobile web service S, ranges from 0 to 1. The utility is associated with each service; it indicates the accessing preference value of the service [11].
Definition 3, Utility of service set: Let X= (S1 S2 … Sm). The utility of the mobile web services set X is the summation of utility values of all mobile web services which belong to X, divided by the cardinality of X [11].
Ux(i)is the utility value of the mobile web service that appears in the position i of X.
For example, let X= (S1 S3 S5), based on the utility values presented in the table 2 is 0.9, 0.2, 0.5 respectively and |X|=3 therefore:
Definition 4, Utility of a sequence: Let p= (X1 X2 X3 … Xn) The utility value of the service sequence ID2 (Up) is the summation of the utility values of all mobile web services sets which belong to p, divided by the cardinality of p [11].
Ux(i) is the utility value of the mobile web services set Xi that appears in the position i of p.
For example, in Table 2, since the utility values of the sequence <S4S6S2> are 0.8, 0.6 and 0.4 respectively, and the number of mobile web service in sequence <S4S6S2> is 3, then U<S4S6S2>=(0.8+0.6+0.4)/3=0.6.
Definition 5, Utility of a subsequence: The utility value of a mobile web services subsequence r, Sr is the summation of utility values of all mobile web services in r over the number of sequence in r [11]. That is,
Where |r| and Ux are the number of mobile web services in the subsequence r, and the utility value of mobile web services set S in r, respectively.
For example, in Tables 1 and 2, since the sixth sequence <S1S2><S4S6> consists of two sequences {S1S2} and {S4S6}, and the utility of two sequences are 0.65 and 0.7 respectively, then U<S1S2><S4S6> =(0.65+0.7)/2=0.675.
Definition 6, Sequence max utility and Total Sequence max utility (TSMU): The sequence max utility value of a sequence p, SMU, is the maximum utility value among all mobile web services in the sequence p [11]. The total SMU of database D, TSMU, is the summation of the SMU values of all sequences in D [11]. That is,
Where |D| represents the cardinality of the mobile web services accessing database (D).
For example, in Table 1, the SMU of ten sequences are 0.9, 0.8, 0.8, 0.4, 0.8, 0.9, 0.9, 0.8, 0.9 and 0.8 respectively. Then TSMU=0.9+0.8+0.8+0.4+0.8+0.9+0.9+0.8+0.9+0.8=8.0.
Definition 7, Utility support value: Utility support value (USV) in a sequence calculated as the addition of the sequence utility values [11]. It is given as
Where t is the number of times where p appears as a subsequence in all the sequences of the database D.
For example, if we want to find the USV of sequence ID 4, i.e., <S2S3> then it is
(t=3 because p4 is a subsequence of ID 1, 4 and 7, as of definition 1).
Definition 8, Utility frequent sequential pattern: A subsequence r is called a utility frequent sequential pattern (FSP) if USV ≥ min_uti, where min_uti is a predefined minimum utility threshold.
For example in Table 1, a S2 service appears 7 times, and its utility is 0.4 in Table 2. Then its USV is (0.4+0.4+0.4+0.4+0.4+0.4+0.4)/8=35%. If min_uti=25%, then the sequence <S2> is a FSP.
Definition 9, Utility sequence upper bound: The utility sequence upper bound pattern of a subsequence r, is the sum of SMU of the sequence including r in sequence database over the TSMU of the D [11]. It is denoted as SUB and is defined as
For example, in Table 1, the sequence <S5>is a subsequence of ID 1, 2, 3, 8, 9 and 10. Therefore,
Definition 10, Utility frequent sequence upper bound pattern: A subsequence r is called utility frequent sequence upper bound pattern (FSUBP) if SUB≥ min_uti.
Problem statement: Mobile web service frequent pattern mining is a new application of frequent pattern mining as well as mobile computing. If a particular user visited multiple locations on a full day (24 hours), we have a sequence dataset about their visiting as well as service details which is linked to the user. Based on this information we can extract some new knowledge and facts. Utility based approach can be used to discovered frequent patterns from mobile web services sequences. In this regard, mobile web service accessed preference can be used as a utility value for a mobile web service. Suppose a user accessed a particular service 20 times in a complete day. This can be expressed as 0.2 utility values. To address the above reason, we propose a utility based approach to reduce the large number of generated candidates. The problem is to find a complete set of frequent service patterns in database D.
2.2 Related Work
In this subsection, related work on frequent pattern mining, utility mining, utility-based frequent pattern mining and sequential pattern mining is briefly reviewed.
2.2.1 Frequent Pattern Mining
The process of extracting a set of items or subsequences that occur frequently in a dataset is known as frequent pattern mining. Different studies have been conducted to mine frequent patterns through transactional databases. Firstly, Apriori started mining frequent itemsets from transactional databases [1]. To get a better result than Apriori, FP-Growth method was later develop [10]. FP-Growth has only one scan of the database. Hence it improved the efficiency of the algorithm. Several types of databases such as sequential, incremental, and stream are used for frequent pattern mining [31]. The relative importance of items in databases can be found through weighted frequent pattern mining [32]. In this method weight of a pattern is calculated by dividing the sum of weights of items by pattern length.
2.2.2 Utility Mining
In the transactional database, a profit, weight, importance or performance of an item can be considered as utility value [13], [17]. The utility was firstly used by Chan et al. in transactional databases [8]. To prune the search space, an estimation method was used named as Umining [24]. Level-wise search method uses item discarding approach to reduce candidate generation [14]. Various required information on utility mining is maintained using a tree-structured, known as Huc-Tree [6]. To discover high utility itemset, it is required to maintain downward closure property. It is done by transaction weighted utilization model, which is based on Apriori algorithm [15]. Apriori-based utility mining approaches use multiple database scans for candidate generation. Another method, Incremental High Utility Pattern (IHUP) was proposed by Ahmed et al. to avoid multiple database scans that uses FP-Tree concept [5]. To enhance the performance of utility mining and getting higher itemset Tseng et al. proposed UP-Growth [22] algorithm, which included various strategies for mining. Next revised version of UP-Growth is UP-Growth+ [23], it decreases overestimated utilities. A tree-based high utility itemset mining algorithm MU-Growth is proposed by Yun et al. [31] which reduces the number of candidates.
2.2.3 Utility Based Frequent Pattern Extraction
High utility items may occur with a low frequency but have more importance. In a transaction, the gold item may have low frequency, but its value is higher. Transactional association rule mining [1] approaches use 0 or 1 for ítem absent or present in a sequence. Traditional approaches are not sufficient for high utility with low frequencies, then, utility-based approaches are useful for frequent pattern extraction. To fulfill a business objective, Chan et al. proposed the idea of top-K patterns [8]. To discover valuable frequent itemset, weighted itemset mining has been proposed [28]. Yun et al. also uses an upper bound model to handle downward closure property [28]. Later, to enhance the performance of weighted itemset mining various studies have been proposed [3], [4], [5], [9], [26], [27].
2.2.4 Utility Based Sequential Pattern Extraction
Transactions and timestamps are present in a sequential dataset. This dataset consists of the transaction Id, consumer detail and list of buying items. Frequent pattern generation is possible using these datasets. Agrawal et al. proposed the AprioriAll, AprioriSome and DynamicSome algorithms for sequential pattern mining [2]. Generalized Sequential Patterns (GSP) [21] and PrefixSpan [19] approaches were later developed for enhancing execution efficiency in sequential pattern mining. Yun et al. proposed various approaches to find weighted sequential patterns in sequential databases [30]. Shie et al. proposed a valuable pattern mining approach to discover high utility itemsets in different shopping websites using quantities and profits [20]. Next, Ahmed et al. proposed a new research approach, high utility sequential pattern mining, in which they consider the relationship order of an itemset with quantity and profit. According to traditional sequential pattern mining, the count of a pattern in a sequence was only regarded as one even if the subsequence appeared multiple times in a sequence. Based on this concept, max utility concept could be more suitable regarded as the estimated utility for subsequence in quantitative sequences [5, 11]. Consumers’ purchase behavior extraction is the main use of sequential pattern mining. Max utility concept may be more appropriate for finding high utility sequential patterns in various real life problems, such as getting high utility business policies or finding mostly accessed services based on their preferences [11]. Lan et al. proposed a sequence utility upper bound model for generating patterns [12]. This model did not adopt any strategy to handle the high utility sequential pattern mining task. Lots of unpromising subsequences still need to be generated. In addition, the USpan approach has to spend a great deal of execution time using an LQS-Tree structure [25]. Thus, the utility upper bound reduction for subsequences in mining is quite important. The aim of this study is to develop an efficient approach to extract frequent patterns from mobile web service sequences.
3 Proposed Approach
The proposed approach effectively handles the problem of finding utility frequent sequential patterns from mobile web service sequences. We have adopted minimum utility value to mine frequent utility patterns. The proposed approach, UBFPM, can reduce the number of candidates for utility itemsets and reveals valuable information that may be needed in various applications of user behavior analysis.
The proposed approach mainly consists of four steps: first it finds FSP-1 and FSUBP-1 patterns. Second, sequences are modified and new SMU’s are calculated based on FSUBP-1 patterns. Next, a postfix database is generated for each frequent web service. Last, this process recursively applies for generating FSP-n patterns using postfix databases. Section 3.1 describes the generation of FSP-1 and FSUBP-1 patterns; Section 3.2 describes sequence modification and calculation of new SMU’s; Section 3.3 describes the process of postfix database preparation; Section 3.4 describes the process of FSP-n patterns generation.
3.1 FSP-1 and FSUBP-1
In this step, sequence database is scanned, and the SMU value of each sequence is calculated. It is shown in Table 3.
After getting SMU and TSMU, we generate Subsequence-1 SUB and USV values of all the service which is shown in Table 4.
After applying minimum utility threshold min_util=50%, following FSP-1 and FSUBP-1 patterns are discovered.
FSUBP-1 patterns: S2, S4, S5 and S6.
FSP-1 patterns: S4, and S4.
3.2 Modified SMU and Sequence Generation
Four services, S2, S4, S5 and S6 are above the minimum utility threshold. These services can be used to generate the next level sequential patterns. They are also used to modify sequences and SMU values. These modified values are shown in Table 5.
3.3 Postfix Sequence Generation
According to FSUBP-1 patterns, postfix sequences are generated. In postfix sequence, only services which follow FSUBP-1 patterns are considered. Below Table 6 shows the postfix sequence of FSUBP-1 pattern <S4>.
3.4 Recursively Generating FSP-n and FSUBP-n
The above steps are recursively applied to generate FSP-n and FSUBP-n patterns. Based on the FSUBP-1 pattern S4, the following subsequence-2 can be generated.
FSUBP-2 patterns: S4 S6.
FSP-2 patterns: Nil.
In a similar way, all the FSP-n and FSUBP-n patterns can be generated. The algorithm UBFPM for generating utility frequent sequential patterns is shown below.
Algorithm 1: Algorithm of generating utility frequent sequential patterns
Input: Web service sequence database D, utility values and min_util threshold Output: FSP-n and FSUBP-n patterns 1: Scan database D and compute SMU,TSMU 2: for each Si∈ D do 3: { 4: Calculate SUBi and USVi 5: if(SUBi≥ min_util) 6: { 7: Generate FSUBP-1 patterns 8: } 9: if(USVi≥ min_util) 10: { 11: Generate FSP-1 patterns 12: } 13: } 14: for each IDi ∈ D do 15: { 16: if(Si∉SUB1) 17: { 18: Modify SMU and sequences 19: } 20: } 21: for each SUBi∈ SUB-1 do 22: { 23: for each Si∈ SUB-1 do 24: { 25: Generate postfix sequence Si 26: } 27: } 28: for each postfix sequence Sido 29: { 30: Calculate SUBnand USVnof Si 31: if(USVn subsequence of Si≥ min_util) 32: { 33: Generate FSP-n patterns 34: } 35: } 36: Print all FSP-n patterns
4 Experiments
In this section, we evaluate the performance of our algorithm UBFPM. The experiment was performed on a Pentium Dual-Core 3.3 GHz processor with 8 GB of memory, using the Java programming language. The experiments ran in the Windows 7 operating system. The simulation is performed on both a synthetic and a real database. The performance of the proposed UBFPM approach is compared with state-of-the-art pattern mining approaches such as IHUP [5],Up-growth [22], UP-Growth+ [23] and MU-Growth [31].
4.1 Experiment on Synthesis Dataset
In this experiment, we use the public IBM data generator (Site 2). This data generator produces the mobile web services sequence data. The parameters used in the IBM data generator were the average length of transaction per sequence S, the average length of services per transaction T, the average length of maximum potentially frequent services set I, the total number of distinct mobile web services N, and the total number of sequences D. For each service sequence dataset generated, a corresponding utility table was also produced in which a utility value in the range from 0.0 to 1.0 was randomly assigned to a service. The simulation model was similar to that used in Liu et al. [15], to generate the utilities of the services in the sequence. Figure 2 shows the utility-value distribution of all the mobile web services generated by the simulation model in the utility table.
4.2 Performance Comparison on the Synthetic Dataset
Figure 4 shows the experimental results of performance evaluation on a synthetic dataset. Figures 3 (a) and (b) present the results of total execution time. Figures 3(c) and (d) present the number of FSUBP patterns on fixed data size (200k) and varied data set size, respectively. In figure 3, the proposed approach UBFPM has the best performance in terms of total execution time as well as the lowest memory consumption. Other approaches generate a more FSUBP pattern, while UBFPM generates less frequent patterns. In figures 3 (a) and (b) the approach takes less time as compared to different state-of-the-art approaches because these use tree-based pruning strategy. Tree-based pruning requires more time to construct the tree first, and then prunes based on a minimum utility threshold. In terms of execution time, the approach is more efficient while the minimum utility threshold is less than 0.60%. As seen in figure 3 (a), when the minimum utility threshold increases from 0.20% to 0.60%, execution time is varied for all approaches. But when the minimum utility is higher than 0.60% this variation goes down and above 1% execution time is approximately similar. The same thing is applicable to different data sizes. As figure 3 (b) shows for small data size (100k) all algorithms take approximately the same execution time, but when the number of sequences is increased (about 200k or more), the previous algorithm takes more time, while UBFPM performs well.
4.3 Performance Comparison on Real Datasets
We present the experimental results of the compared approaches under varied minimum utility values in figure 4. For this performance comparison, the kosarak and retail dataset are used. For both datasets different minimum utility thresholds have been used. In figure 4 (a), the runtime of the UBFPM is best among all other approaches on the kosarak dataset. The results show that the proposed approach is more suitable while minimum utility threshold is increasing from smaller to higher. In addition, it is observed that the approach generates the least number of frequent FSUBP patterns. Another comparison is shown in figure 4 (b) on the retail dataset. When minimum utility threshold is increased from 5000 to 20000, the execution time decreases. In this figure, it is shown that all the approaches have a good execution time beyond the 20000 minimum utility thresholds. Figure 4 shows that MU-Growth also performs well for both datasets.
4.4 Memory Usage
Figure 5 shows the memory usage of the different approaches to different datasets. UBFPM always usage less memory than the other algorithms. The reason is that these algorithms have to reserve a very large amount of memory to store candidate itemsets during the execution process, while UBFPM does not. Figures 5 (a) and (b) show memory consumption on the kosarak and retail dataset, respectively. Figure 5 (a) indicates that when the minimum utility increases from 1000000 to 3000000, memory usage gradually decreases. In this figure the rate of memory usage also decreases for other approaches, but UBFPM frees more memory space for execution. Memory usage is also shown in figure 5 (b) for the retail dataset. When minimum utility increases from 5000 to 25000, memory usage decreases.
4.5 Discussion
Having run the above experiments, the proposed approach UBFPM is shown to outperform the current state-of-the-art algorithms. To mine interesting patterns, almost all existing algorithms first generate candidate itemsets and subsequently compute the exact utility of each candidate to identify interesting patterns. The UBFPM approach does not generate candidate sets as it stores only postfix sequences of services. Experimental results showed that UBFPM extracts frequent patterns faster than the state-of-the-art approaches [5], [22], [23], [31]. Figures 3, 4 and 5 show that the UBFPM approach reduces the execution time as well as memory usage. The number of frequent patterns generated by the proposed approach was clearly less than that of existing algorithms. The main reason for this is that the maximum utility value in a services sequence was more suitable as an upper bound of any subsequence in a sequence. Today mobile devices and services are getting more and more popular for various applications. Effective data mining techniques for ensuring user’s requirements in both of reliability and timeliness on the mobile devices with limited resources is still a crucial challenge. Regarding reliability, our proposed approach is shown to produce highly frequent patterns as illustrated by the rationale in section 4. Hence, the proposed approach does not only meet timeliness requirements but also identifies interesting patterns for users.
5 Conclusion
In this paper, we propose an efficient approach named UBFPM, for service frequent pattern extraction using utility as the preference of service. We also propose an algorithm which is based on the postfix sequence generation of service sequence. More accurate frequent upper bounds are also computed for enhancing the filtration of service sequence. The proposed approach can discover highly frequent FSUBP patterns and sequential FSP patterns of service sequences. These discovered patterns are very useful for mobile web service users and business analysts. If a service provider knows the frequent patterns of any sequence beforehand, they can take decisions to enhance their business effortlessly. The experimental results show that UBFPM is better than previously developed approaches. With the help of this approach, mobile web services prediction and maintenance becomes simpler and easier. In the next steps, we will attempt to handle the dynamic maintenance problem of utility based sequential patterns, when sequences are dynamically modified.