Data mining is the process of observing large pools of information to generate more, meaningful information. It is a meticulous task to make sense of seemingly disparate data by using rigorous algorithms to slowly shape the data into readable, useful information. Intuitively, data mining is about extrapolating patterns and new knowledge from the data we’ve already collected. One of the major cornerstones of data mining is the use of mining association rules. In computer science terms, they are a spiderweb of IF-THEN statements that bloom useful patterns in the data, which can be used for high-end business or managerial decisions.
Mining frequent itemsets and association rules is a popular and well-researched approach for discovering interesting relationships between variables in large databases and it is the foundation on which many data-mining algorithms are based on.
The most widely used ‘market basket analysis’ employs mining association rules to the greatest extent. Amongst the plethora of functionality provided by the association mining techniques, each has its perks and drawbacks.
The need to obtain meaningful patterns and knowledge for business from raw data creates healthy competition for improvement in algorithms and techniques. Most businesses strive to create tailored applications implementing complex data mining techniques to meet business requirements. In the following sections, we will discuss some technological advancements in recent years in the domain of mining association rules. It is crucial to note that each proposed technique is unique and provides specific functionality which pertains to a specific task.
Online data mining is dynamic in nature and finding relevant patterns in real-time data streams can be challenging.
Relational Associations Rules (RARs) extend the basic concept of association rules to extract different relations between attributes that identify the given data. Here, a new Incremental Relational Association Rule Mining (IRARM) approach is used. The technique evolves to changing environments and adapts to the traditional, relational association rules. The results of using IRARM on available datasets proved its efficiency and productivity. There was a significant reduction in time in using IRARM in comparison to mining all the rules from scratch. We introduce IRARM to find all RARs in a dynamic data set, when any number of new entities are added to it.
We consider the following data set of entities (objects) described by m attributes. Thus, each entity ei from the set is as an m-dimensional sequence. We denote by R, all possible binary relations that can be defined on the cartesian product of two attributes domains. Using the offline DRAR algorithms and the binary relations from the set R, one can uncover the set relational association rules of all interesting RARs which hold over E. The data set E evolves in time, being extended at a certain time with a non-empty set of entities. The extended set of entities will be denoted, while the set of newly added entities is denoted. Given the minimum support and confidence thresholds, smin and cmin respectively, we investigate the problem of incrementally discovering the set RARext of all interesting RARs which hold over Next by adjusting the set of interesting RARs mined in E (before data extension).
Through the incremental approach, we reduce the time needed to determine the set RARext of RARs from Next whose support and confidence are greater than men and cmin, respectively. We note that IRARM is independent of the way the interestingness of a RAR is defined. The main idea used for determining the set RARext will be briefly described in the following. We divided IRARM in two successive stages. The first stage consists of filtering the set RAR of rules interesting in the initial data set E to keep only the rules which are also interesting in the extended data set Next too. The second stage is extending this subset resulting from the first stage with new rules which were not interesting in E, but become interesting on the extended data set Next. Consequently, after the second phase, the set RARext of RARs which are interesting in the extended data set Next will have been discovered.
After numerous experiments, it is evident that this new IRARM algorithm has many benefits over ‘starting from scratch in the entire data mining procedure. As entities are added to the data pool, the algorithm adapts to the dynamic environment and new data is considered to extract new, interesting RARs which can be used for prediction.
The exponential increase in data from in numerous sources has caused a bottleneck in the operating capacity of organizations. Extensive data processing is the major focus in Information Technology. Association rule mining in large databases is becoming a behemoth task. Even the most commonly used Apriori algorithm is also inefficient when applied to big databases as it requires more I/O load. This load could be distributed amongst all available systems; distributed and parallel. Making use of all available resources to compute data will increase throughput in minimum time.
In the simplest terms, a traditional computer has a single processing unit. We can use multiprocessing techniques to maximize CPU utilization. Another way to do this is by using multiple processors to achieve the desired output. Here, we combine both. The proposed idea is to partition the large volume of data into manageable clusters and assign each of them to specific processing units. The PAFI (Partition Algorithm for Mining Frequent Itemsets) is used to partition the clusters followed by a matrix method for transaction reduction to reduce redundant database access.
A hybrid master-slave architecture is used for implementation. The master processor handles the task of reducing the database into clusters using PAFI and then assigning it to the slave processors. The latter run multiple scans on the clusters to generate frequent itemsets (FIS). These frequent itemsets are submitted to the master processor, which is then stitched to obtain the entire set of FIS for the entire database. The following figure illustrates the system architecture.
Input: Database, Threshold, and Number of clusters. Output: Generate clusters, matrix, and frequent itemsets. Steps:
The main takeaway from this algorithm is that it will work accurately for large databases to find FIS efficiently. The key is to decide if partitioning is necessary. It is an overhead to form the clusters and then regroup the data processed by the slave processors. There will be situations when simple multiprocessing algorithms may solve the problem in a reasonable time. All in all, the system works well with large databases to generate frequent itemsets.
Association rule mining is widely used in the retail sales industry to identify which products are consistently bought together daily and then reposition them in the inventory to increase their probability of sale. This implementation gives the business an edge over the others who don’t. Association rule involves 2 major steps.
Referring to the previous section we have seen the benefits of using a distributed system for implementation and we continue that discussion in this implementation as well. The entire procedure can be summarized in a pseudo-algorithm:
The system works in two phases. The first partitions the data and finds frequent itemsets along with the interesting measures for each zone. In the second phase, the association’s rules are segregated into consistent and inconsistent rules. Data is sent for preprocessing as illustrated in the figure below. Due to the enormity of databases, the preprocessing is carried out in a distributed environment. This majorly involves the Hadoop MapReduce Framework. The mapper is setup to tag zonal itemsets and then the reducer trims the transactions to our need. For this stage, the interestingness measures (IM) are calculated for each generated rule. A rule is said to be consistent if its interestingness measure is closer to the global IM value. Else, the rule is categorized as inconsistent.