Association Rule Mining
The Focus of this section will be on Association Rule Mining (ARM). ARM is a machine learning technique used to discover relationships among variables.
The following illustrations will assist the reader to better understand the methods and considerations when using ARM.
raw data
code
transaction data
lights.csv
Concepts and Considerations
In order to conduct the ARM, the data in use must be formatted in one of three ways. For the purpose of this illustration the
basket method was chosen. As such the text data pertaining to smarthome devices was reformatted into the basket format which can be seen below.
After putting the data in the proper format, The data must be tokenized. This dataset has 13 different labels. In order to accelerate the process of each individual analysis, the tokenization was done to the entire data set. After tokenization however, each label and all of its corresponding tokenized data were read to a csv file.
This will accelerate the project in that the user would not need to run the entirety of the code everytime a different label needs to be analyzed. It would suffice to read in the
tokenized csv file for the label of interest and to run the algorithm.
In the analysis, the labels, garagedoor, information and light were analyzed. However for the purpose of this demonstration, only the analysis pertaining to the lights lable will be discussed here. This project attempts to answer the question: what words are most commonly used with the label "lights". A picture of the tokenized data can be seen below.
After tokenization, the data is read through the apriori algorithm. The apriori algorithm prunes the rules and relationships specified by the support and confidence parameters provided by the programmer. In this case, a support of 0.13 and confidence of 0.9 was provided and a set of 17 rules were retrieved. The top 15 rules sorted by support, confidence and lift will be compared in the following.
Support
The support of an itemset is the joint probability of two sets.
Sup(A,B) = P(A U B).
The Rules sorted by highest to lowest support are shown in the table below.
From this table it can be concluded that the word "Turn" and "light" are in 45% of commands pertaining too the lights. This can help with text prediction when programming a smart home device!
Another example is the term "make" which in conjunction with "lights" is present in 25% of commands pertaining to lights.
Confidence
Confidence can be understood as:
the fraction of times itemset b is used when itemset a is used.
The Rules sorted by highest to lowest confidence are shown in the table below.
For the confidence rule it is observed that the highest level of confidence is 1 which occurs often. For every confidence level of 1 we can then generalize:
If the words on the right hand side were used, 100% of the time the word on the left hand side was used as well. For example, every time the word "less" was used the word "make" was used as well.
Lift
The lift rule is defined as Sup(A U B) / Sup(A) * Sup(B). In english this means: lift measures performance of an association rule compared to a random choice association rule. It is a ratio of target response/ average response.
The Rules sorted by highest to lowest Lift are shown in the table below.
For the lift it is observed that multiple sets of words have a lift of four. This means that the itemset(s) {less}, {can,less}, {less, light}, {can,less, light} are all independent on the word "Make"
Visualizations
Here we will look at 3 interesting visualizations peretaining to this ARM analysis.
From this interactive graph it can be observed that some words have more of an association than other words. we can see that words such as {can}, {dark}, {make} and "lights" for example are very much centric when compared to others
By lookimng at this graph we see that the word {less} also is significant. we were not able to see this in the interactive map because of the complexity.
This graph shows the centricity of the word {light}, but also in a sense shows the realtive importance of the word {make}, which is interesting as it can be related to the Lift measurement.
Conclusion
To conclude, ARM is an exteremly powerful unsupervised learning method that can enable us to find patterns in transactions. It is shown that no only does it enable users to find related words but to also in a sense understand the relation ship among them.
Additionally, it can be used in determining which variables have better predictive ability than others.
NETID : MZ569