Wednesday, August 21, 2019
K Means Clustering With Decision Tree Computer Science Essay
K Means Clustering With Decision Tree Computer Science Essay The K-means clustering data mining algorithm is commonly used to find the clusters due to its simplicity of implementation and fast execution. After applying the K-means clustering algorithm on a dataset, it is difficult for one to interpret and to extract required results from these clusters, until another data mining algorithm is not used. The Decision tree (ID3) is used for the interpretation of the clusters of the K-means algorithm because the ID3 is faster to use, easier to generate understandable rules and simpler to explain. In this research paper we integrate the K-means clustering algorithm with the Decision tree (ID3) algorithm into a one algorithm using intelligent agent, called Learning Intelligent Agent (LIAgent). This LIAgent capable of to do the classification and interpretation of the given dataset. For the visualization of the clusters 2D scattered graphs are drawn. Keywords: Classification, LIAgent, Interpretation, Visualization 1. Introduction The data mining algorithms are applied to discover hidden, new patterns and relations from the complex datasets. The uses of intelligent mobile agents in the data mining algorithms further boost their study. The term intelligent mobile agent is a combination of two different disciplines, the agent is created from Artificial Intelligence and code mobility is defined from the distributed systems. An agent is an object which has independent thread of control and can be initiated. The first step is the agent initialization. The agent will then start to operate and may stop and start again depending upon the environment and the tasks that it tried to accomplish. After the agent finished all the tasks that are required, it will end at its complete state. Table 1 elaborates the different states of an agent [1][2][3][4]. Table 1. States of an agent Name of Step Description Initialize Performs one-time setup activity. Start Start its job or task. Stop Stops its jobs or tasks after saving intermediate results. Complete Performs completion or termination activity. There is link between Artificial Intelligence (AI) and the Intelligent Agents (IA). The data mining is known as Machine Learning in Artificial Intelligence. Machine Learning deals with the development of techniques which allows the computer to learn. It is a method of creating computer programs by the analysis of the datasets. The agents must be able to learn to do classification, clustering and prediction using learning algorithms [5][6][7][8]. The remainder of this paper is organized as followos: Section 2 reviews the relevant data mining algoritms, namely the K-means clustering and the Decision tree (ID3). Section 3 is about the methodology; a hybrid integration of the data mining algorithms. In section 4 we discuss the results and dicussion. Finally section 5 presents the conclusion. 2. Overview of Data Mining Algorithms The K-means clustering data mining algorithm is used for the classification of a dataset by producing the clusters of that dataset. The K-means clustering algorithm is a kind of unsupervised learning of machine learning. The decision tree (ID3) data mining algorithm is used to interpret these clusters by producing the decision rules in if-then-else form. The decision tree (ID3) algorithm is a type of supervised learning of machine learning. Both of these algorithms are combined in one algorithm through intelligent agents, called Learning Intelligent Agent (LIAgent). In this section we will discuss both of these algorithms. 2.1. K-means clustering Algorithm The following steps explain the K-means clustering algorithm: Step 1: Enter the number of clusters and number of iterations, which are the required and basic inputs of the K-means clustering algorithm. Step 2: Compute the initial centroids by using the Range Method shown in equations 1 and 2. (1) (2) The initial centroid is C(ci, cj).Where: max X, max Y, min X and min Y represent maximum and minimum values of X and Y attributes respectively. k represents the number of clusters and i, j and n vary from 1 to k where k is an integer. In this way, we can calculate the initial centroids; this will be the starting point of the algorithm. The value (maxX minX) will provide the range of X attribute, similarly the value (maxY minY) will give the range of Y attribute. The value of n varies from 1 to k. The number of iterations should be small otherwise the time and space complexity will be very high and the value of initial centroids will also become very high and may be out of the range in the given dataset. This is a major drawback of the K-means clustering algorithm. Step 3: Calculate the distance using Euclideans distance formula in equation 3. On the basis of the distances, generate the partition by assigning each sample to the closest cluster. Euclidean Distance Formula: (3) Where d(xi, xj) is the distance between xi and xj. xi and xj are the attributes of a given object, where i and j vary from 1 to N where N is total number of attributes of a given object. i,j and N are integers. Step 4: Compute new cluster centers as centroids of the clusters, again compute the distances and generate the partition. Repeat this until the cluster memberships stabilizes [9][10]. The strengths and weaknesses of the K-means clustering algorithm are discussed in table 2. Table 2. Strengths and Weakness of the K-means clustering Algorithm Strengths Weaknesses Time complexity is O(nkl). Linear time complexity in the size of the dataset. It is easy to implement, it has the drawback of depending on the initial centre provided. Space complexity is O(k + n). If a distance measure does not exist, especially in multidimensional spaces, first define the distance, which is not always easy. It is an order-independent algorithm. It generates same partition of data irrespective of order of samples. The Results obtained from this clustering algorithm can be interpreted in different ways. Not applicable All clustering techniques do not address all the requirements adequately and concurrently. The following are areas but not limited to where the K-means clustering algorithm can be applied: Marketing: Finding groups of customers with similar behavior given large database of customer containing their profiles and past records. Biology: Classification of plants and animals given their features. Libraries: Book ordering. Insurance: Identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds. City-planning: Identifying groups of houses according to their house type, value and geographically location. Earthquake studies: Clustering observed earthquake epicenters to identify dangerous zones. WWW: Document classification; clustering web log data to discover groups of similar access patterns. Medical Sciences: Classification of medicines; patient records according to their doses etc. [11][12]. 2.2. Decision Tree (ID3) Algorithm The decision tree (ID3) produces the decision rules as an output. The decision rules obtained from ID3 are in the form of if-then-else, which can be use for the decision support systems, classification and prediction. The decision rules are helpful to form an accurate, balanced picture of the risks and rewards that can result from a particular choice. The function of the decision tree (ID3) is shown in the figure 1. Figure 1. The Function of Decision Tree (ID3) algorithm The cluster is the input data for the decision tree (ID3) algorithm, which produces the decision rules for the cluster. The following steps explain the Decision Tree (ID3) algorithm: Step 1: Let S is a training set. If all instances in S are positive, then create YES node and halt. If all instances in S are negative, create a NO node and halt. Otherwise select a feature F with values v1,,vn and create a decision node. Step 2: Partition the training instances in S into subsets S1, S2, , Sn according to the values of V. Step 3: Apply the algorithm recursively to each of the sets Si [13][14]. Table 3 shows the strengths and weaknesses of ID3 algorithm. Table 3. Strengths and Weaknesses of Decision Tree (ID3) Algorithm Strengths Weaknesses It generates understandable rules. It is less appropriate for a continuous attribute. It performs classification without requiring much computation. It does not perform better in problems with many class and small number of training examples. It is suitable to handle both continuous and categorical variables. The growing of a decision tree is expensive in terms of computation because it sorts each node before finding the best split. It provides an indication for prediction or classification. It is suitable for a single field and does not treat well on non-rectangular regions. 3. Methodology We combine two different data mining algorithms namely the K-means clustering and Decision tree (ID3) into a one algorithm using intelligent agent called Learning Intelligent Agent (LIAgent). The Learning Intelligent Agent (LIAgent) is capable of clustering and interpretation of the given dataset. The clusters can also be visualized by using 2D scattered graphs. The architecture of this agent system is shown in figure 2. Figure 2. The Architecture of LIAgent System The LIAgent is a combination of two data mining algorithms, the one is the K-means clustering algorithm and the second is the Decision tree (ID3) algorithm. The K-means clustering algorithm produces the clusters of the given dataset which is the classification of that dataset and the Decision tree (ID3) will produce the decision rules for each cluster which are useful for the interpretation of these clusters. The user can access both the clusters and the decision rules from the LIAgent. This LIAgent is used for the classification and the interpretation of the given dataset. The clusters of the LIAgent are further used for visualization using 2D scattered graphs. Decision tree (ID3) is faster to use, easier to generate understandable rules and simpler to explain since any decision that is made can be understood by viewing path of decision. They also help to form an accurate, balanced picture of the risks and rewards that can result from a particular choice. The decision rules are obta ined in the form of if-then-else, which can be used for the decision support systems, classification and prediction. A medical dataset Diabetes is used in this research paper. This is a dataset/testbed of 790 records. The data of Diabetes dataset is pre-processed, called the data standardization. The interval scaled data is properly cleansed. The attributes of the dataset/testbed Diabetes are: Number of times pregnant (NTP)(min. age = 21, max. age = 81) Plasma glucose concentration a 2 hours in an oral glucose tolerance test (PGC) Diastolic blood pressure (mm Hg) (DBP) Triceps skin fold thickness (mm) (TSFT) 2-Hour serum insulin (m U/ml) (2HSHI) Body mass index (weight in kg/(height in m)^2) (BMI) Diabetes pedigree function (DPF) Age Class (whether diabetes is cat 1 or cat 2) [15]. We create the four vertical partitions of the dataset Diabetes, by selecting the proper number of attributes. This is illustrated in tables 4 to 7. Table 4. 1st Vertically partition of Diabetes Dataset NTP DPF Class 4 0.627 -ive 2 0.351 +ive 2 2.288 -ive Table 5. 2nd Vertically partition of Diabetes Dataset DBP AGE Class 72 50 -ive 66 31 +ive 64 33 -ive Table 6. 3rd Vertically partition of Diabetes Dataset TSFT BMI Class 35 33.6 -ive 29 28.1 +ive 0 43.1 -ive Table 7. 4th Vertically partition of Diabetes Dataset PGC 2HIS Class 148 0 -ive 85 94 +ive 185 168 -ive Each partitioned table is a dataset of 790 records; only 3 records are exemplary shown in each table. For the LIAgent, the number of clusters k is 4 and the number of iterations n in each case is 50 i.e. value of k =4 and value of n=50. The decision rules of each clusters is obtained. For the visualization of the results of these clusters, 2D scattered graphs are also drawn. 4. Results and Discussion The results of the LIAgent are discussed in this section. The LIAgent produces the two outputs, namely, the clusters and the decision rules for the given dataset. The total sixteen clusters are obtained for all four partitions, four clusters per partition. Not all the clusters are good for the classification, only the required and useful clusters are discussed for further information. The sixteen decision rules are also generated by LIAgent. We are presenting three decision rules of three different clusters. The number of decision rules varies from cluster to cluster; it depends upon the number of records in the cluster. The Decision Rules of the 4th partition of the dataset Diabetes: Rule: 1 if PGC = 165 then Class = Cat2 else Rule: 2 if PGC = 153 then Class = Cat2 else Rule: 3 if PGC = 157 then Class = Cat2 else Rule: 4 if PGC = 139 then Class = Cat2 else Rule: 5 if HIS = 545 then Class = Cat2 else Rule: 6 if HIS = 744 then Class = Cat2 else Class = Cat1 Only six decision rules are for the 4th partition of the dataset. It is easy for any one to take the decision and interpret the results of this cluster. The Decision Rules of the 1st partition of the dataset Diabetes: Rule: 1 if DPF = 1.32 then Class = Cat1 else Rule: 2 if DPF = 2.29 then Class = Cat1 else Rule: 3 if NTP = 2 then Class = Cat2 else Rule: 4 if DPF = 2.42 then Class = Cat1 else Rule: 5 if DPF = 2.14 then Class = Cat1 else Rule: 6 if DPF = 1.39 then Class = Cat1 else Rule: 7 if DPF = 1.29 then Class = Cat1 else Rule: 8 if DPF = 1.26 then Class = Cat1 else Class = Cat2 The eight decision rules are for the 1st partition of the dataset. The interpretation of the cluster is easy through the decision rules and it also helps to take the decision. The Decision Rules of the 3rd partition of the dataset Diabetes: Rule: 1 if BMI = 29.9 then Class = Cat1 else Rule: 2 if BMI = 32.9 then Class = Cat1 else Rule: 3 if TSFK = 23 then Rule: 4 if BMI = 25.5 then Class = Cat1 else Rule: 5 if BMI = 30.1 then Class = Cat1 else Rule: 6 if BMI = 28.4 then Class = Cat1 else Class = Cat2 else Rule: 7 if BMI = 22.9 then Class = Cat1 else Rule: 8 if BMI = 27.6 then Class = Cat1 else Rule: 9 if BMI = 29.7 then Class = Cat1 else Rule: 10 if BMI = 27.1 then Class = Cat1 else Rule: 11 if BMI = 25.8 then Class = Cat1 else Rule: 12 if BMI = 28.9 then Class = Cat1 else Rule: 13 if BMI = 23.4 then Class = Cat1 else Rule: 14 if BMI = 30.5 then Rule: 15 if TSFK = 18 then Class = Cat2 else Class = Cat1 else Rule: 16 if BMI = 26.6 then Rule: 17 if TSFK = 18 then Class = Cat2 else Class = Cat1 else Rule: 18 if BMI = 32 then Rule: 19 if TSFK = 15 then Class = Cat2 else Class = Cat1 else Rule: 20 if BMI = 31.6 then Class = Cat2 , Cat1 else Class = Cat2 The twenty decision rules are for the 3rd partition of the dataset. The number of rules for this cluster is higher than the other two clusters discussed. The visualization is important tool which provides the better understanding of the data and illustrates the relationship among the attributes of the data. For the visualization of the clusters 2D scattered graphs are drawn for all the clusters. We are presenting the four 2D scattered graphs of four different clusters of different partitions. Figure 3. 2D Scattered Graph between NTP and DPF attributes of Diabetes dataset The distance between NTP and DPF attributes of Diabetes dataset varies at the beginning of the graph but after some interval the distance becomes constant. Figure 4. 2D Scattered Graph between DBP and AGE attributes of Diabetes dataset There is a variable distance between DBP and AGE attributes of the dataset. It remains variable throughout this graph. Figure 5. 2D Scattered Graph between TSFT and BMI attributes of Diabetes dataset The graph shows almost constant distance between TSFT and BMI attributes of the dataset. It remains constant throughout the graph. Figure 6. 2D Scattered Graph between PGC and 2HIS attributes of Diabetes dataset There is a variable distance between PGC and 2HIS attributes of the dataset. But in the middle of this graph there is some constant distance between these attributes. The structure of this graph is similar to the graph of figure 5. 5. Conclusion It is not simple for all the users that they can interpret and extract the required results from these clusters, until some other data mining algorithms or other tools are not used. In this research paper we have tried to address the issue by integrating the K-means clustering algorithm with the Decision tree (ID3) algorithm. The choice of the ID3 is due to the decision rules in the form of if-then-else as an output, which are easy to understand and help to take the decision. It is a hybrid combination of supervised and unsupervised machine learning, using intelligent agent, called a LIAgent. The LIAgent is helpful in the classification and prediction of the given dataset. Furthermore, 2D scattered graphs of the clusters are drawn for the visualization.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.