Machine Learning: Juli 2017

Sabtu, 29 Juli 2017

Naive Bayes Algorithm

What is naive bayes?

Data sets that have labels can be classified using the naïve bayes motode. A simple method of classification without an iteration process so that the process in it becomes faster than other algorithms that require the iteration process to determine the model. The naïve bayes process refers to the Bayes theory.

Naive bayes are used to specify labeled data sets based on bayes theory. This theory assumes that each attribute stands alone or is not bound by any other attribute. The simple process is to determine a certain probability and without any iteration so the process is very simple and fast even though the dataset used has a large size.

How does naïve bayes work?
It has been mentioned above that naïve bayes refer to the bayes theory which calculates the posterior probability value of P (c | x), from P (c), P (x), and P (x | c) as shown below.

Decision Tree Algorithm

What is decision tree?
If you are given data that has multiple attributes and has a label then you can use the decision tree to classify unknown data (test data). As the name implies, the algorithm tries to build a decision tree to serve as a model. With this model, each test data can be classified based on the number of labels that have been determined from the decision tree process.

Decision tree model is formed based on calculation result with input in the form of training data. Calculations aim to find important components of decision trees such as finding tree roots, tree branches and tree branches. Roots, branches, twigs, and leaves represent attributes on the dataset affecting the classification process..

The decision tree is then formed into a rule that is used to classify new data (test data) inputted to the decision tree. The rules used in the form of logic "if else" which in the process will produce label classification of data entered into the decision tree.

How to calculate decision trees algorithm?
Model making is done by calculating the impurity value of each attribute on the label contained in the training dataset. The methods used are entropy, gini index, classification error. The impurity value is said to be good if it has a very small value. The impurity calculation is performed on each attribute paired with the label. Next each impurty value on each attribute is compared to the other attribute, the attribute that has the smallest value will be used as the root.

What is the formula of impurity?
The Decision tree actually uses entropy to determine the degree of impurity of its attributes. As the formula shows below.

It can also be used gini index formula

Or the classification error

After obtaining the value of impurty degree on each attribute then done the comparison process on them, if got attribute with the smallest impurty value then it is chosen as root, or stem, or twig.

k nearest neighbor algorithm

What is k nearest neighbor ?

K nearest neighbor, or you can call it knn is an algorithm for grouping labeled data based on the similarity of distance between the data points. This algorithm includes a simple algorithm because the calculation involves only two stages of calculating the distance of each data with all data points and then looking for the minimum distance to determine the label data being tested (test data).

What are data in knn ?

Test data is data that has been prepared for testing knn algorithm. This data is not involved in the training or may be referred to as unknown data. This type of data was never used before in the knn algorithm. The amount of data is usually less than the amount of data used in the knn algorithm training.

The data involved in the training process of the knn algorithm is called training data. Usually the number is much higher than the test data. The knn algorithm to treat this data is uniquely different from other machine learning algorithms, especially algorithms for the purpose of data classification. . The collection of training data is not counted in the training process or in other words the knn has no training process but the training data is stored in a vector space to serve as vector data knn.

Why knn is called lazy algorithm?

The knn algorithm is well known as the lazy algorithm, since the knn algorithm uses training data on the process of determining labels of test data. Unlike other machine learning algorithms. They use training data to build their algorithm model.

Jumat, 28 Juli 2017

Kmeans algorithm

What is Kmeans?

If you are given a data base x1, x2, x3, ..., xn that have not label or just consisting of a set of attributes then you can group them by using kmeans. With a kmeans you can group unlabeled data whose number of groups is determined by number k. K denotes the number of data groups to be formed in the kmeans. The process of grouping unlabelled data is expressed by clustering.

Steps of kmeans

The calculation process kmeans is very simple and does not require a complicated process.

Kmeans has steps like below:

1. Set up unlabeled data sets x1, x2, x3, ..., xn, dataset consists only of a set of numeric attributes.

2. Determine coefficient k, k denote the number of data groups to be created.

3. Determine the number of data centers c1, c2, ..., ck, number of centers equals the value of
k. determined by random, or other methods.

4. Calculate the distance between each data point with the entire data center by using euclidean distance. Once the distance calculation is done then you get the distance of each point with each centroid.

5. Compare distance between each centroid. The closest distance determines the cluster of data points.

6. Collect all points that have the same cluster and then calculate the mean value. This determines the new centroid.

7. Compare the new centroid with the old centroid, if both are the same then kmeans algorithm has been completed but if not the same then done the calculation process as before. The algorithm process returns to step 4 through step 5 to determine the new centroid.

If you want to see a sample calculation kmeans please go to the post Calculation of kmeans to cluster the data set of iris

Minggu, 16 Juli 2017

Machine learning algorithm

Machine learning algorithm has a wide variety, always experiencing rapid development. This algorithm is used for research in various fields such as data mining, image processing and others.

Here you can download the source code of certain machine learning algorithms, as well as the ebook explanation of source code, as well as other supporting files. Free without you paying.

Here is the machine learning algorithm you can download:neigh

1. k nearest neighbor (knn)

source code knn dalam program matlab click in here

dataset haberman in text file (*.txt) click in here

ebook about "SOURCE CODE EXPLANATION K NEAREST NEIGHBOR MATLAB" click in here

ebook about "k nearest neighbor with python " downoad in here please...!

2. Kmeans algorithm
Kmeans in the matlab program, this source code is used to group iris datasets. You can download the source code and sliced dataset in here

In addition, so you more easily understand kmeans algorithm that has been downloaded on my blog then I provide a special ebook that contains explanation in detail about the program code. To be able to read it please download in here.

3. Naive Bayes
Naive Bayes in matlab program. this source code is used to group iris datasets. You can download the source code and sliced dataset in here

4. ID3

ID3 in octave program. You can download the source code in here

5. kmedoid soon

6. FCM soon

7. SVM soon

8. C45 soon