Machine Learning Algorithms: Data Scientists Should Know

Jul 4, 2018 | 1590 Views

Machine Learning has become increasingly important today because of the digital transformation of companies leading to the production of massive data of different forms and types, at an ever increasing rate. Due to the advancements in computing technologies and exposure to huge amounts of data, the applicability of machine learning is dramatically increasing.

There is a great significance of Machine Learning. So, to get started with Machine Learning, and then most popular machine learning algorithms used in data science:

What is Machine Learning?
It is a subfield of computer science which gives computers the ability to learn without being explicitly programmed. It is concerned with construction of algorithms than can learn and make predictions from data. 
1. Supervised machine learning: It is a machine learning task of inferring a function from 'labeled' training data.
2. Unsupervised machine learning: It is a machine learning task of inferring a function to describe hidden structure from 'unlabelled' training data. 
Popular machine learning algorithms used in data science are as follow:

Decision Tree: Decision Tree is one of the popular algorithms. It is a binary classifier which helps to choose one from two given decisions that uses features.
Decision tree are used for making classifications & predictions.  It uses a tree-like graphical representation where the branches signify the outcomes and leaf signifies a particular class label- mostly, a Yes or a No to any outcome. Decision trees are used when the data scientist wants to evaluate different operations of an alternate decision. It gives a structured direction for businesses to make favorable decisions by assessing the likeable probabilities. It is used for classification & segmentation.

1. A decision tree is a graphical representation of various decisions and their possible consequences, including chance like events, resource costs and utility.
2. It is a decision support tool that uses tree like graph or model of decisions. 
3. There are three parts of a decision tree: 
  • Internal Node: Represents a test on the attribute.
  • Branch: Represents the outcome of the test.
  • Leaf Node: Represents a particular class label like decision made after computing all of the attributes. 
4. The classification rules are represented through the path from the roof to the leaf node.
5. Decision trees are used in operations research, especially decision analysis apart from the machine learning. 
6. It is one of the predictive modeling approach.

Linear Regression: 
Regression analysis is concerned with the study of dependence of one variable on one or more variables. The dependent variables are the variable whose value we want to predict. The independent variables are those variables which are expected to influence the dependent variable. 

Logistic Regression: 
Logistics Regression is also one of the popular algorithm in Machine Learning. It is simple & linear in nature. It helps in classifying data into multiple groups. It is primarily used for modeling binomial target variable and can be extended to multinomial logistics regression. It is popular in credit scoring, marketing campaign analytics.
  • This is also known as logit model or logit regression. 
  • In logistics regression, the dependent variable is a categorical.
  • It models the probabilities of the default class. 
  • The regression coefficients are estimated using the maximum likelihood method.
  • Binary logit model is one where the dependent variable can only take a value 0 or 1.
  • Multinomial logit model is one where the dependent variable can have three or more possible types and are not ordered.
  • Ordinal logit model is one where dependent variable can have many types and are ordered.

Naive Bayesian Classification:
Naive Bayesian Classification uses Bayesian theorem for data classification. It also uses strong independent assumption between features. So, whatever features you are using in your data it assumes that there is no dependency between the features. It is also used for text mining, email spam classification. And, it is also popular in text classification. It is based on Bayes theorem of probability. 

K means Clustering: 
K means clustering aims to partition 'n' observations into 'k' clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. It is a popular unsupervised machine learning algorithm for cluster analysis. This is used by engines such as Yahoo, Google to cluster web pages by similarity. It also helps the retailers to segment the customers on the basis of their spending patterns and understand price sensitivity of the customers. 

Support Vector Machine:
Support Vector Machine is a type of algorithm which uses hyper planes to separate binary classes. So, instead of a linear line, it actually uses multiple ones to separate data. And, the separation is more efficient in this case because it takes into consideration the non-linearity in data which often gets ignored in simple classification algorithms like logistic regression. Also, it is very highly scalable so that it can be used for a large deficit even if there is consideration of non-linearity as multiple hyper planes are highly used here.

Apriori Algorithm: 
  • It is an unsupervised machine learning algorithm. 
  • It generated association rules from the given data sets.
Association Rules: If an event A occurs, then event B also occurs with certain possibility.
Hence, these are some of the popular machine learning frameworks used in data science. 


Source: HOB