...

Full Bio

How To Develop Your Own Neural Network From Scratch In Python

today

Awesome Mario Kart Character According To Data Science - Every Data Scientist Should Learn

today

Top 5 Most Popular Programming Languages For Hacking

today

Why I did Master In Python For Data Science: 8 Concepts You May Have Forgotten

today

10+ Essential Cheat Sheets for Machine Learning and Deep Learning Engineers

yesterday

Which Programming Languages in Demand & Earn The Highest Salaries?

290094 views

50+ Data Structure, Algorithms & Programming Languages Interview Questions for Programmers

176985 views

100+ Data Structure, Algorithms & Programming Language Interview Questions Answers for Programmers - Part 1

153594 views

How I Wrote programming language. Here's how you can, too.

133335 views

Five books every data scientist should read that are not about data science

128808 views

### A Complete Machine Learning Guide in Python: Part 2

- Linear Regression
- K-Nearest Neighbors Regression
- Random Forest Regression
- Gradient Boosted Regression
- Support Vector Machine Regression

import pandas as pd

import numpy as np

# Read in data into dataframes

train_features = pd.read_csv('data/training_features.csv')

test_features = pd.read_csv('data/testing_features.csv')

train_labels = pd.read_csv('data/training_labels.csv')

test_labels = pd.read_csv('data/testing_labels.csv')

`Training Feature Size: (6622, 64)`

Testing Feature Size: (2839, 64)

Training Labels Size: (6622, 1)

Testing Labels Size: (2839, 1)

# Create an imputer object with a median filling strategy

imputer = Imputer(strategy='median')

# Train on the training features

imputer.fit(train_features)

# Transform both training data and testing data

X = imputer.transform(train_features)

X_test = imputer.transform(test_features)

`Missing values in training features: 0`

Missing values in testing features: 0

# Create the scaler object with a range of 0-1

scaler = MinMaxScaler(feature_range=(0, 1))

# Fit on the training data

scaler.fit(X)

# Transform both the training and testing data

X = scaler.transform(X)

X_test = scaler.transform(X_test)

- Model hyperparameters are best thought of as settings for a machine learning algorithm that are set by the data scientist before training. Examples would be the number of trees in a random forest or the number of neighbors used in K-nearest neighbors algorithm.
- Model parameters are what the model learns during training, such as weights in a linear regression.

- Random Search refers to the technique we will use to select hyperparameters. We define a grid and then randomly sample different combinations, rather than grid search where we exhaustively try out every single combination. (Surprisingly, random search performs nearly as well as grid search with a drastic reduction in runtime.)
- Cross-Validation is the technique we use to evaluate a selected combination of hyperparameters. Rather than splitting the training set up into separate training and validation sets, which reduces the amount of training data we can use, we use K-Fold Cross-Validation. This involves dividing the training data into K number of folds, and then going through an iterative process where we first train on K-1 of the folds and then evaluate performance on the Kth fold. We repeat this process K times and at the end of K-fold cross-validation, we take the average error on each of the K iterations as the final performance measure.

- Set up a grid of hyperparameters to evaluate
- Randomly sample a combination of hyperparameters
- Create a model with the selected combination
- Evaluate the model using K-fold cross-validation
- Decide which hyperparameters worked the best

- loss: the loss function to minimize
- n_estimators: the number of weak learners (decision trees) to use
- max_depth: the maximum depth of each decision tree
- min_samples_leaf: the minimum number of examples required at a leaf node of the decision tree
- min_samples_split: the minimum number of examples required to split a node of the decision tree
- max_features: the maximum number of features to use for splitting nodes

# Find the best combination of settings

random_cv.best_estimator_

`GradientBoostingRegressor(loss='lad', max_depth=5,`

max_features=None,

min_samples_leaf=6,

min_samples_split=6,

n_estimators=500)

# Make predictions on the test set using default and final model

default_pred = default_model.predict(X_test)

final_pred = final_model.predict(X_test)

`Default model performance on the test set: MAE = 10.0118.`

Final model performance on the test set: MAE = 9.0446.

- Imputation of missing values and scaling of features
- Evaluating and comparing several machine learning models
- Hyperparameter tuning using a random grid search and cross-validation
- Evaluating the best model on the test set