...

Full Bio

Top 10 Big Data Trending, Everyone Should Know

today

How to Get Job in Machine Learning, Even If You Aren't a Data Scientist

today

Top 10 Machine Learning Algorighms Everyone Should know to Become Data Scientist

today

7 Things Every Manager Should Know About Machine Learning

today

Five books every data scientist should read that are not about data science

115692 views

88 Data Science Resources & Tools to Become a Data Scientist Expert

71043 views

60+ Free Books on Data Science, Big Data, Data Mining, Machine Learning that Everyone Should Read!

66741 views

Top 9 Data Science Skills Must Have to Become a Data Scientist

55728 views

Are Data Scientists the Highest Paid Job?

45282 views

### A "Data Science for Good" Machine Learning Project Walk-Through in Python: Part Two

**Getting the most from our model, figuring out what it all means, and experimenting with new techniques**

**Model Optimization**

- Manual Tuning: select hyperparameters with intuition/experience or by guessing, train the models with the values, find the validation score, and repeat until you run out of patience or are satisfied with the results.
- Grid Search: set up a hyperparameter grid and for every single combination of values, train a model, and find the validation score. The optimal set of hyperparameters are the ones that score the highest.
- Random Search: set up a hyperparameter grid and select random combinations of values, train the model, and find the validation score. Search iterations are limited based on time/resources
- Automated Tuning: Use methods (gradient descent, Bayesian Optimization, evolutionary algorithms) for a guided search for the best hyperparameters. These are informed methods that use past information.

- Objective function: what we want to maximize (or minimize)
- Domain space: region over which to search
- Algorithm for choosing the next hyperparameters: uses the past results to suggest next values to evaluate
- Results history: saves the past results

In the real-world, above a certain threshold - which depends on the application - accuracy becomes secondary to explainability, and you're better off with a slightly less performant model if it is simpler.

- Recursive Feature Elimination for feature selection
- Uniform Manifold Approximation and Projection for dimension reduction and visualization

from sklearn.metrics import f1_score, make_scorer | |

from sklearn.feature_selection import RFECV | |

from sklearn.ensemble import RandomForestClassifier | |

# Custom scorer for cross validation | |

scorer = make_scorer(f1_score, greater_is_better=True, average = 'macro') | |

# Create a model for feature selection | |

estimator = RandomForestClassifier(n_estimators = 100, n_jobs = -1) | |

# Create the object | |

selector = RFECV(estimator, step = 1, cv = 3, | |

scoring= scorer, n_jobs = -1) | |

# Fit on training data | |

selector.fit(train, train_labels) | |

# Transform data | |

train_selected = selector.transform(train) | |

test_selected = selector.transform(test) |

- Matrix decomposition algorithms: PCA and ICA
- Embedding techniques that map data onto low-dimension manifolds: IsoMap, t-SNE

import umap as UMAP | |

n_components = 3 | |

# Use default parameters | |

umap = UMAP(n_components=n_components) | |

# Fit and transform | |

train_reduced = umap.fit_transform(train) | |

test_reduced = umap.transform(test) |

- Understand the problem and data
- Perform data cleaning alongside exploratory data analysis
- Engineer relevant features automatically and manually
- Compare machine learning models
- Optimize the best performing model
- Interpret the model results and explore how it makes predictions
- Finally, if after all that you still haven't got your fill of data science, you can move on to exploratory techniques and learn something new!