I Wanted to Learn Machine Learning and Data Science, But Where to Start?

By Kimberly Cook |Email | Sep 23, 2018 | 57279 Views

Advice for young professionals in the non-CS field who wants to learn and contribute to data science/machine learning. Curated from personal experience.

The motivation
Bill Gates proclaimed in a recent graduation ceremony, that artificial intelligence (AI), energy, and bioscience are three most exciting and rewarding career choices today's young college graduates can choose from.

I couldn't agree more.

I have come to believe strongly that some of the most important questions of our generation - related to sustainability, energy generation, and distribution, transportation, access to basic amenities of life etc., are dependent on how intelligently we can mix the first two branches of knowledge Mr. Gates mentions.

In other words, the world of physical electronics (semiconductor industry comprises a central portion of that world), must do more to embrace fully the fruits of information technology and new developments in AI or data science.
I wanted to learn, but where to start?
I am a semiconductor professional with 8+ years of post-PhD experience in a top technology company. I take pride in the fact that I work in the cross-section of physical electronics which directly contributes to the energy sector. I develop power semiconductor devices. They are built to carry the electrical power efficiently and reliably and they power everything from the tiny sensor inside your smartphone to the large industrial motor drives which process food or cloth for everyday consumption.

Therefore, naturally, I want to learn and apply the techniques of modern data science and machine learning to improve the design, reliability, and operation of such devices and systems.

But I am no computer science graduate. I could not tell a linked list from a heap. Support vector machines sounded like (a few months back) some special equipment for people with disabilities. And the only keyword of AI I remembered (from my junior year elective course) was 'first order predicate calculus', a remnant of the so-called 'old AI' or knowledge-engineering approach as opposed to the newer machine learning based approach.

I had to start somewhere to learn the basics and then study my way deep. The obvious choice was MOOC (Massive Open Online Courses). I am still very much in the learning phase but believe that I have at least gathered some good experience in choosing the right MOOC for this path. In this article, I wanted to share my insights on that aspect.
Know your 'Chi' and your 'Enemy'
Sorry for the bad analogy :-) It's from Netflix's latest superhero ensemble saga - The Defenders.

But it's true that you should know your strengths, weakness, and technical inclination very well before you start the learning-through-MOOC process.
Because let's face it, time and energy are limited and you cannot afford to waste your precious resources on something you are highly unlikely to practice in your current work or future job. And this is assuming that you want to take the (almost) free learning path i.e. auditing the MOOCs rather than paying for the certificates. I have an 'almost' there because, at the end of this article, I would like to list a few MOOCs which I think you should pay to showcase the certificates. And, for my personal journey, I had to pay for few Udemy courses I took because they are never free but you can buy them at the cost of a good lunch sandwich when the promotion runs.

In this picture, I just want to show the possibilities and impossibilities of this process i.e. what you can hope to learn through self-studying and practice and what must be learned on the job or what kind of mentality must be cultivated no matter what your profession is. Having said that, however, these circles broadly encompass the core skills that one can study to venture into the field of data science/machine learning from a non-CS background. Please note that even if you are in information technology (IT) sector, you may have a steep learning curve ahead because traditional IT is being disrupted by these new fields and the core skills and good practices are often different.

I, for one, view the data science field as more democratic than many other professional domains (e.g. my own area of work semiconductor technology), where the entry barrier is low and with sufficient hard work and zeal, anybody can acquire meaningful skills. For me personally, I have no burning desire to 'break in' this field, rather I just have a passion to borrow the fruits to apply to my own area of expertise. However, that end goal does not impact the initial learning curve that one has to traverse. So, you could be aiming to be either data engineer, or business analyst, or machine learning scientist, or a visualization expert - the field and choices are wide open. And if your aim is like mine - stay in the current domain of expertise and apply the newly learned techniques - you are fine too.

You can start with real basics, no shame there :)
I started with real basic - learning Python on Codeacademy. In all likelihood, you cannot go more basic than this :-). It worked though. I had this aversion towards coding but the simple and fun interface and the right pace of Codeacademy's free course was appropriate to excite me enough to keep going. I could have picked a Java or C++ course on Coursera or Datacamp or Udacity but some reading and research told me that Python is the optimal choice balancing learning complexity and utility (especially for data science) and I decided to trust the insight.

After a while, you crave for deeper knowledge (but at a gentle pace)
Codeacademy's introduction was a fine base to start with. I had choices from so many online MOOC platforms and predictably enough, I signed up for multiple courses at the same time. However, after dabbling with a Coursera class for few days, I realized I was not ready enough to learn Python from a professor! I was looking for a course taught by some enthusiastic instructor who will take time to go over the concepts in great detail, teach me other essential tools like Git and Jupyter notebook system, and maintain a right balance between basic concepts and advanced topics in the curriculum. And I found the right man for the job: Jose Marcial Portilla. He offers multiple courses on Udemy and is one of the most popular and positively reviewed instructors on that platform. I signed up and completed his Python Bootcamp course. It was an amazing introduction to the language with right pace, depth, and rigor. I recommend this course highly for new learners even though you have to fork out $10 (Udemy courses are generally not free and their regular price is $190 or $200 but you can always wait few days to have the recurrent promotion cycle and sign up for $10 or $15).

It's important to keep your focus on data science
The next step proved crucial for me. I could have gone astray and try to study anything and everything I could on Python. Especially, the object-oriented and class definition part which easily can suck you in for a long and arduous journey. Now, taking nothing away from that key sphere of Python universe, one can safely say that you can practice deep learning and good data science without being able to define your own class and methods in Python. One of the fundamental reasons of Python's ever-increasing popularity as the de facto language of choice for data science, is the availability of a large number of high-quality, peer-reviewed, expert-written libraries, classes, and methods, just waiting to be downloaded in a nice packaged form and unwrapped for seamless integration into your code.

Therefore, it was important for me to quickly jump into the packages and methods used most widely for data science - NumPy, Pandas, and Matplotlib.
I was introduced to those by a neat little course from edX. Although most courses on edX are from universities and rigorous (and longish) in nature, there are few short and more hands-on/less theoretical courses offered by technology companies like Microsoft. One of them is the Microsoft Professional Program in Data Science. You can register for as many courses under this program as you want. However, I took only the following courses (and I intend to come back for other courses in future)

  • Data Science Orientation: Discusses the everyday life of a typical data scientist and touches upon the core skills one is expected to have in this role along with the basic introduction to the constituting subjects.
  • Introduction to Python for Data Science: Teaches the basics of Python - data structures, loops, functions, and then introduces NumPy, Matplotlib, and Pandas.
  • Introduction to Data Analysis using Excel: Teaches basic and few advanced data analysis functions, plotting, and tools with Excel (e.g. pivot table, power pivot, and solver plug-in).
  • Introduction to R for Data Science: Introduces R syntax, data types, vector and matrix operations, factors, functions, data frames, and graphics with ggplot2.

Although these courses present the material in a rudimentary fashion and cover only the most basics of examples, they were enough to spark the plug! Boy, I was hooked!

I switched to learning R in detail - for some time
The last course made me realize a few important things: (a) statistics and linear algebra are at the core of the data science process, (b) I did not know/had forgot enough of that, and (c) R is naturally suited for the kind of work I want to do with my dataset - few MB sized data generated by controlled wafer fab experiment or TCAD simulation, primed for basic inferential analysis.

This prompted me to search for a solid introductory course in R language and who better to turn to than Jose Portilla again! I signed up for his "Data Science and Machine Learning Bootcamp with R" class. This was a 'buy one get another free' deal as the course covered essentials of R language in the first half and switched to teaching basic machine learning concepts (all the important concepts, expected in an introductory course, were covered with sufficient care). Unlike the edX Microsoft course, which used a server-based hands-on lab environment, this course covered the installation and setup of R Studio and necessary packages, introduced me to kaggle and gave the required push to graduate from being a passive learner (aka MOOC video watcher) to a person who is not afraid of playing with data. It also followed the great "Introduction to Statistical Learning in R" (ISLR) book by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, chapter by chapter.

If you are allowed to read only one book in your lifetime to learn machine learning and nothing else, pick this book and read all the chapters, no exception. By the way, there is no neural network or deep learning material in this book, so there's that...
Armed with the course materials, the ISLR book and practice on random datasets downloaded from kaggle or even my own electricity usage data from PG&E, I was no longer afraid of writing small bytes of codes which can actually model something interesting or useful. I analyzed some US county-level crime data, why a large design-of-experiment can lead to spurious correlation and even my apartment's electricity usage over the past 3 months. I also successfully used R to build predictive models based on some real-world data sets from my work. The statistical/functional nature of the language and ready-made estimate of the confidence intervals (p-values or z-score) for a variety of models (regression or classifications) really help a new learner to gain the easy foothold in the domain of statistical modeling.

Do as much side-study of mathematical basics as possible
This aspect of learning cannot be over-emphasized - especially for non-CS graduates and IT engineers who are not in touch with rigorous mathematics for some years into their professional lives. I even wrote a medium article on what mathematics knowledge is necessary to have for machine learning and data science.

For this, I chose a few courses from Cousera and edX. Few of them stand out in their depth and rigor. Those are,

  • Statistical Thinking for Data Science and Analytics (Columbia Univ.): Foundation statistics course from Columbia University on their Data Science Executive certificate program on edX. Rigorous but drills down the concepts very well in a structured manner.
  • Computational Probability and Inference (MIT): This is a hard one from MIT, be aware! It covers advanced topics like Bayesian models and Graphical models in unparalleled depth.
  • Statistics with R Specialization (Duke Univ.): This is a 5-course (the last one is a capstone project, you can ignore that) specialization from Duke University to enhance your statistics foundation along with hands-on programming exercise. Recommended for balanced difficulty level and rigor.
  • LAFF: Linear Algebra - Foundations to Frontiers (UT Austin): This is an amazing course in linear algebra foundation (along with deep discussion about high-performance computing of linear algebra routines) that you must give a try. Offered by the University of Texas, Austin on edX platform. Trust me when I say, after taking this course, you will never want to invert a matrix to solve a linear system of equations even if that is tempting and easy to understand but you will try to find a QR factorization or Cholesky decomposition to reduce the computation complexity.
  • Optimization Methods in Business Analytics (MIT): This is a course in optimization/operation research methods for business analytics from MIT. I signed up because this was the only highly-rated course on a good platform (edX) that I could find about linear and dynamic programming techniques. I believed that learning about those techniques could be immensely helpful as the optimization problem turns up in almost all machine learning algorithm.

Please note that I did not search and sign up for any calculus course as I was comfortable with the level of knowledge I could remember (from college days) and what I expected to be useful for any machine learning or data science study and practice. If you are rusty in that area, please search for a good one.

Machine Learning - various personalities make it a colorful affair
Somewhere among all these side-studies, I managed to complete the course that is considered as one of the pioneers of all MOOCs - Andrew Ng's machine learning course on Coursera. I guess there are plenty of articles written about it already, and therefore, I will not waste any more of your time describing this course. Just take it, do all the homework and programming assignments, learn to think in terms of vectorized codes for all the major machine learning algorithms that you know of, and save the notes for ready reference for your future work.

Oh, by the way, if you want to brush up/ learn from scratch MATLAB (you will need to write MATLAB codes for this course, not R or Python), then you can check out this course: Introduction to Programming with MATLAB.

Now, I want to talk about personalities.

I took multiple machine learning courses and the aspect I enjoyed most was realizing how the treatment of the same fundamental subject becomes a function of the personality and worldview of different instructors :) This was a fascinating experience.
I am listing down the various machine learning MOOCs I signed up and covered...

  • Machine Learning (Stanford Univ.): Andrew Ng's widely known course. Talked about it in the paragraph above.
  • Machine Learning Specialization (Univ. of Washington): This comes with a different flavor than Ng's. Emily Fox and Carlos Guestrin present the concepts from a statistician's and a practitioner's perspective respectively. I could not install the Python package that Carlos' company offers as a free license but this specialization is worth completing for its theory lectures alone. The proofs and discussion of some of the fundamental concepts like bias-variance trade-off, cost computation, and comparison of analytic vs. numerical approaches for cost function minimization, are more intuitively and carefully presented than even Prof. Ng's course (and that's saying something given the superb quality of Prof. Ng's teaching).
  • Machine Learning for Data Science and Analytics (Columbia Univ.): This course had a little unusual syllabus for a general machine learning course by devoting the full first half on conventional algorithms lectures. It covered essential sorting, searching, graph traversing, and scheduling algorithms. There is not a much one-to-one discussion about how these algorithms are exactly used in the machine learning problems but studying about them gives you an idea about the traditional computer science knowledge necessary to appreciate how large-scale data science problems are tackled. Think O(n^3) whenever you are about to multiply to matrices or think O(nlog(n)) whenever you are sorting a list. You may not exclusively use this knowledge in your day-to-day job, but knowing about these nuts and bolts of computation process certainly broadens your worldview about the problem at hand.
  • Data Science: Data to Insights (MIT xPro 6 weeks online course): This one is among the very few paid courses I have taken (I generally go Audit route for MOOCs). This is not available on public edX website although it uses the edX platform for delivering content. The 6-week course is well-structured and full of interesting content which opens up the wide world of data science and machine learning to the uninitiated. The case studies are very interesting but reasonably hard and time-consuming to codify. Lectures are very engaging with the illustration of those case studies. My particular favorite module was the one about recommendation system. I literally started viewing the Netflix screen on my laptop in terms of adjacency matrix after taking this class!
  • Neural Networks for Machine Learning (Univ. of Toronto): This is a somewhat underrated course on Coursera, even with the neural network pioneer Jeff Hinton as the instructor. I realize that Andrew Ng's new Deep Learning specialization will directly compete with this course and I would not be surprised if Coursera removes this in near future. However, while it is there, a deep learning enthusiastic should sit through this one, even if just to gauge the pattern of the historical development of deep networks.
  • Deep Learning Specialization (deeplearning.ai): This is the newest kid on the block but it stands of the very board shoulder of Andrew Ng, and therefore boasts of very strong legs :) I have finished the 2nd course and on to the 3rd now. The jury is still out there but definitely, you should consider completing this series if you want to brush over the latest trends in deep learning. Even if the programming assignments look hard and you want to stay out of programming a deep network by hand (you can argue there are always excellent open-source packages like TensorFlow, Keras, Theranos, out there to take care of the nuts and bolts under the hood), it is imperative to have deep understanding of the essential concepts such as regularization, exploding gradient, hyperparameter tuning, batch normalization, etc. to effectively use those high-level deep learning frameworks.

Two umbrella data science MOOCs with R and Python
As we draw closer to the end of this long article, I wanted to list down two multi-course MOOCs I found interesting and useful to go along with the specific subject areas mentioned above.

  • Data Science Specialization (John Hopkins Univ.): This one is a well-known 10-course specialization offered on Coursera. Not every course will appeal to every learner. I personally completed only 5 of the 10. The key thing is the timing i.e. when to start this specialization. Often this comes up at the top of the Google result when one researches about MOOCs for data science and therefore this becomes the first MOOC for many new learners. Personally, I would have had a problem getting the full value from this course if I had done that. The introductory Microsoft and Udemy courses on R and few statistics and linear algebra courses before this helped me immensely to extract the full benefit from these set of courses. As the specialization is instructed by professors from the bio-statistics department of JHU, one gets an excellent treatment of two aspects of data science which are often under-represented in many curriculum- research study and design of the experiment.
  • Data Science Micromasters certificate program (UC San Diego): I have just enrolled and started the 1st of the 4 courses in this series/certificate program. I like the fact that this is similar in breadth and goals as the John Hopkins specialization, except it chooses Python as the working language for the hands-on portion. The structure and content seem well thought out covering basics of Python, Git, Jupyter all the way up to Big data processing with Apache Spark framework (statistics and machine learning courses thrown in the middle). The case studies and hands-on examples are drawn from the real-world application of data science such as wildfire modeling, cholera outbreak, or world development indicator analysis. One of the lead instructors is Ilkay Altintas, who has created an amazing platform for helping wildfire dynamics prediction and is putting the fruits of data science research for pursuing societal good. I am sure my journey with this specialization will be an exciting and rewarding one. You are welcome to join the party!

Learning is pretty democratized - take advantage of it
With the advent of MOOCs, open-source programming platforms, collaboration tools, and virtually unlimited free cloud-based storage, learning is as democratized, ubiquitous, and universally accessible as it can get. If you are not a specialist on data science/machine learning but want to learn the subject, write some code for higher productivity at work, strive for a career enhancement, or just have some fun, now is the time to start learning. Few parting comments,

You are a data scientist: Do not let any so-called expert demoralize you by saying something like "MOOCs are for kids, you won't learn real data science like that". The very fact that you are trying to data science by enrolling in a MOOC means two things: (a) you already deal with data in your professional life and (b) you want to learn science, the structured manner of extracting maximum value from your data and generate intelligent questions around that data. That means you, my friend, are already a data scientist. If still not convinced, read this blog by Brandon Rohrer, one of the most admired and inspirational data scientists that I know of.

You don't have to spend a large sum for this learning: I know that I listed a lot of courses and they may look expensive to you. But, fortunately, most (if not all), can be enrolled into free of cost. edX courses are always free to enroll and they generally don't have any restrictions in terms of course content i.e. you can view, execute, submit all the graded assignments (unlike Coursera, which lets you watch all the videos but hides the graded material). If you think some certificate is worth showcasing on your resume, you can always pay for it in the middle of the course after you have completed some videos and judged the merit and utility.

Practice, code, and build things to supplement your online learning: There is a real algorithm called 'online learning' in the context of machine learning. In this technique, instead of processing a full matrix of millions of data points, the algorithm works with the latest few data points and updates the prediction. You can work in this mode too. The halting problem/parking problem is always a fascinating one and it applies to learn too. We always wonder how much to study and assimilate before building things i.e. where to halt the learning and start implementing. Don't hesitate, don't procrastinate. Learn a concept and test it by simple coding. Work with the latest trick or technique you watched a video about, don't wait for achieving mastery over the entire topic. You will be amazed by how simple 20 lines of coding can give you solid practice (and make you sweat enough) on the most complex concept you learned watching that video.

There is plenty of data out there: You will also be amazed by how many rich sources of free data are out there on the web. Don't go to Kaggle, try something different for fun. Try data.gov or United Nations data portal. Go to the UCI machine learning repository. Feeling more adventurous? What about downloading data about various countries from CIA and try all the cool visualizations that you learned in the latest Matplotlib or ggplot2 lecture? If not anything else, download your own electricity usage data from your energy provider and analyze if you could save few bucks if you turned on the AC or dishwasher at a different time.


The article was originally published here

Source: HOB