Working to become a Data Scientist, Data Analyst, Data Engineer or Machine Learning Engineer

8 minute read


For this post I want to discuss how to work towards becoming a Data Scientist, Data Analyst, Data Engineer or Machine Learning Engineer. I will refer to these job titles as “The Big 4”. When I look at the number of job postings for “The Big 4” on employment web sites as of writing this post it is amazing. All you have to do is go to LinkedIn or another job site and do a search for “Data Analyst” in the United States and you will see several pages of openings and that’s just for one of “The Big 4”.

Related to “The Big 4” are specialized roles in Natural Language Processing (NLP) and AI techniques, but I believe these are just variants of say a Machine Learning Engineer role or Data Scientist role, so I don’t mention NLP or AI as separate roles. As of writing this post when I do a search on Linkedin for NLP it returns Data Science positions. Well, the fact that there are many openings with the job title of “Data Analyst” or “Data Scientist” is great, but what if you haven’t held a job as a “Data Analyst” or “Data Scientist” or any of “The Big 4”?

Ready, Set, Go!

So, where does one start if you are interested in working as one of “The Big 4”? I took the advice of well-known professionals in the field, those who have written books or teach “The Big 4”. I also reviewed several blogs of those who work as one of “The Big 4”, where they share their work and thoughts. I have enrolled in two different online curriculums one is Coursera, where I am taking a Machine Learning course. The other is called Dataquest.io, which provides paths for students who are interested becoming a Data Scientist, Data Analyst or Data Engineer. I am also considering an online Master’s degree in Data Science.

As you begin exploration into “The Big 4”, you will realize that there is some overlap and often you will see in job postings that suffer from I call the “bundle complex” where a given job posting bundles what could possibly be 1 or 2 FTE’s into a position description, but this is the subject of a future post.

For now, to show how these roles overlap, consider the image below a data plotting exercise from my Data Analyst / Data Scientist track on Dataquest.io. The example, comes from a section on “Improving Plot Aesthetics” in the “Storytelling Through Data Visualization” course.

Simple Graph

To create the graph above, I used standard tools of a Data Analyst or Data Scientist, which in this case are the Python libraries pandas, matplotlib.pyplot using IPython.

The good thing is that much of the software used in “The Big 4” is open source and very accessible. For example, Anaconda a software platform, which integrates many different tools for Data Analysts or Data Scientists such as IPython or Jupyter Notebook. Additionally, there are complete Data Science platforms available as well as Big Data Platforms that integrate “The Big 4” tools.

The Mathematics Journey

One thing that has become clear early in my Journey into “The Big 4” is the need to brush up on statistics, linear algebra, vector mathematics and some calculus. If you plan to become a Machine Learning Engineer or Data Scientist it is important to know how to solve basic equations, but you won’t need to compute partial derivatives necessarily. However, if you are a mathematical wizard and can apply what you know to the formulas used within the field, it doesn’t hurt either. As I continue my journey with “The Big 4”, I am realizing that rather than solving equations manually it is more important to understand which model to apply and the proper way to apply it, in order to produce the best results.

The Machine Learning class that I am currently taking on Coursera does include a fair amount of mathematical formulas, but these formulas themselves are rarely used to manually produce a result. Instead, the formulas are translated to a more practical format to be used in a programming language. As an example, here is a formula used in my machine learning course in week 6 of the class.

Cost Function

This formula represents what is called a cost function and it is helping to calculate the errors on training data and cross validation data in a process to plot a learning curve, which is helpful in debugging a machine learning algorithm. The good thing is during the class you translate this formula numerous times into a vectorized equivalent within the Octave programming environment. Yay, for Octave! As you continue your journey you will discover that many different programming languages such as Python or R have what are called mathematical libraries that provide translations of machine learning formulas. While the language is different, the implementaion is usually optimized.

Do I need a Degree in Data Science or Data Analysis?

You might be asking yourself, should I obtain a degree in Data Science? That’s a good question, which is not easily answered. The answer to this question is that it really depends on where you are in your career and whether you have a STEM degree. In my case, I had not done any statistics since college, but I do hold a Master’s degree in Information Technology and have many years of experience working with data driven applications and recently held the role of Analytics Consultant, which provided valuable experience in data analysis and data visualization with exposure to some data science methods. My most recent role more than likely served as catalyst to my journey to become one of “The Big 4”.

While doing my research, I found an article on Forbes written by an expert in the field that says you may want to rethink getting your degree in Data Science, meaning you may not have to obtain one of these degrees, in order to pursue “The Big 4”. The article suggests that you review or take a basic statistics course to become familiar with some of the methods that are used in “The Big 4” roles to determine if you even like it or to show you that you can learn the techniques without actually enrolling in Data Science program. Indeed, this appears to be true, because there are many online resources that provide more than enough to get started. However, this doesn’t mean that I won’t pursue a Master’s degree in Data Science to expand my horizons, it just means I am carefully considering if it is the right thing for me to do. I’m thankful that I read this article. I did end up downloading a free e-book on statistics, so I could brush up from my days in college. By the way, the book is called “Introductory Statistics “ and it has a very nice introduction to Linear Regression.

Create your Blog!

I found that it is recommended and, in some cases, required for employment purposes or education programs that you create a blog or create blog posts if you are planning to pursue “The Big 4”. For example, an employer that I recently interviewed with that provides a Big Data platform stated that as an employee you are expected to contribute to the company blog and can also use the time spent blogging as part of your utilization.

I’ve also seen programs that teach “The Big 4” require you to blog as you progress through the program. While you do not have to create a blog, for me it works, because I like writing and can use Github Pages to host my blog. I can use the same platform to post code samples relevant to “The Big 4”.

If you haven’t created a blog yet, be sure to check out my tutorial on how to build a blog using Jeckyll and Github Pages. It will take some effort to get through it, but after completing it, you will have very nice professional looking blog, that is relatively easy to maintain.

Summary

After enrolling in online curriculums above, I realized that many of the tools used in “The Big 4” are readily available on a platform that I’m already using for a day to day workstation, which is Ubuntu Linux. While you do not have to use Linux to become one of “The Big 4”, you might be surprised to learn that most if not all of the tools used for these careers are readily available on the Linux platform. This is especially great if you are like me and enjoy working on an open source platform. Either way hopefully you are doing something that you enjoy.

To be perfectly honest, working toward becoming one of “The Big 4” is not necessarily an easy journey. If becoming one of “The Big 4” was easy, then anyone or everyone would be doing it.

If you are considering a career change, have a great deal of experience with data systems (DBA, Data Modeling, OO Programming, SQL etc.) and have done some data visualization or quantitative reporting, and want to take your skill set and use it to become one of “The Big 4”, you can get there a variety of ways and as it turns out it doesn’t have to necessarily be expensive either. You can probably get there without having a great deal of experience in data systems or programing, but the Journey may take a bit longer. For example, if you have not done any programming it may be pretty hard to pick up Python as used in my Data Analyst / Data Science track with Dataquest.io. The good is that they have a complete introduction to Python as a part of their Data Analyst or Data Scientist tracks and there are other online training sites such as DataCamp that provide similar training with some free tutorials for Python or R.

Updated:

Leave a Comment

Your email address will not be published. Required fields are marked *

Loading...