As a working Data Scientist, I get a lot of queries from aspiring students and recent graduates wanting to start a career in Data Science all the time. Although I try and help them whenever I can, I thought it would be a good idea for me to write it down somewhere that’s accessible to all. This is my attempt at doing that.
What makes a Data Scientist?
I have a whole article dedicated to what Data Science is, so I’m not going to repeat that here. What it essentially says is that, for someone to become a Data Scientist, she must be knowledgeable in statistics, be able to explore and model the data by writing code, reason about the results scientifically, and present them clearly.
Besides these, there are a number of “soft skills” involved that separate a truly effective data scientist from the rest. These include qualities such as being able to switch between different domains easily; having meaningful conversations with domain experts on the problem you are trying to solve; figuring out what question the users are trying to answer with the data; preparing effective plans for the analysis to figure out those answers; iterating through the various stages in a reasonable amount of time; and organizing the analysis so that the results can be recreated easily.
In the next few paragraphs you will see all the different skills you need to develop to succeed in this profession along with some of the resources I’ve handpicked for learning them.
The foundational skills
As with everything, being good at working with data requires you to acquire certain foundational skills. It’s easy to get carried away and start with popular topics like Deep Learning right away, but you will quickly get lost if you don’t know about things like probability and statistics, linear algebra, basic programming, algorithm design etc.
Most of this information is already part of a typical computer science undergraduate course. Revising those lectures or textbooks should be a good starting point. There are also many good online videos like those from Khan academy and Udacity that teach these basics. The important thing is to develop a good intuition about these subjects rather than learn them by rote. You should at least be able to know what different statistical distributions are, how to carry out significance testing, what the central limit theorem is and be able to understand and code basic numeric algorithms like gradient descent.
1. Think Stats (Allen Downey)
2. All of Statistics (Wasserman)
3. Coding the Matrix (Klien)
Think in Dataframes
After getting a sound grounding of the basics, one needs software tools to do any practical data science work. Excel spreadsheets and SQL databases like MySQL may come to mind. While they may be useful for many basic analysis tasks, we need specialized software to do advanced data transformations and statistics.
The two most popular languages used by Data Scientists are R and Python.
R: R is a statistics and graphics functional programming language that is very popular among data scientists, researchers and statisticians. It uses what it calls “data frames” as the basic way of working with data. These are two-dimensional data structure with rows and columns analogous to a table in a SQL database. This has proven to be the most powerful way to work with data and has been used by other tools like Python (Pandas) and Apache Spark as well. I recommend getting familiar with R, especially with Hadley Wickham’s “tidy verse” libraries. If you read scientific papers, many of their implementations will be in R.
Python: Most programmers know Python as a general-purpose programming language but recently it has emerged at the leading language used by data scientists as well. This is due to libraries such as Numpy, Scipy, Pandas, Matplotlib, Scikit Learn and Deep learning frameworks such as Tensorflow and Pytorch. This might be an easier way to get started if you are already familiar with Python or come from a software development background. Pandas is the tool you will work with most in the python data world and is a port of R’s data frame.
Personally, I started using R in the beginning of my career and have lately been mostly been working in the Python data ecosystem. But, I do find some statistical and visualization tools in R to be better than in python. You cannot go wrong with either though. The point is to learn to work with dataframes.
Another important tool is the Jupyter notebook. It was originally created for python-based data analysis but now supports many other language backends including R. Working with data is a stepwise process, and going through your analysis serially in a notebook, carefully looking at the result of each step of the process, charts etc. is how Data Scientists work. You can easily export the code or model from Jupyter notebooks.
1. R for Data Science (Wickham)
2. Python for Data Analysis, 2nd Edition (McKinney)
3. Python Data Science Handbook (Vanderplas)
This is probably what most of you are interested in. If you’ve learned the basics that I’ve talked about earlier this should easier to understand.
Of the many resources out there to learn machine learning I’d advise you to take Andrew Ng’s Coursera course. Although it used Octave instead of R or Python, Andrew has a way of explaining complex topics in a less intimidating way. You cannot go wrong with this one. If you want to read a book on machine learning, I’d suggest “Introduction to Statistical Learning” . They also have another similar book, but this one is more beginner friendly and is really well written.
The reality is that the actual code for creating and fitting the machine learning model is quite easy using R or python libraries like Scikit-learn. There are so many machine learning models and its variations that it’s often difficult to know which one to use.
What you need to learn is to prepare your features to feed into the model, aka feature engineering and the ability to choose and optimize the best model for the dataset. Things like regularization and cross-validation to prevent model from overfitting the data, hyper parameter tuning, and measuring the model’s effectiveness for the task at hand are more important.
If you really want to dig deep into how the models work though, I would also suggest watching the “Learning from Data” course from Caltech. The professor is awesome and goes really deep into the basics of the “learning problem”, and how all of the machine learning magic really works. There is also an accompanying book for the course.
If Deep Learning is what you are interested in, you can then explore books and courses for that. Again, Andrew Ng has a excellent course on the topic, and there is also the “Deep Learning Book” . That book is a must for anyone who wants to get started in Deep Learning.
I’d also like to mention the importance of working with real datasets. Most of the data used in course work and books are pre-processed and cleaned up so that they are ready for analysis. In the real world I’ve never come across a dataset that didn’t require some form of cleanup and transformation. You can look for those in online databases like open data portals. Entering Kaggle competitions is also a good way of working with industry datasets.
I hope some of you will benefit from following the path that I’ve laid out above. There are other skills such as data wrangling, data engineering, visualization etc. that are also equally important for a Data Scientist in her day-to-day work, that I’ve not gone into detail here. But you will come across these in the books / courses I’ve mentioned above and know why they are so important.
So, remember to focus on the basics first, always keep on learning, and try to get your hands on real-world datasets. With practice and dedication anyone can be a good data scientist.
More on Data Science:
On Data Science, Saurav Dhungana
Interested in big data in Nepal? See what’s happening, Dovan Rai
Author: Saurav Dhungana
Saurav is a Data Scientist and the Founder of CraftData Labs. A passionate advocate of using data to solve real-world problems, he loves creating products and solutions that help people better understand the world around them. Saurav holds a M.Sc in Communications Engineering from Aalto University, Finland.
The name of the book is An introduction to Statistical Learning, not Statistical Inference. And the authors, Robert Tibshirani and Trevor Hastie had done a MOOC on it as well. It has been taken down from Coursera, but you can find the whole series on YouTube. 🙂
Corrected . Thanks for mentioning Subigya 🙂