Breaking down the job description: What is a Data Scientist Expected to Know?
Data science, the buzzword today is an exponentially emerging market. With advancements in 5G, Internet of Things, and Cloud computing, it is set to be the field with some of the most sought jobs in the near future. So what does it take to be a Data Scientist? Who is a Data Scientist in the first place and why does a company need them? Let's explore this profile in depth here.
Who is a Data Scientist? What does he/she do?
Obviously, a Data Scientist is someone who analyzes data to obtain some useful results. However, it is not just that. A true data scientist is someone who can define a problem statement that can be solved using data and then successfully solve it. He/She should also be an expert in data mining, cleaning, analyzing, and interpreting. The final results should be something beneficial to the company, like increasing profits or quality of service.
The book, "O'Reilly's Doing Data Science" written by Rachel Schutt and Cathy O’Neil beautifully describe what data science is all about and various practices related to it. It is a recommended read for all aspiring data scientists.
Let us look at the scenario of a few major companies. A quick LinkedIn job search for a 'Data Scientist' would give more than 30,000 results worldwide.
Multiple firms hire 'Data Scientists' for multiple profiles. Essential the role of a data scientist is not fixed to a particular domain. Different teams within a company have their own data scientist. For example, Facebook hires data scientists in various profiles like revenue forecasting and analytics, Online safety and security, Growth and analytics, Ads, Video ecosystems, Product quality, etc. The job is same but the data you work with and the outcomes generated are different. Data is being generated everywhere. That’s what increases the demand for data scientists. Their use cases are very versatile and the scope is tremendous.
After analyzing a wide range of job descriptions for data scientists from various companies, we have compiled the requirements and various things that people commonly look for in Data Scientists;
The things you'll need to know (You don't need to know all of them):
The Programming Languages:
- Python: Python is a simple and effective scripting language very useful in data science and analytics equipped with a wide range of libraries. The "Python Data Science Handbook" has all the information you need to get started with python for data analytics and various important libraries used.
- R: R is a powerful programming language often preferred by companies over python for statistical analysis. It is simple to python in many ways. This blog by Analytics Vidhya titled, "A complete tutorial to learn data science in R from scratch" covers everything you need to know about R and how to use it along with some important technical concepts. Having a good knowledge of either python or R is essential.
- SAS: SAS is statistical analysis software. It is another major programming language used for data science. If a company prefers to using SAS, then this 203 page tutorial from Tutorials Point covers everything you need to know about SAS.
- SPSS: This is another software used for statistical analysis of data. Although not as prevalent as Python or R, it can be quite handy at times.
The Query Languages:
As a data scientist, one would be continuously working with data and databases. As such, having knowledge about at least one query language is necessary. A query language is basically a programming language that can interact with databases and information system by requesting and retrieving data;
- SQL: It is Structured Query Language, one of the most popular. Here is a tutorial, "7 steps to mastering SQL for data science" that just has everything you need to know and quickly get a grasp of this language and use.
- Apache Hive and Pig: If you have never heard of them before, the names may seem pretty weird but these are actually used while working with big data. Before you get to know these, you should first know about Hadoop. Hadoop is basically a distributed framework used for storing and managing huge amounts of data. It is itself one huge topic to learn worth a separate article. This article, "Hive and Pig, Introduction and Key Difference Between Them" gives a fundamental idea of what they are. I myself do not have much experience with these, hence cannot comment more. This is where you should head to after mastering SQL.
- NoSQL Databases: Basically, it stands for Not only SQL. MongoDB, Cassandra, and HBase come under this domain. Each one is worth its own article exploring its use case and advantages.
Data Visualization tools and frameworks:
Data visualization is an important step in any data analysis procedure and helps us to physically perceive the data.
- D3.js: It is a javascript library used for interactive data visualization in a web browser. Before you can come to learn this, you must know the basics of HTML, CSS, and Javascript which are the fundamentals of web development and not difficult to grasp as well. If you are already familiar with them, this article, "Learn D3.js in 5 minutes", will help you understand it's use case pretty quickly.
- GGplot and Matplotlib are other important packages used for data visualization in python and R. They are pretty straightforward and can be used directly as they come with a lot of inbuilt functions.
- Tableau: This is a widely used interactive data visualization software. It is useful for some quick visualization of data and can be integrated with R as well. Here is a 5-week online course being offered by Coursera, "Data Visualization and Communication with Tableau" being offered by Duke University that can provide some quick training in this software.
- Microstrategy and Power BI are other similar software in the market a data scientist must be aware of. Different companies may use different frameworks.
All these constitute the major skill set points and experience of usage expected in any data scientist. Professionals either looking for a career change or a job in data science domain can find these as requirements for most data science profiles if not all. So far we have just seen the software, frameworks, and tools that data scientists use in their day to day work but what are the technical concepts one must be aware of?
Stay tuned for the next article where we briefly explore the technical concepts most frequently sought in a data scientist and resources to gain expertise in the same.
(Do comment, which other job profiles you want to be analyzed in a similar manner)
(This article was first published on Research Nest's blog on Medium here)
#IndiaStudents #StudentVoices #DataScientist #JobProfile
MS - Physics Research Student
6yGreat!!
Air Pollution Control, Greening and Plantation, OSD (A), CAQM, Ex-Director, MoEFCC
6yHow to understand the scenario where a large number of fields or domains and sub domains require data scientists on the one hand and, on the other a fresher thinks of having expertise in a particular domain. Can the job opportunities be categorised?
Lead DevOps Engineer | Immediately Available | AWS | Kubernetes | Python | Jenkins | Terraform | Ansible | Azure
6yNice one!!
MBA (F&C)
6yIt's great !!