Skip to main content

What is Data Science and Data Scientist Role?

The Beginner’s Guide for a Data Scientist to getting Help to their problems.

I am excited to share with you all that I have just uploaded two new videos on my YouTube channel. These videos are a part of my ongoing effort to create valuable content for the data science community.


In the first lesson we look at how to effectively get help when you run into a problem. This is important for this course, but also for your future as a data scientist!

We first looked at strategies to use before asking for help, including reading the manual, checking the help files, and searching Google and appropriate forums. We also covered some common coding problems you may face and some preliminary steps you can take on your own, including paying special attention to error messages and examining how your code behaved compared to your goal.

Once you’ve exhausted these options, we turn to other people for help. We can ask peers for help or explain our problems to our trusty rubber ducks (be it an actual rubber duck or an unsuspecting coworker!). Our course forum is also a great resource for you all to talk with many of your peers! Go introduce yourself!

And if all else fails, we can post on forums (be it in this class or at another forum, like Stack Overflow), with very specific, reproducible questions. Before doing so, be sure to brush up on your forum etiquette - it never hurt anybody to be polite! Be a good citizen of our forums! There is an art to problem solving, and the only way to get practice is to get out there and start solving problems! Get to it!


In the second Lecture, we talked about what is Data Science?



What is data science?


So the first question you probably need answered going into this course is, “What is Data Science?” - and that is a great question. To different people, this means different things, but at its core, data science is using data to answer questions. This is a pretty broad definition, and that’s because it’s a pretty broad field!


Data science can involve:
  • Statistics, computer science, mathematics
  • Data cleaning and formatting
  • Data visualization

An Economist Special Report sums up this combination of skills well - they state that a data scientist is broadly defined as someone:

“who combines the skills of software programmer, statistician and storyteller slash artist to extract the nuggets of gold hidden under mountains of data”

And by the end of these courses, hopefully you will feel equipped to do just that!

Why do we need data science?

One of the reasons for the rise of data science in recent years is the vast amount of data currently available and being generated. Not only are massive amounts of data being collected about many aspects of the world and our lives, but we simultaneously have the rise of inexpensive computing. This has created the perfect storm in which we have rich data and the tools to analyse it: Rising computer memory capabilities, better processors, more software and now, more data scientists with the skills to put this to use and answer questions using this data!

There is a little anecdote that describes the truly exponential growth of data generation we are experiencing. In the third century BC, the Library of Alexandria was believed to house the sum of human knowledge. Today, there is enough information in the world to give every person alive 320 times as much of it as historians think was stored in Alexandria’s entire collection.

And that is still growing.

What is big data?

We’ll talk a little bit more about big data in a later lecture, but it deserves an introduction here - since it has been so integral to the rise of data science. There are a few qualities that characterize big data. The first is volume. As the name implies, big data involves large datasets - and these large datasets are becoming more and more routine. For example, say you had a question about online video - well, YouTube has approximately 300 hours of video uploaded every minute! You would definitely have a lot of data available to you to analyse, but you can see how this might be a difficult problem to wrangle all of that data!

And this brings us to the second quality of big data: velocity. Data is being generated and collected faster than ever before. In our YouTube example, new data is coming at you every minute! In a completely different example, say you have a question about shipping times or routes. Well, most transport trucks have real time GPS data available - you could in real time analyse the trucks movements… if you have the tools and skills to do so!

The third quality of big data is variety. In the examples I’ve mentioned so far, you have different types of data available to you. In the YouTube example, you could be analysing video or audio, which is a very unstructured data set, or you could have a database of video lengths, views or comments, which is a much more structured dataset to analyse.

A summary of three qualities that characterize big data

What is a data scientist?

So we’ve talked about what data science is and what sorts of data it deals with, but something else we need to discuss is what exactly a data scientist is.

The most basic of definitions would be that a data scientist is somebody who uses data to answer questions. But more importantly to you, what skills does a data scientist embody?

Drew Conway’s Venn diagram of data science

And to answer this, we have this illustrative Venn diagram, in which data science is the intersection of three sectors - Substantive expertise, hacking skills, and math and statistics.

To explain a little on what we mean by this, we know that we use data science to answer questions - so first, we need to have enough expertise in the area that we want to ask about in order to formulate our questions and to know what sorts of data are appropriate to answer that question. Once we have our question and appropriate data, we know from the sorts of data that data science works with, often times it needs to undergo significant cleaning and formatting - and this often takes computer programming slash “hacking” skills. Finally, once we have our data, we need to analyse it, and this often takes math and stats knowledge.

In this specialization, we’ll spend a bit of time focusing on each of these three sectors, but will primarily focus on math and statistics knowledge and hacking skills. For hacking skills, we’ll focus on teaching two different components: computer programming or at least computer programming with R, which will allow you to access data, play around with it, analyze it, and plot it. Additionally, we’ll focus on having you learn how to go out and get answers to your programming questions.

One reason data scientists are in such demand is that most of the answers aren’t already outlined in textbooks - a data scientist needs to be somebody who knows how to find answers to novel problems.

Why do data science?

Speaking of that demand, there is a huge need for individuals with data science skills. Not only are machine learning engineers, data scientists, and big data engineers among the top emerging jobs, the demand far exceeds the supply.

Data scientist roles have grown exponentially since 2012, but currently demand beats the supply significantly around the world, besides hundreds of companies are hiring for those roles - even those you may not expect in sectors like retail and finance - supply of candidates for these roles cannot keep up with demand.

This is a great time to be getting in to data science - not only do we have more and more data, and more and more tools for collecting, storing, and analysing it, but the demand for data scientists is becoming increasingly recognized as important in many diverse sectors, not just business and academia.

Additionally, according to towards data science, in which they ranked the top 50 best jobs in America, Data Scientist is THE top job in the US in 2022, based on job satisfaction, salary, and demand.

Examples of data scientists

The diversity of sectors in which data science is being used is exemplified by looking at examples of data scientists.

One place we might not immediately recognize the demand for data science is in sports – Daryl Morey is the general manager of a US basketball team, the Houston Rockets. Despite not having a strong background in basketball, Morey was awarded the job as GM on the basis of his bachelor’s degree in computer science and his M.B.A. from M.I.T. He was chosen for his ability to collect and analyse data, and use that to make informed hiring decisions.

Another data scientist that you may have heard of is Hilary Mason. She is a co-founder of FastForward labs, a machine learning company recently acquired by Cloudera, a data science company, and is the Data Scientist in Residence at Accel. Broadly, she uses data to answer questions about mining the web and understanding the way that humans interact with each other through social media.

And finally, Nate Silver is one of the most famous data scientists or statisticians in the world today. He is founder and editor in chief at FiveThirtyEight - A website that uses statistical analysis - hard numbers - to tell compelling stories about elections, politics, sports, science, economics and lifestyle.

He uses large amounts of totally free public data to make predictions about a variety of topics; most notably he makes predictions about who will win elections in the United States, and has a remarkable track record for accuracy doing so.

Data science in action!

One example of data science in action is the development of personalized recommendations on streaming platforms like Netflix or Spotify. These companies use data science techniques to analyze user behavior, such as the content they watch or listen to, how long they watch or listen to it, and which other users have similar preferences.


Based on this analysis, the platform can generate personalized recommendations for each user, suggesting new content that they are likely to enjoy. The recommendations are constantly updated as the platform gathers more data on each user's preferences and behavior.

This application of data science not only improves the user experience but also helps the company increase user engagement and retention, leading to higher revenue and profitability.

What will we teach you in this course?

Now that you have had this introduction into data science, all that really remains to cover here is a summary of what it is that we will be teaching you throughout this course. To start, we’ll go over the basics of R. R is the main programming language that we will be working with in this course track, so a solid understanding of what it is, how it works and getting it installed on your computer is a must. We’ll then transition into RStudio - which is a very nice graphical interface to R, that should make your life easier! We’ll then talk about version control, why it is important and how to integrate it into your work. And once you have all of these basics down, you’ll be all set to apply these tools to answering your very own data science questions!

Looking forward to learning with you! Let’s get to it!


Comments

Post a Comment

Type your comment here.
However, Comments for this blog are held for moderation before they are published to blog.
Thanks!

Popular posts from this blog

What is Data? And What is Data Science Process?

The Beginner’s Guide to Data & Data Science Process About Data: In our First Video today we talked about Data and how the Cambridge English Dictionary and Wikipedia defines Data, then we looked on few forms of Data that are: Sequencing data   Population census data ( Here  is the US census website and  some tools to help you examine it , but if you aren’t from the US, I urge you to check out your home country’s census bureau (if available) and look at some of the data there!) Electronic medical records (EMR), other large databases Geographic information system (GIS) data (mapping) Image analysis and image extrapolation (A fun example you can play with is the  DeepDream software  that was originally designed to detect faces in an image, but has since moved on to more  artistic  pursuits.) Language and translations Website traffic Personal/Ad data (e.g.: Facebook, Netflix predictions, etc.) These data forms need a lot of preprocessin...

Mastering R Programming: Best Coding Practices for Readable and Maintainable Code

The Beginner’s Guide to Coding Standards: When it comes to programming, writing code that is easy to read and maintain is just as important as writing code that works. This is especially true in R programming, where it's common to work with large datasets and complex statistical analyses. In this blog post, we'll go over some coding standards that you should follow when writing R code to ensure that your code is easy to read and maintain . Indenting One of the most important coding standards to follow is to use consistent indenting. Indenting makes your code more readable by visually indicating the structure of your code. In R programming, it's common to use two spaces for each level of indentation. For example: if (x > y) {   z <- x + y } else {   z <- x - y } Column Margins Another important coding standard is to use consistent column margins. This means that you should avoid writing code that extends beyond a certain number of characters (often 80 or 100). Th...

Introduction to Functions and Arguments in R Programming: Part 2

The Beginner’s Guide to Functions in R Programming: Functions are an essential part of programming, and they play a critical role in R programming. In R, a function is a set of instructions that perform a specific task. Functions in R can have several arguments, and their evaluation can be lazy or eager. In this blog post, we will explore functions in R, including their  "dot-dot-dot" or ellipsis  argument, lazy evaluation, and more . Ellipsis or "dot-dot-dot" Argument in R Functions The "dot-dot-dot" or ellipsis argument in R programming is a special argument that can be used in functions to represent a variable number of additional arguments that are not explicitly defined in the function. The ellipsis argument is represented by three dots ... and is typically used at the end of the function's argument list. When the function is called, any additional arguments provided by the user after the defined arguments are collected by the ellipsis argument an...