The Beginner’s Guide for a Data Scientist to getting Help to their problems.
I am excited to share with you all that I have just uploaded two new videos on my YouTube channel. These videos are a part of my ongoing effort to create valuable content for the data science community.
In the first lesson we look at how to effectively get help when you run into a problem. This is important for this course, but also for your future as a data scientist!
We first looked at strategies to use before asking for help, including reading the manual, checking the help files, and searching Google and appropriate forums. We also covered some common coding problems you may face and some preliminary steps you can take on your own, including paying special attention to error messages and examining how your code behaved compared to your goal.
Once you’ve exhausted these options, we turn to other people for help. We can ask peers for help or explain our problems to our trusty rubber ducks (be it an actual rubber duck or an unsuspecting coworker!). Our course forum is also a great resource for you all to talk with many of your peers! Go introduce yourself!
And if all else fails, we can post on forums (be it in this class or at another forum, like Stack Overflow), with very specific, reproducible questions. Before doing so, be sure to brush up on your forum etiquette - it never hurt anybody to be polite! Be a good citizen of our forums! There is an art to problem solving, and the only way to get practice is to get out there and start solving problems! Get to it!
What is data science?
So the first question you probably need answered going into this course is, “What is Data Science?” - and that is a great question. To different people, this means different things, but at its core, data science is using data to answer questions. This is a pretty broad definition, and that’s because it’s a pretty broad field!
Data science can involve:
- Statistics, computer science, mathematics
- Data cleaning and formatting
- Data visualization
An Economist Special Report sums up this combination of skills well - they state that a data scientist is broadly defined as someone:
“who combines the skills of software programmer, statistician and storyteller slash artist to extract the nuggets of gold hidden under mountains of data”
And by the end of these courses, hopefully you will feel equipped to do just that!
Why do we need data science?
One of the reasons for
the rise of data science in recent years is the vast amount of
data currently available and being generated. Not only are massive amounts of
data being collected about many aspects of the world and our lives, but we
simultaneously have the rise of inexpensive computing. This has created the
perfect storm in which we have rich data and the tools to analyse it: Rising
computer memory capabilities, better processors, more software and now, more
data scientists with the skills to put this to use and answer questions using
this data!
There is a little
anecdote that describes the truly exponential growth of data generation we are
experiencing. In the third century BC, the Library of Alexandria was believed
to house the sum of human knowledge. Today, there is enough information in the
world to give every person alive 320 times as much of it as historians think
was stored in Alexandria’s entire collection.
And that is still growing.
What is big data?
We’ll talk a little
bit more about big data in a later lecture, but it deserves an introduction
here - since it has been so integral to the rise of data science. There are a few qualities that characterize big
data. The first is volume.
As the name implies, big data involves large datasets - and these large
datasets are becoming more and more routine. For example, say you had a
question about online video - well, YouTube has approximately 300 hours of
video uploaded every minute! You would definitely have a lot of data available
to you to analyse, but you can see how this might be a difficult problem to
wrangle all of that data!
And this brings us to
the second quality of big data: velocity. Data is being generated
and collected faster than ever before. In our YouTube example, new data is
coming at you every minute! In a completely different example, say you have a
question about shipping times or routes. Well, most transport trucks have real
time GPS data available - you could in real time analyse the trucks movements…
if you have the tools and skills to do so!
The third quality of
big data is variety. In the examples I’ve mentioned so far, you
have different types of data available to you. In the YouTube example, you
could be analysing video or audio, which is a very unstructured data set, or
you could have a database of video lengths, views or comments, which is a much
more structured dataset to analyse.
A summary of three qualities that characterize big data
What is a data scientist?
So we’ve talked about
what data science is and what sorts of data it deals with, but something else
we need to discuss is what exactly a data scientist is.
The most basic of
definitions would be that a data scientist is somebody who uses data to answer
questions. But more importantly to you, what skills does a data scientist
embody?
Drew Conway’s Venn diagram of data science
And to answer this, we
have this illustrative Venn diagram, in which data science is the intersection of
three sectors - Substantive expertise, hacking skills, and math and statistics.
To explain a little on
what we mean by this, we know that we use data science to answer questions - so
first, we need to have enough expertise in the area that we want to ask about
in order to formulate our questions and to know what sorts of data are
appropriate to answer that question. Once we have our question and appropriate
data, we know from the sorts of data that data science works with, often times
it needs to undergo significant cleaning and formatting - and this often takes
computer programming slash “hacking” skills. Finally, once we have our data, we
need to analyse it, and this often takes math and stats knowledge.
In this
specialization, we’ll spend a bit of time focusing on each of these three
sectors, but will primarily focus on math and statistics knowledge and hacking
skills. For hacking skills, we’ll focus on teaching two different components:
computer programming or at least computer programming with R, which will allow
you to access data, play around with it, analyze it, and plot it. Additionally,
we’ll focus on having you learn how to go out and get answers to your
programming questions.
One reason data
scientists are in such demand is that most of the answers aren’t already
outlined in textbooks - a data scientist needs to be somebody who knows how to
find answers to novel problems.
Why do data science?
Speaking of that
demand, there is a huge need for individuals with data science skills. Not only
are machine learning engineers, data scientists, and big data engineers among
the top emerging jobs, the demand far exceeds the supply.
Data scientist roles have grown exponentially since
2012, but currently demand beats the supply significantly around the world, besides
hundreds of companies are hiring for those roles - even those you may not
expect in sectors like retail and finance - supply of candidates for these
roles cannot keep up with demand.
This is a great time
to be getting in to data science - not only do we have more and more data, and
more and more tools for collecting, storing, and analysing it, but the demand
for data scientists is becoming increasingly recognized as important in many
diverse sectors, not just business and academia.
Additionally, according to towards data science, in which they ranked the top 50 best jobs in America, Data Scientist is THE top job in the US in 2022, based on job satisfaction, salary, and demand.
Examples of data scientists
The diversity of
sectors in which data science is being used is exemplified by looking at
examples of data scientists.
One place we might not
immediately recognize the demand for data science is in sports – Daryl Morey is the general manager of a US
basketball team, the Houston Rockets. Despite not having a strong
background in basketball, Morey was awarded the job as GM on the basis of his bachelor’s
degree in computer science and his M.B.A. from M.I.T. He was chosen for his
ability to collect and analyse data, and use that to make informed hiring
decisions.
Another data scientist that you may have heard of is Hilary Mason. She is a co-founder of FastForward labs, a machine learning company recently acquired by Cloudera, a data science company, and is the Data Scientist in Residence at Accel. Broadly, she uses data to answer questions about mining the web and understanding the way that humans interact with each other through social media.
And finally, Nate Silver is one of the most famous data scientists or statisticians in the world today. He is founder and editor in chief at FiveThirtyEight - A website that uses statistical analysis - hard numbers - to tell compelling stories about elections, politics, sports, science, economics and lifestyle.
He uses large amounts
of totally free public data to make predictions about a variety of topics; most
notably he makes predictions about who will win elections in the United States,
and has a remarkable track record for accuracy doing so.
Data science in action!
One example of data science in action is the development of personalized recommendations on streaming platforms like Netflix or Spotify. These companies use data science techniques to analyze user behavior, such as the content they watch or listen to, how long they watch or listen to it, and which other users have similar preferences.
Based on this analysis, the platform can generate personalized recommendations for each user, suggesting new content that they are likely to enjoy. The recommendations are constantly updated as the platform gathers more data on each user's preferences and behavior.
This application of data science not only improves the user experience but also helps the company increase user engagement and retention, leading to higher revenue and profitability.
What will we teach you in this course?
Now that you have had
this introduction into data science, all that really remains to cover here is a
summary of what it is that we will be teaching you throughout this course. To
start, we’ll go over the basics of R. R is the main programming language that
we will be working with in this course track, so a solid understanding of what
it is, how it works and getting it installed on your computer is a must. We’ll
then transition into RStudio - which is a very nice graphical interface to R,
that should make your life easier! We’ll then talk about version control, why
it is important and how to integrate it into your work. And once you have all
of these basics down, you’ll be all set to apply these tools to answering your
very own data science questions!
Looking forward to
learning with you! Let’s get to it!
Well Done!
ReplyDelete