Reflection Blog Post

What (if anything) has changed about what you think a data scientist is and what they do
Some things that I left out in my initial explanation of what a data scientist does:

  • I think that a data scientist should be able to glean insight from data. A great data scientist is a great storyteller, someone who can distinguish the signal from the noise in a dataset. This means finding the trends in a dataset that best explain outside phenomenon.
  • A data scientist should also be able to use data to make predictions about the future. This is usually done in the form of machine learning.
Read More

Analyzing R against other Languages

R is a language wonderfully suited to scientific programming. The structures and functions available for vector, matrix, and data frame manipulation are powerful and easy to use. Furthermore, the family of libraries making up the tidyverse make working with data frames even easier. The R Markdown file type and R Studio IDE are well-suited for creating reproducible data analysis. R also boasts the best collection of open-source statistical libraries created by researchers around the world. If I wanted to use the most cutting edge statistical techniques, I would choose R.

Now, R is not without its limitations, and I would choose Python over R for some cases. Python has the advantage of being a robust language for web development, with web frameworks such as Flask and Django. Python also has very strong libraries for web parsing, natural language processing, machine learning, and deep learning, and is more generally available on cloud computing services. Python is also more performant than R generally. Python’s pandas library and visualization libraries built upon matplotlib in my mind rival the tidyverse and provide a great alternative.

Read More

What Defines a Data Scientist

What is a data scientist? I can take my own experience in the field as some inspiration. A data scientist in my mind has knowledge in the follwoing four areas:


1) Programming:
I first entered the field of data science by learning how to code. I quit my job after 1 and a half years living in New York City post-college with a desire to enter the field of data science. I enrolled in a 3-month bootcamp dedicated to learning python, backend web development, and data science. This first experience showed me that a big part of data science is understanding how to put ideas into action through programming. That can be implementing a machine learning algorithm or creating an R Shiny dashboard to present data analysis. Learning how to program opened up a lot of doors to learn data analysis…


2) Data Analysis:
Data analysis involves the process of surfacing data and exploring the data with any number of methods to derive insights that would not be possible to derive from simply looking at the data. This often involves data visualization, simple statistical methods such as regression, describing relationships between variables, and summarizing data. A data scientist must be able to gain a good understanding of the contents of a data set through this exploration. Sometimes, this results directly into presentations if your goal is to simply analyze characteristics of that particular data set. However, we often want to do more interesting things beyond exploratory analysis, which is where the next two skills come into play.


3) Statistics:
The field of statistics allow us to provide mathematical rigor around uncertainty. Often, the first step of applying a statistical method is to start with a hypothesis - your problem statement. Then, we can apply certain methods to determine whether we can determine some interaction or effect has had a significant effect on the data we observe. This could be problems such as the effect of an ad campaign on sales, or the effect of a drug treatment on patient outcomes.


4) Machine learning:
Machine learning largely focuses on making predictions from data. It is a large field that encompasses supervised and unsupervised learning, regression and classification, linear and non-linear methods, and even deep learning. It is a subfield of the greater Arificial Intelligence field. A data scientist is often tasked to create intelligent systems from data such as a classifier - or more complex programs using deep learning such as a translator or product recommender.


Comparisons Between Career Fields
A statistician differs in that they likely focus more on #3. A data analyst by comparison will focus more in #1 and #2, while a machine learning engineer will focus on productionalizing #4. A data scientist is comfortable in all of these areas, though may not have as much knowledge (especially research-wise) in any particular domain. I began my career as a data analyst, the learned machine learning, and am rounding out my experience with this Master’s program in Statistics, with the goal of being a more well-rounded data scientist.

Read More