Data Science

What is data science?

Data science is an emerging field in computer science focused on writing programs that can find insights in data. It is very math-intensive, and is closely tied to machine learning, which is widely considered to be the most important technical breakthrough of recent years.

What should I know in advance?

Since data science is so heavily dependent on math, you should definitely go into it with a strong math foundation. I would say that you should at least be taking Algebra 2. AP Statistics is also a helpful course here. Professional data scientists are proficient in linear algebra and multivariable calculus, as these allow you to truly understand the methods that you're using (as opposed to just using other people's code).

What is there to do?

With data science, the possibilities are endless. Datasets and statistics are everywhere. Do you like politics? Find poll data and analyze it. Have a favorite sports team? There are certainly datasets out there about your favorite players. Download some datasets on topics that interest you and see what trends you can find. Can you find the winner of the next Super Bowl? Predict the next president?

Where can I go for inspiration?

Long-form evaluations of surveys/datasets can often be very insightful, exposing problems or confirming suspicions present in the population of interest. When done and presented well, these works can be cited by other people making arguments about that data (I have done this, often). Some of my favorite examples of these are:

How can I start?

To get started with data science, you should pick your programming language. In my opinion, Python, R, and Julia will be your best options. Python and R are industry standards, and are used by the majority of data science professionals. They have both been around for decades, and will definitely not be going anywhere anytime soon. Because of this, there is a wealth of resources available for getting started with both of these languages. If you have some experience with one of those two and are up for a bit of a challenge, one of my personal favorites is Julia. It's a much newer language, created just five years ago by MIT graduates, but it has lots of promise, and was created with data scientists in mind.

Another important aspect of data science is being able to present your findings. Although all of the examples I listed above had a public-facing website, this is not a necessity (but it is a plus). For example, at my job at Microsoft, data scientists frequently give us powerpoint presentations to present their findings, which requires no additional programming experience to make. I also like using Jupyter notebooks for these things.

Once you have a basic understanding of the programming language you want to use, you'll be ready to dive straight into your data. Python has multiple libraries (e.g. NumPy, Pandas, scikit-learn) that are great for parsing/working with datasets, and many of these functions are either built into R and Julia or provided by open-source libraries. If you have any other questions about getting started with data science, please feel free to ask.

How can I use RStudio?

Installing RStudio on your own computer allows you to take advantage of your own computer's processing power, but if you're on a school computer, you won't be able to do this. If you have a slow laptop, this may not be the best option either. If you would like to use RStudio through your web browser, fill out this form and I will send you an email once you have been added to our club's server.