What is data science?
Data science is an emerging field in computer science focused on writing programs that can find insights in data. It is very math-intensive, and is closely tied to machine learning, which is widely considered to be the most important technical breakthrough of recent years.
What should I know in advance?
Since data science is so heavily dependent on math, you should definitely go into it with a strong math foundation. I would say that you should at least be taking Algebra 2. AP Statistics is also a helpful course here. Professional data scientists are proficient in linear algebra and multivariable calculus, as these allow you to truly understand the methods that you're using (as opposed to just using other people's code).
What is there to do?
With data science, the possibilities are endless. Datasets and statistics are everywhere. Do you like politics? Find poll data and analyze it. Have a favorite sports team? There are certainly datasets out there about your favorite players. Download some datasets on topics that interest you and see what trends you can find. Can you find the winner of the next Super Bowl? Predict the next president?
Where can I go for inspiration?
Long-form evaluations of surveys/datasets can often be very insightful, exposing problems or confirming suspicions present in the population of interest. When done and presented well, these works can be cited by other people making arguments about that data (I have done this, often). Some of my favorite examples of these are:
- Kaggle is undoubtedly the premier online data science community. If you create an account, you can access competitions (many of which are beginner-friendly and have detailed solution write-ups), browse and download public datasets that you're interested in working with, and participate in online discussion with other data scientists. This is a great place to get started.
- The results of Stack Overflow's annual developer survey are always very captivating, thorough, and relevant. Because of this, I see them get cited by numerous tech bloggers/journalists every single year. Their dataset is free to use if you're interested in analyzing this as well.
- FiveThirtyEight may be my favorite example of data science being used to make predictions in the real world. They apply near-realtime statistics to numerous real-world events, such as Donald Trump's (un)popularity, the outcomes of upcoming baseball games, and the effect that location has on mortality rates.
- The results of GitHub's open source survey was just recently published, and despite how it wasn't originally intended to do this, it helped shed light on numerous disparities in the open source community. Their dataset is also public if you want to make some more insights.
How can I start?
To get started with data science, you should pick your programming language. In my opinion, Python, R, and Julia will be your best options. Python and R are industry standards, and are used by the majority of data science professionals. They have both been around for decades, and will definitely not be going anywhere anytime soon. Because of this, there is a wealth of resources available for getting started with both of these languages. If you have some experience with one of those two and are up for a bit of a challenge, one of my personal favorites is Julia. It's a much newer language, created just five years ago by MIT graduates, but it has lots of promise, and was created with data scientists in mind.
Another important aspect of data science is being able to present your findings. Although all of the examples I listed above had a public-facing website, this is not a necessity (but it is a plus). For example, at my job at Microsoft, data scientists frequently give us powerpoint presentations to present their findings, which requires no additional programming experience to make. I also like using Jupyter notebooks for these things.
Once you have a basic understanding of the programming language you want to use, you'll be ready to dive straight into your data. Python has multiple libraries (e.g. NumPy, Pandas, scikit-learn) that are great for parsing/working with datasets, and many of these functions are either built into R and Julia or provided by open-source libraries. If you have any other questions about getting started with data science, please feel free to ask.
How can I use RStudio?
Installing RStudio on your own computer allows you to take advantage of your own computer's processing power, but if you're on a school computer, you won't be able to do this. If you have a slow laptop, this may not be the best option either. If you would like to use RStudio through your web browser, fill out this form and I will send you an email once you have been added to our club's server.