How To Get Started With SQL for Data Science
Learning SQL for data science can be a daunting experience. The language looks different than what you’re used to, and the available tutorials out there mostly focus on using SQL syntax to accomplish specific tasks.
Trying to learn SQL from scratch can feel like learning how to solve problems in a new programming language instead of looking up how to write loops or build functions, you’re trying to understand exactly how it is that you’d use this toolset to get any sort of meaningful work done.
This article will teach you three things:
How do I connect my database software (i.e., MySQL) with my coding environment (i.e., Python)? How do I query my data? How do I perform simple manipulations on the data?
Our goal is to not only teach you the basics of SQL but also how to think about solving problems in SQL. The best way to do that is to walk through a simple example together.
You can follow along with this article by downloading our sample dataset here. If you want to query your own data instead, you’ll need access to MySQL, Python, and Pandas. We’ll be using Pandas in what follows since it’s one of the most common tools data scientists use when working with databases.
Let’s get started!
Connecting With MySQL All we have at our disposal are three CSV files; let’s get them loaded into a database so that we can start doing some interesting stuff with them. For the sake of simplicity, we’re going to use MySQL. If you have a local installation of MySQL installed and wants to follow along, go ahead and set that up now. If not, there are a number of great cloud providers where you can create a MySQL instance in seconds. The easiest way is probably going with Amazon RDS since it requires no setup on your part beyond creating an account with them.
If this is your first time using MySQL, here’s how you might get started:
Log into AWS Console. Make sure MySQL is selected from the list of services on the left-hand side — if not, click that option and wait for it to load. Once that has loaded, click “Create Database” Give your database whatever name you want, then click “Advanced” Select your preferred engine from the list. Use the default character set. Click creates!
Your database has now been created and is ready to roll with whatever crazy data you throw at it. If you head over to the Instances section of AWS and select your newly-created database, you’ll be able to see just how much space we have available:
A little less than 16 GB is pretty generous for a database where all we’re planning on storing are three CSV files containing about ~130MB worth of data each.
Now that we’ve got our database and Python both up and running and talking to one another, let’s move on to getting those CSV files loaded into MySQL so we can start doing some actual analysis with it.
Great, so now we have a MySQL server set up and ready to go. So how do we connect Python to MySQL? There are a number of options available for this, but one of the most popular is definitely SQLAlchemy. If you don’t already have it installed, head over to the official website and download the appropriate version for your computer:
Once you’ve got that downloaded and unzipped, open up your terminal/command-line interface and type pip install -U sqlalchemy. This should take less than a minute.
Now let’s get those CSV files loaded into our new database! First off we’ll need import them as Pandas data frames: import pandas as pd df = pd.read_csv(‘./sample-data/iris.csv’) df2 = pd.read_csv(‘./sample-data/mtcars.csv’) df3 = pd.read_csv(‘./sample-data/chicagoCrime1.csv’)
Let’s check the size of this data:
Great, we’ve got around ~130MB of data to work within each file — and since we only need three tiny little CSV files for this example; we can easily fit all of our data in memory without having to store it on disk at all! If you’re still following along, awesome; if not, go ahead and download those CSV files now and get them into your MySQL instance.
Conclusion:
We’re now all set up and ready to go! Just like that, we’ve got three data sets (Iris, mtcars, and Chicago crime) loaded into memory and ready for action. Now let’s get those babies in MySQL so we can do some fun stuff with them.