Breaking Down Joins
V Helpful in SQL and Data Cleaning in All Data Tools
(Left table to Right Table)
Alrighty onto the newest and most fun topic in data… Joining data. (Who has said it's the most fun you ask, no one, absolutely no one)
The concept of joins is used in many-a-data tools, but most often refers to SQL. Can also be referred to as combining data… appending… merging… I’m sure there’s more...
Joining data is combining columns and their relative information from one or more tables. How you will join them is dependent on 1) how the data is set up 2) what data answers you need to output or analyze.
Another way to think about this is where your query is an invite and the rows from the left and right table are invitees. And unlike in real life everyone who is invited shows up 🎉.
Four main types of joins/invites:
Left -- Everyone who shows up in the left group plus the columns and additional information from the right table
Right -- Everyone who shows up in the right group plus the columns and additional information from the left table
Inner -- Everyone who shows up in both tables and the columns from both tables
Outer -- Everyone is invited!
Now cue the cute Venn Diagram pictures we are used to seeing:
What to consider when selecting your join type:
What is the primary and foreign key in your tables?
Are there duplicates within the columns you're matching?
Does it need to match with multiple columns?
Do you need ALL the rows from each table?
Let’s first use the example datasets of Austin Animal Center that they continuously update through the data.austintexas.gov site.
There are the animals that they intake in one file and then a separate file for the outcomes of the animals. We want to analyze the full cycle of their stay at the shelter to better understand how to help the greatest creatures on Earth.
Imagine both of these .CSVs have been imported into SQL as tables. We will be joining them based on their UNIQUEID, the animal may have entered and exited the shelter more than once so it would be MANY-to-MANY, and then we want all the animals from intake even if there isn’t a match yet in the outcomes dataset.
This would mean we want to use a LEFT JOIN.
Let’s say we wanted to only select the animals that were in both the intake and outcome tables then we would choose INNER JOIN.
Let’s say we wanted to select every single row in both tables then we would choose OUTER JOIN.
Let’s say we wanted to select all the animals that had an outcome whether or not there was a match with intake then you got yourself a RIGHT JOIN.
So this is how I typically think through what query is necessary to get the final table you NEED so all the right information is included. (AKA all the right people are invited!)
Next up, writing the SQL queries that help you properly write your invite (blushy face emoji).
Watch our SQL Series Phase 2 tutorial to get all the deets.