How-to: Analyze Coffee using Amazon Athena, S3, & Glue
I heard really cool things about Amazon Athena.
So I thought I’d give it a shot while enjoying a shot of delicious ☕ espresso.
So, what is Amazon Athena? 🧐
Athena is a serverless interactive query service offered by Amazon Web Services (AWS). It is special by how it makes it easy to analyze data using SQL. Athena is able to query both structured and unstructured types of data, and can query data that is stored in Amazon S3.
And what are we gonna be doing? 🤔
For this demo, I will be walking us through setting up and working with the following AWS Services:
We will have a coffee dataset stored in an S3 Bucket. Then we will use a Glue Crawler to create an external table schema that will be used by Athena to query our dataset in S3 to run analytical queries.
📋Note: You will need to have an existing AWS Account in order to build along with this demo, as well as a small bit of AWS knowledge.
Where is our data? 📊
I will be using this coffee quality dataset I found on Kaggle. The dataset is a .CSV file that holds upwards of 1,330 ratings for Arabica and Robusta coffees, as well as details about the national origin of the coffees, flavor profiles, region, acidity, balance, and aroma.
Let’s get started! 🏎️
1) Upload coffee dataset to S3 🪣
Before we setup anything on the Athena side, we first need to upload our coffee dataset into S3.
We will need to create an S3 Bucket first:
Once we have our S3 Bucket, next we will create two directories inside of it.
- One for storing the coffee dataset
- One for storing the Athena query results
Upload your coffee dataset (the .CSV file from Kaggle) into the appropriate S3 directory.
2) Getting started with Athena!
We will use Athena to execute the following queries against our coffee dataset:
- Retrieve all of the coffee brands in the dataset
- Sort the coffees by sweetness & rating
- Discover the relationship between the coffee beans moisture and body
Let’s get this coffee! ☕
In the AWS Management Console, we will search for Athena:
Click on ➡️ Athena.
Once you are on the main page for Amazon Athena, click on ➡️ Explore the query editor.
📋Note: I am using the older console experience since that one is a bit quicker for navigating to the data sources. You can use the new one if that works better for you, doesn’t matter! 🙂
After you click Explore the query editor, you will be presented with a screen like this:
Now, we need to connect to our coffee dataset that we have stored in our S3 Bucket.
Click on the link ➡️ Connect data source:
Our coffee dataset is stored in S3, so we will go for the Query data in Amazon S3 as our Data source.
As for our Metadata catalog, we will use AWS Glue Data Catalog.
Click ➡️ Next.
3) Let’s setup Glue
AWS Glue is a fully-managed ETL (extract, transform, load) service that enables consumers to prepare and load their datasets for analytical purposes.
In Athena, the tables are built from the metadata definitions of the data’s schema, which are described by Glue. Since S3 holds just the raw data, we need a Glue Data Catalog for storing our metadata (which tells Athena information about what data resides within our S3 Bucket).
In Glue, we will store data related to the S3 Location (such as location, data structure, column names, data types).
In turn, this metadata will then help Athena to query our coffee dataset.
You might be wondering at this point, how are we gonna get this metadata stored into our Glue Data Catalog? 🤔
We will be using a Glue Crawler! 🙂
Glue Crawlers will automatically be able to extract metadata, and create table defintions based off of that metadata for our coffee dataset.
In the last step we chose our data source (S3) and connection method (Glue).
The screen below is where we want to select Glue to set up our Glue Crawler.
Click on ➡️ Connect to AWS Glue.
And that will bring us here:
Click on ➡️ Get Started, and then click ➡️ Add table using a Crawler:
That will bring us here:
For our coffee-crawler, it will also need the Crawler source type defined as Data stores.
Now, we need to point the Glue Crawler to our S3 Bucket. We can leave the Connection field blank, since usually it is not required for S3 data sources/targets.
4) Let’s create our IAM Role 🔒
An IAM (Identity & Access Management) role entity defines a set of permissions and accesses for various AWS services/resources that need to interact with one another.
In this example, the interaction is between our Glue Crawler having access to our coffee dataset in S3.
I named mine AWSGlueServiceRole-CoffeeAnalyticsS3
💡Pro-Tip: In an Enterprise setting, when you are working with data that is frequently changing, you are likely going to have the new data being appended to S3 on an hourly, or daily cadence. For this, you should setup a schedule for your crawler to ensure that it is always working with your most recent data. You can create Glue Crawler schedules by expressing them in the cron format.
For this demo, we will pick Run on demand, since we only have that one coffee dataset we need to crawl.
Click ➡️ Next.
We have almost finished setting up our crawler. We now need to add a database for our data to be stored in:
Click ➡️ Create.
Once our Glue Catalog database has been created, we should see a screen like below:
Click ➡️ Finish.
Now that we have got our crawler setup, we can run it by clicking ➡️ Run crawler
After the crawler is done running through our S3 bucket (or crawling 😂), we will see a message like the one below:
This means that the crawler has finished going through our coffee dataset in S3, and has inspected portions of it to figure out the schema. Once we click to view it, we will see that it picked out each of the column names, as well as the data types for each column.
Now, if we look back in Athena, we will also notice our database and table have been populated with our coffee data:
One final thing — we need to set up our query results location, before we use Athena to query our data. To set this up, we can click on the link saying set up a query result location in Amazon S3.
The Athena service has to know where our results from each query execution will be stored. We can direct our results to the athena-queries-results folder we created in S3:
Click ➡️ Save.
So now that our setup is all finished, we can begin querying our coffee data in Athena.
Standard SQL commands can be used to query the data:
ORDER BY, and so on.
We will run the queries in the Athena Query Panel:
The first query I ran was this one, just to ensure that everything has been setup correctly and our data is coming back as expected.
SELECT * FROM coffee_analysis_data
Let’s try some more queries!
-- To ensure we don't get back duplicatesSELECT DISTINCT producer
Now, I want to sort the coffees by their sweetness rating to see which ones have been rated the highest:
AVG(sweetness) AS average_sweetness
company IS NOT NULL AND sweetness IS NOT NULL
GROUP BY company
ORDER BY average_sweetness DESC
Finally, I want to see if there is a relationship between the moisture present in preparation of the early coffee beans with the body that is present once the coffee is brewed:
AVG(body) AS average_body
moisture IS NOT NULL AND body IS NOT NULL
GROUP BY moisture
ORDER BY average_body DESC
Aaaannnddd, there seems to be a positive correlation between the two (moisture and body), according to our dataset. Interesting… (takes a sip of espresso☕)
In Athena, we can also view our past queries by clicking on the History tab:
And of course, our S3 Bucket’s athena-query-results folder has all of our past query results saved:
Picture above is one of the query results file I opened.
Athena and Glue are really useful if you have some raw (structured or unstructured) data that you want to quickly examine to see any potential relationships, and draw theories.
Upsolver, which is a data lake engineering company has some cool use cases of when different organizations have chosen to use Athena (link to use cases).
Thank you for following along! 🤓
Now, I will go finish up the rest of my ☕ espresso.
Want to learn more about Data Engineering? AWS Cloud?
Check out my series listed below! 🙂