AWS Data Integration across Multiple Data Stores
--
Use Case / Requirements
You are tasked with joining / integrating data across two separate data stores (i.e. Redshift and S3) stored in a Single AWS account for deriving custom analytics.
High-Level Solution Design
Background
Redshift Spectrum, Glue, and S3 are three AWS services that work together to provide powerful analytics solutions.
In this segment, I will go over how these three services work together, with a few code examples, and also outline the pros and cons of this implementation. Of course, everything has give and take. No such thing as a perfect solution! 🙃
Related reading



Overview of Redshift Spectrum, Glue, and S3
As mentioned in my previous blogs, Redshift is a powerful data warehousing service that allows users to store and analyze large amounts (petabytes) of data.
S3 is a highly scalable and durable object storage service that allows users to store and retrieve data.
Glue is a fully managed ETL service that allows users to extract, transform, and load data. In this example, Glue will be used to create and manage metadata for data stored in S3. Via Glue, users can define the schema and structure of their data, making it easy to query using Redshift Spectrum.
Redshift Spectrum extends the power of Redshift by enabling users to analyze and query data that is stored in S3. By using Spectrum, users can query data in S3 using standard SQL syntax, without having to copy/unload the data into Redshift.
This saves time, infrastructure resources, data pipelines, costs, and reduces dependencies.
Prerequisites
To implement this Redshift Spectrum solution, ensure the following steps have already been completed:
- Create an S3 bucket and upload data.
- Create a Glue database and crawler.
- Run the crawler.
- Create an IAM Role that allows Redshift READ permissions to your S3 bucket and to the Glue database.
Once the setup above is complete, we can start with the Redshift Spectrum solution.
Building out the Spectrum solution
Create an external schema in Redshift
To create an external schema in Redshift, connect to your Redshift cluster using a SQL client such as SQL Workbench/DBeaver.
Execute the following SQL command to create an external schema:
CREATE EXTERNAL SCHEMA my_external_schema
FROM DATA CATALOG DATABASE 'my_glue_catalog_database'
IAM_ROLE 'my_redshift_iam_role'
Query the external schema using Redshift Spectrum
Once you have created the external schema, you can query it using Redshift Spectrum.
To query the external table, simply execute the following SQL command:
SELECT *
FROM my_external_schema.my_table
WHERE column1 = 1;
Pros and Cons of this solution
Pros:
- Cost-Effective:
- Storing data in S3 is cost-effective as compared to storing data in Redshift. With Redshift Spectrum, you can query data in S3 without having to load it into Redshift, which can significantly reduce storage costs.
2. Scalability:
- Redshift Spectrum is highly scalable and can handle queries on petabyte-scale data sets. This makes it an ideal solution for organizations with large amounts of data.
3. Flexibility:
- With Redshift Spectrum, you can query data stored in a variety of formats, including CSV, JSON, and Parquet. This makes it easy to integrate with a wide range of data sources.
4. Easy to Use:
- With Glue, you can easily define the schema and structure of your data, making it easy to query using Redshift Spectrum. Additionally, Glue provides a user-friendly interface for managing metadata, making it easy to discover and manage data.
Cons:
- Performance:
- While Redshift Spectrum is highly scalable, it can sometimes be slower than querying data stored directly in Redshift. This is because data stored in S3 must be accessed over the network, which can introduce latency.
2. Complexity:
- Implementing Redshift Spectrum, Glue, and S3 can be complex, particularly for teams that are new to AWS. This complexity can make it more difficult to get started with the solution.
Closing thoughts 👏
So, pretty much, Redshift Spectrum, Glue, and S3 are a good solution for joining / integrating data stored within the same AWS account across multiple data stores.
Although there are some drawbacks to this implementation, the benefits outweigh the costs for quite a few organizations.
Thanks for following along. Feel free to comment below with questions / inputs.