AWS Data Integration across Multiple Data Stores

Abdul R. Wahab
4 min readMay 7

--

Use Case / Requirements

You are tasked with joining / integrating data across two separate data stores (i.e. Redshift and S3) stored in a Single AWS account for deriving custom analytics.

High-Level Solution Design

Source: Abdul R. Wahab

Background

Redshift Spectrum, Glue, and S3 are three AWS services that work together to provide powerful analytics solutions.

In this segment, I will go over how these three services work together, with a few code examples, and also outline the pros and cons of this implementation. Of course, everything has give and take. No such thing as a perfect solution! 🙃

Related reading

Data + Cloud (Redshift, Data Mesh)

16 stories

Overview of Redshift Spectrum, Glue, and S3

As mentioned in my previous blogs, Redshift is a powerful data warehousing service that allows users to store and analyze large amounts (petabytes) of data.

S3 is a highly scalable and durable object storage service that allows users to store and retrieve data.

Glue is a fully managed ETL service that allows users to extract, transform, and load data. In this example, Glue will be used to create and manage metadata for data stored in S3. Via Glue, users can define the schema and structure of their data, making it easy to query using Redshift Spectrum.

Redshift Spectrum extends the power of Redshift by enabling users to analyze and query data that is stored in S3. By using Spectrum, users can query data in S3 using standard SQL syntax, without having to copy/unload the data into Redshift.

This saves time, infrastructure resources, data pipelines, costs, and reduces dependencies.

Prerequisites

To implement this Redshift Spectrum solution, ensure the following steps have already been completed:

  1. Create an S3 bucket and upload data.
  2. Create a Glue database and crawler.
  3. Run the crawler.
  4. Create an IAM Role that allows Redshift READ permissions to your S3 bucket and to the Glue database.

Once the setup above is complete, we can start with the Redshift Spectrum solution.

Building out the Spectrum solution

Create an external schema in Redshift

To create an external schema in Redshift, connect to your Redshift cluster using a SQL client such as SQL Workbench/DBeaver.

Execute the following SQL command to create an external schema:

CREATE EXTERNAL SCHEMA my_external_schema
FROM DATA CATALOG DATABASE 'my_glue_catalog_database'
IAM_ROLE 'my_redshift_iam_role'

Query the external schema using Redshift Spectrum

Once you have created the external schema, you can query it using Redshift Spectrum.

To query the external table, simply execute the following SQL command:

SELECT *
FROM my_external_schema.my_table
WHERE column1 = 1;

Pros and Cons of this solution

Pros:

  1. Cost-Effective:
  • Storing data in S3 is cost-effective as compared to storing data in Redshift. With Redshift Spectrum, you can query data in S3 without having to load it into Redshift, which can significantly reduce storage costs.

2. Scalability:

  • Redshift Spectrum is highly scalable and can handle queries on petabyte-scale data sets. This makes it an ideal solution for organizations with large amounts of data.

3. Flexibility:

  • With Redshift Spectrum, you can query data stored in a variety of formats, including CSV, JSON, and Parquet. This makes it easy to integrate with a wide range of data sources.

4. Easy to Use:

  • With Glue, you can easily define the schema and structure of your data, making it easy to query using Redshift Spectrum. Additionally, Glue provides a user-friendly interface for managing metadata, making it easy to discover and manage data.

Cons:

  1. Performance:
  • While Redshift Spectrum is highly scalable, it can sometimes be slower than querying data stored directly in Redshift. This is because data stored in S3 must be accessed over the network, which can introduce latency.

2. Complexity:

  • Implementing Redshift Spectrum, Glue, and S3 can be complex, particularly for teams that are new to AWS. This complexity can make it more difficult to get started with the solution.

Closing thoughts 👏

So, pretty much, Redshift Spectrum, Glue, and S3 are a good solution for joining / integrating data stored within the same AWS account across multiple data stores.

Although there are some drawbacks to this implementation, the benefits outweigh the costs for quite a few organizations.

Thanks for following along. Feel free to comment below with questions / inputs.

--

--

A Deep Dive into AWS Data Services

7 min read

May 29

AWS Redshift - Cross-Account Unload & Copy with S3

3 min read

May 26

Amazon Redshift Data Sharing - Underlying Technology

2 min read

May 8

Authenticating to AWS Redshift using Ephemeral Credentials

3 min read

Apr 29

Amazon Redshift  —  Overview

7 min read

Oct 17, 2021

Amazon Redshift: Data Consumption Patterns - Part I

3 min read

Apr 16

AWS Redshift Data sharing: Cluster-to-Cluster / Sharing to a Unified Cluster

4 min read

Apr 23

Amazon Redshift - Views: Simple, Materialized, Late-Binding

4 min read

Apr 27

Lesser Known aspects of Amazon Redshift

4 min read

Apr 28

Abdul R. Wahab

Multi-domain Technical Lead specialized in building products users love. Today, I manage & secure big data in the AWS cloud. All views shared are my own.