Amazon Redshift - A Quick Intro
For the past few months, I have been developing with Amazon Redshift. Redshift is a data warehousing service that uses SQL to analyze structured and semi-structured data across data warehouses, operational and analytical data sources, and data lakes.
Unlike traditional data warehouse solutions, Redshift offers a wide range of features to manage and store large amounts of data in the AWS Cloud. It is highly-performant, by offering concurrency scaling to maximize throughout and table optimization through its use of AWS-designed hardware and machine learning. It also offers isolation of critical workloads through the use of separate replica clusters.
How is that?
When you have large and growing amounts of data gathered from multiple sources continuing to grow everyday, as well as a large number of applications and consumers relying on that data for various business needs, you are likely to run into situations like these:
- Difficulty managing/keeping up with storage of the data volume
- Long-running SQL Queries
- Volatile and incorrect data (changes very frequently, unreliable)
Redshift comes to the rescue as the performant solution for data warehousing and storage that enables your data to be better organized, along with making it easily accessible for reporting, analytics, machine learning, and data science use cases.
Great, tell me more about Redshift!
Amazon Redshift stores data at the petabyte scale. It is fully-managed by AWS, and highly-efficient at storing analytical and historical data.
From what I have experienced so far working with Redshift, the biggest values of the service lie in how you can seamlessly connect and integrate Redshift with many types of business intelligence, machine learning applications, and other AWS services to access and query data for your use case.
Let’s say you have data stored in both Redshift, and an S3 Bucket. But, you need to access data in both sources, as well as utilize them both at the same time (i.e. perform JOIN queries on the data). You can easily create an external schema in your Redshift database, and have your schema connect to a Glue Catalog Database. Your Glue Catalog Database can connect to your S3 Bucket. Assuming you have the proper IAM Roles and Database Users setup, you can use Redshift to query your S3 Data Lake, while deciding where to migrate/consolidate your S3 Data for the long-term. And, avoid multiple JDBC connections to two different data sources.
Lastly, the functionality to separate SPI and Non-SPI Data columns, by creating separate views. From a data security perspective, this is a huge plus point, especially as you develop Enterprise-Wide applications. You can manage permissions by revoking or assigning grants at the schema level, without having to worry about inadvertently giving any teams access to SPI Data columns.
💡Pro Tip: If you are responsible for managing security access to Redshift Data, use Stored Procedures to manage the creation of database user logins and permissions. Avoid using the Superuser login as much as possible, since you run the risk of inadvertently deleting something permanently. Plus, Stored Procedures can be moved around schema-to-schema, updated as needed, and executed manually, or via automated workflows.
It’s also worth mentioning, if you require quick access to data sitting in Redshift for analysis, you do need to have your data structured (or semi-structured), and schemas already defined in advance. This can usually be accomplished through other business applications and/or transactional APIs.
Key Advantages of using Amazon Redshift
High Performing 🎭
When you have data that is large and repetitive, you need a mechanism to be able to retrieve that data efficiently, without using too many Input/Output disk operations. Too many Input/Output operations lead to decrease in performance.
Hence, Redshift utilizes columnar storage in its databases, where you can store frequently-accessed data. Unlike your traditional data warehouse that would simply store data row-by-row exactly like it was entered, columnar storage allows the databases to do a more precise search through searching values of specific columns, as opposed to checking and ignoring the un-targeted data row-by-row. This enables large savings in time.
Now, since Redshift stores the data in columns, it also has the ability to store the data by type on the disk as opposed to row-by-row. This allows an added feature of data compression, where each column is compressed by the individual data type, further enabling rapid query results and faster hits.
Scalable Horizontally ⚖️
When you are dealing with a service capable of handling huge amounts of data, that is some serious business. You must have some type of mechanism to ensure fault tolerance and scalability, as your data volume continues to grow, along with the organizational consumers relying on accessing the data.
To ensure scalability, you can spin up additional Redshift nodes through the AWS Console or Cluster API, anytime there is a need for larger storage or increased speed. And those nodes can support workloads of up to eight petabytes of compressed data.
Keep in mind, any application that is connected to the Redshift Cluster at the time will only be able to do read operations. Overall, the transition flow between new nodes and existing nodes is pretty smooth, since the data is transferred in a parallel fashion between the nodes. Concurrency is also an advantage, as Redshift is claimed to be able to support an unlimited number of queries and users by giving transient capacity within seconds as the need for concurrent operations increase.
Redshift offers several security options. For Network Isolation, Cluster security groups are created within each Redshift Cluster. Data Encryption comes built-in with the service. You can also enable Cluster Encryption while launching a new Redshift cluster, encrypting all of the data stored within.
Client-Side and Server-Side encryption may also be used when loading data from an external source like an S3 Bucket, or a Glue Database Catalog.
Reasonable Pricing 💲
Like many other cloud-based services, Redshift is much cheaper than various on-premise solutions. Customers have the flexibility to determine the expense as operational or capital expenses.
The Redshift Query Engine (like the interface for PostgreSQL DB) is based on ParAccel. So if you have some experience with Postgres, Redshift is pretty easy to interact with. Plus, it is readily compatible with PostgreSQL ODBC/JDBC drivers.
Vast Storage Range 🗄️
Redshift deals with storing data beginning at the petabyte-range. Customers have the option to opt-in for Dense Storage compute nodes that give large storage space. And of course, you can spin-up additional nodes to a Redshift Cluster in order to gain a larger range beyond the default one petabyte.
Overall, Amazon Redshift is a great option for companies looking to migrate their large volumes of structured analytic data to the cloud. Usage and cost are relatively easy to understand, but special attention should (always) be given to security. Things like, who can create objects, who can/cannot access SPI/Non-SPI data, etc. are very important to determine as you migrate.
Similar Products by other Cloud Providers
Want to learn more about Data Engineering? AWS Cloud?
Check out my series listed below! 🙂