AWS Redshift - INSERT & COPY Commands
--
Background
In Redshift, there are 2 primary methods for loading data into a Redshift table: INSERT
and COPY
commands.
One might ask, what’s the difference? When is it best to use which one?
INSERT
Command
Per the official AWS docs, INSERT
is a SQL command used to, well, insert data row-by-row into a Redshift table.
When using INSERT
, every individual row is inserted as a separate transaction, which can result in slower performance compared to COPY
for large data sets.
When to use INSERT?
INSERT
is typically used for small-scale data updates / inserts, or for inserting data incrementally.
COPY Command
COPY
is a command used to load data in bulk from one or more data sources into a Redshift table.
COPY
supports loading data from several sources, including S3, Amazon EMR, DynamoDB, and more.
- It can handle various file formats such as CSV, JSON, Avro, and more.
COPY
can perform parallel data loading, leveraging the Redshift cluster's resources for efficient data ingestion.- It automatically manages data distribution and parallelism to optimize the loading process.
When to use COPY?
It’s optimized for high-speed data ingestion, and for larger datasets is generally much faster than INSERT
statements.
Key Factors when deciding between INSERT & COPY
When deciding between INSERT
and COPY
, consider the following factors:
Data size
For large data sets, COPY
is generally the preferred method due to its faster performance. INSERT
is better suited for smaller data volumes.
Data source
If the data is already in a format that Redshift can directly ingest from, such as CSV or JSON files in S3, COPY
can streamline the process.
Transactional integrity
If transactional integrity is critical and data needs to be inserted in a row-by-row manner, INSERT
provides more control to ensure data integrity.
Data transformation
If data requires transformation or preprocessing before loading it into Redshift, INSERT
allows for more flexibility in applying custom logic.
Closing thoughts 👏
In summary, COPY
is the recommended method for high-performance bulk data loading into Redshift, especially for larger data sets.
For smaller-scale or incremental data inserts, INSERT
provides more control and flexibility.
Rule of 👍 is to consider the specific requirements and characteristics of your data loading process when choosing between INSERT
and COPY
.