In Redshift, there are 2 primary methods for loading data into a Redshift table:
One might ask, what’s the difference? When is it best to use which one?
Per the official AWS docs,
INSERT is a SQL command used to, well, insert data row-by-row into a Redshift table.
INSERT, every individual row is inserted as a separate transaction, which can result in slower performance compared to
COPY for large data sets.
When to use INSERT?
INSERT is typically used for small-scale data updates / inserts, or for inserting data incrementally.
COPY is a command used to load data in bulk from one or more data sources into a Redshift table.
COPY supports loading data from several sources, including S3, Amazon EMR, DynamoDB, and more.
- It can handle various file formats such as CSV, JSON, Avro, and more.
COPYcan perform parallel data loading, leveraging the Redshift cluster's resources for efficient data ingestion.
- It automatically manages data distribution and parallelism to optimize the loading process.
When to use COPY?
It’s optimized for high-speed data ingestion, and for larger datasets is generally much faster than
Key Factors when deciding between INSERT & COPY
When deciding between
COPY, consider the following factors:
For large data sets,
COPY is generally the preferred method due to its faster performance.
INSERT is better suited for smaller data volumes.
If the data is already in a format that Redshift can directly ingest from, such as CSV or JSON files in S3,
COPY can streamline the process.
If transactional integrity is critical and data needs to be inserted in a row-by-row manner,
INSERT provides more control to ensure data integrity.
If data requires transformation or preprocessing before loading it into Redshift,
INSERT allows for more flexibility in applying custom logic.
Closing thoughts 👏
COPY is the recommended method for high-performance bulk data loading into Redshift, especially for larger data sets.
For smaller-scale or incremental data inserts,
INSERT provides more control and flexibility.
Rule of 👍 is to consider the specific requirements and characteristics of your data loading process when choosing between