Upload data the right way

Best practices while loading data to Amazon Redshift from S3

Amazon Redshift gives us the option to upload data from multiple sources such as S3, Dynamo DB, EMR etc and in different formats such as csv, avro, parquet and so on.

Based on my working experience, i have realized that in an enterprise setup, it is generally recommended to get all the data to S3, stage the data there in logical buckets and folders and then upload to Redshift.

Before we write the scripts to upload the data from S3 to Redshift, there are certain checklists to be completed,

  1. Create an IAM role with read access from S3 and associate it with the Redshift Cluster. (Keep the ARN of his role handy for the scripts later) This association ensures Redshift can read from the S3 buckets.
  2. Do a cursory check of all the source data files and its delimeters and the data size for the first time uploads.
  3. In the COPY Command the AWS Resource ARN is the ARN of the role created for Redshift to access as discussed in Step1. In my first attempt i was vainly looking for the cluster ARN!

Assuming we have the necessary tables created in Redshift cluster, we can proceed to uploading the data from S3. Per my experience i have realized some practices which can help augment the entire data upload experience. Following are few such steps,

  1. If it is a full load of data with data size running into GBs it is advisable to connect to the source files in S3 from Athena and do a round of data validation, especially on consistency of data, Null values, date time format, floating point numbers. This will ensure data quality is good to be uploaded. It is a one time activity
  2. If the source file is in GBs it is advisable to split them into multiple files logically based on the primary key order. This will enable us to fully use the parallel upload operation of all the slices in the Nodes configured.
  3. The number of splits should be equal or a multiple of the number of slices in the chosen cluster. The number of slices can be determined in select count(slices) from stv_slices.
  4. The file size should be generally between 1 mb to 1 gb with gzip compression.
  5. Once multiple files are created, we can use the manifest option (https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html) to specify the files in S3 to be uploaded to Redshift cluster.
  6. If the source files have headers ensure to include the ‘IGNOREHEADERS 1’ in he copy command to ignore the headers while uploading.
  7. If the source files have Blank values and we have agreed them to be treated as NULL ensure to include the ‘BLANKSASNULL’ option in the copy command.
  8. If we use the Query Editor from Redshift, it has a strange behavior of just showing records loaded successfully even though the underlying COPY command operation was a failure. This puts us in a fix as the query completes successfully but no rows are returned when we select values from the table. In this scenario, check the stl_error table for the errors(https://docs.aws.amazon.com/redshift/latest/dg/r_STL_ERROR.html).
  9. If we want to aggregate the data in S3 and then load to Redshift or make some column changes or calculations before importing data from S3 to Redshift, then we can create an external table in Redshift using the Redshift Spectrum and then do an insert into the Redshift table and select from the external table with calculations on the fly.
  10. The tables in Redshift cluster may have a different datetime format when compared to the sourcefiles, if there is a mismatch, this can lead to errors while uploading. This can be overcome by specifying the datetime format in the COPY command or setting it to ‘auto’ for implicit auto conversion.
  11. A general tip in an enterprise set up is to use SQL Client tools like SQL Workbench or Aginity Pro to interact with the Redshift Cluster. Both are available for free and make it easy on developers adept at using SQL Server Management Studio, TOAD or similar client tools.