Amazon Redshift SQL Cheatsheet

1. Connection and Basic Setup

psql Connection

Bash

1
2
psql -h <cluster-endpoint> -p 5439 -U <username> -d <database>

JDBC Connection String

jdbc:redshift://endpoint:5439/database

AWS CLI Cluster Management

Bash

1
2
3
4
5
6
7
8
9
10
# List provisioned clusters
aws redshift describe-clusters

# Describe a specific cluster
aws redshift describe-clusters --cluster-identifier <cluster-name>

# List Redshift Serverless workgroups
aws redshift-serverless list-workgroups
aws redshift-serverless get-workgroup --workgroup-name <workgroup>

Redshift Data API (no persistent connection required)

Bash

1
2
3
4
5
6
aws redshift-data execute-statement \
    --cluster-identifier <cluster-name> \
    --database <database> \
    --db-user <username> \
    --sql "SELECT COUNT(*) FROM sales;"

2. Database, Schema, and Table Operations

Create Database

SQL

1
2
3
4
5
CREATE DATABASE warehouse
WITH 
    OWNER = admin 
    ENCODING = 'UTF8';

Create Schema

SQL

1
2
3
CREATE SCHEMA analytics 
AUTHORIZATION admin;

Create Table (with Distribution and Sort Keys)

SQL

1
2
3
4
5
6
7
8
9
10
11
CREATE TABLE sales (
    sale_id INTEGER ENCODE AZ64,
    product_id INTEGER DISTKEY,
    sale_date DATE,
    amount DECIMAL(12,2) ENCODE AZ64,
    region VARCHAR(50) ENCODE ZSTD
)
DISTSTYLE KEY
DISTKEY (product_id)
SORTKEY (sale_date);

Automatic Table Optimization (recommended default)

SQL

1
2
3
4
5
6
7
8
9
10
11
-- Let Redshift choose and evolve DISTKEY/SORTKEY based on workload
CREATE TABLE sales (
    sale_id INTEGER,
    product_id INTEGER,
    sale_date DATE,
    amount DECIMAL(12,2),
    region VARCHAR(50)
)
DISTSTYLE AUTO
SORTKEY AUTO;

3. Redshift-Specific Data Types and Encodings

Data Types

INTEGER, BIGINT, SMALLINT
DECIMAL(precision, scale)
VARCHAR(n), CHAR(n)
DATE, TIMESTAMP, TIMESTAMPTZ, TIME, TIMETZ
BOOLEAN
SUPER (semi-structured JSON/nested data, queryable with PartiQL)
GEOMETRY, GEOGRAPHY (spatial data)
VARBYTE (variable-length binary)
HLLSKETCH (HyperLogLog approximate cardinality)

Compression Encodings

AZ64: Amazon proprietary, default for most numeric, date, and timestamp types
ZSTD: Best general-purpose compression, ideal for VARCHAR/CHAR
LZO: Legacy general-purpose (prefer ZSTD or AZ64)
DELTA, DELTA32K: Optimal for sequential or incremental numeric data
BYTEDICT: Small dictionaries of unique values (≤256)
RUNLENGTH: Repeated consecutive values
MOSTLY8, MOSTLY16, MOSTLY32: Mostly small numeric values
RAW: No compression (use only for sort key leading column when appropriate)

Note: Sort key columns no longer require RAW encoding — AZ64/ZSTD on sort keys is supported and often beneficial.

4. Standard SQL Operations

SELECT with Advanced Features

SQL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Window functions
SELECT 
    product_id, 
    sale_date, 
    amount,
    SUM(amount) OVER (PARTITION BY product_id ORDER BY sale_date) AS cumulative_sales
FROM sales;

-- Top N with window function
SELECT * FROM (
    SELECT 
        product_id, 
        amount,
        RANK() OVER (ORDER BY amount DESC) AS sales_rank
    FROM sales
) ranked
WHERE sales_rank <= 10;

MERGE (UPSERT)

SQL

1
2
3
4
5
6
MERGE INTO sales AS target
USING staging_sales AS source
ON target.sale_id = source.sale_id
WHEN MATCHED THEN UPDATE SET amount = source.amount, region = source.region
WHEN NOT MATCHED THEN INSERT VALUES (source.sale_id, source.product_id, source.sale_date, source.amount, source.region);

COPY Command (Data Loading)

SQL

1
2
3
4
5
6
7
8
9
10
11
COPY sales 
FROM 's3://mybucket/sales_data/' 
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyRole'
FORMAT AS PARQUET;

-- JSON load
COPY sales 
FROM 's3://mybucket/sales_data/' 
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyRole'
FORMAT AS JSON 'auto';

Auto-Copy from S3 (continuous ingestion)

SQL

1
2
3
4
5
6
COPY JOB CREATE sales_autocopy
FROM 's3://mybucket/sales_data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyRole'
FORMAT AS PARQUET
AUTO ON;

UNLOAD Command (Data Export)

SQL

1
2
3
4
5
6
7
UNLOAD ('SELECT * FROM sales WHERE sale_date > GETDATE() - INTERVAL ''30 days''')
TO 's3://mybucket/sales_export/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftUnloadRole'
FORMAT AS PARQUET
PARTITION BY (region)
PARALLEL ON;

5. Distribution Styles

Distribution Styles

AUTO: Redshift chooses and evolves the style as the table grows (recommended default)
EVEN: Distributes rows across slices uniformly
KEY: Distributes based on a specific column
ALL: Replicates entire table to all compute nodes

Example Distribution Selection

SQL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Good for fact tables with foreign key joins
CREATE TABLE fact_sales (
    sale_id INT DISTKEY,
    product_id INT,
    customer_id INT
)
DISTSTYLE KEY;

-- Good for small dimension tables
CREATE TABLE dim_products (
    product_id INT PRIMARY KEY
)
DISTSTYLE ALL;

6. Compression and Encoding Strategies

Best Practices

Prefer AZ64 for numeric, date, and timestamp; ZSTD for string/binary
Use ANALYZE COMPRESSION to validate encoding choices
Choose encoding based on data characteristics and cardinality
Minimize column width
Leverage CREATE TABLE without ENCODE clauses — Redshift assigns optimal encodings automatically

SQL

1
2
3
-- Analyze compression recommendations
ANALYZE COMPRESSION sales;

7. Analytics and Window Functions

Window Function Examples

SQL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- Running total
SELECT 
    sale_date, 
    amount,
    SUM(amount) OVER (ORDER BY sale_date) AS running_total
FROM sales;

-- Percentile calculation
SELECT 
    product_id,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median_sale
FROM sales
GROUP BY product_id;

-- Approximate distinct count via HyperLogLog
SELECT APPROXIMATE COUNT(DISTINCT customer_id) FROM sales;

Semi-Structured Data (SUPER + PartiQL)

SQL

1
2
3
4
5
6
7
8
9
CREATE TABLE events (
    event_id BIGINT,
    payload SUPER
);

SELECT event_id, payload.user.id, payload.items[0].sku
FROM events
WHERE payload.user.country = 'US';

8. Performance Optimization

Query Tuning Tips

Use EXPLAIN to analyze query execution plan
Minimize data movement (broadcasts and redistributions)
Use sort and dist keys effectively, or rely on AUTO
Avoid unnecessary joins, subqueries, and SELECT *
Use materialized views with auto-refresh and auto-rewrite for repeat aggregations
Enable Result Caching (on by default) and Concurrency Scaling for spiky workloads

SQL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Explain query plan
EXPLAIN 
SELECT product_id, SUM(amount) 
FROM sales 
GROUP BY product_id;

-- Materialized view with automatic refresh and rewrite
CREATE MATERIALIZED VIEW mv_sales_by_product
AUTO REFRESH YES
AS
SELECT product_id, SUM(amount) AS total
FROM sales
GROUP BY product_id;

Workload Management (WLM)

Prefer Automatic WLM with Short Query Acceleration (SQA) — manual WLM is legacy
Use query priorities (HIGHEST, HIGH, NORMAL, LOW, LOWEST) and query monitoring rules (QMR) to control runaway queries
Concurrency Scaling adds transient clusters to handle bursts

9. User and Security Management

Create User

SQL

1
2
3
4
CREATE USER data_analyst 
WITH PASSWORD 'SecurePass123'
IN GROUP reporting_group;

Grant Permissions

SQL

1
2
3
4
GRANT SELECT 
ON TABLE sales 
TO data_analyst;

Role-Based Access Control (RBAC)

SQL

1
2
3
4
CREATE ROLE analyst_role;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics TO ROLE analyst_role;
GRANT ROLE analyst_role TO data_analyst;

Row-Level and Column-Level Security

SQL

1
2
3
4
5
6
7
8
9
10
11
-- Column-level privilege
GRANT SELECT (sale_id, sale_date, amount) ON sales TO data_analyst;

-- Row-level security policy
CREATE RLS POLICY region_policy
WITH (region VARCHAR(50))
USING (region = current_setting('app.region'));

ATTACH RLS POLICY region_policy ON sales TO ROLE analyst_role;
ALTER TABLE sales ROW LEVEL SECURITY ON;

Dynamic Data Masking

SQL

1
2
3
4
5
CREATE MASKING POLICY mask_email
WITH (email VARCHAR) USING (REGEXP_REPLACE(email, '.+@', '****@'));

ATTACH MASKING POLICY mask_email ON customers (email) TO ROLE analyst_role;

10. Maintenance Operations

VACUUM and ANALYZE

SQL

1
2
3
4
5
6
7
8
9
10
-- Reclaim space and resort table
VACUUM sales;

-- Update table statistics
ANALYZE sales;

-- Comprehensive maintenance
VACUUM FULL;
ANALYZE;

Most clusters can rely on automatic VACUUM DELETE, automatic VACUUM SORT, and automatic ANALYZE — manual runs are mainly needed after large bulk loads or DML.

11. Best Practices

Columnar Storage Optimization

Use narrow, compressed columns
Leverage sort and distribution keys (or AUTO)
Minimize data type sizes
Use appropriate encoding (AZ64/ZSTD by default)

Performance Checklist

Use COPY (Parquet preferred) for bulk loading; Auto-Copy for continuous ingestion
Implement appropriate distribution strategy or use DISTSTYLE AUTO
Use compression (AZ64/ZSTD)
Rely on automatic VACUUM/ANALYZE; supplement manually after large loads
Monitor query performance with system views
Leverage materialized views with auto-refresh for complex aggregations
Use Concurrency Scaling and Result Caching for read-heavy workloads
Use Zero-ETL integrations (Aurora, RDS, DynamoDB) to avoid custom pipelines
Use Data Sharing for cross-cluster/cross-account read access without copying data

12. Monitoring and Troubleshooting

Key System Views

Modern SYS_* monitoring views are preferred over legacy STL_*/STV_* tables and work for both provisioned and Serverless:

SYS_QUERY_HISTORY: Query execution history (replaces STL_QUERY)
SYS_QUERY_DETAIL: Step-level query details
SYS_LOAD_HISTORY: COPY load history
SYS_UNLOAD_HISTORY: UNLOAD history
SYS_SERVERLESS_USAGE: RPU usage for Serverless
STL_ALERT_EVENT_LOG: Performance alerts (provisioned only)

SQL

1
2
3
4
5
6
7
8
9
10
11
12
13
-- Check long-running queries (modern view)
SELECT 
    query_id, 
    user_id,
    start_time, 
    end_time, 
    elapsed_time,
    status,
    query_text
FROM SYS_QUERY_HISTORY
WHERE elapsed_time > 60000000  -- microseconds; > 1 minute
ORDER BY elapsed_time DESC;

13. Redshift Serverless

Redshift Serverless removes cluster sizing — you pay per second for RPUs (Redshift Processing Units) consumed by queries.

Bash

1
2
3
4
5
6
7
# Create a namespace and workgroup
aws redshift-serverless create-namespace --namespace-name analytics-ns
aws redshift-serverless create-workgroup \
    --workgroup-name analytics-wg \
    --namespace-name analytics-ns \
    --base-capacity 32

Set base capacity (RPUs) and optional max capacity to cap cost
Use usage limits to alert/halt on spend
Same SQL surface as provisioned Redshift; supports Data Sharing, Zero-ETL, and Spectrum

Redshift Spectrum (query S3 directly)

SQL

1
2
3
4
5
6
7
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'spectrumdb'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole';

SELECT * FROM spectrum_schema.s3_sales WHERE sale_date > CURRENT_DATE - 7;

Federated Query (PostgreSQL/MySQL/Aurora)

SQL

1
2
3
4
5
6
7
CREATE EXTERNAL SCHEMA pg_ext
FROM POSTGRES
DATABASE 'prod' SCHEMA 'public'
URI 'mydb.cluster-xyz.us-east-1.rds.amazonaws.com'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftFederatedRole'
SECRET_ARN 'arn:aws:secretsmanager:us-east-1:123456789012:secret:pg-creds';

SQL

1
2
3
4
5
CREATE DATASHARE sales_share;
ALTER DATASHARE sales_share ADD SCHEMA analytics;
ALTER DATASHARE sales_share ADD ALL TABLES IN SCHEMA analytics;
GRANT USAGE ON DATASHARE sales_share TO ACCOUNT '210987654321';

Amazon Redshift is a powerful columnar MPP database for analytics and business intelligence, available in both provisioned and Serverless deployments. Modern Redshift emphasizes automation — AUTO distribution/sort keys, automatic VACUUM/ANALYZE, Automatic WLM, materialized view auto-refresh, and Zero-ETL integrations — letting teams focus on data and queries instead of tuning.

Key Takeaways:

Lean on automation (DISTSTYLE AUTO, SORTKEY AUTO, Auto WLM, auto VACUUM/ANALYZE)
Use AZ64/ZSTD encodings and Parquet for COPY/UNLOAD
Prefer modern SYS_* monitoring views over legacy STL_*
Adopt Data Sharing and Zero-ETL instead of copying data between systems
Use Serverless for variable workloads; Concurrency Scaling for bursty reads
Exploit SUPER + PartiQL for semi-structured data and Spectrum for S3 lakehouse queries

Redshift

Continue Learning