Redshift
Printed from:
Amazon Redshift SQL Cheatsheet
1. Connection and Basic Setup
psql Connection
12psql -h <cluster-endpoint> -p 5439 -U <username> -d <database>
JDBC Connection String
jdbc:redshift://endpoint:5439/database
AWS CLI Cluster Management
12345678910# List provisioned clusters
aws redshift describe-clusters
# Describe a specific cluster
aws redshift describe-clusters --cluster-identifier <cluster-name>
# List Redshift Serverless workgroups
aws redshift-serverless list-workgroups
aws redshift-serverless get-workgroup --workgroup-name <workgroup>
Redshift Data API (no persistent connection required)
123456aws redshift-data execute-statement \
--cluster-identifier <cluster-name> \
--database <database> \
--db-user <username> \
--sql "SELECT COUNT(*) FROM sales;"
2. Database, Schema, and Table Operations
Create Database
12345CREATE DATABASE warehouse
WITH
OWNER = admin
ENCODING = 'UTF8';
Create Schema
123CREATE SCHEMA analytics
AUTHORIZATION admin;
Create Table (with Distribution and Sort Keys)
1234567891011CREATE TABLE sales (
sale_id INTEGER ENCODE AZ64,
product_id INTEGER DISTKEY,
sale_date DATE,
amount DECIMAL(12,2) ENCODE AZ64,
region VARCHAR(50) ENCODE ZSTD
)
DISTSTYLE KEY
DISTKEY (product_id)
SORTKEY (sale_date);
Automatic Table Optimization (recommended default)
1234567891011-- Let Redshift choose and evolve DISTKEY/SORTKEY based on workload
CREATE TABLE sales (
sale_id INTEGER,
product_id INTEGER,
sale_date DATE,
amount DECIMAL(12,2),
region VARCHAR(50)
)
DISTSTYLE AUTO
SORTKEY AUTO;
3. Redshift-Specific Data Types and Encodings
Data Types
INTEGER,BIGINT,SMALLINTDECIMAL(precision, scale)VARCHAR(n),CHAR(n)DATE,TIMESTAMP,TIMESTAMPTZ,TIME,TIMETZBOOLEANSUPER(semi-structured JSON/nested data, queryable with PartiQL)GEOMETRY,GEOGRAPHY(spatial data)VARBYTE(variable-length binary)HLLSKETCH(HyperLogLog approximate cardinality)
Compression Encodings
AZ64: Amazon proprietary, default for most numeric, date, and timestamp typesZSTD: Best general-purpose compression, ideal for VARCHAR/CHARLZO: Legacy general-purpose (prefer ZSTD or AZ64)DELTA,DELTA32K: Optimal for sequential or incremental numeric dataBYTEDICT: Small dictionaries of unique values (≤256)RUNLENGTH: Repeated consecutive valuesMOSTLY8,MOSTLY16,MOSTLY32: Mostly small numeric valuesRAW: No compression (use only for sort key leading column when appropriate)
Note: Sort key columns no longer require
RAWencoding — AZ64/ZSTD on sort keys is supported and often beneficial.
4. Standard SQL Operations
SELECT with Advanced Features
123456789101112131415161718-- Window functions
SELECT
product_id,
sale_date,
amount,
SUM(amount) OVER (PARTITION BY product_id ORDER BY sale_date) AS cumulative_sales
FROM sales;
-- Top N with window function
SELECT * FROM (
SELECT
product_id,
amount,
RANK() OVER (ORDER BY amount DESC) AS sales_rank
FROM sales
) ranked
WHERE sales_rank <= 10;
MERGE (UPSERT)
123456MERGE INTO sales AS target
USING staging_sales AS source
ON target.sale_id = source.sale_id
WHEN MATCHED THEN UPDATE SET amount = source.amount, region = source.region
WHEN NOT MATCHED THEN INSERT VALUES (source.sale_id, source.product_id, source.sale_date, source.amount, source.region);
COPY Command (Data Loading)
1234567891011COPY sales
FROM 's3://mybucket/sales_data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyRole'
FORMAT AS PARQUET;
-- JSON load
COPY sales
FROM 's3://mybucket/sales_data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyRole'
FORMAT AS JSON 'auto';
Auto-Copy from S3 (continuous ingestion)
123456COPY JOB CREATE sales_autocopy
FROM 's3://mybucket/sales_data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyRole'
FORMAT AS PARQUET
AUTO ON;
UNLOAD Command (Data Export)
1234567UNLOAD ('SELECT * FROM sales WHERE sale_date > GETDATE() - INTERVAL ''30 days''')
TO 's3://mybucket/sales_export/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftUnloadRole'
FORMAT AS PARQUET
PARTITION BY (region)
PARALLEL ON;
5. Distribution Styles
Distribution Styles
AUTO: Redshift chooses and evolves the style as the table grows (recommended default)EVEN: Distributes rows across slices uniformlyKEY: Distributes based on a specific columnALL: Replicates entire table to all compute nodes
Example Distribution Selection
1234567891011121314-- Good for fact tables with foreign key joins
CREATE TABLE fact_sales (
sale_id INT DISTKEY,
product_id INT,
customer_id INT
)
DISTSTYLE KEY;
-- Good for small dimension tables
CREATE TABLE dim_products (
product_id INT PRIMARY KEY
)
DISTSTYLE ALL;
6. Compression and Encoding Strategies
Best Practices
- Prefer
AZ64for numeric, date, and timestamp;ZSTDfor string/binary - Use
ANALYZE COMPRESSIONto validate encoding choices - Choose encoding based on data characteristics and cardinality
- Minimize column width
- Leverage
CREATE TABLEwithoutENCODEclauses — Redshift assigns optimal encodings automatically
123-- Analyze compression recommendations
ANALYZE COMPRESSION sales;
7. Analytics and Window Functions
Window Function Examples
1234567891011121314151617-- Running total
SELECT
sale_date,
amount,
SUM(amount) OVER (ORDER BY sale_date) AS running_total
FROM sales;
-- Percentile calculation
SELECT
product_id,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median_sale
FROM sales
GROUP BY product_id;
-- Approximate distinct count via HyperLogLog
SELECT APPROXIMATE COUNT(DISTINCT customer_id) FROM sales;
Semi-Structured Data (SUPER + PartiQL)
123456789CREATE TABLE events (
event_id BIGINT,
payload SUPER
);
SELECT event_id, payload.user.id, payload.items[0].sku
FROM events
WHERE payload.user.country = 'US';
8. Performance Optimization
Query Tuning Tips
- Use
EXPLAINto analyze query execution plan - Minimize data movement (broadcasts and redistributions)
- Use sort and dist keys effectively, or rely on
AUTO - Avoid unnecessary joins, subqueries, and
SELECT * - Use materialized views with auto-refresh and auto-rewrite for repeat aggregations
- Enable Result Caching (on by default) and Concurrency Scaling for spiky workloads
1234567891011121314-- Explain query plan
EXPLAIN
SELECT product_id, SUM(amount)
FROM sales
GROUP BY product_id;
-- Materialized view with automatic refresh and rewrite
CREATE MATERIALIZED VIEW mv_sales_by_product
AUTO REFRESH YES
AS
SELECT product_id, SUM(amount) AS total
FROM sales
GROUP BY product_id;
Workload Management (WLM)
- Prefer Automatic WLM with Short Query Acceleration (SQA) — manual WLM is legacy
- Use query priorities (HIGHEST, HIGH, NORMAL, LOW, LOWEST) and query monitoring rules (QMR) to control runaway queries
- Concurrency Scaling adds transient clusters to handle bursts
9. User and Security Management
Create User
1234CREATE USER data_analyst
WITH PASSWORD 'SecurePass123'
IN GROUP reporting_group;
Grant Permissions
1234GRANT SELECT
ON TABLE sales
TO data_analyst;
Role-Based Access Control (RBAC)
1234CREATE ROLE analyst_role;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics TO ROLE analyst_role;
GRANT ROLE analyst_role TO data_analyst;
Row-Level and Column-Level Security
1234567891011-- Column-level privilege
GRANT SELECT (sale_id, sale_date, amount) ON sales TO data_analyst;
-- Row-level security policy
CREATE RLS POLICY region_policy
WITH (region VARCHAR(50))
USING (region = current_setting('app.region'));
ATTACH RLS POLICY region_policy ON sales TO ROLE analyst_role;
ALTER TABLE sales ROW LEVEL SECURITY ON;
Dynamic Data Masking
12345CREATE MASKING POLICY mask_email
WITH (email VARCHAR) USING (REGEXP_REPLACE(email, '.+@', '****@'));
ATTACH MASKING POLICY mask_email ON customers (email) TO ROLE analyst_role;
10. Maintenance Operations
VACUUM and ANALYZE
12345678910-- Reclaim space and resort table
VACUUM sales;
-- Update table statistics
ANALYZE sales;
-- Comprehensive maintenance
VACUUM FULL;
ANALYZE;
Most clusters can rely on automatic VACUUM DELETE, automatic VACUUM SORT, and automatic ANALYZE — manual runs are mainly needed after large bulk loads or DML.
11. Best Practices
Columnar Storage Optimization
- Use narrow, compressed columns
- Leverage sort and distribution keys (or
AUTO) - Minimize data type sizes
- Use appropriate encoding (AZ64/ZSTD by default)
Performance Checklist
- Use COPY (Parquet preferred) for bulk loading; Auto-Copy for continuous ingestion
- Implement appropriate distribution strategy or use
DISTSTYLE AUTO - Use compression (AZ64/ZSTD)
- Rely on automatic VACUUM/ANALYZE; supplement manually after large loads
- Monitor query performance with system views
- Leverage materialized views with auto-refresh for complex aggregations
- Use Concurrency Scaling and Result Caching for read-heavy workloads
- Use Zero-ETL integrations (Aurora, RDS, DynamoDB) to avoid custom pipelines
- Use Data Sharing for cross-cluster/cross-account read access without copying data
12. Monitoring and Troubleshooting
Key System Views
Modern SYS_* monitoring views are preferred over legacy STL_*/STV_* tables and work for both provisioned and Serverless:
SYS_QUERY_HISTORY: Query execution history (replacesSTL_QUERY)SYS_QUERY_DETAIL: Step-level query detailsSYS_LOAD_HISTORY: COPY load historySYS_UNLOAD_HISTORY: UNLOAD historySYS_SERVERLESS_USAGE: RPU usage for ServerlessSTL_ALERT_EVENT_LOG: Performance alerts (provisioned only)
12345678910111213-- Check long-running queries (modern view)
SELECT
query_id,
user_id,
start_time,
end_time,
elapsed_time,
status,
query_text
FROM SYS_QUERY_HISTORY
WHERE elapsed_time > 60000000 -- microseconds; > 1 minute
ORDER BY elapsed_time DESC;
13. Redshift Serverless
Redshift Serverless removes cluster sizing — you pay per second for RPUs (Redshift Processing Units) consumed by queries.
1234567# Create a namespace and workgroup
aws redshift-serverless create-namespace --namespace-name analytics-ns
aws redshift-serverless create-workgroup \
--workgroup-name analytics-wg \
--namespace-name analytics-ns \
--base-capacity 32
- Set base capacity (RPUs) and optional max capacity to cap cost
- Use usage limits to alert/halt on spend
- Same SQL surface as provisioned Redshift; supports Data Sharing, Zero-ETL, and Spectrum
14. Federated Queries, Spectrum, and Data Sharing
Redshift Spectrum (query S3 directly)
1234567CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'spectrumdb'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole';
SELECT * FROM spectrum_schema.s3_sales WHERE sale_date > CURRENT_DATE - 7;
Federated Query (PostgreSQL/MySQL/Aurora)
1234567CREATE EXTERNAL SCHEMA pg_ext
FROM POSTGRES
DATABASE 'prod' SCHEMA 'public'
URI 'mydb.cluster-xyz.us-east-1.rds.amazonaws.com'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftFederatedRole'
SECRET_ARN 'arn:aws:secretsmanager:us-east-1:123456789012:secret:pg-creds';
Data Sharing (live, no-copy cross-cluster access)
12345CREATE DATASHARE sales_share;
ALTER DATASHARE sales_share ADD SCHEMA analytics;
ALTER DATASHARE sales_share ADD ALL TABLES IN SCHEMA analytics;
GRANT USAGE ON DATASHARE sales_share TO ACCOUNT '210987654321';
Conclusion
Amazon Redshift is a powerful columnar MPP database for analytics and business intelligence, available in both provisioned and Serverless deployments. Modern Redshift emphasizes automation — AUTO distribution/sort keys, automatic VACUUM/ANALYZE, Automatic WLM, materialized view auto-refresh, and Zero-ETL integrations — letting teams focus on data and queries instead of tuning.
Key Takeaways:
- Lean on automation (
DISTSTYLE AUTO,SORTKEY AUTO, Auto WLM, auto VACUUM/ANALYZE) - Use AZ64/ZSTD encodings and Parquet for COPY/UNLOAD
- Prefer modern
SYS_*monitoring views over legacySTL_* - Adopt Data Sharing and Zero-ETL instead of copying data between systems
- Use Serverless for variable workloads; Concurrency Scaling for bursty reads
- Exploit SUPER + PartiQL for semi-structured data and Spectrum for S3 lakehouse queries
Continue Learning
Discover more cheatsheets to boost your productivity