BigQuery SQL Cheatsheet: Petabyte-Scale Analytics

1. Connection and Basic Setup

gcloud CLI Authentication

Bash

1
2
3
4
5
6
7
8
9
10
# Install gcloud CLI
curl https://sdk.cloud.google.com | bash

# Initialize and authenticate
gcloud init
gcloud auth login

# Set active project
gcloud config set project YOUR_PROJECT_ID

Web UI Access

Navigate to https://console.cloud.google.com/bigquery
Select project from dropdown
Use web interface for query writing, table management

2. Project, Dataset, and Table Operations

Create Dataset

SQL

1
2
3
4
5
6
7
-- Standard SQL
CREATE SCHEMA `project.dataset`
OPTIONS (
  description = 'My analytics dataset',
  labels = [('environment', 'production')]
);

Create Table

SQL

1
2
3
4
5
6
7
8
9
10
11
12
13
-- Create table with schema
CREATE TABLE `project.dataset.table` (
  id INT64,
  name STRING,
  created_at TIMESTAMP
);

-- Create table from query
CREATE TABLE `project.dataset.derived_table` AS
SELECT column1, column2
FROM `source_table`
WHERE condition;

3. BigQuery-Specific Data Types

STRUCT

SQL

1
2
3
4
5
6
7
8
9
10
-- Creating a STRUCT
SELECT STRUCT(
  'John' AS first_name, 
  'Doe' AS last_name
) AS person;

-- Accessing STRUCT fields
SELECT person.first_name
FROM table;

ARRAY

SQL

1
2
3
4
5
6
7
-- Creating an ARRAY
SELECT ['apple', 'banana', 'cherry'] AS fruits;

-- Unnesting ARRAY
SELECT fruit
FROM table, UNNEST(fruits) AS fruit;

GEOGRAPHY

SQL

1
2
3
4
5
6
7
-- Creating a geographic point
SELECT GEOGRAPHY('POINT(-122.4194 37.7749)') AS san_francisco;

-- Calculating distance
SELECT ST_DISTANCE(point1, point2) AS distance
FROM geo_table;

4. Standard SQL Operations

SELECT with Advanced Filtering

SQL

1
2
3
4
5
6
7
8
9
10
SELECT 
  column1, 
  column2
FROM `project.dataset.table`
WHERE 
  column1 IS NOT NULL
  AND column2 BETWEEN 10 AND 100
ORDER BY column1 DESC
LIMIT 1000;

INSERT (Streaming)

SQL

1
2
3
4
5
INSERT INTO `project.dataset.table` 
  (id, name, created_at)
VALUES 
  (1, 'Test', CURRENT_TIMESTAMP());

UPDATE

SQL

1
2
3
4
5
6
UPDATE `project.dataset.table`
SET 
  status = 'active',
  updated_at = CURRENT_TIMESTAMP()
WHERE id = 123;

DELETE

SQL

1
2
3
4
DELETE FROM `project.dataset.table`
WHERE status = 'inactive'
  AND created_at < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY);

5. Analytics Functions

Window Functions

SQL

1
2
3
4
5
6
7
8
-- Ranking within partitions
SELECT 
  department,
  employee_name,
  salary,
  RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS salary_rank
FROM employees;

Statistical Functions

SQL

1
2
3
4
5
6
7
-- Advanced statistical analysis
SELECT 
  AVG(value) AS mean,
  STDDEV(value) AS standard_deviation,
  APPROX_QUANTILES(value, 4) AS quartiles
FROM metrics;

6. Array and Struct Operations

Array Aggregation

SQL

1
2
3
4
5
6
7
-- Transform rows to array
SELECT 
  user_id, 
  ARRAY_AGG(DISTINCT product) AS purchased_products
FROM purchases
GROUP BY user_id;

Struct Manipulation

SQL

1
2
3
4
5
6
7
8
9
10
-- Complex struct operations
SELECT 
  ARRAY(
    SELECT AS STRUCT 
      name, 
      age, 
      ARRAY(SELECT score FROM UNNEST(scores)) AS test_scores
    FROM users
  ) AS user_details;

7. Date and Time Functions

Date Manipulation

SQL

1
2
3
4
5
6
7
-- Date arithmetic and parsing
SELECT 
  DATE_ADD(current_date, INTERVAL 30 DAY) AS future_date,
  EXTRACT(YEAR FROM timestamp) AS year,
  FORMAT_DATE('%Y-%m', current_date) AS month_year
FROM table;

8. Geographic Functions

Spatial Queries

SQL

1
2
3
4
5
6
7
8
-- Find points within a region
SELECT 
  location, 
  ST_WITHIN(location, region) AS is_inside
FROM geo_data
WHERE ST_WITHIN(location, 
  ST_GEOGFROMTEXT('POLYGON((...))'));

9. Machine Learning Integration (BigQuery ML)

Create and Train Model

SQL

1
2
3
4
5
6
7
CREATE OR REPLACE MODEL `project.dataset.model`
OPTIONS(
  model_type='linear_reg',
  input_label_cols=['target']
) AS
SELECT * FROM training_data;

Prediction

SQL

1
2
3
4
5
6
SELECT * 
FROM ML.PREDICT(
  MODEL `project.dataset.model`, 
  (SELECT * FROM prediction_data)
);

10. Performance Optimization

Partitioning

SQL

1
2
3
4
5
-- Create partitioned table
CREATE TABLE `project.dataset.partitioned_table`
PARTITION BY DATE(timestamp)
AS SELECT * FROM source_table;

Clustering

SQL

1
2
3
4
5
-- Create clustered table
CREATE TABLE `project.dataset.clustered_table`
CLUSTER BY user_id, country
AS SELECT * FROM source_table;

11. Cost Optimization Strategies

Use LIMIT to restrict query size
Leverage partitioning and clustering
Use dry_run=True to estimate query cost
Utilize materialized views
Set up cost controls and budgets

12. External Data Sources

External Table

SQL

1
2
3
4
5
6
CREATE EXTERNAL TABLE `project.dataset.external_table`
OPTIONS (
  format = 'CSV',
  uris = ['gs://bucket/path/*.csv']
);

13. Best Practices

Always use standard SQL mode
Leverage LIMIT for testing queries
Use approximate aggregation functions
Partition large tables
Use clustering for frequent filter columns
Avoid SELECT *
Use STRUCT and ARRAY for complex data
Monitor and optimize query performance

BigQuery

BigQuery

BigQuery SQL Cheatsheet: Petabyte-Scale Analytics

1. Connection and Basic Setup

gcloud CLI Authentication

Web UI Access

2. Project, Dataset, and Table Operations

Create Dataset

Create Table

3. BigQuery-Specific Data Types

STRUCT

ARRAY

GEOGRAPHY

4. Standard SQL Operations

SELECT with Advanced Filtering

INSERT (Streaming)

UPDATE

DELETE

5. Analytics Functions

Window Functions

Statistical Functions

6. Array and Struct Operations

Array Aggregation

Struct Manipulation

7. Date and Time Functions

Date Manipulation

8. Geographic Functions

Spatial Queries

9. Machine Learning Integration (BigQuery ML)

Create and Train Model

Prediction

10. Performance Optimization

Partitioning

Clustering

11. Cost Optimization Strategies

12. External Data Sources

External Table

13. Best Practices

Additional Resources

Continue Learning