HiveQL Cheatsheet: Mastering Big Data Processing with Hadoop

1. Connection and Basic Setup

Hive CLI (Deprecated)

Bash

1
2
3
4
5
6
7
8
9
# Launch Hive CLI (legacy - use Beeline instead)
hive

# Run Hive script
hive -f script.hql

# Execute inline query
hive -e "SELECT * FROM table LIMIT 10;"

Note: The Hive CLI is deprecated. Beeline is the recommended client for HiveServer2.

Beeline (JDBC Client - Recommended)

Bash

1
2
3
4
5
6
7
8
9
# Connect to Hive server
beeline -u jdbc:hive2://localhost:10000 -n username

# Run script via Beeline
beeline -u jdbc:hive2://localhost:10000 -f script.hql

# Connect with Kerberos
beeline -u "jdbc:hive2://localhost:10000/default;principal=hive/_HOST@REALM"

2. Hive Architecture Basics

Core Components

Metastore: Centralized metadata repository
- Stores table schemas, partitions, storage information
- Supports multiple storage backends (MySQL, PostgreSQL, MariaDB, Oracle)
- Can run as a standalone HiveMetastore (HMS) service shared with Spark, Presto/Trino, Flink
HiveServer2 (HS2): Thrift service for client connections (JDBC/ODBC)
Storage Layers:
- HDFS, S3, Azure Blob/ADLS, GCS
- Supports multiple file formats (ORC, Parquet, Avro)
Execution Engines:
- Tez (default and recommended)
- MapReduce (deprecated, removed as a viable engine in Hive 4)
- Spark integration via Hive-on-Spark (community maintenance only)

3. Database and Table Operations

Database Management

SQL

1
2
3
4
5
6
7
8
9
10
11
12
-- Create database (DATABASE and SCHEMA are synonyms)
CREATE DATABASE IF NOT EXISTS my_database;

-- Switch database
USE my_database;

-- List databases
SHOW DATABASES;

-- Drop database
DROP DATABASE IF EXISTS my_database CASCADE;

Table Creation

SQL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Create managed (ACID) table - default in Hive 3+ is transactional ORC
CREATE TABLE employees (
    id INT,
    name STRING,
    salary DOUBLE,
    hire_date DATE
) STORED AS ORC
TBLPROPERTIES ('transactional'='true');

-- Create external table
CREATE EXTERNAL TABLE log_data (
    ts STRING,
    level STRING,
    message STRING
)
STORED AS TEXTFILE
LOCATION '/path/to/log/files';

Hive 3+ Note: Managed tables are ACID/transactional by default and must be stored as ORC. Use EXTERNAL for non-transactional or non-ORC data.

4. Hive Data Types and File Formats

Primitive Data Types

TINYINT: 1-byte signed integer
SMALLINT: 2-byte signed integer
INT: 4-byte signed integer
BIGINT: 8-byte signed integer
FLOAT: 4-byte floating point
DOUBLE: 8-byte floating point
DECIMAL(p,s): Precise decimal type
STRING: Variable-length character string
VARCHAR(n): Variable-length string with max length
CHAR(n): Fixed-length string
BOOLEAN: TRUE/FALSE
TIMESTAMP: Date and time
TIMESTAMP WITH LOCAL TIME ZONE: Timezone-aware timestamp
DATE: Calendar date
BINARY: Byte array

Complex Types

ARRAY<T>, MAP<K,V>, STRUCT<...>, UNIONTYPE<...>

File Formats

SQL

1
2
3
4
5
6
7
8
9
10
11
-- ORC (Optimized Row Columnar) - preferred for managed/ACID tables
CREATE TABLE orc_table (...) STORED AS ORC
TBLPROPERTIES ("orc.compress"="ZSTD");

-- Parquet (Columnar storage)
CREATE TABLE parquet_table (...) STORED AS PARQUET;

-- Avro (Schema-based)
CREATE TABLE avro_table (...)
STORED AS AVRO;

5. Data Loading Techniques

LOAD DATA

SQL

1
2
3
4
5
6
7
8
-- Load local file
LOAD DATA LOCAL INPATH '/local/path/data.csv'
INTO TABLE my_table;

-- Load from HDFS
LOAD DATA INPATH '/hdfs/path/data.csv'
INTO TABLE my_table;

INSERT Operations

SQL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Insert single row
INSERT INTO TABLE employees
VALUES (1, 'John Doe', 50000.00, '2023-01-01');

-- Insert from another table
INSERT INTO TABLE target_table
SELECT * FROM source_table
WHERE condition;

-- Multi-table insert
FROM source_table s
INSERT INTO TABLE high_value SELECT * WHERE s.amount > 1000
INSERT INTO TABLE low_value  SELECT * WHERE s.amount <= 1000;

ACID DML (Hive 3+)

SQL

1
2
3
4
5
6
7
8
-- Requires transactional table
UPDATE employees SET salary = salary * 1.05 WHERE department = 'ENG';
DELETE FROM employees WHERE hire_date < '2010-01-01';
MERGE INTO target t
USING updates u ON t.id = u.id
WHEN MATCHED THEN UPDATE SET t.amount = u.amount
WHEN NOT MATCHED THEN INSERT VALUES (u.id, u.amount);

6. Basic SQL Operations

SELECT and Filtering

SQL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Basic select
SELECT id, name, salary
FROM employees
WHERE salary > 50000;

-- Aggregations
SELECT
    department,
    AVG(salary) AS avg_salary,
    COUNT(*)    AS employee_count
FROM employees
GROUP BY department;

-- Ordering
SELECT * FROM employees
ORDER BY salary DESC
LIMIT 10;

-- Common Table Expression (CTE)
WITH top_earners AS (
    SELECT * FROM employees WHERE salary > 100000
)
SELECT department, COUNT(*) FROM top_earners GROUP BY department;

7. Joins and Subqueries

Join Types

SQL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Inner Join
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.dept_id = d.id;

-- Left Outer Join
SELECT e.name, d.department_name
FROM employees e
LEFT OUTER JOIN departments d ON e.dept_id = d.id;

-- Subquery (correlated and non-correlated supported)
SELECT name, salary
FROM employees
WHERE salary > (
    SELECT AVG(salary) FROM employees
);

-- Map-side / broadcast join hint
SELECT /*+ MAPJOIN(d) */ e.name, d.department_name
FROM employees e JOIN departments d ON e.dept_id = d.id;

8. Partitioning and Bucketing

Partitioning

SQL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- Create partitioned table
CREATE TABLE sales (
    id INT,
    amount DOUBLE
)
PARTITIONED BY (year INT, month STRING);

-- Add partition
ALTER TABLE sales
ADD PARTITION (year=2023, month='January');

-- Dynamic partitioning
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

INSERT INTO TABLE sales PARTITION (year, month)
SELECT id, amount, year, month FROM staging_sales;

Bucketing

SQL

1
2
3
4
5
6
7
-- Create bucketed table
CREATE TABLE users (
    id INT,
    name STRING
)
CLUSTERED BY (id) INTO 10 BUCKETS;

Hive 3+ Note: hive.enforce.bucketing and hive.enforce.sorting are removed; bucketing is enforced automatically.

9. User-Defined Functions (UDFs)

SQL

1
2
3
4
5
6
7
8
9
10
11
12
-- Permanent UDF
CREATE FUNCTION my_uppercase
AS 'com.example.MyUppercaseUDF'
USING JAR 'hdfs:///path/to/udf.jar';

-- Temporary UDF (session-scoped)
CREATE TEMPORARY FUNCTION my_uppercase
AS 'com.example.MyUppercaseUDF';

-- Use UDF
SELECT my_uppercase(name) FROM employees;

10. Window Functions

SQL

1
2
3
4
5
6
7
8
9
10
-- Rank employees by salary in each department
SELECT
    name,
    salary,
    department,
    RANK()       OVER (PARTITION BY department ORDER BY salary DESC) AS salary_rank,
    DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank,
    LAG(salary)  OVER (PARTITION BY department ORDER BY hire_date)   AS prev_salary
FROM employees;

11. Performance Optimization

Vectorization

SQL

1
2
3
4
-- Vectorization is enabled by default in Hive 3+
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;

Cost-Based Optimizer (CBO)

SQL

1
2
3
4
5
6
7
8
-- CBO is enabled by default in Hive 3+; keep stats fresh
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;

-- Refresh statistics
ANALYZE TABLE employees COMPUTE STATISTICS;
ANALYZE TABLE employees COMPUTE STATISTICS FOR COLUMNS;

LLAP (Live Long and Process)

SQL

1
2
3
4
-- Sub-second interactive queries via persistent daemons
SET hive.llap.execution.mode=all;
SET hive.execution.engine=tez;

Materialized Views (Hive 3+)

SQL

1
2
3
4
5
6
7
8
CREATE MATERIALIZED VIEW mv_dept_totals AS
SELECT department, SUM(salary) AS total_salary
FROM employees
GROUP BY department;

-- CBO can automatically rewrite queries to use MVs
SET hive.materializedview.rewriting=true;

12. Hadoop Ecosystem Integration

Apache Spark: Reads/writes Hive tables via the shared Hive Metastore
Apache Iceberg / Hudi: Modern open table formats integrated with Hive 4+
Trino / Presto: Federated SQL over the Hive Metastore
Apache Impala: Real-time SQL queries over HMS-managed data
Apache Zeppelin / Hue / Superset: Interactive data exploration

Deprecated: Apache Pig is retired by the ASF and is no longer recommended for new development.

13. Common Big Data Patterns

ETL/ELT Processing: Extract, Transform, Load large datasets
Log Analysis: Processing system and application logs
Machine Learning Data Preparation
Data Warehousing / Lakehouse
Slowly Changing Dimensions via ACID MERGE
Real-time Analytics Preprocessing

14. Best Practices

Use columnar file formats (ORC for managed/ACID, Parquet for interchange)
Partition large tables on low-cardinality columns
Use bucketing for frequently joined or sampled tables
Keep vectorization and CBO enabled; refresh statistics regularly
Prefer Beeline + HiveServer2 over the legacy CLI
Use Tez (or LLAP for interactive workloads); avoid MapReduce
Use ZSTD or Snappy compression for ORC/Parquet
Implement Ranger/Sentry-based access controls and column masking
Consider Iceberg tables for schema evolution and time travel
Regularly compact ACID tables (ALTER TABLE ... COMPACT 'major')

Pro Tips:

Always use EXPLAIN (or EXPLAIN CBO, EXPLAIN VECTORIZATION) to understand query execution plans
Monitor resource usage with YARN, Tez UI, or the HS2 web UI
Keep table and column statistics updated for the CBO
Use materialized views to accelerate recurring aggregation queries
Test and benchmark different configurations

Performance Checklist:

Use columnar file formats (ORC/Parquet)
Implement proper partitioning and bucketing
Keep vectorization enabled
Use the Cost-Based Optimizer with fresh statistics
Use Tez (or LLAP for interactive queries)
Compact ACID tables periodically
Consider materialized views for hot aggregations

HiveQL

Continue Learning