HiveQL
Updated: May 22, 2026Categories: Query
Printed from:
HiveQL Cheatsheet: Mastering Big Data Processing with Hadoop
1. Connection and Basic Setup
Hive CLI (Deprecated)
Bash
123456789# Launch Hive CLI (legacy - use Beeline instead)
hive
# Run Hive script
hive -f script.hql
# Execute inline query
hive -e "SELECT * FROM table LIMIT 10;"
Note: The Hive CLI is deprecated. Beeline is the recommended client for HiveServer2.
Beeline (JDBC Client - Recommended)
Bash
123456789# Connect to Hive server
beeline -u jdbc:hive2://localhost:10000 -n username
# Run script via Beeline
beeline -u jdbc:hive2://localhost:10000 -f script.hql
# Connect with Kerberos
beeline -u "jdbc:hive2://localhost:10000/default;principal=hive/_HOST@REALM"
2. Hive Architecture Basics
Core Components
-
Metastore: Centralized metadata repository
- Stores table schemas, partitions, storage information
- Supports multiple storage backends (MySQL, PostgreSQL, MariaDB, Oracle)
- Can run as a standalone HiveMetastore (HMS) service shared with Spark, Presto/Trino, Flink
-
HiveServer2 (HS2): Thrift service for client connections (JDBC/ODBC)
-
Storage Layers:
- HDFS, S3, Azure Blob/ADLS, GCS
- Supports multiple file formats (ORC, Parquet, Avro)
-
Execution Engines:
- Tez (default and recommended)
- MapReduce (deprecated, removed as a viable engine in Hive 4)
- Spark integration via Hive-on-Spark (community maintenance only)
3. Database and Table Operations
Database Management
SQL
123456789101112-- Create database (DATABASE and SCHEMA are synonyms)
CREATE DATABASE IF NOT EXISTS my_database;
-- Switch database
USE my_database;
-- List databases
SHOW DATABASES;
-- Drop database
DROP DATABASE IF EXISTS my_database CASCADE;
Table Creation
SQL
123456789101112131415161718-- Create managed (ACID) table - default in Hive 3+ is transactional ORC
CREATE TABLE employees (
id INT,
name STRING,
salary DOUBLE,
hire_date DATE
) STORED AS ORC
TBLPROPERTIES ('transactional'='true');
-- Create external table
CREATE EXTERNAL TABLE log_data (
ts STRING,
level STRING,
message STRING
)
STORED AS TEXTFILE
LOCATION '/path/to/log/files';
Hive 3+ Note: Managed tables are ACID/transactional by default and must be stored as ORC. Use
EXTERNALfor non-transactional or non-ORC data.
4. Hive Data Types and File Formats
Primitive Data Types
TINYINT: 1-byte signed integerSMALLINT: 2-byte signed integerINT: 4-byte signed integerBIGINT: 8-byte signed integerFLOAT: 4-byte floating pointDOUBLE: 8-byte floating pointDECIMAL(p,s): Precise decimal typeSTRING: Variable-length character stringVARCHAR(n): Variable-length string with max lengthCHAR(n): Fixed-length stringBOOLEAN: TRUE/FALSETIMESTAMP: Date and timeTIMESTAMP WITH LOCAL TIME ZONE: Timezone-aware timestampDATE: Calendar dateBINARY: Byte array
Complex Types
ARRAY<T>,MAP<K,V>,STRUCT<...>,UNIONTYPE<...>
File Formats
SQL
1234567891011-- ORC (Optimized Row Columnar) - preferred for managed/ACID tables
CREATE TABLE orc_table (...) STORED AS ORC
TBLPROPERTIES ("orc.compress"="ZSTD");
-- Parquet (Columnar storage)
CREATE TABLE parquet_table (...) STORED AS PARQUET;
-- Avro (Schema-based)
CREATE TABLE avro_table (...)
STORED AS AVRO;
5. Data Loading Techniques
LOAD DATA
SQL
12345678-- Load local file
LOAD DATA LOCAL INPATH '/local/path/data.csv'
INTO TABLE my_table;
-- Load from HDFS
LOAD DATA INPATH '/hdfs/path/data.csv'
INTO TABLE my_table;
INSERT Operations
SQL
1234567891011121314-- Insert single row
INSERT INTO TABLE employees
VALUES (1, 'John Doe', 50000.00, '2023-01-01');
-- Insert from another table
INSERT INTO TABLE target_table
SELECT * FROM source_table
WHERE condition;
-- Multi-table insert
FROM source_table s
INSERT INTO TABLE high_value SELECT * WHERE s.amount > 1000
INSERT INTO TABLE low_value SELECT * WHERE s.amount <= 1000;
ACID DML (Hive 3+)
SQL
12345678-- Requires transactional table
UPDATE employees SET salary = salary * 1.05 WHERE department = 'ENG';
DELETE FROM employees WHERE hire_date < '2010-01-01';
MERGE INTO target t
USING updates u ON t.id = u.id
WHEN MATCHED THEN UPDATE SET t.amount = u.amount
WHEN NOT MATCHED THEN INSERT VALUES (u.id, u.amount);
6. Basic SQL Operations
SELECT and Filtering
SQL
123456789101112131415161718192021222324-- Basic select
SELECT id, name, salary
FROM employees
WHERE salary > 50000;
-- Aggregations
SELECT
department,
AVG(salary) AS avg_salary,
COUNT(*) AS employee_count
FROM employees
GROUP BY department;
-- Ordering
SELECT * FROM employees
ORDER BY salary DESC
LIMIT 10;
-- Common Table Expression (CTE)
WITH top_earners AS (
SELECT * FROM employees WHERE salary > 100000
)
SELECT department, COUNT(*) FROM top_earners GROUP BY department;
7. Joins and Subqueries
Join Types
SQL
123456789101112131415161718192021-- Inner Join
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.dept_id = d.id;
-- Left Outer Join
SELECT e.name, d.department_name
FROM employees e
LEFT OUTER JOIN departments d ON e.dept_id = d.id;
-- Subquery (correlated and non-correlated supported)
SELECT name, salary
FROM employees
WHERE salary > (
SELECT AVG(salary) FROM employees
);
-- Map-side / broadcast join hint
SELECT /*+ MAPJOIN(d) */ e.name, d.department_name
FROM employees e JOIN departments d ON e.dept_id = d.id;
8. Partitioning and Bucketing
Partitioning
SQL
123456789101112131415161718-- Create partitioned table
CREATE TABLE sales (
id INT,
amount DOUBLE
)
PARTITIONED BY (year INT, month STRING);
-- Add partition
ALTER TABLE sales
ADD PARTITION (year=2023, month='January');
-- Dynamic partitioning
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT INTO TABLE sales PARTITION (year, month)
SELECT id, amount, year, month FROM staging_sales;
Bucketing
SQL
1234567-- Create bucketed table
CREATE TABLE users (
id INT,
name STRING
)
CLUSTERED BY (id) INTO 10 BUCKETS;
Hive 3+ Note:
hive.enforce.bucketingandhive.enforce.sortingare removed; bucketing is enforced automatically.
9. User-Defined Functions (UDFs)
SQL
123456789101112-- Permanent UDF
CREATE FUNCTION my_uppercase
AS 'com.example.MyUppercaseUDF'
USING JAR 'hdfs:///path/to/udf.jar';
-- Temporary UDF (session-scoped)
CREATE TEMPORARY FUNCTION my_uppercase
AS 'com.example.MyUppercaseUDF';
-- Use UDF
SELECT my_uppercase(name) FROM employees;
10. Window Functions
SQL
12345678910-- Rank employees by salary in each department
SELECT
name,
salary,
department,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS salary_rank,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank,
LAG(salary) OVER (PARTITION BY department ORDER BY hire_date) AS prev_salary
FROM employees;
11. Performance Optimization
Vectorization
SQL
1234-- Vectorization is enabled by default in Hive 3+
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
Cost-Based Optimizer (CBO)
SQL
12345678-- CBO is enabled by default in Hive 3+; keep stats fresh
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
-- Refresh statistics
ANALYZE TABLE employees COMPUTE STATISTICS;
ANALYZE TABLE employees COMPUTE STATISTICS FOR COLUMNS;
LLAP (Live Long and Process)
SQL
1234-- Sub-second interactive queries via persistent daemons
SET hive.llap.execution.mode=all;
SET hive.execution.engine=tez;
Materialized Views (Hive 3+)
SQL
12345678CREATE MATERIALIZED VIEW mv_dept_totals AS
SELECT department, SUM(salary) AS total_salary
FROM employees
GROUP BY department;
-- CBO can automatically rewrite queries to use MVs
SET hive.materializedview.rewriting=true;
12. Hadoop Ecosystem Integration
- Apache Spark: Reads/writes Hive tables via the shared Hive Metastore
- Apache Iceberg / Hudi: Modern open table formats integrated with Hive 4+
- Trino / Presto: Federated SQL over the Hive Metastore
- Apache Impala: Real-time SQL queries over HMS-managed data
- Apache Zeppelin / Hue / Superset: Interactive data exploration
Deprecated: Apache Pig is retired by the ASF and is no longer recommended for new development.
13. Common Big Data Patterns
- ETL/ELT Processing: Extract, Transform, Load large datasets
- Log Analysis: Processing system and application logs
- Machine Learning Data Preparation
- Data Warehousing / Lakehouse
- Slowly Changing Dimensions via ACID MERGE
- Real-time Analytics Preprocessing
14. Best Practices
- Use columnar file formats (ORC for managed/ACID, Parquet for interchange)
- Partition large tables on low-cardinality columns
- Use bucketing for frequently joined or sampled tables
- Keep vectorization and CBO enabled; refresh statistics regularly
- Prefer Beeline + HiveServer2 over the legacy CLI
- Use Tez (or LLAP for interactive workloads); avoid MapReduce
- Use ZSTD or Snappy compression for ORC/Parquet
- Implement Ranger/Sentry-based access controls and column masking
- Consider Iceberg tables for schema evolution and time travel
- Regularly compact ACID tables (
ALTER TABLE ... COMPACT 'major')
Pro Tips:
- Always use
EXPLAIN(orEXPLAIN CBO,EXPLAIN VECTORIZATION) to understand query execution plans - Monitor resource usage with YARN, Tez UI, or the HS2 web UI
- Keep table and column statistics updated for the CBO
- Use materialized views to accelerate recurring aggregation queries
- Test and benchmark different configurations
Performance Checklist:
- Use columnar file formats (ORC/Parquet)
- Implement proper partitioning and bucketing
- Keep vectorization enabled
- Use the Cost-Based Optimizer with fresh statistics
- Use Tez (or LLAP for interactive queries)
- Compact ACID tables periodically
- Consider materialized views for hot aggregations
Continue Learning
Discover more cheatsheets to boost your productivity