HiveQL
Updated: September 10, 2025Categories: Query
Printed from:
HiveQL Cheatsheet: Mastering Big Data Processing with Hadoop
1. Connection and Basic Setup
Hive CLI
Bash
123456789# Launch Hive CLI
hive
# Run Hive script
hive -f script.hql
# Execute inline query
hive -e "SELECT * FROM table LIMIT 10;"
Beeline (JDBC Client)
Bash
123456# Connect to Hive server
beeline -u jdbc:hive2://localhost:10000 -n username
# Run script via Beeline
beeline -u jdbc:hive2://localhost:10000 -f script.hql
2. Hive Architecture Basics
Core Components
-
Metastore: Centralized metadata repository
- Stores table schemas, partitions, storage information
- Supports multiple storage backends (MySQL, PostgreSQL)
-
Storage Layers:
- HDFS: Distributed file system
- Supports multiple file formats (ORC, Parquet, Avro)
-
Execution Engines:
- MapReduce (legacy)
- Tez (recommended for better performance)
- Spark (for more complex processing)
3. Database and Table Operations
Database Management
SQL
123456789101112-- Create database
CREATE DATABASE IF NOT EXISTS my_database;
-- Switch database
USE my_database;
-- List databases
SHOW DATABASES;
-- Drop database
DROP DATABASE IF EXISTS my_database CASCADE;
Table Creation
SQL
1234567891011121314151617-- Create managed table
CREATE TABLE employees (
id INT,
name STRING,
salary DOUBLE,
hire_date DATE
) STORED AS ORC;
-- Create external table
CREATE EXTERNAL TABLE log_data (
timestamp STRING,
level STRING,
message STRING
)
STORED AS TEXTFILE
LOCATION '/path/to/log/files';
4. Hive Data Types and File Formats
Primitive Data Types
TINYINT: 1-byte signed integerSMALLINT: 2-byte signed integerINT: 4-byte signed integerBIGINT: 8-byte signed integerFLOAT: 4-byte floating pointDOUBLE: 8-byte floating pointDECIMAL: Precise decimal typeSTRING: Variable-length character stringVARCHAR: Variable-length string with max lengthCHAR: Fixed-length stringBOOLEAN: TRUE/FALSETIMESTAMP: Date and time
File Formats
SQL
12345678910-- ORC (Optimized Row Columnar)
CREATE TABLE orc_table (...) STORED AS ORC;
-- Parquet (Columnar storage)
CREATE TABLE parquet_table (...) STORED AS PARQUET;
-- Avro (Schema-based)
CREATE TABLE avro_table (...)
STORED AS AVRO;
5. Data Loading Techniques
LOAD DATA
SQL
12345678-- Load local file
LOAD DATA LOCAL INPATH '/local/path/data.csv'
INTO TABLE my_table;
-- Load from HDFS
LOAD DATA INPATH '/hdfs/path/data.csv'
INTO TABLE my_table;
INSERT Operations
SQL
123456789-- Insert single row
INSERT INTO TABLE employees
VALUES (1, 'John Doe', 50000.00, '2023-01-01');
-- Insert from another table
INSERT INTO TABLE target_table
SELECT * FROM source_table
WHERE condition;
6. Basic SQL Operations
SELECT and Filtering
SQL
123456789101112131415161718-- Basic select
SELECT id, name, salary
FROM employees
WHERE salary > 50000;
-- Aggregations
SELECT
department,
AVG(salary) as avg_salary,
COUNT(*) as employee_count
FROM employees
GROUP BY department;
-- Ordering
SELECT * FROM employees
ORDER BY salary DESC
LIMIT 10;
7. Joins and Subqueries
Join Types
SQL
1234567891011121314151617-- Inner Join
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.dept_id = d.id;
-- Left Outer Join
SELECT e.name, d.department_name
FROM employees e
LEFT OUTER JOIN departments d ON e.dept_id = d.id;
-- Subquery
SELECT name, salary
FROM employees
WHERE salary > (
SELECT AVG(salary) FROM employees
);
8. Partitioning and Bucketing
Partitioning
SQL
1234567891011-- Create partitioned table
CREATE TABLE sales (
id INT,
amount DOUBLE
)
PARTITIONED BY (year INT, month STRING);
-- Add partition
ALTER TABLE sales
ADD PARTITION (year=2023, month='January');
Bucketing
SQL
1234567-- Create bucketed table
CREATE TABLE users (
id INT,
name STRING
)
CLUSTERED BY (id) INTO 10 BUCKETS;
9. User-Defined Functions (UDFs)
SQL
12345678-- Create UDF
CREATE FUNCTION my_uppercase
AS 'com.example.MyUppercaseUDF'
USING JAR '/path/to/udf.jar';
-- Use UDF
SELECT my_uppercase(name) FROM employees;
10. Window Functions
SQL
12345678-- Rank employees by salary in each department
SELECT
name,
salary,
department,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as salary_rank
FROM employees;
11. Performance Optimization
Vectorization
SQL
1234-- Enable vectorization
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled=true;
Cost-Based Optimizer (CBO)
SQL
1234-- Enable CBO
SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
12. Hadoop Ecosystem Integration
- Apache Spark: Advanced analytics and machine learning
- Apache Pig: High-level data flow scripting
- Apache Impala: Real-time SQL queries
- Apache Zeppelin: Interactive data exploration
13. Common Big Data Patterns
- ETL Processing: Extract, Transform, Load large datasets
- Log Analysis: Processing system and application logs
- Machine Learning Data Preparation
- Data Warehousing
- Real-time Analytics Preprocessing
14. Best Practices
- Use appropriate file formats (ORC/Parquet)
- Partition large tables
- Use bucketing for frequently joined tables
- Leverage vectorization and CBO
- Monitor and optimize query performance
- Use appropriate compression (Snappy, ZLIB)
- Implement proper access controls
- Regular metadata and statistics updates
Pro Tips:
- Always use
EXPLAINto understand query execution plan - Monitor resource usage with YARN
- Keep statistics updated for query optimization
- Test and benchmark different configurations
Performance Checklist:
- Use columnar file formats
- Implement proper partitioning
- Enable vectorization
- Use Cost-Based Optimizer
- Choose right execution engine (Tez/Spark)
Continue Learning
Discover more cheatsheets to boost your productivity