ETL Engine SQL Syntax
SQL is a module for working with structured data. The SQL Syntax section describes the SQL syntax in detail and provides usage examples when applicable.
This document provides a list of Data Definition, Manipulation, Retrieval, and Auxiliary Statements.
Data Definition Statements are used to create or modify the structure of database objects in a database.
Spark SQL supports the following Data Definition Statements:
CREATE table
THE CREATE TABLE statement defines a table in an existing database:
CREATE TABLE student (id INT, name STRING, age INT); CREATE TABLE student (id INT, name STRING) PARTITIONED BY (age INT); CREATE TABLE Student_Dupli like Student;
DROP table
DROP TABLE deletes the table and removes the directory associated with it from the file system if the table is not an EXTERNAL table. If the table is not present, it throws an exception.
In case of an external table, only the associated metadata information is removed from the metastore database.
If the table is cached, the command uncaches it and all its dependents.
DROP TABLE employeetable; DROP TABLE userdb.employeetable; DROP TABLE employeetable; DROP TABLE IF EXISTS employeetable; DROP TABLE employeetable PURGE; TRUNCATE TABLE Student partition(age=10);
ALTER table
THE ALTER TABLE statement changes the schema or properties of a table.
ALTER TABLE Student RENAME TO StudentInfo; ALTER TABLE StudentInfo ADD columns (LastName string, DOB timestamp); ALTER TABLE StudentInfo DROP columns (LastName, DOB); ALTER TABLE StudentInfo DROP columns (LastName, DOB); ALTER TABLE StudentInfo DROP column (LastName); ALTER TABLE StudentInfo RENAME COLUMN name TO FirstName; ALTER TABLE Teacher CLUSTER BY (gender, country);
DESCRIBE table
DESCRIBE TABLE statement returns the basic metadata information of a table. The metadata information includes column name, column type and column comment. Optionally a partition spec or column name may be specified to return the metadata pertaining to a partition or column respectively.
DESCRIBE TABLE EXTENDED tempdb1.v2; DESC TABLE EXTENDED tempdb1.v2; DESC TABLE EXTENDED tempdb1.v2; DESC TABLE EXTENDED tempdb1.v2; DESC TABLE EXTENDED tempdb1.v2; DESC TABLE EXTENDED open_orders; CREATE OR REPLACE VIEW open_orders WITH SCHEMA EVOLUTION AS SELECT * FROM orders WHERE status = 'open'; CREATE OR REPLACE VIEW open_orders AS SELECT * FROM orders WHERE status = 'open'; DROP VIEW employeeView; DROP VIEW userdb.employeeView; DROP VIEW employeeView; DROP VIEW IF EXISTS employeeView; ALTER VIEW tempdb1.v1 RENAME TO tempdb1.v2; ALTER VIEW tempdb1.v2 AS SELECT * FROM tempdb1.v1;
Data Manipulation Statements are used to add, change, or delete data.
SQL supports the following Data Manipulation Statements:
INSERT table
The INSERT statement inserts new rows into a table or overwrites the existing data in the table. The inserted rows can be specified by value expressions or the result of a query.
INSERT INTO students VALUES('Amy Smith', '123 Park Ave, San Jose', 111111); INSERT INTO students VALUES('Bob Brown', '456 Taylor St, Cupertino', 222222),('Cathy Johnson', '789 Race Ave, Palo Alto', 333333); INSERT INTO students PARTITION (student_id = 444444)SELECT name, address FROM persons WHERE name = 'Dora Williams'; INSERT INTO students TABLE visiting_students; INSERT INTO students PARTITION (birthday = date'2019') VALUES('Amy Smith', '123 Park Ave, San Jose'); INSERT INTO students (address, name, student_id) VALUES('Hangzhou, China', 'Kent Yao', 11215016); INSERT INTO students PARTITION (student_id = 11215017) (address, name) VALUES('Hangzhou, China', 'Kent Yao Jr.'); INSERT OVERWRITE students VALUES ('Ashua Hill', '456 Erica Ct, Cupertino', 111111), ('Brian Reed', '723 Kern Ave, Palo Alto', 222222); INSERT OVERWRITE students PARTITION (student_id = 222222)BY NAME SELECT 'Unknown' as address, name FROM persons WHERE name = 'Dora Williams'; INSERT INTO students PARTITION (student_id = 222222) BY NAME SELECT address, name FROM persons WHERE name = 'Dora Williams'; INSERT OVERWRITE students PARTITION (student_id = 222222) BY NAME SELECT 'Unknown' as address, name FROM persons WHERE name = 'Dora Williams'; INSERT INTO persons REPLACE WHERE ssn = 123456789 SELECT * FROM persons2; INSERT OVERWRITE students TABLE visiting_students; CREATE TABLE students (name STRING, address STRING) PARTITIONED BY (birthday DATE); INSERT INTO students PARTITION (birthday = date'2019')VALUES ('Amy Smith', '123 Park Ave, San Jose'); INSERT OVERWRITE students PARTITION (birthday = date'2019') VALUES('Jason Wang', '908 Bird St, Saratoga'); INSERT OVERWRITE students (address, name, student_id) VALUES('Hangzhou, China', 'Kent Yao', 11215016); INSERT OVERWRITE students PARTITION (student_id = 11215016) (address, name) VALUES ('Hangzhou, China', 'Kent Yao Jr.');
UPDATE table
Updates the column values for the rows that match a predicate. When no predicate is provided, update the column values for all rows.
> UPDATE events SET eventType = 'click' WHERE eventType = 'clk'; > UPDATE all_events SET session_time = 0, ignored = true WHERE session_time < (SELECT min(session_time) FROM good_events); > UPDATE orders AS t1 SET order_status = 'returned' WHERE EXISTS (SELECT oid FROM returned_orders WHERE t1.oid = oid); > UPDATE events SET category = 'undefined' WHERE category NOT IN (SELECT category FROM events2 WHERE date > '2001-01-01'); > UPDATE events SET ignored = DEFAULTWHERE eventType = 'unknown';
DELETE table
Deletes the rows that match a predicate. When no predicate is provided, it deletes all rows.
DELETE FROM events WHERE date < '2017-01-01'; DELETE FROM all_events WHERE session_time < (SELECT min(session_time) FROM good_events); DELETE FROM orders AS t1 WHERE EXISTS (SELECT oid FROM returned_orders WHERE t1.oid = oid); DELETE FROM events WHERE category NOT IN (SELECT category FROM events2 WHERE date > '2001-01-01');
MERGE INTO
Merges a set of updates, insertions, and deletions based on a source table into a target Delta table.
MERGE INTO target USING source ON target.key = source.key WHEN MATCHED THEN DELETE; MERGE INTO target USING source ON target.key = source.key WHEN MATCHED AND target.updated_at < source.updated_at THEN UPDATE SET *; MERGE INTO target USING source ON target.key = source.key WHEN MATCHED AND target.marked_for_deletion THEN DELETE WHEN MATCHED THEN UPDATE SET target.updated_at = source.updated_at, target.value = DEFAULT; MERGE INTO target USING source ON target.key = source.key WHEN NOT MATCHED THEN INSERT *; MERGE INTO target USING source ON target.key = source.key WHEN NOT MATCHED BY TARGET AND source.created_at > now() - INTERVAL “1” DAY THEN INSERT (created_at, value) VALUES (source.created_at, DEFAULT); MERGE INTO target USING source ON target.key = source.key WHEN NOT MATCHED BY SOURCE THEN DELETE; MERGE INTO target USING source ON target.key = source.key WHEN NOT MATCHED BY SOURCE AND target.marked_for_deletion THEN DELETE WHEN NOT MATCHED BY SOURCE THEN UPDATE SET target.value = DEFAULT;
Spark supports SELECT
statement that is used to retrieve rows from one or more tables according to the specified clauses. The full syntax and brief description of supported clauses are explained in SELECT section. The SQL statements related to SELECT are also included in this section. Sql also provides the ability to generate logical and physical plan for a given query using EXPLAIN statement.
An SQL operator is a symbol specifying an action performed on one or more expressions. Operators are represented by special characters or by keywords.
Operator precdence
When a complex expression has multiple operators, operator precedence determines the sequence of operations in the expression. For example, in expression 1 + 2 * 3,
*
has higher precedence than +
, so the expression is evaluated as 1 + (2 * 3) = 7
. The order of execution can significantly affect the resulting value.
Operators have the precedence levels shown in the following table. An operator on higher precedence is evaluated before an operator on a lower level. In the following table, the operators in descending order of precedence, a.k.a. 1 is the highest level. Operators listed on the same table cell have the same precedence and are evaluated from left to right or right to left based on the associativity.
Precedence | Operator | Operation | Associativity |
---|---|---|---|
1 | . [] :: | member access element access cast | Left to right |
2 | + - ~ | unary plus unary minus bitwise NOT | Right to left |
3 | * / % DIV | multiplication division, modulo integral division | Left to right |
4 | + - || | addition subtraction concatenation | Left to right |
5 | << >> | bitwise shift left bitwise shift right | Left to right |
6 | & | bitwise AND | Left to right |
7 | ^ | bitwise XOR(exclusive or) | Left to right |
8 | | | bitwise OR(inclusive or) | Left to right |
9 | =, == <>, != <, < = >, >= | comparison operators | Left to right |
10 | NOT, ! EXISTS | logical NOT existence | Right to left |
11 | BETWEEN IN RLIKE, REGEXP ILIKE LIKE IS [NULL, TRUE, FALSE] IS DISTINCT FROM | other predicates | Left to right |
12 | AND | conjunction | Left to right |
13 | OR | disjunction | Left to right |
The following SQL functions can be used:
Function | Description | Example |
---|---|---|
| abs(expr) - Returns the absolute value of the numeric or interval value. | > SELECT abs(-1); Result: 1 > SELECT abs(INTERVAL -'1-1' YEAR TO MONTH); Result: 1-1 |
| add_months(start_date, num_months) - Returns the date that isnum_monthsafterstart_date. | > SELECT add_months('2016-08-31', 1); Result: 2016-09-30 |
| aggregate(expr, start, merge, finish) - Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function. | > SELECT aggregate(array(1, 2, 3), 0, (acc, x) -> acc + x); Result: 6 > SELECT aggregate(array(1, 2, 3), 0, (acc, x) -> acc + x, acc -> acc * 10); Result: 60 |
| expr1 and expr2 - Logical AND. | > SELECT true and true; Result: true > SELECT true and false; Result: false > SELECT true and NULL; Result: NULL > SELECT false and NULL; Result: False |
| any(expr) - Returns true if at least one value of expr is true. | > SELECT any(col) FROM VALUES (true), (false), (false) AS tab(col); Result: true > SELECT any(col) FROM VALUES (NULL), (true), (false) AS tab(col); Result: true > SELECT any(col) FROM VALUES (false), (false), (NULL) AS tab(col); Result: false |
| any_value(expr[, isIgnoreNull]) - Returns some value ofexprfor a group of rows. IfisIgnoreNullis true, returns only non-null values. | > SELECT any_value(col) FROM VALUES (10), (5), (20) AS tab(col); Result: 10 > SELECT any_value(col) FROM VALUES (NULL), (5), (20) AS tab(col); Result: NULL > SELECT any_value(col, true) FROM VALUES (NULL), (5), (20) AS tab(col); Result: 5 |
| array(expr, ...) - Returns an array with the given elements. | > SELECT array(1, 2, 3); Result: [1,2,3] |
| avg(expr) - Returns the mean calculated from values of a group. | > SELECT avg(col) FROM VALUES (1), (2), (3) AS tab(col); Result: 2.0 > SELECT avg(col) FROM VALUES (1), (2), (NULL) AS tab(col); Result: 1.5 |
| base64(bin) - Converts the argument from a binary b into a base 64 string. | > SELECT base64('Spark SQL'); Result: U3BhcmsgU1FM > SELECT base64(x'537061726b2053514c'); Result: U3BhcmsgU1FM |
| input [NOT] between lower AND upper - evaluate if input is [not] in between lower and upper | > SELECT 0.5 between 0.1 AND 1.0; Result: true |
| bool_and(expr) - Returns true if all values of expr are true. | > SELECT bool_and(col) FROM VALUES (true), (true), (true) AS tab(col); Result: true > SELECT bool_and(col) FROM VALUES (NULL), (true), (true) AS tab(col); Result: true > SELECT bool_and(col) FROM VALUES (true), (false), (true) AS tab(col); Result: false |
| bool_or(expr) - Returns true if at least one value of expr is true. | > SELECT bool_or(col) FROM VALUES (true), (false), (false) AS tab(col); Result: true > SELECT bool_or(col) FROM VALUES (NULL), (true), (false) AS tab(col); Result: true > SELECT bool_or(col) FROM VALUES (false), (false), (NULL) AS tab(col); Result: false |
| CASE expr1 WHEN expr2 THEN expr3 [WHEN expr4 THEN expr5]* [ELSE expr6] END - Whenexpr1=expr2, returnsexpr3; whenexpr1=expr4, returnexpr5; else returnexpr6. | > SELECT CASE col1 WHEN 1 THEN 'one' WHEN 2 THEN 'two' ELSE '?' END FROM VALUES 1, 2, 3; Result: one two ? > SELECT CASE col1 WHEN 1 THEN 'one' WHEN 2 THEN 'two' END FROM VALUES 1, 2, 3; Result: one two NULL |
| cast(expr AS type) - Casts the valueexprto the target data typetype.expr::typealternative casting syntax is also supported. | > SELECT cast('10' as int); Result: 10 > SELECT '10' :: int; Result: 10 |
| char(expr) - Returns the ASCII character having the binary equivalent to expr. If n is larger than 256 the result is equivalent to chr(n % 256) | > SELECT char(65); Result: A |
| char_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros. | > SELECT char_length('Spark SQL '); Result: 10 > SELECT char_length(x'537061726b2053514c'); Result: 9 > SELECT CHAR_LENGTH('Spark SQL '); Result: 10 > SELECT CHARACTER_LENGTH('Spark SQL '); Result: 10 |
| character_length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros. | > SELECT character_length('Spark SQL '); Result: 10 > SELECT character_length(x'537061726b2053514c'); Result: 9 > SELECT CHAR_LENGTH('Spark SQL '); Result: 10 > SELECT CHARACTER_LENGTH('Spark SQL '); Result: 10 |
| chr(expr) - Returns the ASCII character having the binary equivalent to expr. If n is larger than 256 the result is equivalent to chr(n % 256) | > SELECT chr(65); Result: A |
| coalesce(expr1, expr2, ...) - Returns the first non-null argument if exists. Otherwise, null. | > SELECT coalesce(NULL, 1, NULL); Result: 1 |
| concat(col1, col2, ..., colN) - Returns the concatenation of col1, col2, ..., colN. | > SELECT concat('Spark', 'SQL'); Result: SparkSQL > SELECT concat(array(1, 2, 3), array(4, 5), array(6)); Result: [1,2,3,4,5,6] |
| concat_ws(sep[, str | array(str)]+) - Returns the concatenation of the strings separated bysep, skipping null values. | > SELECT concat_ws(' ', 'Spark', 'SQL'); Result: Spark SQL > SELECT concat_ws('s'); Result: > SELECT concat_ws('/', 'foo', null, 'bar'); Result: foo/bar > SELECT concat_ws(null, 'Spark', 'SQL'); Result: NULL |
| contains(left, right) - Returns a boolean. The value is True if right is found inside left.Returns NULL if either input expression is NULL. Otherwise, returns False.Both left or right must be of STRING or BINARY type. | > SELECT contains('Spark SQL', 'Spark'); Result: true > SELECT contains('Spark SQL', 'SPARK'); Result: false > SELECT contains('Spark SQL', null); Result: NULL > SELECT contains(x'537061726b2053514c', x'537061726b'); Result: true |
| count(*) - Returns the total number of retrieved rows, including rows containing null. | > SELECT count(*) FROM VALUES (NULL), (5), (5), (20) AS tab(col); Result: 4 > SELECT count(col) FROM VALUES (NULL), (5), (5), (20) AS tab(col); Result: 3 > SELECT count(DISTINCT col) FROM VALUES (NULL), (5), (5), (10) AS tab(col); Result: 2 |
| current_date() - Returns the current date at the start of query evaluation. All calls of current_date within the same query return the same value. | > SELECT current_date(); Result: 2020-04-25 > SELECT current_date; Result: 2020-04-25 |
| current_schema() - Returns the current database. | > SELECT current_schema(); Result: default |
| current_timestamp() - Returns the current timestamp at the start of query evaluation. All calls of current_timestamp within the same query return the same value. | > SELECT current_timestamp(); Result: 2020-04-25 15:49:11.914 > SELECT current_timestamp; Result: 2020-04-25 15:49:11.914 |
| date(expr) - Casts the value expr to the target data typedate. | > SELECT date_add('2016-07-30', 1); Result: 2016-07-31 |
| date_add(start_date, num_days) - Returns the date that isn um_days after start_date. | > SELECT date_add('2016-07-30', 1); Result: 2016-07-31 |
| date_diff(endDate, startDate) - Returns the number of days from startDate to endDate. | > SELECT date_diff('2009-07-31', '2009-07-30'); Result: 1 > SELECT date_diff('2009-07-30', '2009-07-31'); Result: -1 |
| date_trunc(fmt, ts) - Returns timestamptstruncated to the unit specified by the format modelfmt. | > SELECT date_trunc('YEAR', '2015-03-05T09:32:05.359'); Result: 2015-01-01 00:00:00 > SELECT date_trunc('MM', '2015-03-05T09:32:05.359'); Result: 2015-03-01 00:00:00 > SELECT date_trunc('DD', '2015-03-05T09:32:05.359'); Result: 2015-03-05 00:00:00 > SELECT date_trunc('HOUR', '2015-03-05T09:32:05.359'); Result: 2015-03-05 09:00:00 > SELECT date_trunc('MILLISECOND', '2015-03-05T09:32:05.123456'); Result: 2015-03-05 09:32:05.123 |
| dateadd(start_date, num_days) - Returns the date that is num_days after start_date. | > SELECT dateadd('2016-07-30', 1); Result: 2016-07-31 |
| datediff(endDate, startDate) - Returns the number of days from startDate to endDate. | > SELECT datediff('2009-07-31', '2009-07-30'); Result: 1 > SELECT datediff('2009-07-30', '2009-07-31'); Result: -1 |
| day(date) - Returns the day of month of the date/timestamp. | > SELECT day('2009-07-30'); Result: 30 |
| dayofmonth(date) - Returns the day of month of the date/timestamp. | > SELECT dayofmonth('2009-07-30'); Result: 30 |
| dayofweek(date) - Returns the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday). | > SELECT dayofweek('2009-07-30'); Result: 5 |
| dayofyear(date) - Returns the day of year of the date/timestamp. | > SELECT dayofyear('2016-04-09'); Result: 100 |
| decode(bin, charset) - Decodes the first argument using the second argument character set. If either argument is null, the result will also be null. | > SELECT decode(encode('abc', 'utf-8'), 'utf-8'); Result: abc > SELECT decode(2, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle', 'Non domestic'); Result: San Francisco > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle', 'Non domestic'); Result: Non domestic > SELECT decode(6, 1, 'Southlake', 2, 'San Francisco', 3, 'New Jersey', 4, 'Seattle'); Result: NULL > SELECT decode(null, 6, 'Spark', NULL, 'SQL', 4, 'rocks'); Result: SQL |
| expr1 div expr2 - Divide expr1 by expr2. It returns NULL if an operand is NULL or expr2 is 0. The result is casted to long. | > SELECT 3 div 2; Result: 1 > SELECT INTERVAL '1-1' YEAR TO MONTH div INTERVAL '-1' MONTH; Result: -13 |
| encode(str, charset) - Encodes the first argument using the second argument character set. If either argument is null, the result will also be null. | > SELECT encode('abc', 'utf-8'); Result: abc |
| exists(expr, pred) - Tests whether a predicate holds for one or more elements in the array. | > SELECT exists(array(1, 2, 3), x -> x % 2 == 0); Result: true > SELECT exists(array(1, 2, 3), x -> x % 2 == 10); Result: false > SELECT exists(array(1, null, 3), x -> x % 2 == 0); Result: NULL > SELECT exists(array(0, null, 2, 3, null), x -> x IS NULL); Result: true > SELECT exists(array(1, 2, 3), x -> x IS NULL); Result: false |
| extract(field FROM source) - Extracts a part of the date/timestamp or interval source. | > SELECT extract(YEAR FROM TIMESTAMP '2019-08-12 01:00:00.123456'); Result: 2019 > SELECT extract(week FROM timestamp'2019-08-12 01:00:00.123456'); Result: 33 > SELECT extract(doy FROM DATE'2019-08-12'); Result: 224 > SELECT extract(SECONDS FROM timestamp'2019-10-01 00:00:01.000001'); Result: 1.000001 > SELECT extract(days FROM interval 5 days 3 hours 7 minutes); Result: 5 > SELECT extract(seconds FROM interval 5 hours 30 seconds 1 milliseconds 1 microseconds); Result: 30.001001 > SELECT extract(MONTH FROM INTERVAL '2021-11' YEAR TO MONTH); Result: 11 > SELECT extract(MINUTE FROM INTERVAL '123 23:55:59.002001' DAY TO SECOND); Result: 55 |
| first(expr[, isIgnoreNull]) - Returns the first value of expr for a group of rows. If is IgnoreNull is true, returns only non-null values. | > SELECT first(col) FROM VALUES (10), (5), (20) AS tab(col); Result: 10 > SELECT first(col) FROM VALUES (NULL), (5), (20) AS tab(col); Result: NULL > SELECT first(col, true) FROM VALUES (NULL), (5), (20) AS tab(col); Result: 5 |
| hex(expr) - Converts expr to hexadecimal. | > SELECT hex(17); Result: 11 > SELECT hex('Spark SQL'); Result: 537061726B2053514C |
| hour(timestamp) - Returns the hour component of the string/timestamp. | > SELECT hour('2009-07-30 12:58:59'); Result: 12 |
| if(expr1, expr2, expr3) - If expr1 evaluates to true, then returns expr2; otherwise returns expr3. | > SELECT if(1 < 2, 'a', 'b'); Result: a |
| ifnull(expr1, expr2) - Returns expr2 if expr1is null, or expr1 otherwise. | > SELECT ifnull(NULL, array('2')); Result: ["2"] |
| str ilike pattern[ ESCAPE escape] - Returns true if str matches pattern with escapecase-insensitively, null if any arguments are null, false otherwise. | > SELECT ilike('Spark', '_Park'); Result: true > SELECT '\\\\abc' AS S, S ilike r'\\\\abc', S ilike '\\\\\\\\abc'; Result: \abc true true > SET spark.sql.parser.escapedStringLiterals=true; Result: spark.sql.parser.escapedStringLiterals true > SELECT '%SystemDrive%\\Users\\John' ilike '\\%SystemDrive\\%\\\\users%'; Result: true > SET spark.sql.parser.escapedStringLiterals=false; Result: spark.sql.parser.escapedStringLiterals false > SELECT '%SystemDrive%\\\\USERS\\\\John' ilike r'%SystemDrive%\\\\Users%'; Result: true > SELECT '%SystemDrive%/Users/John' ilike '/%SYSTEMDrive/%//Users%' ESCAPE '/'; Result: true |
| expr1 in(expr2, expr3, ...) - Returns true if expr equals to any valN. | > SELECT 1 in(1, 2, 3); Result: true > SELECT 1 in(2, 3, 4); Result: false > SELECT named_struct('a', 1, 'b', 2) in(named_struct('a', 1, 'b', 1), named_struct('a', 1, 'b', 3)); Result: false > SELECT named_struct('a', 1, 'b', 2) in(named_struct('a', 1, 'b', 2), named_struct('a', 1, 'b', 3)); Result: true |
| instr(str, substr) - Returns the (1-based) index of the first occurrence of substrin str. | > SELECT instr('SparkSQL', 'SQL'); Result: 6 |
| lag(input[, offset[, default]]) - Returns the value of input at the offsetth row before the current row in the window. The default value of offset is 1 and the default value of default is null. If the value of input at the offsetth row is null, null is returned. If there is no such offset row (e.g., when the offset is 1, the first row of the window does not have any previous row), default is returned. | > SELECT a, b, lag(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b); Result: A1 1 NULL A1 1 1 A1 2 1 A2 3 NULL |
| last(expr[, isIgnoreNull]) - Returns the last value of expr for a group of rows. If is Ignore Null is true, returns only non-null values. | > SELECT last(col) FROM VALUES (10), (5), (20) AS tab(col); Result: 20 > SELECT last(col) FROM VALUES (10), (5), (NULL) AS tab(col); Result: NULL > SELECT last(col, true) FROM VALUES (10), (5), (NULL) AS tab(col); Result: 5 |
| last_day(date) - Returns the last day of the month which the date belongs to. | > SELECT last_day('2009-01-12'); Result: 2009-01-31 |
| last_value(expr[, isIgnoreNull]) - Returns the last value of expr for a group of rows. If is IgnoreNull is true, returns only non-null values. | > SELECT last_value(col) FROM VALUES (10), (5), (20) AS tab(col); Result: 20 > SELECT last_value(col) FROM VALUES (10), (5), (NULL) AS tab(col); Result: NULL > SELECT last_value(col, true) FROM VALUES (10), (5), (NULL) AS tab(col); Result: 5 |
| left(str, len) - Returns the left most len (lencan be string type) characters from the stringstr,if len is less or equal than 0 the result is an empty string. | > SELECT left('Spark SQL', 3); Result: Spa > SELECT left(encode('Spark SQL', 'utf-8'), 3); Result: Spa |
| len(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros. | > SELECT len('Spark SQL '); Result: 10 > SELECT len(x'537061726b2053514c'); Result: 9 > SELECT CHAR_LENGTH('Spark SQL '); Result: 10 > SELECT CHARACTER_LENGTH('Spark SQL '); Result: 10 |
| length(expr) - Returns the character length of string data or number of bytes of binary data. The length of string data includes the trailing spaces. The length of binary data includes binary zeros. | > SELECT length('Spark SQL '); Result: 10 > SELECT length(x'537061726b2053514c'); Result: 10 > SELECT CHAR_LENGTH('Spark SQL '); Result: 9 > SELECT CHARACTER_LENGTH('Spark SQL '); Result: 10 |
| str like pattern[ ESCAPE escape] - Returns true if str matches pattern with escape, null if any arguments are null, false otherwise. | > SELECT like('Spark', '_park'); Result: true > SELECT '\\\\abc' AS S, S like r'\\\\abc', S like '\\\\\\\\abc'; Result: \abc true true > SET spark.sql.parser.escapedStringLiterals=true; Result: spark.sql.parser.escapedStringLiterals true > SELECT '%SystemDrive%\\Users\\John' like '\\%SystemDrive\\%\\\\Users%'; Result: true > SET spark.sql.parser.escapedStringLiterals=false; Result: spark.sql.parser.escapedStringLiterals false > SELECT '%SystemDrive%\\\\Users\\\\John' like r'%SystemDrive%\\\\Users%'; Result: true > SELECT '%SystemDrive%/Users/John' like '/%SystemDrive/%//Users%' ESCAPE '/'; Result: true |
| listagg(expr[, delimiter])[ WITHIN GROUP (ORDER BY key [ASC | DESC] [,...])] - Returnsthe concatenation of non-NULL input values, separated by the delimiter ordered by key. If all values are NULL, NULL is returned. | > SELECT listagg(col) FROM VALUES ('a'), ('b'), ('c') AS tab(col); Result: abc > SELECT listagg(col) WITHIN GROUP (ORDER BY col DESC) FROM VALUES ('a'), ('b'), ('c') AS tab(col); Result: cba > SELECT listagg(col) FROM VALUES ('a'), (NULL), ('b') AS tab(col); Result: ab > SELECT listagg(col) FROM VALUES ('a'), ('a') AS tab(col); Result: aa > SELECT listagg(DISTINCT col) FROM VALUES ('a'), ('a'), ('b') AS tab(col); Result: ab > SELECT listagg(col, ', ') FROM VALUES ('a'), ('b'), ('c') AS tab(col); Result: a, b, c > SELECT listagg(col) FROM VALUES (NULL), (NULL) AS tab(col); Result: NULL |
| log(base, expr) - Returns the logarithm of expr with base. | > SELECT log(10, 100); Result: 2.0 |
| log2(expr) - Returns the logarithm of expr with base 2. | > SELECT log2(2); Result: 1.0 |
| lower(str) - Returns str with all characters changed to lowercase. | > SELECT lower('SparkSql'); Result: sparksql |
| lpad(str, len[, pad]) - Returnsstr, left-padded with pad to a length o flen.If str is longer than len, the return value is shortened to len characters or bytes. If pad is not specified, str will be padded to the left with space characters if it is a character string, and with zeros if it is a byte sequence. | > SELECT lpad('hi', 5, '??'); Result: ???hi > SELECT lpad('hi', 1, '??'); Result: h > SELECT lpad('hi', 5); Result: hi > SELECT hex(lpad(unhex('aabb'), 5)); Result: 000000AABB > SELECT hex(lpad(unhex('aabb'), 5, unhex('1122'))); Result: 112211AABB |
| ltrim(str) - Removes the leading space characters from str. | > SELECT ltrim(' SparkSQL '); Result: SparkSQL |
| median(col) - Returns the median of numeric or ANSI interval columncol. | > SELECT median(col) FROM VALUES (0), (10) AS tab(col); Result: 5.0 > SELECT median(col) FROM VALUES (INTERVAL '0' MONTH), (INTERVAL '10' MONTH) AS tab(col); Result: 0-5 |
| min(expr) - Returns the minimum value of expr. | > SELECT min(col) FROM VALUES (10), (-1), (20) AS tab(col); Result: -1 |
| minute(timestamp) - Returns the minute component of the string/timestamp. | > SELECT minute('2009-07-30 12:58:59'); Result: 58 |
| expr1 % expr2, or mod(expr1, expr2) - Returns the remainder after expr1/expr2. | > SELECT 2 % 1.8; Result: 0.2 > SELECT MOD(2, 1.8); Result: 0.2 |
| month(date) - Returns the month component of the date/timestamp. | > SELECT month('2016-07-30'); Result: 7 |
| months_between(timestamp1, timestamp2[, roundOff]) - If timestamp 1is later than timestamp 2, then the result is positive. If timestamp1 and timestamp2are on the same day of month, or both are the last day of month, time of day will be ignored. Otherwise, the difference is calculated based on 31 days per month, and rounded to 8 digits unless roundOff=false. | > SELECT months_between('1997-02-28 10:30:00', '1996-10-30'); Result: 3.94959677 > SELECT months_between('1997-02-28 10:30:00', '1996-10-30', false); Result: 3.9495967741935485 |
| not expr - Logical not. | > SELECT not true; Result: false > SELECT not false; Result: true > SELECT not NULL; Result: NULL |
| now() - Returns the current timestamp at the start of query evaluation. | > SELECT now(); Result: 2020-04-25 15:49:11.914 |
| nullif(expr1, expr2) - Returns null if expr1 equals to expr2, or expr1 otherwise. | > SELECT nullif(2, 2); Result: NULL |
| expr1 or expr2 - Logical OR. | > SELECT true or false; Result: true > SELECT false or false; Result: false > SELECT true or NULL; Result: true > SELECT false or NULL; Result: NULL |
| position(substr, str[, pos]) - Returns the position of the first occurrence of substrin str after positionpos. The given pos and return value are 1-based. | > SELECT position('bar', 'foobarbar'); Result: 4 > SELECT position('bar', 'foobarbar', 5); Result: 7 > SELECT POSITION('bar' IN 'foobarbar'); Result: 4 |
| power(expr1, expr2) - Raises expr1 to the power of expr2. | > SELECT power(2, 3); Result: 8.0 |
| rand([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1). | > SELECT rand(); Result: 0.9629742951434543 > SELECT rand(0); Result: 0.7604953758285915 > SELECT rand(null); Result: 0.7604953758285915 |
| randn([seed]) - Returns a random value with independent and identically distributed (i.i.d.) values drawn from the standard normal distribution. | > SELECT randn(); Result: -0.3254147983080288 > SELECT randn(0); Result: 1.6034991609278433 > SELECT randn(null); Result: 1.6034991609278433 |
| random([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1). | > SELECT random(); Result: 0.9629742951434543 > SELECT random(0); Result: 0.7604953758285915 > SELECT random(null); Result: 0.7604953758285915 |
| rank() - Computes the rank of a value in a group of values. The result is one plus the numberof rows preceding or equal to the current row in the ordering of the partition. The values will produce gaps in the sequence. | > SELECT a, b, rank(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b); Result: A1 1 1 A1 1 1 A1 2 3 A2 3 1 |
| regexp_extract(str, regexp[, idx]) - Extract the first string in the str that match the reg exp expression and corresponding to the regex group index. | > SELECT regexp_extract('100-200', '(\\\\d+)-(\\\\d+)', 1); Result: 100 > SELECT regexp_extract('100-200', r'(\\d+)-(\\d+)', 1); Result: 100 |
| regexp_extract_all(str, regexp[, idx]) - Extract all strings in the str that match the reg exp expression and corresponding to the regex group index. | > SELECT regexp_extract_all('100-200, 300-400', '(\\\\d+)-(\\\\d+)', 1); Result: ["100","300"] > SELECT regexp_extract_all('100-200, 300-400', r'(\\d+)-(\\d+)', 1); Result: ["100","300"] |
| regexp_like(str, regexp) - Returns true if str matches reg exp, or false otherwise. | > SET spark.sql.parser.escapedStringLiterals=true; Result: spark.sql.parser.escapedStringLiterals true > SELECT regexp_like('%SystemDrive%\\Users\\John', '%SystemDrive%\\\\Users.*'); Result: true > SET spark.sql.parser.escapedStringLiterals=false; Result: spark.sql.parser.escapedStringLiterals false Result: spark.sql.parser.escapedStringLiterals false > SELECT regexp_like('%SystemDrive%\\\\Users\\\\John', '%SystemDrive%\\\\\\\\Users.*'); Result: true > SELECT regexp_like('%SystemDrive%\\\\Users\\\\John', r'%SystemDrive%\\\\Users.*'); Result: true |
| replace(str, search[, replace]) - Replaces all occurrences of search with replace. | > SELECT replace('ABCabc', 'abc', 'DEF'); Result: ABCDEF |
| right(str, len) - Returns the rightmostlen(lencan be string type) characters from the string str, if len is less or equal than 0 the result is an empty string. | > SELECT right('Spark SQL', 3); Result: SQL |
| rlike(str, regexp) - Returns true if str matches reg exp, or false otherwise. | > SET spark.sql.parser.escapedStringLiterals=true; Result: spark.sql.parser.escapedStringLiterals true > SELECT rlike('%SystemDrive%\\Users\\John', '%SystemDrive%\\\\Users.*'); Result: true > SET spark.sql.parser.escapedStringLiterals=false; Result: spark.sql.parser.escapedStringLiterals false > SELECT rlike('%SystemDrive%\\\\Users\\\\John', '%SystemDrive%\\\\\\\\Users.*'); Result: true > SELECT rlike('%SystemDrive%\\\\Users\\\\John', r'%SystemDrive%\\\\Users.*'); Result: true |
| round(expr, d) - Returns expr rounded to decimal places using HALF_UP rounding mode. | > SELECT round(2.5, 0); Result: 3 |
| row_number() - Assigns a unique, sequential number to each row, starting with one,according to the ordering of rows within the window partition. | > SELECT a, b, row_number() OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b); Result: A1 1 1 A1 1 2 A1 2 3 A2 3 1 |
| second(timestamp) - Returns the second component of the string/timestamp. | > SELECT second('2009-07-30 12:58:59'); Result: 59 |
| sequence(start, stop, step) - Generates an array of elements from start to stop (inclusive), incrementing by step. The type of the returned elements is the same as the type of argument expressions. | > SELECT sequence(1, 5); Result: [1,2,3,4,5] > SELECT sequence(5, 1); Result: [5,4,3,2,1] > SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 month); Result: [2018-01-01,2018-02-01,2018-03-01] > SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval '0-1' year to month); Result: [2018-01-01,2018-02-01,2018-03-01] |
| sha(expr) - Returns a sha1 hash value as a hex string of the expr. | > SELECT sha('Spark'); Result: 85f5955f4b27a9a4c2aab6ffe5d7189fc298b92c |
| base shiftleft exp - Bitwise left shift. | > SELECT shiftleft(2, 1); Result: 4 > SELECT 2 << 1; Result: 4 |
| base shiftright expr - Bitwise (signed) right shift. | > SELECT shiftright(4, 1); Result: 2 > SELECT 4 >> 1; Result: 2 |
| split(str, regex, limit) - Splits str around occurrences that match reg ex and returns an array with a length of at most limit | > SELECT split('oneAtwoBthreeC', '[ABC]'); Result: ["one","two","three",""] > SELECT split('oneAtwoBthreeC', '[ABC]', -1); Result: ["one","two","three",""] > SELECT split('oneAtwoBthreeC', '[ABC]', 2); Result: ["one","twoBthreeC"] |
| split_part(str, delimiter, partNum) - Splits str by delimiter and return requested part of the split (1-based). If any input is null, returns null. If part Num is out of range of split parts, returns empty string. If part Num is 0 ,throws an error. If part Num is negative, the parts are counted backward from the end of the string. If the delimiter is an empty string, the str is not split. | > SELECT split_part('11.12.13', '.', 3); Result: 13 |
| substring(str, pos[, len]) - Returns the substring of str that starts at pos and is of length len, or the slice of byte array that starts at pos and is of length len. | > SELECT substring('Spark SQL', 5); Result: k SQL > SELECT substring('Spark SQL', -3); Result: SQL > SELECT substring('Spark SQL', 5, 1); Result: k > SELECT substring('Spark SQL' FROM 5); Result: k SQL > SELECT substring('Spark SQL' FROM -3); Result: SQL > SELECT substring('Spark SQL' FROM 5 FOR 1); Result: k > SELECT substring(encode('Spark SQL', 'utf-8'), 5); Result: k SQL |
| sum(expr) - Returns the sum calculated from values of a group. | > SELECT sum(col) FROM VALUES (5), (10), (15) AS tab(col); Result: 30 > SELECT sum(col) FROM VALUES (NULL), (10), (15) AS tab(col); Result: 25 > SELECT sum(col) FROM VALUES (NULL), (NULL) AS tab(col); Result: NULL |
| to_char(expr, format) - Convertexprto a string based on theformat. Throws an exception if the conversion fails. The format can consist of the following characters, case insensitive: '0' or '9': Specifies an expected digit between 0 and 9. A sequence of 0 or 9 in the format string matches a sequence of digits in the input value, generating a result string of the same length as the corresponding sequence in the format string. The result string is left-padded with zeros if the 0/9 sequence comprises more digits than the matching part of the decimal value, starts with 0, and is before the decimal point. Otherwise, it is padded with spaces. '.' or 'D': Specifies the position of the decimal point (optional, only allowed once). ',' or 'G': Specifies the position of the grouping (thousands) separator (,). There must be a 0 or 9 to the left and right of each grouping separator. '$': Specifies the location of the $ currency sign. This character may only be specified once. 'S' or 'MI': Specifies the position of a '-' or '+' sign (optional, only allowed once at the beginning or end of the format string). Note that 'S' prints '+' for positive values but 'MI' prints a space. 'PR': Only allowed at the end of the format string; specifies that the result string will be wrapped by angle brackets if the input value is negative. ('<1>'). Ifexpris a datetime,formatshall be a valid datetime pattern, seeDatetime Patterns. Ifexpris a binary, it is converted to a string in one of the formats: 'base64': a base 64 string. 'hex': a string in the hexadecimal format. 'utf-8': the input binary is decoded to UTF-8 string. | > SELECT to_char(454, '999'); Result: 454 > SELECT to_char(454.00, '000D00'); Result: 454.00 > SELECT to_char(12454, '99G999'); Result: 12,454 > SELECT to_char(78.12, '$99.99'); Result: $78.12 > SELECT to_char(-12454.8, '99G999D9S'); Result: 12,454.8- > SELECT to_char(date'2016-04-08', 'y'); Result: 2016 > SELECT to_char(x'537061726b2053514c', 'base64'); Result: U3BhcmsgU1FM > SELECT to_char(x'537061726b2053514c', 'hex'); Result: 537061726B2053514C > SELECT to_char(encode('abc', 'utf-8'), 'utf-8'); Result: abc |
| to_date(date_str[, fmt]) - Parses the date_str expression with the fmt expression to a date. Returns null with invalid input. By default, it follows casting rules to a date if the fmt is omitted. | > SELECT to_date('2009-07-30 04:17:52'); Result: 2009-07-30 > SELECT to_date('2016-12-31', 'yyyy-MM-dd'); Result: 2016-12-31 |
| to_number(expr, fmt) - Convert string 'expr' to a number based on the string format 'fmt'. Throws an exception if the conversion fails. The format can consist of the following characters, case insensitive: '0' or '9': Specifies an expected digit between 0 and 9. A sequence of 0 or 9 in the format string matches a sequence of digits in the input string. If the 0/9 sequence starts with 0 and is before the decimal point, it can only match a digit sequence of the same size. Otherwise, if the sequence starts with 9 or is after the decimal point, it can match a digit sequence that has the same or smaller size. '.' or 'D': Specifies the position of the decimal point (optional, only allowed once). ',' or 'G': Specifies the position of the grouping (thousands) separator (,). There must be a 0 or 9 to the left and right of each grouping separator. 'expr' must match the grouping separator relevant for the size of the number. '$': Specifies the location of the $ currency sign. This character may only be specified once. 'S' or 'MI': Specifies the position of a '-' or '+' sign (optional, only allowed once at the beginning or end of the format string). Note that 'S' allows '-' but 'MI' does not. 'PR': Only allowed at the end of the format string; specifies that 'expr' indicates a negative number with wrapping angled brackets. ('<1>'). | > SELECT to_number('454', '999'); Result: 454 > SELECT to_number('454.00', '000.00'); Result: 454.00 > SELECT to_number('12,454', '99,999'); Result: 12454 > SELECT to_number('$78.12', '$99.99'); Result: 78.12 > SELECT to_number('12,454.8-', '99,999.9S'); Result: -12454.8 |
| to_timestamp(timestamp_str[, fmt]) - Parses thetimestamp_strexpression with thefmtexpression to a timestamp. Returns null with invalid input. By default, it follows casting rules to a timestamp if thefmtis omitted. The result data type is consistent with the value of configurationspark.sql.timestampType. | > SELECT to_timestamp('2016-12-31 00:12:00'); Result: 2016-12-31 00:12:00 > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd'); Result: 2016-12-31 00:00:00 |
| to_varchar(expr, format) - Convertexprto a string based on theformat. Throws an exception if the conversion fails. The format can consist of the following characters, case insensitive: '0' or '9': Specifies an expected digit between 0 and 9. A sequence of 0 or 9 in the format string matches a sequence of digits in the input value, generating a result string of the same length as the corresponding sequence in the format string. The result string is left-padded with zeros if the 0/9 sequence comprises more digits than the matching part of the decimal value, starts with 0, and is before the decimal point. Otherwise, it is padded with spaces. '.' or 'D': Specifies the position of the decimal point (optional, only allowed once). ',' or 'G': Specifies the position of the grouping (thousands) separator (,). There must be a 0 or 9 to the left and right of each grouping separator. '$': Specifies the location of the $ currency sign. This character may only be specified once. 'S' or 'MI': Specifies the position of a '-' or '+' sign (optional, only allowed once at the beginning or end of the format string). Note that 'S' prints '+' for positive values but 'MI' prints a space. 'PR': Only allowed at the end of the format string; specifies that the result string will be wrapped by angle brackets if the input value is negative. ('<1>'). Ifexpris a datetime,formatshall be a valid datetime pattern, seeDatetime Patterns. Ifexpris a binary, it is converted to a string in one of the formats: 'base64': a base 64 string. 'hex': a string in the hexadecimal format. 'utf-8': the input binary is decoded to UTF-8 string. | > SELECT to_varchar(454, '999'); Result: 454 > SELECT to_varchar(454.00, '000D00'); Result: 454.00 > SELECT to_varchar(12454, '99G999'); Result: 12,454 > SELECT to_varchar(78.12, '$99.99'); Result: $78.12 > SELECT to_varchar(-12454.8, '99G999D9S'); Result: 12,454.8- > SELECT to_varchar(date'2016-04-08', 'y'); Result: 2016 > SELECT to_varchar(x'537061726b2053514c', 'base64'); Result: U3BhcmsgU1FM > SELECT to_varchar(x'537061726b2053514c', 'hex'); Result: 537061726B2053514C > SELECT to_varchar(encode('abc', 'utf-8'), 'utf-8'); Result: abc |
| trim(str) - Removes the leading and trailing space characters fromstr. | > SELECT trim(' SparkSQL '); Result: SparkSQL > SELECT trim(BOTH FROM ' SparkSQL '); Result: SparkSQL > SELECT trim(LEADING FROM ' SparkSQL '); Result: SparkSQL > SELECT trim(TRAILING FROM ' SparkSQL '); Result: SparkSQL > SELECT trim('SL' FROM 'SSparkSQLS'); Result: parkSQ > SELECT trim(BOTH 'SL' FROM 'SSparkSQLS'); Result: parkSQ > SELECT trim(LEADING 'SL' FROM 'SSparkSQLS'); Result: parkSQLS > SELECT trim(TRAILING 'SL' FROM 'SSparkSQLS'); Result: SSparkSQ |
| trunc(date, fmt) - Returns date with the time portion of the day truncated to the unit specified by the format model fmt. | > SELECT trunc('2019-08-04', 'week'); Result: 2019-07-29 > SELECT trunc('2019-08-04', 'quarter'); Result: 2019-07-01 > SELECT trunc('2009-02-12', 'MM'); Result: 2009-02-01 > SELECT trunc('2015-10-27', 'YEAR'); Result: 2015-01-01 |
| unix_date(date) - Returns the number of days since 1970-01-01. | > SELECT unix_date(DATE("1970-01-02")); Result: 1 |
| unix_seconds(timestamp) - Returns the number of seconds since 1970-01-01 00:00:00 UTC. Truncates higher levels of precision. | > SELECT unix_seconds(TIMESTAMP('1970-01-01 00:00:01Z')); Result: 1 |
| unix_timestamp([timeExp[, fmt]]) - Returns the UNIX timestamp of current or specified time. | > SELECT unix_timestamp(); Result: 1476884637 > SELECT unix_timestamp('2016-04-08', 'yyyy-MM-dd'); Result: 1460041200 |
| upper(str) - Returns str with all characters changed to uppercase. | > SELECT upper('SparkSql'); Result: SPARKSQL |
| uuid() - Returns an universally unique identifier (UUID) string. The value is returned as a canonical UUID 36-character string. | > SELECT uuid(); Result: 46707d92-02f4-4817-8116-a4c3b23e6266 |
| weekday(date) - Returns the day of the week for date/timestamp (0 = Monday, 1 = Tuesday, ..., 6 = Sunday). | > SELECT weekday('2009-07-30'); Result: 3 |
| weekofyear(date) - Returns the week of the year of the given date. A week is considered to start on a Monday and week 1 is the first week with >3 days. | > SELECT weekofyear('2008-02-20'); Result: 8 |
| CASE WHEN expr1 THEN expr2 [WHEN expr3 THEN expr4]* [ELSE expr5] END - Whenexpr1= true, returns expr2; else when expr3= true, returns expr4; else returns expr5. | > SELECT CASE WHEN 1 > 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END; Result: 1.0 > SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 > 0 THEN 2.0 ELSE 1.2 END; Result: 2.0 > SELECT CASE WHEN 1 < 0 THEN 1 WHEN 2 < 0 THEN 2.0 END; Result: NULL |
| year(date) - Returns the year component of the date/timestamp. | > SELECT year('2016-07-30'); Result: 2016 |