Known transpilation issues
Celonis SQL glot is an adaptation of the SQL glot python library transpiling Vertica SQL into SparkSQL. Any SQL dialect has some unique features that separate it from other dialects. This is also true for Vertica and Spark. Some of these features are either implicitly or explicitly relied on in Code. Transpiling that code may return either different results or may not be possible at all. This document is a list of these non-transpilable Vertica features combined with possible workarounds or advisories.
Timeseries is used to fill gaps in between a given start and end time (i.e. I have a value for 2020-01-01 and another value for 2021-01-01. TIMESERIES returns rows with e.g. the days between them). There is no direct equivalent in Spark that does the same as TIMESERIES takes the start and end date implicitly from the data it gets applied to. However there is a way to create a table based on the entries between given dates. These expect an explicit start and end date.
We assume that in most cases TIMESERIES is used to create a calendar table to be used in further queries. This would be achieved using this work around. Other more complicated queries may need to be rewritten so as not to use TIMESERIES.
Vertica | Spark |
|---|---|
Instead of Timeseries | Use |
SELECT (...) FROM ( SELECT '01/01/2020'::TIMESTAMP as tm UNION SELECT CURRENT_DATE() as tm ) as t TIMESERIES ts as '1 Day' OVER (ORDER BY tm) ORDER BY DAY; | SELECT explode( sequence( DATE'2020-01-01', NOW()::DATE, INTERVAL 1 DAY) ) as calendar_date |
When using Vertica:
Vertica | Spark |
|---|---|
DATE_TRUNC('day',NOW()-10) | DATE_TRUNC('day', NOW() - INTERVAL 10 DAY) |
DATE_TRUNC truncates a timestamp or date to the specified date part. In Vertica, subtracting a number from NOW() is interpreted as subtracting that many days. In Spark, you must explicitly use the INTERVAL keyword to subtract days.
The SELECT SLEEP(x) query is commonly used in Vertica; however, this functionality is not supported in Spark. If your query is failing due to the use of a sleep function, please remove it to ensure successful execution.
Previously, there were no restrictions on table names. With the ETL Engine, we are restricting table names:
Object names cannot exceed 255 characters.
The following special characters are not allowed:
Period (.)
Space ( )
Forward slash (/)
All ASCII control characters (00-1F hex)
The DELETE character (7F hex)
Vertica has a TIME datatype, spark sql does not. If the time datatype is getting used it gets transpiled into a construct, parsing the data into an interval type.
CAST(DATE_FORMAT({input_date}, 'HH:mm:ss') AS INTERVAL HOUR TO SECOND)However, in some cases, the TIME datatype is used as an intermediate step to create a TIMESTAMP datatype via VARCHAR. For example:
SELECT CAST('2025-11-10' AS DATE)||' '|| CAST('10:11:12' AS TIME)In these scenarios, the INTERVAL datatype does not work. If we detect that the TIME datatype is part of a VARCHAR operation, the transpilation only returns the DATE_FORMAT:
DATE_FORMAT({input_date}, 'HH:mm:ss')Vertica generally allows implicit casting and tries to automatically detect date formats. In many cases this results in queries that would result in errors on other DB engines, but run on Vertica. However, this also can hide cases where Vertica detects the format wrong.
Spark on the other hand is restrictive with the formats: It generally allows one format (YYYY-MM-DD HH:mm:ss) by default but allows many, if specified in the query.
In the transpilation, we try to detect different date formats and also transpile them into the main format above. However this only applies to the dates we see in the query. It does not apply to your data. This means that you may see errors where the transpiled query is not able to replicate the Vertica results requiring you to replace CAST (... AS DATE) or CAST(... AS TIMESTAMP) commands with TO_DATE or TO_TIMESTAMP commands that allow you to also specify the input format.
Vertica | Spark |
|---|---|
Covered date format detection in query: SELECT CAST(‘01.02.2020’ AS DATE) | Date format auto converted: SELECT CAST(‘2020-02-01’AS DATE) |
NOT covered date format detection: SELECT CAST(BSEG.CPUDT AS DATE) | Transpiled query (may lead to error depending on input data format): SELECT CAST(BSEG.CPUDT AS DATE) |
Also, if you are using dates or timestamps specified in your SQL code, verify that they have been translated correctly and use TO_DATE or TO_TIMESTAMP if not.
If the date is ambiguous, the logic likely prefers american formats unless the format would be invalid or the year is in the first position:
01.02.2020 -> 2020-01-02 (MM.DD.YYYY detected) 13.02.2020 -> 2020-02-13 (DD.MM.YYYY detected, because MM.DD.YYYY would be invalid) 2023/02/01 -> 2023-02-01 (YYYY/MM/DD detected)
This function compares a given expression’s result against the same expression’s result in the previous row. If they differ, the counter is increased, if not it stays the same.
The function gets auto transpiled to:
SUM(IF(COALESCE(CAST(LAG(expression) OVER(window) AS STRING), 'NULL') <> CAST((expression) AS STRING), 'NULL'), 1, 0) OVER(window)
Where expression is the given expression in Vertica SQL and window is the provided window function. However, the function in Vertica always starts with 0 even if the first row does not start with NULL, whereas Spark might. Additionally, Vertica sorts NULL values differently depending on the datatype of the column the NULL value is in.
If you are comparing expressions that can become NULL or if you are sorting on columns that can contain NULL, you should check, if the transpiled output is still as intended.
Vertica allows you to CAST to CHAR which is a data type that is not supported by Spark. The behaviour of casting to CHAR is similar to doing a LEFT string operation for single-byte characters. However, there might be cases where a character requires more than 1 byte for its representation. This can cause unexpected truncation:
SELECT CAST('€$%€' AS CHAR(7));
-- it requires 8 bytes to store the string '€$%€'
-- output: '€$%'Using LEFT function in Spark:
SELECT LEFT('€$%€',7);
-- output: '€$%€'It's probable that users employing CAST to CHAR intend character-based trimming and are unknowingly truncating by bytes, particularly when special multi-byte characters are involved.
We cannot use Table Alias with the name “VALUES”:
SELECT * from XYZ VALUES; -- Unexpected Token Error
Even though the query has a correct syntax and would be executed on Vertica and Spark, SQLGLot interprets the “VALUES” name as a KEYWORD. This keyword is not expected at this position and therefore fails.
To avoid this we can simply migrate to any other name:
SELECT * from XYZ as XYZ_VALUES; -- works
Celonis parameters are getting preserved if they are getting used to represent a value, but we cannot resolve them during transpilation. If they contain commands, the remaining query may not be transpilable.
Supported:
Input | Transpiled |
|---|---|
... WHERE <%=Param1%> = 1 AND Table.Field2 = <%=Param2%> | ... WHERE <%=Param1%> = 1 AND `Table`.Field2 = <%=Param2%> |
<%=Param1%> := Table.Field1 <%=Param2%> := ‘Testvalue’ | <%=Param1%> := Table.Field1 <%=Param2%> := ‘Testvalue’``` |
Not supported:
Input | Transpiled |
|---|---|
... WHERE <%=Param1%> AND Table.Field2 = <%=Param2%> | ... WHERE _PARAM_1 -- Invalid SQL! AND `Table`.Field2 = <%=Param2%> |
<%=Param1%> := Table.Field1 = 1 <%=Param2%> := ‘Testvalue’ | <%=Param1%> := Table.Field1 = 1 <%=Param2%> := ‘Testvalue’ |
When creating a new table, you can use another statement to define and fill the table in one go:
CREATE TABLE T AS (SELECT Col FROM SourceTable);
This infers the datatype from the used tables and columns. However, Vertica allows you to leave the type ambiguous. while Spark does not. To still use your queries we transpile ambiguous datatypes to use the string datatype as a fallback:
Vertica:
CREATE TABLE T AS (SELECT NULL as null_column)
Spark:
CREATE TABLE T AS (SELECT cast(NULL as string) as null_column)
LEAD is not supported. Use LAG with negative numbers instead.
LEAD(Column) --> LAG(Column, -1)
LEAD(Column, 10) --> LAG(Column, -10)
Vertica uses Perl regular expression syntax, while Spark uses Java. You may need to check if your regex still works (in many cases it does). A special aspect could be modifiers where some are supported (c:case sensitive, i: case insensitive, m: multiline, n: dot char matches new line, x: ignore comments in the regexp pattern) while one is not supported (b: treat strings as binary octets).
Vertica allows special syntax and parameters for LISTAGG:
separator
max_length
on_overflow 'separator' is supported, but max_length's behavior (either throwing an error or truncating) cannot be exactly matched on spark. The main reasons are:
Vertica counts in bytes. Spark only counts in number of characters. This makes a difference if special UTF-8 characters are used.
Spark SQL cannot throw errors
Vertica SQL truncates not by max length on the resulting string but while concatenating. If a very long value would breach the limit, the value would be left out, resulting in a far shorter string than the limit would allow.
If you do not need exact replication but still require the STRING to be within a certain length, you could rewrite the query with SELECT LEFT(LISTAGG(YourColumn, YourSeparator), YourMaxLengthOfChars)
For example, SELECT LISTAGG(col USING PARAMETERS separator='||', max_length=1000, on_overflow='TRUNCATE') could become SELECT LEFT(LISTAGG(col, '||'), 1000). Note however, that this could produce different results, where the last entry before exceeding the char limit would be cutoff instead of removed and some special chars would get counted differently towards the limit.