A user-defined perform (UDF) is a method for a person to increase the native capabilities of Apache Spark™ SQL. SQL on Databricks has supported exterior user-defined capabilities written in Scala, Java, Python and R programming languages since 1.3.0. Whereas exterior UDFs are very highly effective, additionally they include a number of caveats:
- Safety. A UDF written in an exterior language can execute harmful and even malicious code. This requires tight management over who can create UDF.
- Efficiency. UDFs are black containers to the Catalyst Optimizer. Given Catalyst shouldn’t be conscious of the inside workings of a UDF, it can not do any work to enhance the efficiency of the UDF inside the context of a SQL question.
- SQL Usability. For a SQL person it may be cumbersome to put in writing UDFs in a number language after which register them in Spark. Additionally, there’s a set of extensions many customers might need to make to SQL that are reasonably easy the place growing an exterior UDF is overkill.
To deal with the above limitations, we’re thrilled to introduce a brand new type of UDF: SQL UDFs. Obtainable in DBR 9.1 LTS, the SQL UDF is totally outlined with the expressive energy of SQL and likewise utterly clear to the SQL Compiler.
Advantages of utilizing SQL UDFs
SQL UDFs are easy but highly effective extensions to SQL on Databricks. As capabilities, they supply a layer of abstraction to simplify question building – making SQL queries extra readable and modularized. In contrast to UDFs which can be written in a non-SQL language, SQL UDFs are extra light-weight for SQL customers to create. SQL perform our bodies are clear to the question optimizer thus making them extra performant than exterior UDFs. SQL UDFs may be created as both non permanent or everlasting capabilities, be reused throughout a number of queries, classes and customers, and be access-controlled by way of Entry Management Language (ACL). On this weblog, we are going to stroll you thru some key use circumstances of SQL UDFs with examples.
SQL UDFs as constants
Let’s begin with essentially the most simplistic perform possible: a continuing. Everyone knows we’re not supposed to make use of literals in our code as a result of it harms readability and, who is aware of, possibly the fixed doesn’t stay fixed in spite of everything. So we wish to have the ability to change it in a single place solely:
CREATE FUNCTION blue() RETURNS STRING COMMENT 'Blue colour code' LANGUAGE SQL RETURN '0000FF'
If you’re accustomed to exterior UDFs, you may see there are some variations that stand out:
- A SQL UDF should outline its parameter checklist, even when it’s empty. A relentless takes no parameters.
- The perform additionally declares the info kind it would return. On this case that’s a STRING.
- The implementation of the perform is a part of the perform definition.
- You specify LANGUAGE SQL to say that it’s a SQL UDF. However actually, that’s not wanted. The RETURN clause is sufficient of a give away that we determined to make this non-obligatory.
Past these variations there are a lot of different issues which can be the identical as exterior UDF:
- You possibly can substitute a perform. Extra on that later.
- You possibly can add a remark that describes the perform – as proven above.
- You possibly can even create a brief perform that you should utilize inside the present session, solely.
Let’s use the perform:
SELECT blue(); 0000FF
Unsurprisingly this works. However what is going on underneath the hood?EXPLAIN SELECT blue(); == Bodily Plan == *(1) Challenge [0000FF AS default.blue()#9] +- *(1) Scan OneRowRelation
That is neat! The SQL compiler changed the perform invocation with the fixed itself.
Which means at the very least this SQL UDF comes at zero value to efficiency.
Now, let’s take a look at one other frequent utilization sample.
SQL UDF encapsulating expressions
Think about you don’t just like the naming of some built-in capabilities. Possibly you’re migrating numerous queries from one other product, which has completely different perform names and behaviors. Or maybe you simply can’t stand copy-pasting some prolonged expressions time and again in your SQL queries. So, you need to repair that.
With SQL UDF, we are able to merely create a brand new perform with the title we like:CREATE FUNCTION to_hex(x INT COMMENT 'Any quantity between 0 - 255') RETURNS STRING COMMENT 'Converts a decimal to a hexadecimal' CONTAINS SQL DETERMINISTIC RETURN lpad(hex(least(biggest(0, x), 255)), 2, 0)
Let’s take a look at what new syntax was used right here:
- This perform takes an argument, and the parameter is outlined by a reputation, a sort and an non-obligatory remark.
- The CONTAINS SQL clause is non-obligatory, however tells us the perform doesn’t learn or modify any knowledge in a desk. It’s the default setting, so that you usually wouldn’t specify it.
- DETERMINISTIC can also be non-obligatory and tells us that the perform will all the time return the identical consequence set given the identical arguments. The clause is for documentation solely at this level. However sooner or later sooner or later it might be used to dam non deterministic capabilities in sure contexts.
- Within the RETURN clause the parameter has been referred to by title. In additional complicated eventualities beneath you will notice that the parameter can get disambiguated with the perform title. Naturally you should utilize arbitrarily complicated expressions because the perform physique.
Not solely does it work …
SELECT to_hex(id) FROM vary(2); 00 01
… however it works effectively:EXPLAIN SELECT to_hex(id) FROM vary(2); == Bodily Plan == *(1) Challenge [lpad(hex(cast(least(greatest(0, cast(id#0 as int)), 255) as bigint)), 2, 0) AS default.to_hex(id)#1] +- *(1) Vary (0, 2, step=1, splits=4)
We will see that the bodily plan reveals a straight software of the capabilities lpad, hex, least and biggest. This is similar plan you get invoking the sequence of capabilities straight.
It’s also possible to compose SQL capabilities out of SQL capabilities:CREATE FUNCTION rgb_to_hex(r INT, g INT, b INT) RETURNS STRING COMMENT 'Converts an RGB colour to a hex colour code' RETURN CONCAT(to_hex(r), to_hex(g), to_hex(b)) SELECT rgb_to_hex(0, 0, 255); 0000FF
SQL UDF studying from tables
One other frequent utilization of SQL UDF is to codify lookups. A easy lookup could also be to decode RGB colour codes into English colour names:CREATE FUNCTION from_rgb(rgb STRING COMMENT 'an RGB hex colour code') RETURNS STRING COMMENT 'Interprets an RGB colour code right into a colour title' RETURN DECODE(rgb, 'FF00FF', 'magenta', 'FF0080', 'rose'); SELECT from_rgb('FF0080'); rose
OK, however there are much more than two colours on this world. And we wish this translation each methods, so these ought to actually be in a lookup desk:CREATE TABLE colours(rgb STRING NOT NULL, title STRING NOT NULL); INSERT INTO colours VALUES ('FF00FF', 'magenta'), ('FF0080', 'rose'), ('BFFF00', 'lime'), ('7DF9FF', 'electrical blue'); CREATE OR REPLACE FUNCTION from_rgb(rgb STRING COMMENT 'an RGB hex colour code') RETURNS STRING READS SQL DATA SQL SECURITY DEFINER COMMENT 'Interprets an RGB colour code right into a colour title' RETURN SELECT FIRST(title) FROM colours WHERE rgb = from_rgb.rgb; SELECT from_rgb(rgb) FROM VALUES('7DF9FF'), ('BFFF00') AS codes(rgb); electrical blue lime
There are a number of new ideas utilized right here:
- You possibly can REPLACE a SQL UDF. To be allowed to try this, the brand new perform should match the outdated perform’s signature. The signature of a perform is outlined because the variety of its parameters and their sorts.
- This perform appears to be like up info in a desk, so you may optionally doc that utilizing READS SQL DATA. If you happen to state nothing the SQL Compiler will derive the right worth, however you could not lie and state CONTAINS SQL.
- SQL SECURITY DEFINER is one other non-obligatory clause, which states that the question accessing the colours desk will use the authorization of the perform proprietor. So the perform may very well be executed by the general public with out compromising the safety of the desk.
- Simply because the perform operates underneath the authorization of its proprietor it would all the time be parsed utilizing the present database at time of creation.
- `rgb` is the title of the column in numbers. By qualifying the parameter as `from_rgb`.`rgb` you make clear that you simply imply the parameter reference, and never the column.
How does the bodily plan appear like now? It’s simple to see that utilizing an exterior UDF, which itself performs a question that may end in a nested loop be a part of, is an terrible method to burn treasured sources.
EXPLAIN SELECT from_rgb(rgb) FROM VALUES ('7DF9FF'), ('BFFF00') AS codes(rgb); == Bodily Plan == AdaptiveSparkPlan isFinalPlan=false +- Challenge [first(name)#1322268 AS default.from_rgb(rgb)#1322259] +- BroadcastHashJoin [rgb#1322261], [rgb#1322266], LeftOuter, BuildRight, false :- LocalTableScan [rgb#1322261] +- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(enter[1, string, false]),false), [id=#1437557] +- SortAggregate(key=[rgb#1322266], capabilities=[finalmerge_first(merge first#1322271, valueSet#1322272) AS first(name#1322267)()#1322260]) +- Type [rgb#1322266 ASC NULLS FIRST], false, 0 +- Trade hashpartitioning(rgb#1322266, 200), ENSURE_REQUIREMENTS, [id=#1437553] +- SortAggregate(key=[rgb#1322266], capabilities=[partial_first(name#1322267, false) AS (first#1322271, valueSet#1322272)]) +- Type [rgb#1322266 ASC NULLS FIRST], false, 0 +- FileScan parquet default.colours[rgb#1322266,name#1322267]
On this case, Catalyst has chosen a broadcast hash be a part of as an alternative of a nested loop be a part of. It might probably do that as a result of it understands the content material of the SQL UDF.
To this point, all examples mentioned used scalar-valued capabilities – ones that return a single worth. That consequence could also be of any kind, even complicated combos of structs, arrays, and maps.There may be additionally one other kind of UDF to debate – the table-valued UDF.
SQL Desk UDF
Think about if views took arguments! You can encapsulate complicated predicates even when they depend on user-provided values. A SQL Desk UDF is simply that: a view by some other title, besides with parameters.
Let’s assume that the colour mapping above shouldn’t be distinctive. On the very least, we are able to assert the colour names differ throughout languages.
Subsequently the `from_rgb` perform must be modified to return both an array of names or a relation.INSERT INTO colours VALUES ('BFFF00', 'citron vert'); CREATE OR REPLACE FUNCTION from_rgb(rgb STRING COMMENT 'an RGB hex colour code') RETURNS TABLE(title STRING COMMENT 'colour title') READS SQL DATA SQL SECURITY DEFINER COMMENT 'Interprets an RGB colour code right into a colour title' RETURN SELECT title FROM colours WHERE rgb = from_rgb.rgb;
As you may see, the one distinction in comparison with a scalar perform is a extra complicated RETURNS clause. In contrast to views, SQL UDFs mandate a declaration of the returned relation’s signature:
- TABLE specifies that the perform returns a relation.
- The TABLE clause should embody a reputation for every return column and the column’s knowledge kind.
- Chances are you’ll optionally specify a remark for any return column.
Person-defined desk capabilities are new to DBR. Let’s take a look at how you can invoke them.
SELECT * FROM from_rgb('7DF9FF'); electrical blue
In its easiest kind, a desk perform is invoked in the identical manner and the identical locations a view is referenced. The one distinction are the necessary braces, which embody the perform’s arguments. This perform is invoked with literal arguments, however the arguments may be any expression, even scalar subqueries.
Strongest, nonetheless, is the utilization of SQL desk UDF in a be a part of, sometimes a correlated cross be a part of:SELECT rgb, from_rgb.title FROM VALUES('7DF9FF'), ('BFFF00') AS codes(rgb), LATERAL from_rgb(codes.rgb); 7DF9FF electrical blue BFFF00 lime BFFF00 citron vert
Right here the arguments refer (correlate) to a previous (lateral) relation within the FROM clause. The brand new LATERAL key phrase provides Catalyst permission to resolve these columns. Additionally notice that you may discuss with the results of the desk perform by naming the columns as outlined within the consequence signature and optionally certified by the perform title.
Naturally, SQL UDFs are totally supported by the present GRANT, REVOKE, SHOW, DESCRIBE and DROP statements.
The assertion value stating in additional element is DESCRIBE.DESCRIBE FUNCTION from_rgb; Operate: default.from_rgb Kind: TABLE Enter: rgb STRING Returns: title STRING
The essential describe returns what you would possibly count on, however the prolonged DESCRIBE provides considerably extra element:DESCRIBE FUNCTION EXTENDED from_rgb; Operate: default.from_rgb Kind: TABLE Enter: rgb STRING 'an RGB hex colour code' Returns: title STRING 'colour title' Remark: Interprets an RGB colour code right into a colour title Information Entry: READS SQL DATA Configs: spark.sql.datetime.java8API.enabled=true spark.sql.hive.model=2.3.7 Proprietor: serge.rielau Create Time: Wed Sep 08 08:59:53 PDT 2021 Physique: SELECT title FROM colours WHERE rgb = from_rgb.rgb
What we now have described represents the preliminary performance for SQL UDF. Future extensions we’re pondering embody assist for:
- SQL PATH, so you may create a library of capabilities in a database and subscribe to them from one other, simply as you’ll do in your file system.
- Overloading of UDFs.
- UDFs with default values for parameters.
SQL UDFs are an enormous step ahead in SQL usability and can be utilized in many various methods as outlined on this weblog. We encourage you to consider much more inventive methods to leverage SQL UDFs be it in Databricks SQL or utilizing Photon for Information Engineering jobs. Attempt the pocket book right here and see the documentation for extra info.