How one can Pace Up Knowledge Movement Between Databricks and SAS

March 16, 2022

283

[ad_1]

This can be a collaborative publish between Databricks and T1A. We thank Oleg Mikhov, Options Architect at T1A, for his contributions.

That is the primary publish in a collection of blogs on one of the best practices of bringing collectively Databricks Lakehouse Platform and SAS. A earlier Databricks weblog publish launched Databricks and PySpark to SAS builders. On this publish, we focus on methods for exchanging knowledge between SAS and Databricks Lakehouse Platform and methods to hurry up the info stream. In future posts, we are going to discover constructing environment friendly knowledge and analytics pipelines involving each applied sciences.

Knowledge-driven organizations are quickly adopting the Lakehouse platform to maintain up with the continually rising enterprise calls for. Lakehouse platform has change into a brand new norm for organizations wanting to construct knowledge platforms and structure. The modernization entails shifting knowledge, functions, or different enterprise components to the cloud. Nevertheless, the transition to the cloud is a gradual course of and it’s business-critical to proceed leveraging legacy investments for so long as potential. With that in thoughts, many firms are inclined to have a number of knowledge and analytics platforms, the place the platforms coexist and complement one another.

One of many mixtures we see is the usage of SAS with the Databricks Lakehouse. There are a lot of advantages of enabling the 2 platforms to effectively work collectively, akin to:

Higher and scalable knowledge storage capabilities of cloud platforms
Higher computing capability utilizing applied sciences, akin to Apache Spark™, natively constructed with parallel processing capabilities
Obtain higher compliance with knowledge governance and administration utilizing Delta Lake
Decrease the price of knowledge analytics infrastructure with simplified architectures

Some frequent knowledge science and knowledge evaluation use instances and causes noticed are:

SAS practitioners leverage SAS for its core statistical packages to develop superior analytics output that meets regulatory necessities whereas they use Databricks Lakehouse for knowledge administration, ELT kinds of processing, and knowledge governance
Machine studying fashions developed in SAS are scored on large quantities of knowledge utilizing parallel processing structure of Apache Spark engine within the Lakehouse platform
SAS knowledge analysts achieve quicker entry to giant quantities of knowledge within the Lakehouse Platform for ad-hoc evaluation and reporting utilizing Databricks SQL endpoints and excessive bandwidth connectors
Ease cloud modernization and migration journey by establishing a hybrid workstream involving each cloud structure and on-prem SAS platform

Nevertheless, a key problem of this coexistence is how the info is performantly shared between the 2 platforms. On this weblog, we share finest practices applied by T1A for his or her prospects and benchmark outcomes evaluating totally different strategies of shifting knowledge between Databricks and SAS.

Eventualities

The most well-liked use case is a SAS developer making an attempt to entry knowledge within the lakehouse. The analytics pipelines involving each applied sciences require knowledge stream in each instructions: knowledge moved from Databricks to SAS and knowledge moved from SAS to Databricks.

Entry Delta Lake from SAS: A SAS consumer needs to entry large knowledge in Delta Lake utilizing the SAS programming language.
Entry SAS datasets from Databricks: A Databricks consumer needs to entry SAS datasets, usually the sas7bdat datasets as a DataFrame to course of in Databricks pipelines or retailer in Delta Lake for enterprise-wide entry.

In our benchmark assessments, we used the next atmosphere setup:

Microsoft Azure because the cloud platform
SAS 9.4M7 on Azure (single node Customary D8s v3 VM)
Databricks runtime 9.0, Apache Spark 3.1.2 (2 nodes Customary DS4v2 cluster)

Determine 1 exhibits the conceptual structure diagram with the parts mentioned. Databricks Lakehouse sits on Azure Knowledge Lake storage with Delta Lake medallion structure. SAS 9.4 put in on Azure VM connects to Databricks Lakehouse to learn/write knowledge utilizing connection choices mentioned within the following sections.

SAS and Databricks conceptual architecture diagram on Azure — Determine 1 SAS and Databricks conceptual structure diagram on Azure

The diagram above exhibits a conceptual structure of Databricks deployed on Azure. The structure can be related on different cloud platforms. On this weblog, we solely focus on the combination with the SAS 9.4 platform. In a later weblog publish, we are going to lengthen this dialogue to entry lakehouse knowledge from SAS Viya.

Entry Delta Lake from SAS

Think about that we now have a Delta Lake desk that must be processed in a SAS program. We wish one of the best efficiency when accessing this desk, whereas additionally avoiding any potential points with knowledge integrity or knowledge sorts compatibility. There are alternative ways to attain knowledge integrity and compatibility. Beneath we focus on a number of strategies and examine them on ease of use and efficiency.

In our testing, we used the eCommerce conduct dataset (5.67GB, 9 columns, ~ 42 mill data) from Kaggle.
Knowledge Supply Credit score: eCommerce conduct knowledge from multi class retailer and REES46 Advertising and marketing Platform.

Examined strategies

1. Utilizing SAS/ACCESS Interface connectors
Historically, SAS customers leverage SAS/ACCESS software program to hook up with exterior knowledge sources. You’ll be able to both use a SAS LIBNAME assertion pointing to the Databricks cluster or use the SQL pass-through facility. At current for SAS 9.4, there are three connection choices accessible.

SAS/ACCESS Interface to Spark has been lately loaded with capabilities with unique assist to Databricks clusters. See this video for a brief demonstration. The video mentions SAS Viya however the identical is relevant to SAS 9.4.

Code samples on tips on how to use these connectors may be discovered on this git repository: T1A Git – SAS Libraries Examples.

2. Utilizing saspy bundle
The open-source library, saspy, from SAS Institute permits Databricks Pocket book customers to run SAS statements from a Python cell within the pocket book to execute code within the SAS server, in addition to to import and export knowledge from SAS datasets to Pandas DataFrame.

For the reason that focus of this part is accessing lakehouse knowledge by a SAS programmer utilizing SAS programming, this technique was wrapped in a SAS macro program much like the purpose-built integration technique mentioned subsequent.

To attain higher efficiency with this bundle, we examined the configuration with an outlined char_length choice (particulars accessible right here). With this feature, we will outline lengths for character fields within the dataset. In our assessments utilizing this feature introduced a further 15% enhance in efficiency. For the transport layer between environments, we used the saspy configuration with an SSH connection to the SAS server.

3. Utilizing a purpose-built integration
Though the 2 strategies talked about above have their upsides, the efficiency may be improved additional by addressing some shortcomings, mentioned within the subsequent part (Take a look at Outcomes), of the earlier strategies. With that in thoughts, we developed a SAS macro-based integration utility with a primary concentrate on efficiency and usefulness for SAS customers. The SAS macro may be simply built-in into present SAS code with none information about Databricks platform, Apache Spark or Python.

The macro orchestrates a multistep course of utilizing Databricks API:

Instruct the Databricks cluster to question and extract knowledge per the supplied SQL question and cache the leads to DBFS, counting on its Spark SQL distributed processing capabilities.
Compress and securely switch the dataset to the SAS server (CSV in GZIP) over SSH
Unpack and import knowledge into SAS to make it accessible to the consumer within the SAS library. At this step, leverage column metadata from Databricks knowledge catalog (column sorts, lengths, and codecs) for constant, appropriate and environment friendly knowledge presentation in SAS

Word that for variable-length knowledge sorts, the combination helps totally different configuration choices, relying on what most closely fits the consumer necessities akin to,

want for utilizing a configurable default worth
profiling to 10,000 rows (+ add headroom) to determine the biggest worth
profiling the complete column within the dataset to determine the biggest worth

A simplified model of the code is obtainable right here T1A Git – SAS DBR Customized Integration.

The tip-user utilization of this SAS macro seems as proven under, and takes three inputs:

SQL question, based mostly on which knowledge can be extracted from Databricks
SAS libref the place the info ought to land
Title to be given to the SAS dataset

Take a look at outcomes

Databricks to SAS data access methods performance — Determine 2 Databricks to SAS knowledge entry strategies efficiency

As proven within the plot above, for the check dataset, the outcomes present that SAS/ACCESS Interface to JDBC and SAS/ACCESS Interface to Apache Spark confirmed related efficiency and carried out decrease in comparison with different strategies. The primary motive for that’s the JDBC strategies don’t profile character columns in datasets to be able to set correct column size within the SAS dataset. As a substitute, they outline the default size for all character column sorts (String and Varchar) as 765 symbols. That causes not solely efficiency points throughout preliminary knowledge retrieval however for all additional processing. Plus it consumes important further storage. In our assessments, for the supply dataset of 5.6 GB, we ended with a 216 GB file within the WORK library. Nevertheless, with the SAS/ACCESS Interface to ODBC, the default size was 255 symbols, which resulted in a big efficiency enhance.

Utilizing SAS/ACCESS Interface strategies is essentially the most handy choice for present SAS customers. There are some essential concerns whenever you use these strategies

Each options assist implicit question pass-through however with some limitations:

SAS/ACCESS Interface to JDBC/ODBC assist solely pass-through for PROC SQL statements
Along with PROC SQL pass-through SAS/ACCESS Interface to Apache Spark helps pass-through for many of the SQL capabilities. This technique additionally permits pushing frequent SAS procedures to Databricks clusters.

The problem with setting the size for the character columns described earlier than. As a workaround, we recommend utilizing the DBSASTYPE choice to explicitly set column size for SAS tables. This can assist with additional processing of the dataset however gained’t have an effect on the preliminary retrieval of the info from Databricks

SAS/ACCESS Interface to Apache Spark/JDBC/ODBC doesn’t permit combining tables from totally different Databricks databases (schemas) assigned as totally different libnames in the identical question (becoming a member of them) with the pass-through facility. As a substitute, it should trigger exporting complete tables in SAS and processing in SAS. As a workaround, we recommend making a devoted schema in Databricks that can comprise views based mostly on tables from totally different databases (schemas).

Utilizing the saspy technique confirmed barely higher efficiency in comparison with SAS/ACCESS Interface to JDBC/Spark strategies, nevertheless, the principle downside is that saspy library solely works with pandas DataFrames and it places a big load on the Apache Spark driver program and requires the complete DataFrame to be pulled into reminiscence.

The purpose-built integration technique confirmed one of the best efficiency in comparison with different examined strategies. Determine 3 exhibits a stream chart with high-level steering in selecting from the strategies mentioned.

Databricks to SAS data access - method selection — Determine 3 Databricks to SAS knowledge entry – technique choice

Entry SAS datasets from Databricks

This part addresses the necessity by Databricks builders to ingest a SAS dataset into Delta Lake and make it accessible in Databricks for enterprise intelligence, visible analytics, and different superior analytics use instances whereas among the beforehand described strategies are relevant right here, some further strategies are mentioned.

Within the check, we begin with a SAS dataset (in sas7bdat format) on the SAS server, and ultimately, we now have this dataset accessible as Spark DataFrame (if the lazy invocation is relevant we drive to load knowledge in a DataFrame and measure the general time) in Databricks.

We used the identical atmosphere and the identical dataset for this situation that was used within the earlier situation. The assessments don’t contemplate the use case the place a SAS consumer writes a dataset into Delta Lake utilizing SAS programming. This entails making an allowance for cloud supplier instruments and capabilities which can be mentioned in a later weblog publish.

Examined strategies

1. Utilizing the saspy bundle from SAS
The sd2df technique within the saspy library converts a SAS dataset to a pandas DataFrame, utilizing SSH for knowledge switch. It provides a number of choices for staging storage (Reminiscence, CSV, DISK) throughout the switch. In our check, the CSV choice, which makes use of PROC EXPORT csv file and pandas read_csv() strategies, which is the really useful choice for giant knowledge units, confirmed one of the best efficiency.

2. Utilizing pandas technique
Since early releases pandas allowed customers to learn sas7bdat recordsdata utilizing pandas.read_sas API. The SAS file must be accessible to the python program. Generally used strategies are FTP, HTTP, or shifting to cloud object storage akin to S3. We fairly used a less complicated strategy to maneuver a SAS file from the distant SAS server to the Databricks cluster utilizing SCP.

3. Utilizing spark-sas7bdat
Spark-sas7bdat is an open-source bundle developed particularly for Apache Spark. Much like the pandas.read_sas() technique, the SAS file have to be accessible on the filesystem. We downloaded the sas7bdat file from a distant SAS Server utilizing SCP.

4. Utilizing a purpose-built integration
One other technique that was explored is utilizing standard methods with a concentrate on balancing comfort and efficiency. This technique abstracts away core integrations and is made accessible to the consumer as a Python library which is executed from the Databricks Pocket book.

Use saspy bundle to execute a SAS macro code (on a SAS server) which does the next

Export sas7bdat to CSV file utilizing SAS code
Compress the CSV file to GZIP

Transfer the compressed file to the Databricks cluster driver node utilizing SCP

Decompresses the CSV file

Reads CSV file to Apache Spark DataFrame

Take a look at outcomes

SAS to Databricks data access methods performance — Determine 4 SAS to Databricks knowledge entry strategies efficiency

The spark-sas7bdat confirmed one of the best efficiency amongst all of the strategies. This bundle takes full benefit of parallel processing in Apache Spark. It distributes blocks of sas7bdat recordsdata on employee nodes. The main downside of this technique is that sas7bdat is a proprietary binary format, and the library was constructed based mostly on reverse engineering of this binary format, so it doesn’t assist all kinds of sas7bdat recordsdata, in addition to it isn’t formally (commercially) vendor-supported.

The saspy and pandas strategies are related in the best way that they’re each constructed for a single node atmosphere and each learn knowledge to pandas DataFrame requiring a further step earlier than having the info accessible as a Spark DataFrame.

The purpose-built integration macro confirmed higher efficiency in comparison with saspy and pandas as a result of it reads knowledge from CSV by way of Apache Spark APIs. Nevertheless, it doesn’t beat the efficiency of the spark-sas7bdat bundle. The aim-built technique may be handy in some instances because it permits including intermediate knowledge transformations on the SAS server.

Conclusion

Increasingly enterprises are gravitating in direction of constructing a Databricks Lakehouse and there are a number of methods of accessing knowledge from the Lakehouse by way of different applied sciences. This weblog discusses how SAS builders, knowledge scientists and different enterprise customers can leverage the info within the Lakehouse and write the outcomes to the cloud. In our experiment, we examined a number of totally different strategies of studying and writing knowledge between Databricks and SAS. The strategies range not solely by efficiency however by comfort and extra capabilities that they supply.

For this check, we used the SAS 9.4M7 platform. SAS Viya helps many of the mentioned approaches but in addition supplies further choices. For those who’d wish to study extra in regards to the strategies or different specialised integration approaches not coated right here, be happy to achieve out to us at Databricks or databricks@t1a.com.

Within the upcoming posts on this weblog collection, we are going to focus on finest practices in implementing built-in knowledge pipelines, end-to-end workflows, utilizing SAS and Databricks and tips on how to leverage SAS In-Database applied sciences for scoring SAS fashions in Databricks clusters.

SAS® and all different SAS Institute Inc. services or products names are registered emblems or emblems of SAS Institute Inc. within the USA and different nations. ® signifies USA registration.

Get began

Attempt the course, Databricks for SAS Customers, on Databricks Academy to get a fundamental hands-on expertise with PySpark programming for SAS programming language constructs and contact us to study extra about how we will help your SAS staff to onboard their ETL workloads to Databricks and allow finest practices.

[ad_2]

How one can Pace Up Knowledge Movement Between Databricks and SAS

Eventualities

Entry Delta Lake from SAS

Examined strategies

Take a look at outcomes

Entry SAS datasets from Databricks

Examined strategies

Take a look at outcomes

Conclusion

Get began

New DataGrail analysis finds firms might spend upwards of $400K/12 months complying with knowledge privateness legal guidelines, doubling the 2020 value

Automate notifications on Slack for Amazon Redshift question monitoring rule violations

From the Floor Up: The Reality About Information Innovation

LEAVE A REPLY Cancel reply

Most Popular

Engaged on a Scrum Group Coaching: Public Course Now Obtainable:

Introducing the Insider Incident Knowledge Trade Normal (IIDES)

Chris Patterson on MassTransit and Occasion-Pushed Methods – Software program Engineering Radio

LangChain and Agentic AI Engineering with Erick Friis

Free Video Coaching – Scrum Staff Reset – Video #1 Out there Now

Cyber-Knowledgeable Machine Studying

Charles Humble on Skilled Expertise for Software program Engineers – Software program Engineering Radio

The Subsea Cable Community with Josh Dzieza

Digital Forensics with Emre Tinaztepe

Fallout: London with Daniel Morrison Neil and Jordan Albon

Recent Comments

ABOUT US

POPULAR POSTS

Engaged on a Scrum Group Coaching: Public Course Now Obtainable:

Introducing the Insider Incident Knowledge Trade Normal (IIDES)

Chris Patterson on MassTransit and Occasion-Pushed Methods – Software program Engineering Radio

POPULAR CATEGORY