Develop and check AWS Glue model 3.0 jobs domestically utilizing a Docker container

April 14, 2022

260

[ad_1]
[*]

AWS Glue is a totally managed serverless service that permits you to course of knowledge coming via totally different knowledge sources at scale. You should use AWS Glue jobs for numerous use instances corresponding to knowledge ingestion, preprocessing, enrichment, and knowledge integration from totally different knowledge sources. AWS Glue model 3.0, the newest model of AWS Glue Spark jobs, supplies a performance-optimized Apache Spark 3.1 runtime expertise for batch and stream processing.

You’ll be able to writer AWS Glue jobs in numerous methods. In the event you choose coding, AWS Glue permits you to write Python/Scala supply code with the AWS Glue ETL library. In the event you choose interactive scripting, AWS Glue interactive classes and AWS Glue Studio notebooks lets you write scripts in notebooks by inspecting and visualizing the info. In the event you choose a graphical interface fairly than coding, AWS Glue Studio helps you writer knowledge integration jobs visually with out writing code.

For a production-ready knowledge platform, a growth course of and CI/CD pipeline for AWS Glue jobs is essential. We perceive the large demand for growing and testing AWS Glue jobs the place you like to have flexibility, an area laptop computer, a Docker container on Amazon Elastic Compute Cloud (Amazon EC2), and so forth. You’ll be able to obtain that by utilizing AWS Glue Docker pictures hosted on Docker Hub or the Amazon Elastic Container Registry (Amazon ECR) Public Gallery. The Docker pictures enable you to arrange your growth surroundings with extra utilities. You should use your most popular IDE, pocket book, or REPL utilizing the AWS Glue ETL library.

This submit is a continuation of weblog submit “Creating AWS Glue ETL jobs domestically utilizing a container“. Whereas the sooner submit launched the sample of growth for AWS Glue ETL Jobs on a Docker container utilizing a Docker picture, this submit focuses on the way to develop and check AWS Glue model 3.0 jobs utilizing the identical method.

Resolution overview

The next Docker pictures can be found for AWS Glue on Docker Hub:

AWS Glue model 3.0 – amazon/aws-glue-libs:glue_libs_3.0.0_image_01
AWS Glue model 2.0 – amazon/aws-glue-libs:glue_libs_2.0.0_image_01

You can too acquire the photographs from the Amazon ECR Public Gallery:

AWS Glue model 3.0 – public.ecr.aws/glue/aws-glue-libs:glue_libs_3.0.0_image_01
AWS Glue model 2.0 – public.ecr.aws/glue/aws-glue-libs:glue_libs_2.0.0_image_01

Notice: AWS Glue Docker pictures are x86_64 suitable and arm64 hosts are at present not supported.

On this submit, we use amazon/aws-glue-libs:glue_libs_3.0.0_image_01 and run the container on an area machine (Mac, Home windows, or Linux). This container picture has been examined for AWS Glue model 3.0 Spark jobs. The picture incorporates the next:

Amazon Linux
AWS Glue ETL Library (aws-glue-libs)
Apache Spark 3.1.1
Spark historical past server
JupyterLab
Livy
Different library dependencies (the identical as those of the AWS Glue job system)

To arrange your container, you pull the picture from Docker Hub after which run the container. We display the way to run your container with the next strategies, relying in your necessities:

spark-submit
REPL shell (pyspark)
pytest
JupyterLab
Visible Studio Code

Conditions

Earlier than you begin, ensure that Docker is put in and the Docker daemon is working. For set up directions, see the Docker documentation for Mac, Home windows, or Linux. Additionally just be sure you have a minimum of 7 GB of disk house for the picture on the host working Docker.

For extra details about restrictions when growing AWS Glue code domestically, see Native Growth Restrictions.

Configure AWS credentials

To allow AWS API calls from the container, arrange your AWS credentials with the next steps:

Create an AWS named profile.
Open cmd on Home windows or a terminal on Mac/Linux, and run the next command:
```
PROFILE_NAME="profile_name"
```

Within the following sections, we use this AWS named profile.

Pull the picture from Docker Hub

In the event you’re working Docker on Home windows, select the Docker icon (right-click) and select Change to Linux containers… earlier than pulling the picture.

Run the next command to tug the picture from Docker Hub:

docker pull amazon/aws-glue-libs:glue_libs_3.0.0_image_01

Run the container

Now you possibly can run a container utilizing this picture. You’ll be able to select any of following strategies based mostly in your necessities.

spark-submit

You’ll be able to run an AWS Glue job script by working the spark-submit command on the container.

Write your ETL script (pattern.py within the instance under) and put it aside beneath the /local_path_to_workspace/src/ listing utilizing the next instructions:

$ WORKSPACE_LOCATION=/local_path_to_workspace
$ SCRIPT_FILE_NAME=pattern.py
$ mkdir -p ${WORKSPACE_LOCATION}/src
$ vim ${WORKSPACE_LOCATION}/src/${SCRIPT_FILE_NAME}

These variables are used within the docker run command under. The pattern code (pattern.py) used within the spark-submit command under is included within the appendix on the finish of this submit.

Run the next command to run the spark-submit command on the container to submit a brand new Spark utility:

$ docker run -it -v ~/.aws:/dwelling/glue_user/.aws -v $WORKSPACE_LOCATION:/dwelling/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_spark_submit amazon/aws-glue-libs:glue_libs_3.0.0_image_01 spark-submit /dwelling/glue_user/workspace/src/$SCRIPT_FILE_NAME
...22/01/26 09:08:55 INFO DAGScheduler: Job 0 completed: fromRDD at DynamicFrame.scala:305, took 3.639886 s
root
|-- family_name: string
|-- identify: string
|-- hyperlinks: array
| |-- aspect: struct
| | |-- be aware: string
| | |-- url: string
|-- gender: string
|-- picture: string
|-- identifiers: array
| |-- aspect: struct
| | |-- scheme: string
| | |-- identifier: string
|-- other_names: array
| |-- aspect: struct
| | |-- lang: string
| | |-- be aware: string
| | |-- identify: string
|-- sort_name: string
|-- pictures: array
| |-- aspect: struct
| | |-- url: string
|-- given_name: string
|-- birth_date: string
|-- id: string
|-- contact_details: array
| |-- aspect: struct
| | |-- sort: string
| | |-- worth: string
|-- death_date: string

...

REPL shell (pyspark)

You’ll be able to run a REPL (read-eval-print loop) shell for interactive growth. Run the next command to run the pyspark command on the container to begin the REPL shell:

$ docker run -it -v ~/.aws:/dwelling/glue_user/.aws -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01 pyspark
...
 ____ __
 / __/__ ___ _____/ /__
 _ / _ / _ `/ __/ '_/
 /__ / .__/_,_/_/ /_/_  model 3.1.1-amzn-0
 /_/

Utilizing Python model 3.7.10 (default, Jun 3 2021 00:02:01)
Spark context Net UI accessible at http://56e99d000c99:4040
Spark context accessible as 'sc' (grasp = native[*], app id = local-1643011860812).
SparkSession accessible as 'spark'.
>>>

pytest

For unit testing, you should use pytest for AWS Glue Spark job scripts.

Run the next instructions for preparation:

$ WORKSPACE_LOCATION=/local_path_to_workspace
$ SCRIPT_FILE_NAME=pattern.py
$ UNIT_TEST_FILE_NAME=test_sample.py
$ mkdir -p ${WORKSPACE_LOCATION}/checks
$ vim ${WORKSPACE_LOCATION}/checks/${UNIT_TEST_FILE_NAME}

Run the next command to run pytest on the check suite:

$ docker run -it -v ~/.aws:/dwelling/glue_user/.aws -v $WORKSPACE_LOCATION:/dwelling/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pytest amazon/aws-glue-libs:glue_libs_3.0.0_image_01 -c "python3 -m pytest"
beginning org.apache.spark.deploy.historical past.HistoryServer, logging to /dwelling/glue_user/spark/logs/spark-glue_user-org.apache.spark.deploy.historical past.HistoryServer-1-5168f209bd78.out
============================================================= check session begins =============================================================
platform linux -- Python 3.7.10, pytest-6.2.3, py-1.11.0, pluggy-0.13.1
rootdir: /dwelling/glue_user/workspace
plugins: anyio-3.4.0
collected 1 merchandise  

checks/test_sample.py . [100%]

============================================================== warnings abstract ===============================================================
checks/test_sample.py::test_counts
 /dwelling/glue_user/spark/python/pyspark/sql/context.py:79: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() as a substitute.
 DeprecationWarning)

-- Docs: https://docs.pytest.org/en/secure/warnings.html
======================================================== 1 handed, 1 warning in 21.07s ========================================================

JupyterLab

You can begin Jupyter for interactive growth and advert hoc queries on notebooks. Full the next steps:

Run the next command to begin JupyterLab:

$ JUPYTER_WORKSPACE_LOCATION=/local_path_to_workspace/jupyter_workspace/
$ docker run -it -v ~/.aws:/dwelling/glue_user/.aws -v $JUPYTER_WORKSPACE_LOCATION:/dwelling/glue_user/workspace/jupyter_workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 -p 8998:8998 -p 8888:8888 --name glue_jupyter_lab amazon/aws-glue-libs:glue_libs_3.0.0_image_01 /dwelling/glue_user/jupyter/jupyter_start.sh
...
[I 2022-01-24 08:19:21.368 ServerApp] Serving notebooks from native listing: /dwelling/glue_user/workspace/jupyter_workspace
[I 2022-01-24 08:19:21.368 ServerApp] Jupyter Server 1.13.1 is working at:
[I 2022-01-24 08:19:21.368 ServerApp] http://faa541f8f99f:8888/lab
[I 2022-01-24 08:19:21.368 ServerApp] or http://127.0.0.1:8888/lab
[I 2022-01-24 08:19:21.368 ServerApp] Use Management-C to cease this server and shut down all kernels (twice to skip affirmation).

Open http://127.0.0.1:8888/lab in your internet browser in your native machine to entry the JupyterLab UI.
Select Glue Spark Native (PySpark) beneath Pocket book.

Now you can begin growing code within the interactive Jupyter pocket book UI.

Visible Studio Code

To arrange the container with Visible Studio Code, full the next steps:

Set up Visible Studio Code.
Set up Python.
Set up Visible Studio Code Distant – Containers.
Open the workspace folder in Visible Studio Code.
Select Settings.
Select Workspace.
Select Open Settings (JSON).

Enter the next JSON and put it aside:

{
    "python.defaultInterpreterPath": "/usr/bin/python3",
    "python.evaluation.extraPaths": [
        "/home/glue_user/aws-glue-libs/PyGlue.zip:/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip:/home/glue_user/spark/python/",
    ]
}

Now you’re able to arrange the container.

Run the Docker container:

$ docker run -it -v ~/.aws:/dwelling/glue_user/.aws -v $WORKSPACE_LOCATION:/dwelling/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01 pyspark

Begin Visible Studio Code.
Select Distant Explorer within the navigation pane, and select the container amazon/aws-glue-libs:glue_libs_3.0.0_image_01.
Proper-click and select Connect to Container.
If the next dialog seems, select Obtained it.
Open /dwelling/glue_user/workspace/.
Create an AWS Glue PySpark script and select Run.

You must see the profitable run on the AWS Glue PySpark script.

Conclusion

On this submit, we discovered the way to get began on AWS Glue Docker pictures. AWS Glue Docker pictures enable you to develop and check your AWS Glue job scripts anyplace you like. It’s accessible on Docker Hub and Amazon ECR Public Gallery. Test it out, we sit up for getting your suggestions.

Appendix: AWS Glue job pattern codes for testing

This appendix introduces three totally different scripts as AWS Glue job pattern codes for testing functions. You should use any of them within the tutorial.

The next pattern.py code makes use of the AWS Glue ETL library with an Amazon Easy Storage Service (Amazon S3) API name. The code requires Amazon S3 permissions in AWS Id and Entry Administration (IAM). You’ll want to grant the IAM-managed coverage arn:aws:iam::aws:coverage/AmazonS3ReadOnlyAccess or IAM customized coverage that permits you to make ListBucket and GetObject API requires the S3 path.

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions


class GluePythonSampleTest:
    def __init__(self):
        params = []
        if '--JOB_NAME' in sys.argv:
            params.append('JOB_NAME')
        args = getResolvedOptions(sys.argv, params)

        self.context = GlueContext(SparkContext.getOrCreate())
        self.job = Job(self.context)

        if 'JOB_NAME' in args:
            jobname = args['JOB_NAME']
        else:
            jobname = "check"
        self.job.init(jobname, args)

    def run(self):
        dyf = read_json(self.context, "s3://awsglue-datasets/examples/us-legislators/all/individuals.json")
        dyf.printSchema()

        self.job.commit()


def read_json(glue_context, path):
    dynamicframe = glue_context.create_dynamic_frame.from_options(
        connection_type="s3",
        connection_options={
            'paths': [path],
            'recurse': True
        },
        format="json"
    )
    return dynamicframe


if __name__ == '__main__':
    GluePythonSampleTest().run()z

The next test_sample.py code is a pattern for a unit check of pattern.py:

import pytest
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
import sys
from src import pattern


@pytest.fixture(scope="module", autouse=True)
def glue_context():
    sys.argv.append('--JOB_NAME')
    sys.argv.append('test_count')

    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    context = GlueContext(SparkContext.getOrCreate())
    job = Job(context)
    job.init(args['JOB_NAME'], args)

    yield(context)

    job.commit()


def test_counts(glue_context):
    dyf = pattern.read_json(glue_context, "s3://awsglue-datasets/examples/us-legislators/all/individuals.json")
    assert dyf.toDF().depend() == 1961

Concerning the Authors

Subramanya Vajiraya is a Cloud Engineer (ETL) at AWS Sydney specialised in AWS Glue. He’s keen about serving to prospects remedy points associated to their ETL workload and implement scalable knowledge processing and analytics pipelines on AWS. Outdoors of labor, he enjoys occurring bike rides and taking lengthy walks together with his canine Ollie, a 1-year-old Corgi.

Vishal Pathak is a Knowledge Lab Options Architect at AWS. Vishal works with prospects on their use instances, architects options to resolve their enterprise issues, and helps them construct scalable prototypes. Previous to his journey in AWS, Vishal helped prospects implement enterprise intelligence, knowledge warehouse, and knowledge lake tasks within the US and Australia.

Noritaka Sekiyama is a Principal Massive Knowledge Architect on the AWS Glue group. He enjoys studying totally different use instances from prospects and sharing information about massive knowledge applied sciences with the broader neighborhood.

[*][ad_2]

Develop and check AWS Glue model 3.0 jobs domestically utilizing a Docker container

Resolution overview

Conditions

Configure AWS credentials

Pull the picture from Docker Hub

Run the container

spark-submit

REPL shell (pyspark)

pytest

JupyterLab

Visible Studio Code

Conclusion

Appendix: AWS Glue job pattern codes for testing

Concerning the Authors

New DataGrail analysis finds firms might spend upwards of $400K/12 months complying with knowledge privateness legal guidelines, doubling the 2020 value

Automate notifications on Slack for Amazon Redshift question monitoring rule violations

From the Floor Up: The Reality About Information Innovation

LEAVE A REPLY Cancel reply

Most Popular

Engaged on a Scrum Group Coaching: Public Course Now Obtainable:

Introducing the Insider Incident Knowledge Trade Normal (IIDES)

Chris Patterson on MassTransit and Occasion-Pushed Methods – Software program Engineering Radio

LangChain and Agentic AI Engineering with Erick Friis

Free Video Coaching – Scrum Staff Reset – Video #1 Out there Now

Cyber-Knowledgeable Machine Studying

Charles Humble on Skilled Expertise for Software program Engineers – Software program Engineering Radio

The Subsea Cable Community with Josh Dzieza

Digital Forensics with Emre Tinaztepe

Fallout: London with Daniel Morrison Neil and Jordan Albon

Recent Comments

ABOUT US

POPULAR POSTS

Engaged on a Scrum Group Coaching: Public Course Now Obtainable:

Introducing the Insider Incident Knowledge Trade Normal (IIDES)

Chris Patterson on MassTransit and Occasion-Pushed Methods – Software program Engineering Radio

POPULAR CATEGORY