Catalog and analyze Software Load Balancer logs extra effectively with AWS Glue customized classifiers and Amazon Athena

November 17, 2021

239

[ad_1]

You may question Software Load Balancer (ALB) entry logs for numerous functions, reminiscent of analyzing site visitors distribution and patterns. You can too simply use Amazon Athena to create a desk and question in opposition to the ALB entry logs on Amazon Easy Storage Service (Amazon S3). (For extra info, see How do I analyze my Software Load Balancer entry logs utilizing Amazon Athena? and Querying Software Load Balancer Logs.) All queries are run in opposition to the entire desk as a result of it doesn’t outline any partitions. You probably have a number of years of ALB logs, you could need to use a partitioned desk as a substitute for higher question efficiency and value management. In reality, partitioning knowledge is likely one of the High 10 efficiency tuning suggestions for Athena.

Nonetheless, as a result of ALB log recordsdata aren’t saved in a Hive-style prefix (reminiscent of /12 months=2021/), the method of making hundreds of partitions utilizing ALTER TABLE ADD PARTITION in Athena is cumbersome. This publish reveals a approach to create and schedule an AWS Glue crawler with a Grok customized classifier that infers the schema of all ALB log recordsdata beneath the required Amazon S3 prefix and populates the partition metadata (12 months, month, and day) robotically to the AWS Glue Information Catalog.

Stipulations

To comply with together with this publish, full the next conditions:

Allow entry logging of the ALBs, and have the recordsdata already ingested within the specified S3 bucket.
Arrange the Athena question consequence location. For extra info, see Working with Question Outcomes, Output Recordsdata, and Question Historical past.

Resolution overview

The next diagram illustrates the answer structure.

To implement this resolution, we full the next steps:

Put together the Grok sample for our ALB logs, and cross-check with a Grok debugger.
Create an AWS Glue crawler with a Grok customized classifier.
Run the crawler to organize a desk with partitions within the Information Catalog.
Analyze the partitioned knowledge utilizing Athena and evaluate question velocity vs. a non-partitioned desk.

Put together the Grok sample for our ALB logs

As a preliminary step, find the entry log recordsdata on the Amazon S3 console, and manually examine the recordsdata to watch the format and syntax. To permit an AWS Glue crawler to acknowledge the sample, we have to use a Grok sample to match in opposition to an expression and map particular elements into the corresponding fields. Roughly 100 pattern Grok patterns can be found within the Logstash Plugins GitHub, and we will write our personal customized sample if it’s not listed.

The next the essential syntax format for a Grok sample %{PATTERN:FieldName}

The next is an instance of an ALB entry log:

http 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 10.0.0.1:80 0.000 0.001 0.000 200 200 34 366 "GET http://www.instance.com:80/ HTTP/1.1" "curl/7.46.0" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337262-36d228ad5d99923122bbe354" "-" "-" 0 2018-07-02T22:22:48.364000Z "ahead" "-" "-" "10.0.0.1:80" "200" "-" "-"
https 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 10.0.0.1:80 0.086 0.048 0.037 200 200 0 57 "GET https://www.instance.com:443/ HTTP/1.1" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337281-1d84f3d73c47ec4e58577259" "www.instance.com" "arn:aws:acm:us-east-2:123456789012:certificates/12345678-1234-1234-1234-123456789012" 1 2018-07-02T22:22:48.364000Z "authenticate,ahead" "-" "-" "10.0.0.1:80" "200" "-" "-"
h2 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 10.0.1.252:48160 10.0.0.66:9000 0.000 0.002 0.000 200 200 5 257 "GET https://10.0.2.105:773/ HTTP/2.0" "curl/7.46.0" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337327-72bd00b0343d75b906739c42" "-" "-" 1 2018-07-02T22:22:48.364000Z "redirect" "https://instance.com:80/" "-" "10.0.0.66:9000" "200" "-" "-"
ws 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 10.0.0.140:40914 10.0.1.192:8010 0.001 0.003 0.000 101 101 218 587 "GET http://10.0.0.30:80/ HTTP/1.1" "-" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 1 2018-07-02T22:22:48.364000Z "ahead" "-" "-" "10.0.1.192:8010" "101" "-" "-"
wss 2018-07-02T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 10.0.0.140:44244 10.0.0.171:8010 0.000 0.001 0.000 101 101 218 786 "GET https://10.0.0.30:443/ HTTP/1.1" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 1 2018-07-02T22:22:48.364000Z "ahead" "-" "-" "10.0.0.171:8010" "101" "-" "-"
http 2018-11-30T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 - 0.000 0.001 0.000 200 200 34 366 "GET http://www.instance.com:80/ HTTP/1.1" "curl/7.46.0" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 0 2018-11-30T22:22:48.364000Z "ahead" "-" "-" "-" "-" "-" "-"
http 2018-11-30T22:23:00.186641Z app/my-loadbalancer/50dc6c495c0c9188 192.168.131.39:2817 - 0.000 0.001 0.000 502 - 34 366 "GET http://www.instance.com:80/ HTTP/1.1" "curl/7.46.0" - - arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337364-23a8c76965a2ef7629b185e3" "-" "-" 0 2018-11-30T22:22:48.364000Z "ahead" "-" "LambdaInvalidResponse" "-" "-" "-" "-"

To map the primary area, the Grok sample would possibly seem like the next code:

The sample contains the next elements:

DATA maps to .*?
kind is the column title
s is the whitespace character

To map the second area, the Grok sample would possibly seem like the next:

%{TIMESTAMP_ISO8601:time}s

This sample has the next components:

TIMESTAMP_ISO8601 maps to %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?
time is the column title
s is the whitespace character

When writing Grok patterns, we also needs to contemplate nook instances. For instance, the next code is a traditional case:

%{BASE10NUM:target_processing_time}s

However when contemplating the opportunity of null worth, we must always substitute the sample with the next:

%{DATA:target_processing_time}s

When our Grok sample is prepared, we will take a look at the Grok sample with pattern enter utilizing a third-party Grok debugger. The next sample is an efficient begin, however all the time keep in mind to check it with the precise ALB logs.

%{DATA:kind}s+%{TIMESTAMP_ISO8601:time}s+%{DATA:elb}s+%{DATA:shopper}s+%{DATA:goal}s+%{BASE10NUM:request_processing_time}s+%{DATA:target_processing_time}s+%{BASE10NUM:response_processing_time}s+%{BASE10NUM:elb_status_code}s+%{DATA:target_status_code}s+%{BASE10NUM:received_bytes}s+%{BASE10NUM:sent_bytes}s+"%{DATA:request}"s+"%{DATA:user_agent}"s+%{DATA:ssl_cipher}s+%{DATA:ssl_protocol}s+%{DATA:target_group_arn}s+"%{DATA:trace_id}"s+"%{DATA:domain_name}"s+"%{DATA:chosen_cert_arn}"s+%{DATA:matched_rule_priority}s+%{TIMESTAMP_ISO8601:request_creation_time}s+"%{DATA:actions_executed}"s+"%{DATA:redirect_url}"s+"%{DATA:error_reason}"s+"%{DATA:target_list}"s+"%{DATA:target_status_code_list}"s+"%{DATA:classification}"s+"%{DATA:classification_reason}"

Take into account that once you copy the Grok sample out of your browser, in some instances there are additional areas ultimately of the strains. Ensure to take away these additional areas.

Create an AWS Glue crawler with a Grok customized classifier

Earlier than you create your crawler, you first create a customized classifier. Full the next steps:

On the AWS Glue console, beneath Crawler, select Classifiers.
Select Add classifier.
For Classifier title, enter alb-logs-classifier.
For Classifier kind¸ choose Grok.
For Classification, enter alb-logs.
For Grok sample, enter the sample from the earlier part.
Select Create.

Now you’ll be able to create your crawler.

Select Crawlers within the navigation pane.
Select Add crawler.
For Crawler title, enter alb-access-log-crawler.
For Chosen classifiers, enter alb-logs-classifier.
Select Subsequent.
For Crawler supply kind, choose Information shops.
For Repeat crawls of S3 knowledge shops, choose Crawl new folders solely.
Select Subsequent.
For Select a knowledge retailer, select S3.
For Crawl knowledge in, choose Specified path in my account.
For Embody path, enter the trail to your ALB logs (for instance, s3://alb-logs-directory/AWSLogs/<ACCOUNT-ID>/elasticloadbalancing/<REGION>/).
Select Subsequent.
When prompted so as to add one other knowledge retailer, choose No and select Subsequent.
Choose Create an IAM function, and provides it a reputation reminiscent of AWSGlueServiceRole-alb-logs-crawler.
For Frequency, select Every day.
Point out your begin hour and minute.
Select Subsequent.
For Database, enter elb-access-log-db.
For Prefix added to tables, enter alb_logs_.
Develop Configuration choices.
Choose Replace all new and present partitions with metadata from the desk.
Hold the opposite choices at their default.
Select Subsequent.
Evaluation your settings and select End.

Run your AWS Glue crawler

Subsequent, we run our crawler to organize a desk with partitions within the Information Catalog.

On the AWS Glue console, select Crawlers.
Choose the crawler we simply created.
Select Run crawler.

When the crawler is full, you obtain a notification indicating {that a} desk has been created.

Subsequent, we overview and edit the schema.

Below Databases, select Tables.
Select the desk alb_logs_<area>.
Cross-check the column title and corresponding knowledge kind.

The desk has three columns: partiion_0, partition_1, and partition_2.

Select Edit schema.
Rename the columns 12 months, month, and day.
Select Save.

Analyze the info utilizing Athena

Subsequent, we analyze our knowledge by querying the entry logs. We evaluate the question velocity between the next tables:

Non-partitioned desk – All knowledge is handled as a single desk
Partitioned desk – Information is partitioned by 12 months, month, and day

Question the non-partitioned desk

With the non-partitioned desk, if we need to question entry logs on a particular date, we’ve to jot down the WHERE clause utilizing the LIKE operator as a result of the info column was interpreted as a string. See the next code:

SELECT COUNT(1) FROM "elb-access-log-db"."alb_logs" WHERE kind="h2" AND time LIKE '2020-12-29%';

The question takes 5.25 seconds to finish, with 3.15 MB knowledge scanned.

Question the partitioned desk

With the 12 months, month, and day columns as partitions, we will use the next assertion to question entry logs on the identical day:

SELECT COUNT(1) FROM "elb-access-log-db"."alb_logs" WHERE kind="h2" AND 12 months=2020 AND month=12 AND day=29;

This time the question takes just one.89 seconds to finish, with 25.72 KB knowledge scanned.

This question is quicker and prices much less (as a result of much less knowledge is scanned) resulting from partition pruning.

Clear up

To keep away from incurring future prices, delete the assets created within the Information Catalog, and delete the AWS Glue crawler.

Abstract

On this publish, we illustrated the best way to create an AWS Glue crawler that populates ALB logs metadata within the AWS Glue Information Catalog robotically with partitions by 12 months, month, and day. With partition pruning, we will enhance question efficiency and related prices in Athena.

You probably have questions or ideas, please depart a remark.

In regards to the Authors

Ray Wang is a Options Architect at AWS. With 8 years of expertise within the IT business, Ray is devoted to constructing fashionable options on the cloud, particularly in huge knowledge and machine studying. As a hungry go-getter, he handed all 12 AWS certificates to make his technical area not solely deep however broad. He likes to learn and watch sci-fi motion pictures in his spare time.

Corvus Lee is a Information Lab Options Architect at AWS. He enjoys all types of data-related discussions with clients, from high-level like white boarding a knowledge lake structure, to the main points of information modeling, writing Python/Spark code for knowledge processing, and extra.

[ad_2]

Catalog and analyze Software Load Balancer logs extra effectively with AWS Glue customized classifiers and Amazon Athena

Stipulations

Resolution overview

Put together the Grok sample for our ALB logs

Create an AWS Glue crawler with a Grok customized classifier

Run your AWS Glue crawler

Analyze the info utilizing Athena

Question the non-partitioned desk

Question the partitioned desk

Clear up

Abstract

In regards to the Authors

New DataGrail analysis finds firms might spend upwards of $400K/12 months complying with knowledge privateness legal guidelines, doubling the 2020 value

Automate notifications on Slack for Amazon Redshift question monitoring rule violations

From the Floor Up: The Reality About Information Innovation

LEAVE A REPLY Cancel reply

Most Popular

Engaged on a Scrum Group Coaching: Public Course Now Obtainable:

Introducing the Insider Incident Knowledge Trade Normal (IIDES)

Chris Patterson on MassTransit and Occasion-Pushed Methods – Software program Engineering Radio

LangChain and Agentic AI Engineering with Erick Friis

Free Video Coaching – Scrum Staff Reset – Video #1 Out there Now

Cyber-Knowledgeable Machine Studying

Charles Humble on Skilled Expertise for Software program Engineers – Software program Engineering Radio

The Subsea Cable Community with Josh Dzieza

Digital Forensics with Emre Tinaztepe

Fallout: London with Daniel Morrison Neil and Jordan Albon

Recent Comments

ABOUT US

POPULAR POSTS

Engaged on a Scrum Group Coaching: Public Course Now Obtainable:

Introducing the Insider Incident Knowledge Trade Normal (IIDES)

Chris Patterson on MassTransit and Occasion-Pushed Methods – Software program Engineering Radio

POPULAR CATEGORY