Monday, October 3, 2022
HomeBig DataAutomate notifications on Slack for Amazon Redshift question monitoring rule violations

Automate notifications on Slack for Amazon Redshift question monitoring rule violations


On this submit, we stroll you thru learn how to arrange automated notifications of question monitoring rule (QMR) violations in Amazon Redshift to a Slack channel, in order that Amazon Redshift customers can take well timed motion.

Amazon Redshift is a totally managed, petabyte-scale knowledge warehouse service within the cloud. With Amazon Redshift, you’ll be able to analyze your knowledge to derive holistic insights about your enterprise and your clients. One of many challenges is to guard the information warehouse workload from poorly written queries that may eat important assets. Amazon Redshift question monitoring guidelines are a characteristic of workload administration (WLM) that permit automated dealing with of poorly written queries. Guidelines which are utilized to a WLM queue permit queries to be logged, canceled, hopped (solely obtainable with guide WLM), or to alter precedence (solely obtainable with automated WLM). The explanation to make use of QMRs is to guard towards wasteful use of the cluster. You too can use these guidelines to log resource-intensive queries, which supplies the chance to ascertain governance for advert hoc workloads.

The Amazon Redshift cluster routinely collects question monitoring guidelines metrics. This handy mechanism enables you to view attributes like the next:

  • Question runtime, in seconds
  • Question return row rely
  • The CPU time for a SQL assertion

It additionally makes Amazon Redshift Spectrum metrics obtainable, such because the variety of Redshift Spectrum rows and MBs scanned by a question.

When a question violates a QMR, Amazon Redshift logs the violation into the STL_WLM_RULE_ACTION system view. If the motion is aborted for the queries that violate a QMR, end-users see an error that signifies question failure attributable to violation of QMRs. We suggest that administrative staff members periodically study violations listed within the STL_WLM_RULE_ACTION desk and coach the concerned end-users on learn how to keep away from future rule violations.

Alternately, a centralized staff, utilizing a Slack channel for collaboration and monitoring, can configure Amazon Redshift occasions and alarms to be despatched to their channel, in order that they’ll take well timed motion. Within the following sections, we stroll you thru learn how to arrange automated notifications of QMR violations to a Slack channel by way of using Slack occasions and alarms. This permits Amazon Redshift customers to be notified and take well timed actions with out the necessity to question the system view.

Resolution overview

To show how one can obtain automated notification to a Slack channel for QMR violation, we’ve designed the next structure. As proven within the following diagram, we’ve combined workload extract, remodel, and cargo (ETL), enterprise intelligence (BI) dashboards, and analytics purposes which are powered by an Amazon Redshift cluster. The answer depends on AWS Lambda and Amazon Easy Notification Service (Amazon SNS) to ship notifications of Amazon Redshift QMR violations to Slack.

To implement this resolution, you create an Amazon Redshift cluster and fasten a customized outlined parameter group.

Amazon Redshift supplies one default parameter group for every parameter group household. The default parameter group has preset values for every of its parameters, and it may’t be modified. If you wish to use completely different parameter values than the default parameter group, you will need to create a customized parameter group after which affiliate your cluster with it.

Within the parameter group, you should use automated WLM and outline just a few workload queues, reminiscent of a queue for processing ETL workloads and a reporting queue for person queries. You possibly can identify the default queue adhoc. With automated WLM, Amazon Redshift determines the optimum concurrency and reminiscence allocation for every question that’s operating in every queue.

For every workload queue, you’ll be able to outline a number of QMRs. For instance, you’ll be able to create a rule to abort a person question if it runs for greater than 300 seconds or returns greater than 1 billion rows. Equally, you’ll be able to create a rule to log a Redshift Spectrum question that scans greater than 100 MB.

The Amazon Redshift WLM evaluates metrics each 10 seconds. It data particulars about actions that end result from QMR violation that’s related to user-defined queues within the STL_WLM_RULE_ACTION system desk. On this resolution, a Lambda operate is scheduled to watch the STL_WLM_RULE_ACTION system desk each jiffy. When the operate is invoked, if it finds a brand new entry, it publishes an in depth message to an SNS matter. A second Lambda operate, created because the goal subscriber to the SNS matter, is invoked every time any message is revealed to the SNS matter. This second operate invokes a pre-created Slack webhook, which sends the message that was obtained by way of the SNS matter to the Slack channel of your selection. (For extra info on publishing messages through the use of Slack webhooks, see Sending messages utilizing incoming webhooks.)

To summarize, the answer includes the next steps:

  1. Create an Amazon Redshift customized parameter group and add workload queues.
  2. Configure question monitoring guidelines.
  3. Connect the customized parameter group to the cluster.
  4. Create a SNS matter.
  5. Create a Lambda operate and schedule it to run each 5 minutes through the use of an Amazon EventBridge rule.
  6. Create the Slack assets.
  7. Add an incoming webhook and authorize the Slack app to submit messages to a Slack channel.
  8. Create the second Lambda operate and subscribe to the SNS matter.
  9. Take a look at the answer.

Create an Amazon Redshift customized parameter group and add workload queues

On this step, you create an Amazon Redshift customized parameter group with automated WLM enabled. You additionally create the next queues to separate the workloads within the parameter group:

  • reporting – The reporting queue runs BI reporting queries which are carried out by any person who belongs to the Amazon Redshift database group named reporting_group
  • adhoc – The default queue, renamed adhoc, performs any question that’s not despatched to every other queue

Full the next steps to create your parameter group and add workload queues:

  1. Create a parameter group, named csblog, with automated WLM enabled.
  2. On the Amazon Redshift console, choose the customized parameter group you created.
  3. Select Edit workload queues.
  4. On the Modify workload queues web page, select Add queue.
  5. Fill within the Concurrency scaling mode and Question precedence fields as wanted to create the reporting queue.
  6. Repeat these steps so as to add the adhoc queue.

For extra details about WLM queues, discuss with Configuring workload administration.

Configure question monitoring guidelines

On this step, you add QMRs to every workload queue. For directions, discuss with Creating or modifying a question monitoring rule utilizing the console.

For the reporting queue, add the next QMRs:

  • nested_loop – Logs any question concerned in a nested loop be part of that leads to a row rely greater than 10,000,000 rows.
  • long_running – Stops queries that run for greater than 300 seconds (5 minutes).

For the adhoc queue, add the next QMRs:

  • returned_rows – Stops any question that returns greater than 1,000,000 rows again to the calling shopper utility (this isn’t sensible and might degrade the end-to-end efficiency of the appliance).
  • spectrum_scan – Stops any question that scans greater than 1000 MB of information from an Amazon Easy Storage Service (Amazon S3) knowledge lake through the use of Redshift Spectrum.

Connect the customized parameter group to the cluster

To connect the customized parameter group to your provisioned Redshift cluster, observe the directions in Associating a parameter group with a cluster. In case you don’t have already got a provisioned Redshift cluster, discuss with Create a cluster.

For this submit, we connected our customized parameter group csblog to an already created provisioned Amazon Redshift cluster.

Create an SNS matter

On this step, you create an SNS matter that receives an in depth message of QMR violation from the Lambda operate that checks the Amazon Redshift system desk for QMR violation entries. For directions, discuss with Creating an Amazon SNS matter.

For this submit, we created an SNS matter named redshiftqmrrulenotification.

Create a Lambda operate to watch the system desk

On this step, you create a Lambda operate that displays the STL_WLM_RULE_ACTION system desk. Every time any file is discovered within the desk because the final time the operate ran, the operate publishes an in depth message to the SNS matter that you simply created earlier. You additionally create an EventBridge rule to invoke the operate each 5 minutes.

For this submit, we create a Lambda operate named redshiftqmrrule that’s scheduled to run each 5 minutes by way of an EventBridge rule named Redshift-qmr-rule-Lambda-schedule. For directions, discuss with Constructing Lambda features with Python.

The next screenshot reveals the operate that checks the pg_catalog.stl_wlm_rule_action desk.

To create an EventBridge rule and affiliate it with the Lambda operate, discuss with Create a Rule.

The next screenshot reveals the EventBridge rule Redshift-qmr-rule-Lambda-schedule, which calls the operate each 5 minutes.

We use the next Python 3.9 code for this Lambda operate. The operate makes use of an Amazon Redshift Knowledge API name that makes use of GetClusterCredentials for momentary credentials.

import json
import time
import unicodedata
import traceback
import sys
from pip._internal import primary
import urllib3
import os
import boto3
from datetime import datetime

# provoke redshift-data shopper in boto3
shopper = boto3.shopper("redshift-data")

question = "choose userid,question,service_class,trim(rule) as rule,trim(motion) as motion,recordtime from stl_wlm_rule_action WHERE userid > 1 AND recordtime >= current_timestamp AT TIME ZONE 'UTC' - INTERVAL '5 minute' order by recordtime desc;"
sns = boto3.useful resource('sns')
sns_arn = os.environ['sns_arn']
platform_endpoint = sns.PlatformEndpoint('{sns_arn}'.format(sns_arn = sns_arn))

def status_check(shopper, query_id):
    desc = shopper.describe_statement(Id=query_id)
    standing = desc["Status"]
    if standing == "FAILED":
        increase Exception('SQL question failed:' + query_id + ": " + desc["Error"])
    return standing.strip('"')

def execute_sql(sql_text, redshift_database, redshift_user, redshift_cluster_id):
    print("Executing: {}".format(sql_text))
    res = shopper.execute_statement(Database=redshift_database, DbUser=redshift_user, Sql=sql_text,
                                   ClusterIdentifier=redshift_cluster_id)
    
    query_id = res["Id"]
    print("question id")
    print(query_id)
    carried out = False
    whereas not carried out:
        time.sleep(1)
        standing = status_check(shopper, query_id)
        if standing in ("FAILED", "FINISHED"):
            print("standing is: {}".format(standing))
            break
    return query_id

def publish_to_sns(message):
    strive:
        # Publish a message.
        response = platform_endpoint.publish(
                  Topic="Redshift Question Monitoring Rule Notifications",
                  Message=message,
                  MessageStructure="string"

            )
        return  response

    besides:
        print(' Didn't publish messages to SNS matter: exception %s' % sys.exc_info()[1])
        return 'Failed'

def lambda_handler(occasion, context):
    
    rsdb = os.environ['rsdb']
    rsuser = os.environ['rsuser']
    rscluster = os.environ['rscluster']
    #print(question)
    res = execute_sql(question, rsdb, rsuser, rscluster)
    print("res")
    print(res)
    response = shopper.get_statement_result(
        Id = res
    )
    # datetime object containing present date and time
    now = datetime.now()
    dt_string = now.strftime("%d-%b-%Y %H:%M:%S")
    print(response) 
    if response['TotalNumRows'] > 0:
        messageText="################## Reporting Start" + ' [' + str(dt_string) + ' UTC] ##################nn'
        messageText = messageText + 'Whole variety of queries affected by QMR Rule violation for Redshift cluster "' + rscluster + '" is ' + str(len(response['Records'])) + '.' + 'n' + 'n'
        for i in vary(len(response['Records'])):
            messageText = messageText + 'It was reported at ' + str(response['Records'][i][5]['stringValue'])[11:19] + ' UTC on ' + str(response['Records'][i][5]['stringValue'])[0:10] + ' {that a} question with Question ID - ' + str(response['Records'][i][1]['longValue']) + ' needed to ' +  str(response['Records'][i][4]['stringValue']) + ' attributable to violation of QMR Rule "' + str(response['Records'][i][3]['stringValue']) + '".n'
        messageText = messageText + 'n########################### Reporting Finish ############################nn'
        query_result_json = messageText
        response = publish_to_sns(query_result_json)
    else:
        print('No rows to publish to SNS')

We use 4 atmosphere variables for this Lambda operate:

  • rscluster – The Amazon Redshift provisioned cluster identifier
  • rsdb – The Amazon Redshift database the place you’re operating these assessments
  • rsuser – The Amazon Redshift person who has the privilege to run queries on pg_catalog.stl_wlm_rule_action
  • sns_arn – The Amazon Useful resource Identify (ARN) of the SNS matter that we created earlier

Create Slack assets

On this step, you create a brand new Slack workspace (when you don’t have one already), a brand new personal Slack channel (provided that you don’t have one or don’t wish to use an present one), and a brand new Slack app within the Slack workspace. For directions, discuss with Create a Slack workspace, Create a channel, and Creating an app.

For this submit, we created the next assets within the Slack web site and Slack desktop app:

  • A Slack workspace named RedshiftQMR*****
  • A non-public channel, named redshift-qmr-notification-*****-*******, within the newly created Slack workspace
  • A brand new Slack app within the Slack workspace, named RedshiftQMRRuleNotification (utilizing the From Scratch possibility)

Add an incoming webhook and authorize Slack app

On this step, you allow and add an incoming webhook to the Slack workspace that you simply created. For full directions, discuss with Allow Incoming Webhooks and Create an Incoming Webhook. You additionally authorize your Slack app in order that it may submit messages to the personal Slack channel.

  1. Within the Slack app, underneath Settings within the navigation pane, select Primary Data.
  2. Select Incoming Webhooks.
  3. Activate Activate Incoming Webhooks.
  4. Select Add New Webhook to Workspace.
  5. Authorize the Slack app RedshiftQMRRuleNotification in order that it may submit messages to the personal Slack channel redshift-qmr-notification-*****-*******.

The next screenshot reveals the main points of the newly added incoming webhook.

Create a second Lambda operate and subscribe to the SNS matter

On this step, you create a second Lambda operate that’s subscribed to the SNS matter that you simply created earlier. For full directions, discuss with Constructing Lambda features with Python and Subscribing a operate to a subject.

For this submit, we create a second operate named redshiftqmrrulesnsinvoke, which is subscribed to the SNS matter redshiftqmrrulenotification. The second operate sends an in depth QMR violation message (obtained from the SNS matter) to the designated Slack channel named redshift-qmr-notification-*. This operate makes use of the incoming Slack webhook that we created earlier.

We additionally create an SNS subscription of the second Lambda operate to the SNS matter that we created beforehand.

The next is the Python 3.9 code used for the second Lambda operate:

import urllib3
import json
import os

http = urllib3.PoolManager()
def lambda_handler(occasion, context):
    
    url = os.environ['webhook']
    channel = os.environ['channel']
    msg = {
        "channel": channel,
        "username": "WEBHOOK_USERNAME",
        "textual content": occasion['Records'][0]['Sns']['Message'],
        "icon_emoji": ""
    }
    
    encoded_msg = json.dumps(msg).encode('utf-8')
    resp = http.request('POST',url, physique=encoded_msg)
    print({
        "message": occasion['Records'][0]['Sns']['Message'], 
        "status_code": resp.standing, 
        "response": resp.knowledge
    })

We use two atmosphere variables for the second Lambda operate:

  • channel – The Slack channel that we created
  • webhook – The Slack webhook that we created

Take a look at the answer

To indicate the impact of the QMRs, we ran queries that violate the QMRs we arrange.

Take a look at 1: Returned rows

Take a look at 1 appears for violations of the returned_rows QMR, during which the return row rely is over 1,000,000 for a question that ran within the adhoc queue.

We created and loaded a desk named lineitem in a schema named aquademo, which has greater than 18 billion data. You possibly can discuss with the GitHub repo to create and cargo the desk.

We ran the next question, which violated the returned_rows QMR, and the question was stopped as specified within the motion set within the QMR.

choose * from aquademo.lineitem restrict 1000001;

The next screenshot reveals the view from the Amazon Redshift shopper after operating the question.

The next screenshot reveals the view on the Amazon Redshift console.

The next screenshot reveals the notification we obtained in our Slack channel.

Take a look at 2: Lengthy-running queries

Take a look at 2 appears for violations of the long_running QMR, during which question runtime is over 300 seconds for a person who belongs to reporting_group.

Within the following code, we created a brand new Amazon Redshift group named reporting_group and added a brand new person, named reporting_user, to the group. reporting_group is assigned USAGE and SELECT privileges on all tables within the retail and aquademo schemas.

create group reporting_group;
create person reporting_user in group reporting_group password 'Test12345';
grant utilization on schema retail,aquademo to group reporting_group;
grant choose on all tables in schema retail,aquademo to group reporting_group;

We set the session authorization to reporting_user so the question runs within the reporting queue. We ran the next question, which violated the long_running QMR, and the question was stopped as specified within the motion set within the QMR:

set session authorization reporting_user;
set enable_result_cache_for_session  to off;
choose * from aquademo.lineitem;

The next screenshot reveals the view from the Amazon Redshift shopper.

The next screenshot reveals the view on the Amazon Redshift console.

The next screenshot reveals the notification we obtained in our Slack channel.

Take a look at 3: Nested loops

Take a look at 3 appears for violations of the nested_loop QMR, during which the nested loop be part of row rely is over 10,000,000 for a person who belongs to reporting_group.

We set the session authorization to reporting_user so the question runs within the reporting queue. We ran the next question, which violated the nested_loop QMR, and the question logged the violation as specified within the motion set within the QMR:

set session authorization reporting_user;
set enable_result_cache_for_session  to off;
choose ss.*,cd.* 
from retail.store_sales ss
, retail.customer_demographics cd;

Earlier than we ran the unique question, we additionally checked the clarify plan and famous that this nested loop will return greater than 10,000,000 rows. The next screenshot reveals the question clarify plan.

The next screenshot reveals the notification we obtained in our Slack channel.

Take a look at 4: Redshift Spectrum scans

Take a look at 4 appears for violations of the spectrum_scan QMR, during which Redshift Spectrum scans exceed 1000 MB for a question that ran within the adhoc queue.

For this instance, we used store_sales knowledge (unloaded from an Amazon Redshift desk that was created through the use of the TPC-DS benchmark knowledge) loaded in an Amazon S3 location. Knowledge in Amazon S3 is non-partitioned underneath one prefix and has a quantity round 3.9 GB. We created an exterior schema (qmr_spectrum_rule_test) and exterior desk (qmr_rule_store_sales) in Redshift Spectrum.

We used the next steps to run this take a look at with the pattern knowledge:

  1. Run an unload SQL command:
    unload ('choose * from store_sales')
    to 's3://<<Your Amazon S3 Location>>/store_sales/' 
    iam_role default;

  2. Create an exterior schema from Redshift Spectrum:
    CREATE EXTERNAL SCHEMA if not exists qmr_spectrum_rule_test
    FROM DATA CATALOG DATABASE 'qmr_spectrum_rule_test' area 'us-east-1' 
    IAM_ROLE default
    CREATE EXTERNAL DATABASE IF NOT exists;

  3. Create an exterior desk in Redshift Spectrum:
    create exterior desk qmr_spectrum_rule_test.qmr_rule_store_sales
    (
    ss_sold_date_sk int4 ,            
      ss_sold_time_sk int4 ,     
      ss_item_sk int4  ,      
      ss_customer_sk int4 ,           
      ss_cdemo_sk int4 ,              
      ss_hdemo_sk int4 ,         
      ss_addr_sk int4 ,               
      ss_store_sk int4 ,           
      ss_promo_sk int4 ,           
      ss_ticket_number int8 ,        
      ss_quantity int4 ,           
      ss_wholesale_cost numeric(7,2) ,          
      ss_list_price numeric(7,2) ,              
      ss_sales_price numeric(7,2) ,
      ss_ext_discount_amt numeric(7,2) ,             
      ss_ext_sales_price numeric(7,2) ,              
      ss_ext_wholesale_cost numeric(7,2) ,           
      ss_ext_list_price numeric(7,2) ,               
      ss_ext_tax numeric(7,2) ,                 
      ss_coupon_amt numeric(7,2) , 
      ss_net_paid numeric(7,2) ,   
      ss_net_paid_inc_tax numeric(7,2) ,             
      ss_net_profit numeric(7,2)                     
    ) ROW FORMAT DELIMITED 
      FIELDS TERMINATED BY '|' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://<<Your Amazon S3 Location>>/store_sales/'
    TABLE PROPERTIES (
      'averageRecordSize'='130', 
      'classification'='csv', 
      'columnsOrdered'='true', 
      'compressionType'='none', 
      'delimiter'='|', 
      'recordCount'='11083990573', 
      'sizeKey'='1650877678933', 
      'typeOfData'='file');

  4. Run the next question:
    choose * 
    FROM qmr_spectrum_rule_test.qmr_rule_store_sales 
    the place ss_sold_date_sk = 2451074;

The question violated the spectrum_scan QMR, and the question was stopped as specified within the motion set within the QMR.

The next screenshot reveals the view from the Amazon Redshift shopper.

The next screenshot reveals the view on the Amazon Redshift console.

The next screenshot reveals the notification we obtained in our Slack channel.

Clear up

While you’re completed with this resolution, we suggest deleting the assets you created to keep away from incurring any additional fees.

Conclusion

Amazon Redshift is a robust, totally managed knowledge warehouse that may provide considerably elevated efficiency and decrease value within the cloud. On this submit, we mentioned how one can automate notification of misbehaving queries on Slack through the use of question monitoring guidelines. QMRs can assist you maximize cluster efficiency and throughput when supporting combined workloads. Use these directions to arrange your Slack channel to obtain automated notifications out of your Amazon Redshift cluster for any violation of QMRs.


Concerning the Authors

Dipankar Kushari is a Senior Specialist Options Architect within the Analytics staff at AWS.

Harshida Patel is a Specialist Senior Options Architect within the Analytics staff at AWS.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments