Thursday, April 18, 2024
HomeBig DataAdd your individual libraries and utility dependencies to Spark and Hive on...

Add your individual libraries and utility dependencies to Spark and Hive on Amazon EMR Serverless with {custom} photos


Amazon EMR Serverless means that you can run open-source massive knowledge frameworks resembling Apache Spark and Apache Hive with out managing clusters and servers. Many shoppers who run Spark and Hive functions wish to add their very own libraries and dependencies to the appliance runtime. For instance, you could wish to add well-liked open-source extensions to Spark, or add a custom-made encryption-decryption module that’s utilized by your utility.

We’re excited to announce a brand new functionality that means that you can customise the runtime picture utilized in EMR Serverless by including {custom} libraries that your functions want to make use of. This function lets you do the next:

  • Keep a set of version-controlled libraries which might be reused and obtainable to be used in all of your EMR Serverless jobs as a part of the EMR Serverless runtime
  • Add well-liked extensions to open-source Spark and Hive frameworks resembling pandas, NumPy, matplotlib, and extra that you really want your EMR serverless utility to make use of
  • Use established CI/CD processes to construct, take a look at, and deploy your custom-made extension libraries to the EMR Serverless runtime
  • Apply established safety processes, resembling picture scanning, to satisfy the compliance and governance necessities inside your group
  • Use a unique model of a runtime element (for instance the JDK runtime or the Python SDK runtime) than the model that’s obtainable by default with EMR Serverless

On this publish, we exhibit methods to use this new function.

Resolution Overview

To make use of this functionality, customise the EMR Serverless base picture utilizing Amazon Elastic Container Registry (Amazon ECR), which is a totally managed container registry that makes it simple on your builders to share and deploy container photos. Amazon ECR eliminates the necessity to function your individual container repositories or fear about scaling the underlying infrastructure. After the {custom} picture is pushed to the container registry, specify the {custom} picture whereas creating your EMR Serverless functions.

The next diagram illustrates the steps concerned in utilizing {custom} photos on your EMR Serverless functions.

Within the following sections, we exhibit utilizing {custom} photos with Amazon EMR Serverless to handle three widespread use instances:

  • Add well-liked open-source Python libraries into the EMR Serverless runtime picture
  • Use a unique or newer model of the Java runtime for the EMR Serverless utility
  • Set up a Prometheus agent and customise the Spark runtime to push Spark JMX metrics to Amazon Managed Service for Prometheus, and visualize the metrics in a Grafana dashboard

Basic conditions

The next are the conditions to make use of {custom} photos with EMR Serverless. Full the next steps earlier than continuing with the next steps:

  1. Create an AWS Id and Entry Administration (IAM) function with IAM permissions for Amazon EMR Serverless functions, Amazon ECR permissions, and Amazon S3 permissions for the Amazon Easy Storage Service (Amazon S3) bucket aws-bigdata-blog and any S3 bucket in your account the place you’ll retailer the appliance artifacts.
  2. Set up or improve to the most recent AWS Command Line Interface (AWS CLI) model and set up the Docker service in an Amazon Linux 2 based mostly Amazon Elastic Compute Cloud (Amazon EC2) occasion. Connect the IAM function from the earlier step for this EC2 occasion.
  3. Choose a base EMR Serverless picture from the next public Amazon ECR repository. Run the next instructions on the EC2 occasion with Docker put in to confirm that you’ll be able to pull the bottom picture from the general public repository:
    # If docker isn't began already, begin the method
    $ sudo service docker begin 
    
    # Examine if you'll be able to pull the most recent EMR 6.9.0 runtime base picture 
    $ sudo docker pull public.ecr.aws/emr-serverless/spark/emr-6.9.0:newest

  4. Log in to Amazon ECR with the next instructions and create a repository known as emr-serverless-ci-examples, offering your AWS account ID and Area:
    $ sudo aws ecr get-login-password --region <area> | sudo docker login --username AWS --password-stdin <your AWS account ID>.dkr.ecr.<area>.amazonaws.com
    
    $ aws ecr create-repository --repository-name emr-serverless-ci-examples --region <area>

  5. Present IAM permissions to the EMR Serverless service principal for the Amazon ECR repository:
    1. On the Amazon ECR console, select Permissions beneath Repositories within the navigation pane.
    2. Select Edit coverage JSON.
    3. Enter the next JSON and save:
      {
        "Model": "2012-10-17",
        "Assertion": [
          {
            "Sid": "Emr Serverless Custom Image Support",
            "Effect": "Allow",
            "Principal": {
              "Service": "emr-serverless.amazonaws.com"
            },
            "Action": [
              "ecr:BatchGetImage",
              "ecr:DescribeImages",
              "ecr:GetDownloadUrlForLayer"
            ]
          }
        ]
      }

Make it possible for the coverage is up to date on the Amazon ECR console.

For manufacturing workloads, we suggest including a situation within the Amazon ECR coverage to make sure solely allowed EMR Serverless functions can get, describe, and obtain photos from this repository. For extra data, seek advice from Permit EMR Serverless to entry the {custom} picture repository.

Within the subsequent steps, we create and use {custom} photos in our EMR Serverless functions for the three totally different use instances.

Use case 1: Run knowledge science functions

One of many widespread functions of Spark on Amazon EMR is the power to run knowledge science and machine studying (ML) functions at scale. For big datasets, Spark contains SparkML, which provides widespread ML algorithms that can be utilized to coach fashions in a distributed style. Nevertheless, you usually must run many iterations of straightforward classifiers to suit for hyperparameter tuning, ensembles, and multi-class options over small-to-medium-sized knowledge (100,000 to 1 million data). Spark is a superb engine to run a number of iterations of such classifiers in parallel. On this instance, we exhibit this use case, the place we use Spark to run a number of iterations of an XGBoost mannequin to pick the perfect parameters. The power to incorporate Python dependencies within the EMR Serverless picture ought to make it simple to make the assorted dependencies (xgboost, sk-dist, pandas, numpy, and so forth) obtainable for the appliance.

Conditions

The EMR Serverless job runtime IAM function ought to be given permissions to your S3 bucket the place you may be storing your PySpark file and utility logs:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "AccessToS3Buckets",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Useful resource": [
                "arn:aws:s3:::<YOUR-BUCKET>",
                "arn:aws:s3:::<YOUR-BUCKET>/*"
            ]
        }
    ]
}

Create a picture to put in ML dependencies

We create a {custom} picture from the bottom EMR Serverless picture to put in dependencies required by the SparkML utility. Create the next Dockerfile in your EC2 occasion that runs the docker course of inside a brand new listing named datascience:

FROM public.ecr.aws/emr-serverless/spark/emr-6.9.0:newest

USER root

# python packages
RUN pip3 set up boto3 pandas numpy
RUN pip3 set up -U scikit-learn==0.23.2 scipy 
RUN pip3 set up sk-dist
RUN pip3 set up xgboost
RUN sed -i 's|import Parallel, delayed|import Parallel, delayed, logger|g' /usr/native/lib/python3.7/site-packages/skdist/distribute/search.py

# EMRS will run the picture as hadoop
USER hadoop:hadoop

Construct and push the picture to the Amazon ECR repository emr-serverless-ci-examples, offering your AWS account ID and Area:

# Construct the picture domestically. This command will take a minute or so to finish
sudo docker construct -t native/emr-serverless-ci-ml /house/ec2-user/datascience/ --no-cache --pull
# Create tag for the native picture
sudo docker tag native/emr-serverless-ci-ml:newest <your AWS account ID>.dkr.ecr.<area>.amazonaws.com/emr-serverless-ci-examples:emr-serverless-ci-ml
# Push the picture to Amazon ECR. This command will take a number of seconds to finish
sudo docker push <your AWS account ID>.dkr.ecr.<area>.amazonaws.com/emr-serverless-ci-examples:emr-serverless-ci-ml

Submit your Spark utility

Create an EMR Serverless utility with the {custom} picture created within the earlier step:

aws --region <area>  emr-serverless create-application 
    --release-label emr-6.9.0 
    --type "SPARK" 
    --name data-science-with-ci 
    --image-configuration '{ "imageUri": "<your AWS account ID>.dkr.ecr.<area>.amazonaws.com/emr-serverless-ci-examples:emr-serverless-ci-ml" }'

Make an observation of the worth of applicationId returned by the command.

After the appliance is created, we’re able to submit our job. Copy the appliance file to your S3 bucket:

aws s3 cp s3://aws-bigdata-blog/artifacts/BDB-2771/code/emrserverless-xgboost-spark-example.py s3://<YOUR BUCKET>/<PREFIX>/emrserverless-xgboost-spark-example.py

Submit the Spark knowledge science job. Within the following command, present the title of the S3 bucket and prefix the place you saved your utility file. Moreover, present the applicationId worth obtained from the create-application command and your EMR Serverless job runtime IAM function ARN.

aws emr-serverless start-job-run 
        --region <area> 
        --application-id <applicationId> 
        --execution-role-arn <jobRuntimeRole> 
        --job-driver '{
            "sparkSubmit": {
                "entryPoint": "s3://<YOUR BUCKET>/<PREFIX>/emrserverless-xgboost-spark-example.py"
            }
        }' 
        --configuration-overrides '{
              "monitoringConfiguration": {
                "s3MonitoringConfiguration": {
                  "logUri": "s3://<YOUR BUCKET>/emrserverless/logs"
                }
              }
            }'

After the Spark job succeeds, you may view the perfect mannequin estimates from our utility by viewing the Spark driver’s stdout logs. Navigate to Spark Historical past Server, Executors, Driver, Logs, stdout.

Use case 2: Use a {custom} Java runtime surroundings

One other use case for {custom} photos is the power to make use of a {custom} Java model on your EMR Serverless functions. For instance, when you’re utilizing Java11 to compile and package deal your Java or Scala functions, and attempt to run them immediately on EMR Serverless, it might result in runtime errors as a result of EMR Serverless makes use of Java 8 JRE by default. To make the runtime environments of your EMR Serverless functions suitable together with your compile surroundings, you should use the {custom} photos function to put in the Java model you might be utilizing to package deal your functions.

Conditions

An EMR Serverless job runtime IAM function ought to be given permissions to your S3 bucket the place you may be storing your utility JAR and logs:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "AccessToS3Buckets",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Useful resource": [
                "arn:aws:s3:::<YOUR-BUCKET>",
                "arn:aws:s3:::<YOUR-BUCKET>/*"
            ]
        }
    ]
}

Create a picture to put in a {custom} Java model

We first create a picture that can set up a Java 11 runtime surroundings. Create the next Dockerfile in your EC2 occasion inside a brand new listing named customjre:

FROM public.ecr.aws/emr-serverless/spark/emr-6.9.0:newest

USER root

# Set up JDK 11
RUN amazon-linux-extras set up java-openjdk11

# EMRS will run the picture as hadoop
USER hadoop:hadoop

Construct and push the picture to the Amazon ECR repository emr-serverless-ci-examples, offering your AWS account ID and Area:

sudo docker construct -t native/emr-serverless-ci-java11 /house/ec2-user/customjre/ --no-cache --pull
sudo docker tag native/emr-serverless-ci-java11:newest <your AWS account ID>.dkr.ecr.<area>.amazonaws.com/emr-serverless-ci-examples:emr-serverless-ci-java11
sudo docker push <your AWS account ID>.dkr.ecr.<area>.amazonaws.com/emr-serverless-ci-examples:emr-serverless-ci-java11

Submit your Spark utility

Create an EMR Serverless utility with the {custom} picture created within the earlier step:

aws --region <area>  emr-serverless create-application 
    --release-label emr-6.9.0 
    --type "SPARK" 
    --name custom-jre-with-ci 
    --image-configuration '{ "imageUri": "<your AWS account ID>.dkr.ecr.<area>.amazonaws.com/emr-serverless-ci-examples:emr-serverless-ci-java11" }'

Copy the appliance JAR to your S3 bucket:

aws s3 cp s3://aws-bigdata-blog/artifacts/BDB-2771/code/emrserverless-custom-images_2.12-1.0.jar s3://<YOUR BUCKET>/<PREFIX>/emrserverless-custom-images_2.12-1.0.jar

Submit a Spark Scala job that was compiled with Java11 JRE. This job additionally makes use of Java APIs which will produce totally different outcomes for various variations of Java (for instance: java.time.ZoneId). Within the following command, present the title of the S3 bucket and prefix the place you saved your utility JAR. Moreover, present the applicationId worth obtained from the create-application command and your EMR Serverless job runtime function ARN with IAM permissions talked about within the conditions. Notice that within the sparkSubmitParameters, we move a {custom} Java model for our Spark driver and executor environments to instruct our job to make use of the Java11 runtime.

aws emr-serverless start-job-run 
        --region <area> 
        --application-id <applicationId> 
        --execution-role-arn <jobRuntimeRole> 
        --job-driver '{
            "sparkSubmit": {
                "entryPoint": "s3://<YOUR BUCKET>/<PREFIX>/emrserverless-custom-images_2.12-1.0.jar",
                "entryPointArguments": ["40000000"],
                "sparkSubmitParameters": "--conf spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.amzn2.0.1.x86_64 --conf spark.emr-serverless.driverEnv.JAVA_HOME=/usr/lib/jvm/java-11-openjdk-11.0.16.0.8-1.amzn2.0.1.x86_64 --class emrserverless.customjre.SyntheticAnalysis"
            }
        }' 
        --configuration-overrides '{
              "monitoringConfiguration": {
                "s3MonitoringConfiguration": {
                  "logUri": "s3://<YOUR BUCKET>/emrserverless/logs"
                }
              }
            }'

You can even prolong this use case to put in and use a {custom} Python model on your PySpark functions.

Use case 3: Monitor Spark metrics in a single Grafana dashboard

Spark JMX telemetry supplies quite a lot of fine-grained particulars about each stage of the Spark utility, even on the JVM stage. These insights can be utilized to tune and optimize the Spark functions to scale back job runtime and value. Prometheus is a well-liked instrument used for amassing, querying, and visualizing utility and host metrics of a number of totally different processes. After the metrics are collected in Prometheus, we are able to question these metrics or use Grafana to construct dashboards and visualize them. On this use case, we use Amazon Managed Prometheus to collect Spark driver and executor metrics from our EMR Serverless Spark utility, and we use Grafana to visualise the collected metrics. The next screenshot is an instance Grafana dashboard for an EMR Serverless Spark utility.

Conditions

Full the next prerequisite steps:

  1. Create a VPC, non-public subnet, and safety group. The non-public subnet ought to have a NAT gateway or VPC S3 endpoint connected. The safety group ought to permit outbound entry to the HTTPS port 443 and may have a self-referencing inbound rule for all visitors.


    Each the non-public subnet and safety group ought to be related to the 2 Amazon Managed Prometheus VPC endpoint interfaces.
  2. On the Amazon Digital Non-public Cloud (Amazon VPC) console, create two endpoints for Amazon Managed Prometheus and the Amazon Managed Prometheus workspace. Affiliate the endpoints to the VPC, non-public subnet, and safety group to each endpoints. Optionally, present a reputation tag on your endpoints and depart all the things else as default.

  3. Create a brand new workspace on the Amazon Managed Prometheus console.
  4. Notice the ARN and the values for Endpoint – distant write URL and Endpoint – question URL.
  5. Connect the next coverage to your Amazon EMR Serverless job runtime IAM function to offer distant write entry to your Prometheus workspace. Substitute the ARN copied from the earlier step within the Useful resource part of "Sid": "AccessToPrometheus". This function must also have permissions to your S3 bucket the place you may be storing your utility JAR and logs.
    {
        "Model": "2012-10-17",
        "Assertion": [
            {
                "Sid": "AccessToPrometheus",
                "Effect": "Allow",
                "Action": [
                    "aps:RemoteWrite"
                ],
                "Useful resource": "arn:aws:aps:<area>:<your AWS account>:workspace/<Workspace_ID>"
            }, {
                "Sid": "AccessToS3Buckets",
                "Impact": "Permit",
                "Motion": [
                    "s3:PutObject",
                    "s3:GetObject",
                    "s3:ListBucket"
                ],
                "Useful resource": [
                    "arn:aws:s3:::<YOUR-BUCKET>",
                    "arn:aws:s3:::<YOUR-BUCKET>/*"
                ]
            }
        ]
    }

  6. Create an IAM person or function with permissions to create and question the Amazon Managed Prometheus workspace.

We use the identical IAM person or function to authenticate in Grafana or question the Prometheus workspace.

Create a picture to put in the Prometheus agent

We create a {custom} picture from the bottom EMR Serverless picture to do the next:

  • Replace the Spark metrics configuration to make use of PrometheusServlet to publish driver and executor JMX metrics in Prometheus format
  • Obtain and set up the Prometheus agent
  • Add the configuration YAML file to instruct the Prometheus agent to ship the metrics to the Amazon Managed Prometheus workspace

Create the Prometheus config YAML file to scrape the driving force, executor, and utility metrics. You’ll be able to run the next instance instructions on the EC2 occasion.

  1. Copy the prometheus.yaml file from our S3 path:
    aws s3 cp s3://aws-bigdata-blog/artifacts/BDB-2771/prometheus-config/prometheus.yaml .

  2. Modify prometheus.yaml to exchange the Area and worth of the remote_write URL with the distant write URL obtained from the conditions:
    ## Substitute your AMP workspace distant write URL 
    endpoint_url="https://aps-workspaces.<area>.amazonaws.com/workspaces/<ws-xxxxxxx-xxxx-xxxx-xxxx-xxxxxx>/api/v1/remote_write"
    
    ## Substitute the distant write URL and area. Following is instance for us-west-2 area. Modify the command on your area. 
    sed -i "s|area:.*|area: us-west-2|g" prometheus.yaml
    sed -i "s|url:.*|url: ${endpoint_url}|g" prometheus.yaml

  3. Add the file to your individual S3 bucket:
    aws s3 cp prometheus.yaml s3://<YOUR BUCKET>/<PREFIX>/

  4. Create the next Dockerfile inside a brand new listing named prometheus on the identical EC2 occasion that runs the Docker service. Present the S3 path the place you uploaded the prometheus.yaml file.
    # Pull base picture
    FROM public.ecr.aws/emr-serverless/spark/emr-6.9.0:newest
    
    USER root
    
    # Set up Prometheus agent
    RUN yum set up -y wget && 
        wget https://github.com/prometheus/prometheus/releases/obtain/v2.26.0/prometheus-2.26.0.linux-amd64.tar.gz && 
        tar -xvf prometheus-2.26.0.linux-amd64.tar.gz && 
        rm -rf prometheus-2.26.0.linux-amd64.tar.gz && 
        cp prometheus-2.26.0.linux-amd64/prometheus /usr/native/bin/
    
    # Change Spark metrics configuration file to make use of PrometheusServlet
    RUN cp /and so forth/spark/conf.dist/metrics.properties.template /and so forth/spark/conf/metrics.properties && 
        echo -e '
     *.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServletn
     *.sink.prometheusServlet.path=/metrics/prometheusn
     grasp.sink.prometheusServlet.path=/metrics/grasp/prometheusn
     functions.sink.prometheusServlet.path=/metrics/functions/prometheusn
     ' >> /and so forth/spark/conf/metrics.properties
    
     # Copy the prometheus.yaml file domestically. Change the worth of bucket and prefix to the place you saved your prometheus.yaml file
    RUN aws s3 cp s3://<YOUR BUCKET>/<PREFIX>/prometheus.yaml .
    
     # Create a script to begin the prometheus agent within the background
    RUN echo -e '#!/bin/bashn
     nohup /usr/native/bin/prometheus --config.file=/house/hadoop/prometheus.yaml </dev/null >/dev/null 2>&1 &n
     echo "Began Prometheus agent"n
     ' >> /house/hadoop/start-prometheus-agent.sh &&  
        chmod +x /house/hadoop/start-prometheus-agent.sh
    
     # EMRS will run the picture as hadoop
    USER hadoop:hadoop

  5. Construct the Dockerfile and push to Amazon ECR, offering your AWS account ID and Area:
    sudo docker construct -t native/emr-serverless-ci-prometheus /house/ec2-user/prometheus/ --no-cache --pull
    sudo docker tag native/emr-serverless-ci-prometheus <your AWS account ID>.dkr.ecr.<area>.amazonaws.com/emr-serverless-ci-examples:emr-serverless-ci-prometheus
    sudo docker push <your AWS account ID>.dkr.ecr.<area>.amazonaws.com/emr-serverless-ci-examples:emr-serverless-ci-prometheus
    

Submit the Spark utility

After the Docker picture has been pushed efficiently, you may create the serverless Spark utility with the {custom} picture you created. We use the AWS CLI to submit Spark jobs with the {custom} picture on EMR Serverless. Your AWS CLI needs to be upgraded to the most recent model to run the next instructions.

  1. Within the following AWS CLI command, present your AWS account ID and Area. Moreover, present the subnet and safety group from the conditions within the community configuration. So as to efficiently push metrics from EMR Serverless to Amazon Managed Prometheus, just remember to are utilizing the identical VPC, subnet, and safety group you created based mostly on the conditions.
    aws emr-serverless create-application 
    --name monitor-spark-with-ci 
    --region <area> 
    --release-label emr-6.9.0 
    --type SPARK 
    --network-configuration subnetIds=<subnet-xxxxxxx>,securityGroupIds=<sg-xxxxxxx> 
    --image-configuration '{ "imageUri": "<your AWS account ID>.dkr.ecr.<area>.amazonaws.com/emr-serverless-ci-examples:emr-serverless-ci-prometheus" }'

  2. Copy the appliance JAR to your S3 bucket:
    aws s3 cp s3://aws-bigdata-blog/artifacts/BDB-2771/code/emrserverless-custom-images_2.12-1.0.jar s3://<YOUR BUCKET>/<PREFIX>/emrserverless-custom-images_2.12-1.0.jar

  3. Within the following command, present the title of the S3 bucket and prefix the place you saved your utility JAR. Moreover, present the applicationId worth obtained from the create-application command and your EMR Serverless job runtime IAM function ARN from the conditions, with permissions to write down to the Amazon Managed Prometheus workspace.
    aws emr-serverless start-job-run 
        --region <area> 
        --application-id <applicationId> 
        --execution-role-arn <jobRuntimeRole> 
        --job-driver '{
            "sparkSubmit": {
                "entryPoint": "s3://<YOUR BUCKET>/<PREFIX>/emrserverless-custom-images_2.12-1.0.jar",
                "entryPointArguments": ["40000000"],
                "sparkSubmitParameters": "--conf spark.ui.prometheus.enabled=true --conf spark.executor.processTreeMetrics.enabled=true --class emrserverless.prometheus.SyntheticAnalysis"
            }
        }' 
        --configuration-overrides '{
              "monitoringConfiguration": {
                "s3MonitoringConfiguration": {
                  "logUri": "s3://<YOUR BUCKET>/emrserverless/logs"
                }
              }
            }'
    

Inside this Spark utility, we run the bash script within the picture to begin the Prometheus course of. You will have so as to add the next traces to your Spark code after initiating the Spark session when you’re planning to make use of this picture to watch your individual Spark utility:

import scala.sys.course of._
Seq("/house/hadoop/start-prometheus-agent.sh").!!

For PySpark functions, you should use the next code:

import os
os.system("/house/hadoop/start-prometheus-agent.sh")

Question Prometheus metrics and visualize in Grafana

A couple of minute after the job modifications to Working standing, you may question Prometheus metrics utilizing awscurl.

  1. Substitute the worth of AMP_QUERY_ENDPOINT with the question URL you famous earlier, and supply the job run ID obtained after submitting the Spark job. Just be sure you’re utilizing the credentials of an IAM person or function that has permissions to question the Prometheus workspace earlier than working the instructions.
    $ export AMP_QUERY_ENDPOINT="https://aps-workspaces.<area>.amazonaws.com/workspaces/<Workspace_ID>/api/v1/question"
    $ awscurl -X POST --region <area> 
                              --service aps "$AMP_QUERY_ENDPOINT?question=metrics_<jobRunId>_driver_ExecutorMetrics_TotalGCTime_Value{}"
    

    The next is instance output from the question:

    {
        "standing": "success",
        "knowledge": {
            "resultType": "vector",
            "outcome": [{
                "metric": {
                    "__name__": "metrics_00f6bueadgb0lp09_driver_ExecutorMetrics_TotalGCTime_Value",
                    "instance": "localhost:4040",
                    "instance_type": "driver",
                    "job": "spark-driver",
                    "spark_cluster": "emrserverless",
                    "type": "gauges"
                },
                "value": [1671166922, "271"]
            }]
        }
    }

  2. Set up Grafana in your native desktop and configure our AMP workspace as an information supply.Grafana is a generally used platform for visualizing Prometheus metrics.
  3. Earlier than we begin the Grafana server, allow AWS SIGv4 authentication to be able to signal queries to AMP with IAM permissions.
    ## Allow SIGv4 auth 
    export AWS_SDK_LOAD_CONFIG=true 
    export GF_AUTH_SIGV4_AUTH_ENABLED=true

  4. In the identical session, begin the Grafana server. Notice that the Grafana set up path could fluctuate based mostly in your OS configurations. Modify the command to begin the Grafana server in case your set up path is totally different from /usr/native/. Additionally, just remember to’re utilizing the credentials of an IAM person or function that has permissions to question the Prometheus workspace earlier than working the next instructions
    ## Begin Grafana server
    grafana-server --config=/usr/native/and so forth/grafana/grafana.ini 
      --homepath /usr/native/share/grafana 
      cfg:default.paths.logs=/usr/native/var/log/grafana 
      cfg:default.paths.knowledge=/usr/native/var/lib/grafana 
      cfg:default.paths.plugins=/usr/native/var/lib/grafana/plugin

  5. Log in to Grafana and go on the info sources configuration web page /datasources so as to add your AMP workspace as an information supply.The URL ought to be with out the /api/v1/question on the finish. Allow SigV4 auth, then select the suitable Area and save.

Once you discover the saved knowledge supply, you may see the metrics from the appliance we simply submitted.

Now you can visualize these metrics and create elaborate dashboards in Grafana.

Clear up

Once you’re accomplished working the examples, clear up the assets. You need to use the next script to delete assets created in EMR Serverless, Amazon Managed Prometheus, and Amazon ECR. Move the Area and optionally the Amazon Managed Prometheus workspace ID as arguments to the script. Notice that this script won’t take away EMR Serverless functions in Working standing.

aws s3 cp s3://aws-bigdata-blog/artifacts/BDB-2771/cleanup/cleanup_resources.sh .
chmod +x cleanup_resources.sh
sh cleanup_resources.sh <area> <AMP Workspace ID> 

Conclusion

On this publish, you discovered methods to use {custom} photos with Amazon EMR Serverless to handle some widespread use instances. For extra data on methods to construct {custom} photos or view pattern Dockerfiles, see Customizing the EMR Serverless picture and Customized Picture Samples.


In regards to the Creator

Veena Vasudevan is a Senior Accomplice Options Architect and an Amazon EMR specialist at AWS specializing in Huge Information and Analytics. She helps clients and companions construct extremely optimized, scalable, and safe options; modernize their architectures; and migrate their Huge Information workloads to AWS.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments