Amazon EMR Serverless means that you can run open-source massive knowledge frameworks resembling Apache Spark and Apache Hive with out managing clusters and servers. Many shoppers who run Spark and Hive functions wish to add their very own libraries and dependencies to the appliance runtime. For instance, you could wish to add well-liked open-source extensions to Spark, or add a custom-made encryption-decryption module that’s utilized by your utility.
We’re excited to announce a brand new functionality that means that you can customise the runtime picture utilized in EMR Serverless by including {custom} libraries that your functions want to make use of. This function lets you do the next:
- Keep a set of version-controlled libraries which might be reused and obtainable to be used in all of your EMR Serverless jobs as a part of the EMR Serverless runtime
- Add well-liked extensions to open-source Spark and Hive frameworks resembling pandas, NumPy, matplotlib, and extra that you really want your EMR serverless utility to make use of
- Use established CI/CD processes to construct, take a look at, and deploy your custom-made extension libraries to the EMR Serverless runtime
- Apply established safety processes, resembling picture scanning, to satisfy the compliance and governance necessities inside your group
- Use a unique model of a runtime element (for instance the JDK runtime or the Python SDK runtime) than the model that’s obtainable by default with EMR Serverless
On this publish, we exhibit methods to use this new function.
Resolution Overview
To make use of this functionality, customise the EMR Serverless base picture utilizing Amazon Elastic Container Registry (Amazon ECR), which is a totally managed container registry that makes it simple on your builders to share and deploy container photos. Amazon ECR eliminates the necessity to function your individual container repositories or fear about scaling the underlying infrastructure. After the {custom} picture is pushed to the container registry, specify the {custom} picture whereas creating your EMR Serverless functions.
The next diagram illustrates the steps concerned in utilizing {custom} photos on your EMR Serverless functions.
Within the following sections, we exhibit utilizing {custom} photos with Amazon EMR Serverless to handle three widespread use instances:
- Add well-liked open-source Python libraries into the EMR Serverless runtime picture
- Use a unique or newer model of the Java runtime for the EMR Serverless utility
- Set up a Prometheus agent and customise the Spark runtime to push Spark JMX metrics to Amazon Managed Service for Prometheus, and visualize the metrics in a Grafana dashboard
Basic conditions
The next are the conditions to make use of {custom} photos with EMR Serverless. Full the next steps earlier than continuing with the next steps:
- Create an AWS Id and Entry Administration (IAM) function with IAM permissions for Amazon EMR Serverless functions, Amazon ECR permissions, and Amazon S3 permissions for the Amazon Easy Storage Service (Amazon S3) bucket
aws-bigdata-blog
and any S3 bucket in your account the place you’ll retailer the appliance artifacts. - Set up or improve to the most recent AWS Command Line Interface (AWS CLI) model and set up the Docker service in an Amazon Linux 2 based mostly Amazon Elastic Compute Cloud (Amazon EC2) occasion. Connect the IAM function from the earlier step for this EC2 occasion.
- Choose a base EMR Serverless picture from the next public Amazon ECR repository. Run the next instructions on the EC2 occasion with Docker put in to confirm that you’ll be able to pull the bottom picture from the general public repository:
- Log in to Amazon ECR with the next instructions and create a repository known as
emr-serverless-ci-examples
, offering your AWS account ID and Area: - Present IAM permissions to the EMR Serverless service principal for the Amazon ECR repository:
- On the Amazon ECR console, select Permissions beneath Repositories within the navigation pane.
- Select Edit coverage JSON.
- Enter the next JSON and save:
Make it possible for the coverage is up to date on the Amazon ECR console.
For manufacturing workloads, we suggest including a situation within the Amazon ECR coverage to make sure solely allowed EMR Serverless functions can get, describe, and obtain photos from this repository. For extra data, seek advice from Permit EMR Serverless to entry the {custom} picture repository.
Within the subsequent steps, we create and use {custom} photos in our EMR Serverless functions for the three totally different use instances.
Use case 1: Run knowledge science functions
One of many widespread functions of Spark on Amazon EMR is the power to run knowledge science and machine studying (ML) functions at scale. For big datasets, Spark contains SparkML, which provides widespread ML algorithms that can be utilized to coach fashions in a distributed style. Nevertheless, you usually must run many iterations of straightforward classifiers to suit for hyperparameter tuning, ensembles, and multi-class options over small-to-medium-sized knowledge (100,000 to 1 million data). Spark is a superb engine to run a number of iterations of such classifiers in parallel. On this instance, we exhibit this use case, the place we use Spark to run a number of iterations of an XGBoost mannequin to pick the perfect parameters. The power to incorporate Python dependencies within the EMR Serverless picture ought to make it simple to make the assorted dependencies (xgboost
, sk-dist
, pandas
, numpy
, and so forth) obtainable for the appliance.
Conditions
The EMR Serverless job runtime IAM function ought to be given permissions to your S3 bucket the place you may be storing your PySpark file and utility logs:
Create a picture to put in ML dependencies
We create a {custom} picture from the bottom EMR Serverless picture to put in dependencies required by the SparkML utility. Create the next Dockerfile in your EC2 occasion that runs the docker course of inside a brand new listing named datascience
:
Construct and push the picture to the Amazon ECR repository emr-serverless-ci-examples
, offering your AWS account ID and Area:
Submit your Spark utility
Create an EMR Serverless utility with the {custom} picture created within the earlier step:
Make an observation of the worth of applicationId
returned by the command.
After the appliance is created, we’re able to submit our job. Copy the appliance file to your S3 bucket:
Submit the Spark knowledge science job. Within the following command, present the title of the S3 bucket and prefix the place you saved your utility file. Moreover, present the applicationId
worth obtained from the create-application
command and your EMR Serverless job runtime IAM function ARN.
After the Spark job succeeds, you may view the perfect mannequin estimates from our utility by viewing the Spark driver’s stdout
logs. Navigate to Spark Historical past Server, Executors, Driver, Logs, stdout.
Use case 2: Use a {custom} Java runtime surroundings
One other use case for {custom} photos is the power to make use of a {custom} Java model on your EMR Serverless functions. For instance, when you’re utilizing Java11 to compile and package deal your Java or Scala functions, and attempt to run them immediately on EMR Serverless, it might result in runtime errors as a result of EMR Serverless makes use of Java 8 JRE by default. To make the runtime environments of your EMR Serverless functions suitable together with your compile surroundings, you should use the {custom} photos function to put in the Java model you might be utilizing to package deal your functions.
Conditions
An EMR Serverless job runtime IAM function ought to be given permissions to your S3 bucket the place you may be storing your utility JAR and logs:
Create a picture to put in a {custom} Java model
We first create a picture that can set up a Java 11 runtime surroundings. Create the next Dockerfile in your EC2 occasion inside a brand new listing named customjre
:
Construct and push the picture to the Amazon ECR repository emr-serverless-ci-examples
, offering your AWS account ID and Area:
Submit your Spark utility
Create an EMR Serverless utility with the {custom} picture created within the earlier step:
Copy the appliance JAR to your S3 bucket:
Submit a Spark Scala job that was compiled with Java11 JRE. This job additionally makes use of Java APIs which will produce totally different outcomes for various variations of Java (for instance: java.time.ZoneId
). Within the following command, present the title of the S3 bucket and prefix the place you saved your utility JAR. Moreover, present the applicationId
worth obtained from the create-application
command and your EMR Serverless job runtime function ARN with IAM permissions talked about within the conditions. Notice that within the sparkSubmitParameters
, we move a {custom} Java model for our Spark driver and executor environments to instruct our job to make use of the Java11 runtime.
You can even prolong this use case to put in and use a {custom} Python model on your PySpark functions.
Use case 3: Monitor Spark metrics in a single Grafana dashboard
Spark JMX telemetry supplies quite a lot of fine-grained particulars about each stage of the Spark utility, even on the JVM stage. These insights can be utilized to tune and optimize the Spark functions to scale back job runtime and value. Prometheus is a well-liked instrument used for amassing, querying, and visualizing utility and host metrics of a number of totally different processes. After the metrics are collected in Prometheus, we are able to question these metrics or use Grafana to construct dashboards and visualize them. On this use case, we use Amazon Managed Prometheus to collect Spark driver and executor metrics from our EMR Serverless Spark utility, and we use Grafana to visualise the collected metrics. The next screenshot is an instance Grafana dashboard for an EMR Serverless Spark utility.
Conditions
Full the next prerequisite steps:
- Create a VPC, non-public subnet, and safety group. The non-public subnet ought to have a NAT gateway or VPC S3 endpoint connected. The safety group ought to permit outbound entry to the HTTPS port 443 and may have a self-referencing inbound rule for all visitors.
Each the non-public subnet and safety group ought to be related to the 2 Amazon Managed Prometheus VPC endpoint interfaces. - On the Amazon Digital Non-public Cloud (Amazon VPC) console, create two endpoints for Amazon Managed Prometheus and the Amazon Managed Prometheus workspace. Affiliate the endpoints to the VPC, non-public subnet, and safety group to each endpoints. Optionally, present a reputation tag on your endpoints and depart all the things else as default.
- Create a brand new workspace on the Amazon Managed Prometheus console.
- Notice the ARN and the values for Endpoint – distant write URL and Endpoint – question URL.
- Connect the next coverage to your Amazon EMR Serverless job runtime IAM function to offer distant write entry to your Prometheus workspace. Substitute the ARN copied from the earlier step within the
Useful resource
part of"Sid": "AccessToPrometheus"
. This function must also have permissions to your S3 bucket the place you may be storing your utility JAR and logs. - Create an IAM person or function with permissions to create and question the Amazon Managed Prometheus workspace.
We use the identical IAM person or function to authenticate in Grafana or question the Prometheus workspace.
Create a picture to put in the Prometheus agent
We create a {custom} picture from the bottom EMR Serverless picture to do the next:
- Replace the Spark metrics configuration to make use of
PrometheusServlet
to publish driver and executor JMX metrics in Prometheus format - Obtain and set up the Prometheus agent
- Add the configuration YAML file to instruct the Prometheus agent to ship the metrics to the Amazon Managed Prometheus workspace
Create the Prometheus config YAML file to scrape the driving force, executor, and utility metrics. You’ll be able to run the next instance instructions on the EC2 occasion.
- Copy the
prometheus.yaml
file from our S3 path: - Modify
prometheus.yaml
to exchange the Area and worth of theremote_write
URL with the distant write URL obtained from the conditions: - Add the file to your individual S3 bucket:
- Create the next Dockerfile inside a brand new listing named
prometheus
on the identical EC2 occasion that runs the Docker service. Present the S3 path the place you uploaded theprometheus.yaml
file. - Construct the Dockerfile and push to Amazon ECR, offering your AWS account ID and Area:
Submit the Spark utility
After the Docker picture has been pushed efficiently, you may create the serverless Spark utility with the {custom} picture you created. We use the AWS CLI to submit Spark jobs with the {custom} picture on EMR Serverless. Your AWS CLI needs to be upgraded to the most recent model to run the next instructions.
- Within the following AWS CLI command, present your AWS account ID and Area. Moreover, present the subnet and safety group from the conditions within the community configuration. So as to efficiently push metrics from EMR Serverless to Amazon Managed Prometheus, just remember to are utilizing the identical VPC, subnet, and safety group you created based mostly on the conditions.
- Copy the appliance JAR to your S3 bucket:
- Within the following command, present the title of the S3 bucket and prefix the place you saved your utility JAR. Moreover, present the
applicationId
worth obtained from thecreate-application
command and your EMR Serverless job runtime IAM function ARN from the conditions, with permissions to write down to the Amazon Managed Prometheus workspace.
Inside this Spark utility, we run the bash script within the picture to begin the Prometheus course of. You will have so as to add the next traces to your Spark code after initiating the Spark session when you’re planning to make use of this picture to watch your individual Spark utility:
For PySpark functions, you should use the next code:
Question Prometheus metrics and visualize in Grafana
A couple of minute after the job modifications to Working
standing, you may question Prometheus metrics utilizing awscurl.
- Substitute the worth of
AMP_QUERY_ENDPOINT
with the question URL you famous earlier, and supply the job run ID obtained after submitting the Spark job. Just be sure you’re utilizing the credentials of an IAM person or function that has permissions to question the Prometheus workspace earlier than working the instructions.The next is instance output from the question:
- Set up Grafana in your native desktop and configure our AMP workspace as an information supply.Grafana is a generally used platform for visualizing Prometheus metrics.
- Earlier than we begin the Grafana server, allow AWS SIGv4 authentication to be able to signal queries to AMP with IAM permissions.
- In the identical session, begin the Grafana server. Notice that the Grafana set up path could fluctuate based mostly in your OS configurations. Modify the command to begin the Grafana server in case your set up path is totally different from
/usr/native/
. Additionally, just remember to’re utilizing the credentials of an IAM person or function that has permissions to question the Prometheus workspace earlier than working the next instructions - Log in to Grafana and go on the info sources configuration web page /datasources so as to add your AMP workspace as an information supply.The URL ought to be with out the
/api/v1/question
on the finish. AllowSigV4 auth
, then select the suitable Area and save.
Once you discover the saved knowledge supply, you may see the metrics from the appliance we simply submitted.
Now you can visualize these metrics and create elaborate dashboards in Grafana.
Clear up
Once you’re accomplished working the examples, clear up the assets. You need to use the next script to delete assets created in EMR Serverless, Amazon Managed Prometheus, and Amazon ECR. Move the Area and optionally the Amazon Managed Prometheus workspace ID as arguments to the script. Notice that this script won’t take away EMR Serverless functions in Working
standing.
Conclusion
On this publish, you discovered methods to use {custom} photos with Amazon EMR Serverless to handle some widespread use instances. For extra data on methods to construct {custom} photos or view pattern Dockerfiles, see Customizing the EMR Serverless picture and Customized Picture Samples.
In regards to the Creator
Veena Vasudevan is a Senior Accomplice Options Architect and an Amazon EMR specialist at AWS specializing in Huge Information and Analytics. She helps clients and companions construct extremely optimized, scalable, and safe options; modernize their architectures; and migrate their Huge Information workloads to AWS.