Spaces:

CadenShokat
/

modernbert-finetuned-embeddings

Paused

File size: 28,195 Bytes

AWS Batch

User Guide

What is AWS Batch?
AWS Batch helps you to run batch computing workloads on the AWS Cloud. Batch computing
is a common way for developers, scientists, and engineers to access large amounts of compute
resources. AWS Batch removes the undiﬀerentiated heavy lifting of conﬁguring and managing the
required infrastructure, similar to traditional batch computing software. This service can eﬃciently
provision resources in response to jobs submitted in order to eliminate capacity constraints, reduce
compute costs, and deliver results quickly.
As a fully managed service, AWS Batch helps you to run batch computing workloads of any scale.
AWS Batch automatically provisions compute resources and optimizes the workload distribution
based on the quantity and scale of the workloads. With AWS Batch, there's no need to install or
manage batch computing software, so you can focus your time on analyzing results and solving
problems.

AWS Batch

User Guide

AWS Batch provides all of the necessary functionality to run high-scale, compute-intensive
workloads on top of AWS managed container orchestration services, Amazon ECS and Amazon EKS.
AWS Batch is able to scale compute capacity on Amazon EC2 instances and Fargate resources.
AWS Batch provides a fully managed service for batch workloads, and delivers the operational
capabilities to optimize these types of workloads for throughput, speed, resource eﬃciency, and
cost.
AWS Batch also enables SageMaker Training job queuing, allowing data scientists and ML engineers
to submit Training jobs with priorities to conﬁgurable queues. This capability ensures that ML
workloads run automatically as soon as resources become available, eliminating the need for
manual coordination and improving resource utilization.
For machine learning workloads, AWS Batch provides queuing capabilities for SageMaker Training
jobs. You can conﬁgure queues with speciﬁc policies to optimize cost, performance, and resource
allocation for your ML Training workloads.

This provides a shared responsibility model where administrators set up the infrastructure and
permissions, while data scientists can focus on submitting and monitoring their ML training
workloads. Jobs are automatically queued and executed based on conﬁgured priorities and
resource availability.

AWS Batch

User Guide

Are you a ﬁrst-time AWS Batch user?
If you are a ﬁrst-time user of AWS Batch, we recommend that you begin by reading the following
sections:
• Components of AWS Batch
• Create IAM account and administrative user
• Setting up AWS Batch
• Getting started with AWS Batch tutorials
• Getting started with AWS Batch on SageMaker AI

Related services
AWS Batch is a fully managed batch computing service that plans, schedules, and runs your
containerized batch ML, simulation, and analytics workloads across the full range of AWS compute
oﬀerings, such as Amazon ECS, Amazon EKS, AWS Fargate, and Spot or On-Demand Instances. For
more information about each managed compute service, see:
• Amazon EC2 User Guide
• AWS Fargate Developer Guide
• Amazon EKS User Guide
• Amazon SageMaker AI Developer Guide

Accessing AWS Batch
You can access AWS Batch using the following:
AWS Batch console
The web interface where you create and manage resources.
AWS Command Line Interface
Interact with AWS services using commands in your command line shell. The AWS Command
Line Interface is supported on Windows, macOS, and Linux. For more information about the
AWS CLI, see AWS Command Line Interface User Guide. You can ﬁnd the AWS Batch commands
in the AWS CLI Command Reference.
Are you a ﬁrst-time AWS Batch user?

AWS Batch

User Guide

AWS SDKs
If you prefer to build applications using language-speciﬁc APIs instead of submitting a request
over HTTP or HTTPS, use the libraries, sample code, tutorials, and other resources provided by
AWS. These libraries provide basic functions that automate tasks, such as cryptographically
signing your requests, retrying requests, and handling error responses. These functions make it
more eﬃcient for you to get started. For more information, see Tools to Build on AWS.

Components of AWS Batch
AWS Batch simpliﬁes running batch jobs across multiple Availability Zones within a Region. You
can create AWS Batch compute environments within a new or existing VPC. After a compute
environment is up and associated with a job queue, you can deﬁne job deﬁnitions that specify
which Docker container images to run your jobs. Container images are stored in and pulled from
container registries, which may exist within or outside of your AWS infrastructure.

Compute environment
A compute environment is a set of managed or unmanaged compute resources that are used to
run jobs. With managed compute environments, you can specify desired compute type (Fargate
or EC2) at several levels of detail. You can set up compute environments that use a particular type
of EC2 instance, a particular model such as c5.2xlarge or m5.10xlarge. Or, you can choose
only to specify that you want to use the newest instance types. You can also specify the minimum,
desired, and maximum number of vCPUs for the environment, along with the amount that you're
Components of AWS Batch

AWS Batch

User Guide

willing to pay for a Spot Instance as a percentage of the On-Demand Instance price and a target
set of VPC subnets. AWS Batch eﬃciently launches, manages, and terminates compute types as
needed. You can also manage your own compute environments. As such, you're responsible for
setting up and scaling the instances in an Amazon ECS cluster that AWS Batch creates for you. For
more information, see Compute environments for AWS Batch.

Job queues
When you submit an AWS Batch job, you submit it to a particular job queue, where the
job resides until it's scheduled onto a compute environment. You associate one or more
compute environments with a job queue. You can also assign priority values for these compute
environments and even across job queues themselves. For example, you can have a high priority
queue that you submit time-sensitive jobs to, and a low priority queue for jobs that can run
anytime when compute resources are cheaper. For more information, see Job queues.

Job deﬁnitions
A job deﬁnition speciﬁes how jobs are to be run. You can think of a job deﬁnition as a blueprint for
the resources in your job. You can supply your job with an IAM role to provide access to other AWS
resources. You also specify both memory and CPU requirements. The job deﬁnition can also control
container properties, environment variables, and mount points for persistent storage. Many of
the speciﬁcations in a job deﬁnition can be overridden by specifying new values when submitting
individual Jobs. For more information, see Job deﬁnitions

Jobs
A unit of work (such as a shell script, a Linux executable, or a Docker container image) that you
submit to AWS Batch. It has a name, and runs as a containerized application on AWS Fargate or
Amazon EC2 resources in your compute environment, using parameters that you specify in a job
deﬁnition. Jobs can reference other jobs by name or by ID, and can be dependent on the successful
completion of other jobs or the availability of resources you specify. For more information, see
Jobs.

Scheduling policy
You can use scheduling policies to conﬁgure how compute resources in a job queue are allocated
between users or workloads. Using fair-share scheduling policies, you can assign diﬀerent share
identiﬁers to workloads or users. The AWS Batch job scheduler defaults to a ﬁrst-in, ﬁrst-out (FIFO)
strategy. For more information, see Fair-share scheduling policies.
Job queues

AWS Batch

User Guide

Consumable resources
A consumable resource is a resource that is needed to run your jobs, such as a 3rd party license
token, database access bandwidth, the need to throttle calls to a third-party API, and so on.
You specify the consumable resources which are needed for a job to run, and Batch takes these
resource dependencies into account when it schedules a job. You can reduce the under-utilization
of compute resources by allocating only the jobs that have all the required resources available. For
more information, see Resource-aware scheduling .

Service Environment
A Service Environment deﬁne how AWS Batch integrates with SageMaker for job execution. Service
Environments enable AWS Batch to submit and manage jobs on SageMaker while providing the
queuing, scheduling, and priority management capabilities of AWS Batch. Service Environments
deﬁne capacity limits for speciﬁc service types such as SageMaker Training jobs. The capacity limits
control the maximum resources that can be used by service jobs in the environment. For more
information, see Service environments for AWS Batch.

Service job
A service job is a unit of work that you submit to AWS Batch to run on a service environment.
Service jobs leverage AWS Batch's queuing and scheduling capabilities while delegating actual
execution to the external service. For example, SageMaker Training jobs submitted as service
jobs are queued and prioritized by AWS Batch, but the SageMaker Training job execution occurs
within SageMaker AI infrastructure. This integration enables data scientists and ML engineers
to beneﬁt from AWS Batch's automated workload management, and priority queuing, for their
SageMaker AI Training workloads. Service jobs can reference other jobs by name or ID and support
job dependencies. For more information, see Service jobs in AWS Batch.

Consumable resources

AWS Batch

User Guide

Setting up AWS Batch
If you've already signed up for Amazon Web Services (AWS) and are using Amazon Elastic Compute
Cloud (Amazon EC2) or Amazon Elastic Container Service (Amazon ECS), you can soon use AWS
Batch. The setup process for these services is similar. This is because AWS Batch uses Amazon ECS
container instances in its compute environments. To use the AWS CLI with AWS Batch, you must
use a version of the AWS CLI that supports the latest AWS Batch features. If you don't see support
for an AWS Batch feature in the AWS CLI, upgrade to the latest version. For more information, see
http://aws.amazon.com/cli/.
Note
Because AWS Batch uses components of Amazon EC2, you use the Amazon EC2 console for
many of these steps.

Complete the following tasks to get set up for AWS Batch.
Topics
• Create IAM account and administrative user
• Create IAM roles for your compute environments and container instances
• Create a key pair for your instances
• Create a VPC
• Create a security group
• Install the AWS CLI

Create IAM account and administrative user
To get started, you need to create an AWS account and a single user that is typically granted
administrative rights. To accomplish this, complete the following tutorials:

Sign up for an AWS account
If you do not have an AWS account, complete the following steps to create one.
Create IAM account and administrative user

AWS Batch

User Guide

Getting started with AWS Batch tutorials
You can use the AWS Batch ﬁrst-run wizard to get started quickly with AWS Batch. After you
complete the Prerequisites, you can use the ﬁrst-run wizard to create a compute environment, a job
deﬁnition, and a job queue.
You can also submit a sample "Hello World" job using the AWS Batch ﬁrst-run wizard to test your
conﬁguration. If you already have a Docker image that you want to launch in AWS Batch, you can
use that image to create a job deﬁnition.
Afterward, you can use the AWS Batch ﬁrst-run wizard to create a compute environment, job
queue, and submit a sample Hello World job.

Getting started with Amazon EC2 orchestration using the
Wizard
Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing capacity in the AWS
Cloud. Using Amazon EC2 eliminates your need to invest in hardware up front, so you can develop
and deploy applications faster.
You can use Amazon EC2 to launch as many or as few virtual servers as you need, conﬁgure
security and networking, and manage storage. Amazon EC2 enables you to scale up or down to
handle changes in requirements or spikes in popularity, reducing your need to forecast traﬃc.

Overview
This tutorial demonstrates how to setup AWS Batch with the Wizard to conﬁgure Amazon EC2 and
run Hello World.
Intended Audience
This tutorial is designed for system administrators and developers responsible for setting up,
testing, and deploying AWS Batch.
Features Used
This tutorial shows you how to use the AWS Batch console wizard to:
• Create and conﬁgure an Amazon EC2 compute environment
• Create a job queue.
Getting started with Amazon EC2 using the Wizard

AWS Batch

User Guide

• Create a job deﬁnition
• Create and submit a job to run
• View the output of the job in CloudWatch
Time Required
It should take about 10–15 minutes to complete this tutorial.
Regional Restrictions
There are no country or regional restrictions associated with using this solution.
Resource Usage Costs
There's no charge for creating an AWS account. However, by implementing this solution, you
might incur some or all of the costs that are listed in the following table.
Description

Cost (US dollars)

Amazon EC2 instance

You pay for each Amazon EC2 instance that
is created. For more information about
pricing, see Amazon EC2 Pricing.

Prerequisites
Before you begin:
• Create an AWS account if you don't have one.
• Create the ecsInstanceRole Instance role.

Step 1: Create a compute environment
Important
To get started as simply and quickly as possible, this tutorial includes steps with default
settings. Before creating for production use, we recommend that you familiarize yourself
with all settings and deploy with the settings that meet your requirements.

To create a compute environment for an Amazon EC2 orchestration, do the following:
Prerequisites

AWS Batch

User Guide

Best practices for AWS Batch
You can use AWS Batch to run a variety of demanding computational workloads at scale without
managing a complex architecture. AWS Batch jobs can be used in a wide range of use cases in areas
such as epidemiology, gaming, and machine learning.
This topic covers the best practices to consider while using AWS Batch and guidance on how to run
and optimize your workloads when using AWS Batch.
Topics
• When to use AWS Batch
• Checklist to run at scale
• Optimize containers and AMIs
• Choose the right compute environment resource
• Amazon EC2 On-Demand or Amazon EC2 Spot
• Use Amazon EC2 Spot best practices for AWS Batch
• Common errors and troubleshooting

When to use AWS Batch
AWS Batch runs jobs at scale and at low cost, and provides queuing services and cost-optimized
scaling. However, not every workload is suitable to be run using AWS Batch.
• Short jobs – If a job runs for only a few seconds, the overhead to schedule the batch job might
take longer than the runtime of the job itself. As a workaround, binpack your tasks together
before you submit them in AWS Batch. Then, conﬁgure your AWS Batch jobs to iterate over the
tasks. For example, stage the individual task arguments into an Amazon DynamoDB table or as a
ﬁle in an Amazon S3 bucket. Consider grouping tasks so the jobs run 3-5 minutes each. After you
binpack the jobs, loop through your task groups within your AWS Batch job.
• Jobs that must be run immediately – AWS Batch can process jobs quickly. However, AWS Batch
is a scheduler and optimizes for cost performance, job priority, and throughput. AWS Batch
might require time to process your requests. If you need a response in under a few seconds, then
a service-based approach using Amazon ECS or Amazon EKS is more suitable.
When to use AWS Batch

487

AWS Batch

User Guide

Checklist to run at scale
Before you run a large workload on 50 thousand or more vCPUs, consider the following checklist.

Note
If you plan to run a large workload on a million or more vCPUs or need guidance running at
large scale, contact your AWS team.

• Check your Amazon EC2 quotas – Check your Amazon EC2 quotas (also known as limits) in the
Service Quotas panel of the AWS Management Console. If necessary, request a quota increase for
your peak number of Amazon EC2 instances. Remember that Amazon EC2 Spot and Amazon OnDemand instances have separate quotas. For more information, see Getting started with Service
Quotas.
• Verify your Amazon Elastic Block Store quota for each Region – Each instance uses a GP2 or
GP3 volume for the operating system. By default, the quota for each AWS Region is 300 TiB.
However, each instance uses counts as part of this quota. So, make sure to factor this in when
you verify your Amazon Elastic Block Store quota for each Region. If your quota is reached, you
can’t create more instances. For more information, see Amazon Elastic Block Store endpoints and
quotas
• Use Amazon S3 for storage – Amazon S3 provides high throughput and helps to eliminate the
guesswork on how much storage to provision based on the number of jobs and instances in each
Availability Zone. For more information, see Best practices design patterns: optimizing Amazon
S3 performance.
• Scale gradually to identify bottlenecks early – For a job that runs on a million or more vCPUs,
start lower and gradually increase so that you can identify bottlenecks early. For example, start
by running on 50 thousand vCPUs. Then, increase the count to 200 thousand vCPUs, and then
500 thousand vCPUs, and so on. In other words, continue to gradually increase the vCPU count
until you reach the desired number of vCPUs.
• Monitor to identify potential issues early – To avoid potential breaks and issues when running
at scale, make sure to monitor both your application and architecture. Breaks might occur
even when scaling from 1 thousand to 5 thousand vCPUs. You can use Amazon CloudWatch
Logs to review log data or use CloudWatch Embedded Metrics using a client library. For more
information, see CloudWatch Logs agent reference and aws-embedded-metrics
Checklist to run at scale

488

AWS Batch

User Guide

Optimize containers and AMIs
Container size and structure are important for the ﬁrst set of jobs that you run. This is especially
true if the container is larger than 4 GB. Container images are built in layers. The layers are
retrieved in parallel by Docker using three concurrent threads. You can increase the number of
concurrent threads using the max-concurrent-downloads parameter. For more information, see
the Dockerd documentation.
Although you can use larger containers, we recommend that you optimize container structure and
size for faster startup times.
• Smaller containers are fetched faster – Smaller containers can lead to faster application start
times. To decrease container size, oﬄoad libraries or ﬁles that are updated infrequently to the
Amazon Machine Image (AMI). You can also use bind mounts to give access to your containers.
For more information, see Bind mounts.
• Create layers that are even in size and break up large layers – Each layer is retrieved by one
thread. So, a large layer might signiﬁcantly impact your job startup time. We recommend a
maximum layer size of 2 GB as a good tradeoﬀ between larger container size and faster startup
times. You can run the docker history your_image_id command to check your container
image structure and layer size. For more information, see the Docker documentation.
• Use Amazon Elastic Container Registry as your container repository – When you run thousands
of jobs in parallel, a self-managed repository can fail or throttle throughput. Amazon ECR works
at scale and can handle workloads with up to over a million vCPUs.

Optimize containers and AMIs

489

AWS Batch

User Guide

Choose the right compute environment resource
AWS Fargate requires less initial setup and conﬁguration than Amazon EC2 and is likely easier
to use, particularly if it's your ﬁrst time. With Fargate, you don't need to manage servers, handle
capacity planning, or isolate container workloads for security.
If you have the following requirements, we recommend you use Fargate instances:
• Your jobs must start quickly, speciﬁcally less than 30 seconds.
• The requirements of your jobs are 16 vCPUs or less, no GPUs, and 120 GiB of memory or less.
For more information, see When to use Fargate.
If you have the following requirements, we recommend that you use Amazon EC2 instances:
• You require increased control over the instance selection or require using speciﬁc instance types.
• Your jobs require resources that AWS Fargate can’t provide, such as GPUs, more memory, a
custom AMI, or the Amazon Elastic Fabric Adapter.
• You require a high level of throughput or concurrency.
• You need to customize your AMI, Amazon EC2 Launch Template, or access to special Linux
parameters.
With Amazon EC2, you can more ﬁnely tune your workload to your speciﬁc requirements and run at
scale if needed.

Amazon EC2 On-Demand or Amazon EC2 Spot
Most AWS Batch customers use Amazon EC2 Spot instances because of the savings over OnDemand instances. However, if your workload runs for multiple hours and can't be interrupted,
On-Demand instances might be more suitable for you. You can always try Spot instances ﬁrst and
switch to On-Demand if necessary.
If you have the following requirements and expectations, use Amazon EC2 On-Demand instances:
• The runtime of your jobs is more than an hour, and you can't tolerate interruptions to your
workload.
Choose the right compute environment resource

490

AWS Batch

User Guide

• You have a strict SLO (service-level objective) for your overall workload and can’t increase
computational time.
• The instances that you require are more likely to see interruptions.
If you have the following requirements and expectations, use Amazon EC2 Spot instances:
• The runtime for your jobs is typically 30 minutes or less.
• You can tolerate potential interruptions and job rescheduling as a part of your workload. For
more information, see Spot Instance advisor.
• Long running jobs can be restarted from a checkpoint if interrupted.
You can mix both purchasing models by submitting on Spot instance ﬁrst and then use
On-Demand instance as a fallback option. For example, submit your jobs on a queue that's
connected to compute environments that are running on Amazon EC2 Spot instances. If a job
gets interrupted, catch the event from Amazon EventBridge and correlate it to a Spot instance
reclamation. Then, resubmit the job to an On-Demand queue using an AWS Lambda function or
AWS Step Functions. For more information, see Tutorial: Sending Amazon Simple Notiﬁcation
Service alerts for failed job events, Best practices for handling Amazon EC2 Spot Instance
interruptions and Manage AWS Batch with Step Functions.

Important
Use diﬀerent instance types, sizes, and Availability Zones for your On-Demand compute
environment to maintain Amazon EC2 Spot instance pool availability and decrease the
interruption rate.

Use Amazon EC2 Spot best practices for AWS Batch
When you choose Amazon Elastic Compute Cloud (EC2) Spot instances, you likely can optimize
your workﬂow to save costs, sometimes signiﬁcantly. For more information, see Best practices for
Amazon EC2 Spot.
To optimize your workﬂow to save costs, consider the following Amazon EC2 Spot best practices
for AWS Batch:

Use Amazon EC2 Spot best practices for AWS Batch

491

AWS Batch

User Guide

• Choose the SPOT_CAPACITY_OPTIMIZED allocation strategy – AWS Batch chooses Amazon
EC2 instances from the deepest Amazon EC2 Spot capacity pools. If you’re concerned about
interruptions, this is a suitable choice. For more information, see Instance type allocation
strategies for AWS Batch.
• Diversify instance types – To diversify your instance types, consider compatible sizes and
families, then let AWS Batch choose based on price or availability. For example, consider
c5.24xlarge as an alternative to c5.12xlarge or c5a, c5n, c5d, m5, and m5d families. For
more information, see Be ﬂexible about instance types and Availability Zones.
• Reduce job runtime or checkpoint – We advise against running jobs that take an hour or more
when using Amazon EC2 Spot instances to avoid interruptions. If you divide or checkpoint
your jobs into smaller parts that consist of 30 minutes or less, you can signiﬁcantly reduce the
possibility of interruptions.
• Use automated retries – To avoid disruptions to AWS Batch jobs, set automated retries for jobs.
Batch jobs can be disrupted for any of the following reasons: a non-zero exit code is returned, a
service error occurs, or an instance reclamation occurs. You can set up to 10 automated retries.
For a start, we recommend that you set at least 1-3 automated retries. For information about
tracking Amazon EC2 Spot interruptions, see Spot Interruption Dashboard.
For AWS Batch, if you set the retry parameter, the job is placed at the front of the job queue.
That is, the job is given priority. When you create the job deﬁnition or you submit the job in the
AWS CLI, you can conﬁgure a retry strategy. For more information, see submit-job.
$ aws batch submit-job --job-name MyJob \
--job-queue MyJQ \
--job-definition MyJD \
--retry-strategy attempts=2

• Use custom retries – You can conﬁgure a job retry strategy to a speciﬁc application exit code
or instance reclamation. In the following example, if the host causes the failure, the job can
be retried up to ﬁve times. However, if the job fails for a diﬀerent reason, the job exits and the
status is set to FAILED.
"retryStrategy": {
"attempts": 5,
"evaluateOnExit":
[{
"onStatusReason" :"Host EC2*",
"action": "RETRY"
Use Amazon EC2 Spot best practices for AWS Batch

492

AWS Batch

User Guide

},{
"onReason" : "*",
"action": "EXIT"
}]
}

• Use the Spot Interruption Dashboard – You can use the Spot Interruption Dashboard to track
Spot interruptions. The application provides metrics on Amazon EC2 Spot instances that are
reclaimed and which Availability Zones that Spot instances are in. For more information, see Spot
Interruption Dashboard

Common errors and troubleshooting
Errors in AWS Batch often occur at the application level or are caused by instance conﬁgurations
that don’t meet your speciﬁc job requirements. Other issues include jobs getting stuck in
the RUNNABLE status or compute environments getting stuck in an INVALID state. For more
information about troubleshooting jobs getting stuck in RUNNABLE status, see Jobs stuck in a
RUNNABLE status. For information about troubleshooting compute environments in an INVALID
state, see INVALID compute environment.
• Check Amazon EC2 Spot vCPU quotas – Verify that your current service quotas meet the job
requirements. For example, suppose that your current service quota is 256 vCPUs and the job
requires 10,000 vCPUs. Then, the service quota doesn't meet the job requirement. For more
information and troubleshooting instructions, see Amazon EC2 service quotas and How do I
increase the service quota of my Amazon EC2resources?.
• Jobs fail before the application runs – Some jobs might fail because of a
DockerTimeoutError error or a CannotPullContainerError error. For troubleshooting
information, see How do I resolve the "DockerTimeoutError" error in AWS Batch?.
• Insuﬃcient IP addresses – The number of IP addresses in your VPC and subnets can limit the
number of instances that you can create. Use Classless Inter-Domain Routings (CIDRs) to provide
more IP addresses than are required to run your workloads. If necessary, you can also build a
dedicated VPC with a large address space. For example, you can create a VPC with multiple
CIDRs in 10.x.0.0/16 and a subnet in every Availability Zone with a CIDR of 10.x.y.0/17.
In this example, x is between 1-4 and y is either 0 or 128. This conﬁguration provides 36,000 IP
addresses in every subnet.

Common errors and troubleshooting

493