File size: 28,195 Bytes
49a5af2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9cab4b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49a5af2
 
 
 
 
9cab4b9
 
49a5af2
9cab4b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49a5af2
 
 
 
 
9cab4b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49a5af2
9cab4b9
49a5af2
9cab4b9
49a5af2
9cab4b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49a5af2
9cab4b9
49a5af2
9cab4b9
49a5af2
9cab4b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49a5af2
9cab4b9
 
 
 
49a5af2
9cab4b9
 
 
 
 
 
49a5af2
9cab4b9
49a5af2
9cab4b9
49a5af2
9cab4b9
49a5af2
9cab4b9
49a5af2
9cab4b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49a5af2
9cab4b9
49a5af2
9cab4b9
49a5af2
9cab4b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49a5af2
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
AWS Batch

User Guide

What is AWS Batch?
AWS Batch helps you to run batch computing workloads on the AWS Cloud. Batch computing
is a common way for developers, scientists, and engineers to access large amounts of compute
resources. AWS Batch removes the undifferentiated heavy lifting of configuring and managing the
required infrastructure, similar to traditional batch computing software. This service can efficiently
provision resources in response to jobs submitted in order to eliminate capacity constraints, reduce
compute costs, and deliver results quickly.
As a fully managed service, AWS Batch helps you to run batch computing workloads of any scale.
AWS Batch automatically provisions compute resources and optimizes the workload distribution
based on the quantity and scale of the workloads. With AWS Batch, there's no need to install or
manage batch computing software, so you can focus your time on analyzing results and solving
problems.

1

AWS Batch

User Guide

AWS Batch provides all of the necessary functionality to run high-scale, compute-intensive
workloads on top of AWS managed container orchestration services, Amazon ECS and Amazon EKS.
AWS Batch is able to scale compute capacity on Amazon EC2 instances and Fargate resources.
AWS Batch provides a fully managed service for batch workloads, and delivers the operational
capabilities to optimize these types of workloads for throughput, speed, resource efficiency, and
cost.
AWS Batch also enables SageMaker Training job queuing, allowing data scientists and ML engineers
to submit Training jobs with priorities to configurable queues. This capability ensures that ML
workloads run automatically as soon as resources become available, eliminating the need for
manual coordination and improving resource utilization.
For machine learning workloads, AWS Batch provides queuing capabilities for SageMaker Training
jobs. You can configure queues with specific policies to optimize cost, performance, and resource
allocation for your ML Training workloads.

This provides a shared responsibility model where administrators set up the infrastructure and
permissions, while data scientists can focus on submitting and monitoring their ML training
workloads. Jobs are automatically queued and executed based on configured priorities and
resource availability.

2

AWS Batch

User Guide

Are you a first-time AWS Batch user?
If you are a first-time user of AWS Batch, we recommend that you begin by reading the following
sections:
• Components of AWS Batch
• Create IAM account and administrative user
• Setting up AWS Batch
• Getting started with AWS Batch tutorials
• Getting started with AWS Batch on SageMaker AI

Related services
AWS Batch is a fully managed batch computing service that plans, schedules, and runs your
containerized batch ML, simulation, and analytics workloads across the full range of AWS compute
offerings, such as Amazon ECS, Amazon EKS, AWS Fargate, and Spot or On-Demand Instances. For
more information about each managed compute service, see:
• Amazon EC2 User Guide
• AWS Fargate Developer Guide
• Amazon EKS User Guide
• Amazon SageMaker AI Developer Guide

Accessing AWS Batch
You can access AWS Batch using the following:
AWS Batch console
The web interface where you create and manage resources.
AWS Command Line Interface
Interact with AWS services using commands in your command line shell. The AWS Command
Line Interface is supported on Windows, macOS, and Linux. For more information about the
AWS CLI, see AWS Command Line Interface User Guide. You can find the AWS Batch commands
in the AWS CLI Command Reference.
Are you a first-time AWS Batch user?

3

AWS Batch

User Guide

AWS SDKs
If you prefer to build applications using language-specific APIs instead of submitting a request
over HTTP or HTTPS, use the libraries, sample code, tutorials, and other resources provided by
AWS. These libraries provide basic functions that automate tasks, such as cryptographically
signing your requests, retrying requests, and handling error responses. These functions make it
more efficient for you to get started. For more information, see Tools to Build on AWS.

Components of AWS Batch
AWS Batch simplifies running batch jobs across multiple Availability Zones within a Region. You
can create AWS Batch compute environments within a new or existing VPC. After a compute
environment is up and associated with a job queue, you can define job definitions that specify
which Docker container images to run your jobs. Container images are stored in and pulled from
container registries, which may exist within or outside of your AWS infrastructure.

Compute environment
A compute environment is a set of managed or unmanaged compute resources that are used to
run jobs. With managed compute environments, you can specify desired compute type (Fargate
or EC2) at several levels of detail. You can set up compute environments that use a particular type
of EC2 instance, a particular model such as c5.2xlarge or m5.10xlarge. Or, you can choose
only to specify that you want to use the newest instance types. You can also specify the minimum,
desired, and maximum number of vCPUs for the environment, along with the amount that you're
Components of AWS Batch

4

AWS Batch

User Guide

willing to pay for a Spot Instance as a percentage of the On-Demand Instance price and a target
set of VPC subnets. AWS Batch efficiently launches, manages, and terminates compute types as
needed. You can also manage your own compute environments. As such, you're responsible for
setting up and scaling the instances in an Amazon ECS cluster that AWS Batch creates for you. For
more information, see Compute environments for AWS Batch.

Job queues
When you submit an AWS Batch job, you submit it to a particular job queue, where the
job resides until it's scheduled onto a compute environment. You associate one or more
compute environments with a job queue. You can also assign priority values for these compute
environments and even across job queues themselves. For example, you can have a high priority
queue that you submit time-sensitive jobs to, and a low priority queue for jobs that can run
anytime when compute resources are cheaper. For more information, see Job queues.

Job definitions
A job definition specifies how jobs are to be run. You can think of a job definition as a blueprint for
the resources in your job. You can supply your job with an IAM role to provide access to other AWS
resources. You also specify both memory and CPU requirements. The job definition can also control
container properties, environment variables, and mount points for persistent storage. Many of
the specifications in a job definition can be overridden by specifying new values when submitting
individual Jobs. For more information, see Job definitions

Jobs
A unit of work (such as a shell script, a Linux executable, or a Docker container image) that you
submit to AWS Batch. It has a name, and runs as a containerized application on AWS Fargate or
Amazon EC2 resources in your compute environment, using parameters that you specify in a job
definition. Jobs can reference other jobs by name or by ID, and can be dependent on the successful
completion of other jobs or the availability of resources you specify. For more information, see
Jobs.

Scheduling policy
You can use scheduling policies to configure how compute resources in a job queue are allocated
between users or workloads. Using fair-share scheduling policies, you can assign different share
identifiers to workloads or users. The AWS Batch job scheduler defaults to a first-in, first-out (FIFO)
strategy. For more information, see Fair-share scheduling policies.
Job queues

5

AWS Batch

User Guide

Consumable resources
A consumable resource is a resource that is needed to run your jobs, such as a 3rd party license
token, database access bandwidth, the need to throttle calls to a third-party API, and so on.
You specify the consumable resources which are needed for a job to run, and Batch takes these
resource dependencies into account when it schedules a job. You can reduce the under-utilization
of compute resources by allocating only the jobs that have all the required resources available. For
more information, see Resource-aware scheduling .

Service Environment
A Service Environment define how AWS Batch integrates with SageMaker for job execution. Service
Environments enable AWS Batch to submit and manage jobs on SageMaker while providing the
queuing, scheduling, and priority management capabilities of AWS Batch. Service Environments
define capacity limits for specific service types such as SageMaker Training jobs. The capacity limits
control the maximum resources that can be used by service jobs in the environment. For more
information, see Service environments for AWS Batch.

Service job
A service job is a unit of work that you submit to AWS Batch to run on a service environment.
Service jobs leverage AWS Batch's queuing and scheduling capabilities while delegating actual
execution to the external service. For example, SageMaker Training jobs submitted as service
jobs are queued and prioritized by AWS Batch, but the SageMaker Training job execution occurs
within SageMaker AI infrastructure. This integration enables data scientists and ML engineers
to benefit from AWS Batch's automated workload management, and priority queuing, for their
SageMaker AI Training workloads. Service jobs can reference other jobs by name or ID and support
job dependencies. For more information, see Service jobs in AWS Batch.

Consumable resources

6

AWS Batch

User Guide

Setting up AWS Batch
If you've already signed up for Amazon Web Services (AWS) and are using Amazon Elastic Compute
Cloud (Amazon EC2) or Amazon Elastic Container Service (Amazon ECS), you can soon use AWS
Batch. The setup process for these services is similar. This is because AWS Batch uses Amazon ECS
container instances in its compute environments. To use the AWS CLI with AWS Batch, you must
use a version of the AWS CLI that supports the latest AWS Batch features. If you don't see support
for an AWS Batch feature in the AWS CLI, upgrade to the latest version. For more information, see
http://aws.amazon.com/cli/.
Note
Because AWS Batch uses components of Amazon EC2, you use the Amazon EC2 console for
many of these steps.

Complete the following tasks to get set up for AWS Batch.
Topics
• Create IAM account and administrative user
• Create IAM roles for your compute environments and container instances
• Create a key pair for your instances
• Create a VPC
• Create a security group
• Install the AWS CLI

Create IAM account and administrative user
To get started, you need to create an AWS account and a single user that is typically granted
administrative rights. To accomplish this, complete the following tutorials:

Sign up for an AWS account
If you do not have an AWS account, complete the following steps to create one.
Create IAM account and administrative user

7

AWS Batch

User Guide

Getting started with AWS Batch tutorials
You can use the AWS Batch first-run wizard to get started quickly with AWS Batch. After you
complete the Prerequisites, you can use the first-run wizard to create a compute environment, a job
definition, and a job queue.
You can also submit a sample "Hello World" job using the AWS Batch first-run wizard to test your
configuration. If you already have a Docker image that you want to launch in AWS Batch, you can
use that image to create a job definition.
Afterward, you can use the AWS Batch first-run wizard to create a compute environment, job
queue, and submit a sample Hello World job.

Getting started with Amazon EC2 orchestration using the
Wizard
Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing capacity in the AWS
Cloud. Using Amazon EC2 eliminates your need to invest in hardware up front, so you can develop
and deploy applications faster.
You can use Amazon EC2 to launch as many or as few virtual servers as you need, configure
security and networking, and manage storage. Amazon EC2 enables you to scale up or down to
handle changes in requirements or spikes in popularity, reducing your need to forecast traffic.

Overview
This tutorial demonstrates how to setup AWS Batch with the Wizard to configure Amazon EC2 and
run Hello World.
Intended Audience
This tutorial is designed for system administrators and developers responsible for setting up,
testing, and deploying AWS Batch.
Features Used
This tutorial shows you how to use the AWS Batch console wizard to:
• Create and configure an Amazon EC2 compute environment
• Create a job queue.
Getting started with Amazon EC2 using the Wizard

16

AWS Batch

User Guide

• Create a job definition
• Create and submit a job to run
• View the output of the job in CloudWatch
Time Required
It should take about 10–15 minutes to complete this tutorial.
Regional Restrictions
There are no country or regional restrictions associated with using this solution.
Resource Usage Costs
There's no charge for creating an AWS account. However, by implementing this solution, you
might incur some or all of the costs that are listed in the following table.
Description

Cost (US dollars)

Amazon EC2 instance

You pay for each Amazon EC2 instance that
is created. For more information about
pricing, see Amazon EC2 Pricing.

Prerequisites
Before you begin:
• Create an AWS account if you don't have one.
• Create the ecsInstanceRole Instance role.

Step 1: Create a compute environment
Important
To get started as simply and quickly as possible, this tutorial includes steps with default
settings. Before creating for production use, we recommend that you familiarize yourself
with all settings and deploy with the settings that meet your requirements.

To create a compute environment for an Amazon EC2 orchestration, do the following:
Prerequisites

17

AWS Batch

User Guide

Best practices for AWS Batch
You can use AWS Batch to run a variety of demanding computational workloads at scale without
managing a complex architecture. AWS Batch jobs can be used in a wide range of use cases in areas
such as epidemiology, gaming, and machine learning.
This topic covers the best practices to consider while using AWS Batch and guidance on how to run
and optimize your workloads when using AWS Batch.
Topics
• When to use AWS Batch
• Checklist to run at scale
• Optimize containers and AMIs
• Choose the right compute environment resource
• Amazon EC2 On-Demand or Amazon EC2 Spot
• Use Amazon EC2 Spot best practices for AWS Batch
• Common errors and troubleshooting

When to use AWS Batch
AWS Batch runs jobs at scale and at low cost, and provides queuing services and cost-optimized
scaling. However, not every workload is suitable to be run using AWS Batch.
• Short jobs – If a job runs for only a few seconds, the overhead to schedule the batch job might
take longer than the runtime of the job itself. As a workaround, binpack your tasks together
before you submit them in AWS Batch. Then, configure your AWS Batch jobs to iterate over the
tasks. For example, stage the individual task arguments into an Amazon DynamoDB table or as a
file in an Amazon S3 bucket. Consider grouping tasks so the jobs run 3-5 minutes each. After you
binpack the jobs, loop through your task groups within your AWS Batch job.
• Jobs that must be run immediately – AWS Batch can process jobs quickly. However, AWS Batch
is a scheduler and optimizes for cost performance, job priority, and throughput. AWS Batch
might require time to process your requests. If you need a response in under a few seconds, then
a service-based approach using Amazon ECS or Amazon EKS is more suitable.
When to use AWS Batch

487

AWS Batch

User Guide

Checklist to run at scale
Before you run a large workload on 50 thousand or more vCPUs, consider the following checklist.

Note
If you plan to run a large workload on a million or more vCPUs or need guidance running at
large scale, contact your AWS team.

• Check your Amazon EC2 quotas – Check your Amazon EC2 quotas (also known as limits) in the
Service Quotas panel of the AWS Management Console. If necessary, request a quota increase for
your peak number of Amazon EC2 instances. Remember that Amazon EC2 Spot and Amazon OnDemand instances have separate quotas. For more information, see Getting started with Service
Quotas.
• Verify your Amazon Elastic Block Store quota for each Region – Each instance uses a GP2 or
GP3 volume for the operating system. By default, the quota for each AWS Region is 300 TiB.
However, each instance uses counts as part of this quota. So, make sure to factor this in when
you verify your Amazon Elastic Block Store quota for each Region. If your quota is reached, you
can’t create more instances. For more information, see Amazon Elastic Block Store endpoints and
quotas
• Use Amazon S3 for storage – Amazon S3 provides high throughput and helps to eliminate the
guesswork on how much storage to provision based on the number of jobs and instances in each
Availability Zone. For more information, see Best practices design patterns: optimizing Amazon
S3 performance.
• Scale gradually to identify bottlenecks early – For a job that runs on a million or more vCPUs,
start lower and gradually increase so that you can identify bottlenecks early. For example, start
by running on 50 thousand vCPUs. Then, increase the count to 200 thousand vCPUs, and then
500 thousand vCPUs, and so on. In other words, continue to gradually increase the vCPU count
until you reach the desired number of vCPUs.
• Monitor to identify potential issues early – To avoid potential breaks and issues when running
at scale, make sure to monitor both your application and architecture. Breaks might occur
even when scaling from 1 thousand to 5 thousand vCPUs. You can use Amazon CloudWatch
Logs to review log data or use CloudWatch Embedded Metrics using a client library. For more
information, see CloudWatch Logs agent reference and aws-embedded-metrics
Checklist to run at scale

488

AWS Batch

User Guide

Optimize containers and AMIs
Container size and structure are important for the first set of jobs that you run. This is especially
true if the container is larger than 4 GB. Container images are built in layers. The layers are
retrieved in parallel by Docker using three concurrent threads. You can increase the number of
concurrent threads using the max-concurrent-downloads parameter. For more information, see
the Dockerd documentation.
Although you can use larger containers, we recommend that you optimize container structure and
size for faster startup times.
• Smaller containers are fetched faster – Smaller containers can lead to faster application start
times. To decrease container size, offload libraries or files that are updated infrequently to the
Amazon Machine Image (AMI). You can also use bind mounts to give access to your containers.
For more information, see Bind mounts.
• Create layers that are even in size and break up large layers – Each layer is retrieved by one
thread. So, a large layer might significantly impact your job startup time. We recommend a
maximum layer size of 2 GB as a good tradeoff between larger container size and faster startup
times. You can run the docker history your_image_id command to check your container
image structure and layer size. For more information, see the Docker documentation.
• Use Amazon Elastic Container Registry as your container repository – When you run thousands
of jobs in parallel, a self-managed repository can fail or throttle throughput. Amazon ECR works
at scale and can handle workloads with up to over a million vCPUs.

Optimize containers and AMIs

489

AWS Batch

User Guide

Choose the right compute environment resource
AWS Fargate requires less initial setup and configuration than Amazon EC2 and is likely easier
to use, particularly if it's your first time. With Fargate, you don't need to manage servers, handle
capacity planning, or isolate container workloads for security.
If you have the following requirements, we recommend you use Fargate instances:
• Your jobs must start quickly, specifically less than 30 seconds.
• The requirements of your jobs are 16 vCPUs or less, no GPUs, and 120 GiB of memory or less.
For more information, see When to use Fargate.
If you have the following requirements, we recommend that you use Amazon EC2 instances:
• You require increased control over the instance selection or require using specific instance types.
• Your jobs require resources that AWS Fargate can’t provide, such as GPUs, more memory, a
custom AMI, or the Amazon Elastic Fabric Adapter.
• You require a high level of throughput or concurrency.
• You need to customize your AMI, Amazon EC2 Launch Template, or access to special Linux
parameters.
With Amazon EC2, you can more finely tune your workload to your specific requirements and run at
scale if needed.

Amazon EC2 On-Demand or Amazon EC2 Spot
Most AWS Batch customers use Amazon EC2 Spot instances because of the savings over OnDemand instances. However, if your workload runs for multiple hours and can't be interrupted,
On-Demand instances might be more suitable for you. You can always try Spot instances first and
switch to On-Demand if necessary.
If you have the following requirements and expectations, use Amazon EC2 On-Demand instances:
• The runtime of your jobs is more than an hour, and you can't tolerate interruptions to your
workload.
Choose the right compute environment resource

490

AWS Batch

User Guide

• You have a strict SLO (service-level objective) for your overall workload and can’t increase
computational time.
• The instances that you require are more likely to see interruptions.
If you have the following requirements and expectations, use Amazon EC2 Spot instances:
• The runtime for your jobs is typically 30 minutes or less.
• You can tolerate potential interruptions and job rescheduling as a part of your workload. For
more information, see Spot Instance advisor.
• Long running jobs can be restarted from a checkpoint if interrupted.
You can mix both purchasing models by submitting on Spot instance first and then use
On-Demand instance as a fallback option. For example, submit your jobs on a queue that's
connected to compute environments that are running on Amazon EC2 Spot instances. If a job
gets interrupted, catch the event from Amazon EventBridge and correlate it to a Spot instance
reclamation. Then, resubmit the job to an On-Demand queue using an AWS Lambda function or
AWS Step Functions. For more information, see Tutorial: Sending Amazon Simple Notification
Service alerts for failed job events, Best practices for handling Amazon EC2 Spot Instance
interruptions and Manage AWS Batch with Step Functions.

Important
Use different instance types, sizes, and Availability Zones for your On-Demand compute
environment to maintain Amazon EC2 Spot instance pool availability and decrease the
interruption rate.

Use Amazon EC2 Spot best practices for AWS Batch
When you choose Amazon Elastic Compute Cloud (EC2) Spot instances, you likely can optimize
your workflow to save costs, sometimes significantly. For more information, see Best practices for
Amazon EC2 Spot.
To optimize your workflow to save costs, consider the following Amazon EC2 Spot best practices
for AWS Batch:

Use Amazon EC2 Spot best practices for AWS Batch

491

AWS Batch

User Guide

• Choose the SPOT_CAPACITY_OPTIMIZED allocation strategy – AWS Batch chooses Amazon
EC2 instances from the deepest Amazon EC2 Spot capacity pools. If you’re concerned about
interruptions, this is a suitable choice. For more information, see Instance type allocation
strategies for AWS Batch.
• Diversify instance types – To diversify your instance types, consider compatible sizes and
families, then let AWS Batch choose based on price or availability. For example, consider
c5.24xlarge as an alternative to c5.12xlarge or c5a, c5n, c5d, m5, and m5d families. For
more information, see Be flexible about instance types and Availability Zones.
• Reduce job runtime or checkpoint – We advise against running jobs that take an hour or more
when using Amazon EC2 Spot instances to avoid interruptions. If you divide or checkpoint
your jobs into smaller parts that consist of 30 minutes or less, you can significantly reduce the
possibility of interruptions.
• Use automated retries – To avoid disruptions to AWS Batch jobs, set automated retries for jobs.
Batch jobs can be disrupted for any of the following reasons: a non-zero exit code is returned, a
service error occurs, or an instance reclamation occurs. You can set up to 10 automated retries.
For a start, we recommend that you set at least 1-3 automated retries. For information about
tracking Amazon EC2 Spot interruptions, see Spot Interruption Dashboard.
For AWS Batch, if you set the retry parameter, the job is placed at the front of the job queue.
That is, the job is given priority. When you create the job definition or you submit the job in the
AWS CLI, you can configure a retry strategy. For more information, see submit-job.
$ aws batch submit-job --job-name MyJob \
--job-queue MyJQ \
--job-definition MyJD \
--retry-strategy attempts=2

• Use custom retries – You can configure a job retry strategy to a specific application exit code
or instance reclamation. In the following example, if the host causes the failure, the job can
be retried up to five times. However, if the job fails for a different reason, the job exits and the
status is set to FAILED.
"retryStrategy": {
"attempts": 5,
"evaluateOnExit":
[{
"onStatusReason" :"Host EC2*",
"action": "RETRY"
Use Amazon EC2 Spot best practices for AWS Batch

492

AWS Batch

User Guide

},{
"onReason" : "*",
"action": "EXIT"
}]
}

• Use the Spot Interruption Dashboard – You can use the Spot Interruption Dashboard to track
Spot interruptions. The application provides metrics on Amazon EC2 Spot instances that are
reclaimed and which Availability Zones that Spot instances are in. For more information, see Spot
Interruption Dashboard

Common errors and troubleshooting
Errors in AWS Batch often occur at the application level or are caused by instance configurations
that don’t meet your specific job requirements. Other issues include jobs getting stuck in
the RUNNABLE status or compute environments getting stuck in an INVALID state. For more
information about troubleshooting jobs getting stuck in RUNNABLE status, see Jobs stuck in a
RUNNABLE status. For information about troubleshooting compute environments in an INVALID
state, see INVALID compute environment.
• Check Amazon EC2 Spot vCPU quotas – Verify that your current service quotas meet the job
requirements. For example, suppose that your current service quota is 256 vCPUs and the job
requires 10,000 vCPUs. Then, the service quota doesn't meet the job requirement. For more
information and troubleshooting instructions, see Amazon EC2 service quotas and How do I
increase the service quota of my Amazon EC2resources?.
• Jobs fail before the application runs – Some jobs might fail because of a
DockerTimeoutError error or a CannotPullContainerError error. For troubleshooting
information, see How do I resolve the "DockerTimeoutError" error in AWS Batch?.
• Insufficient IP addresses – The number of IP addresses in your VPC and subnets can limit the
number of instances that you can create. Use Classless Inter-Domain Routings (CIDRs) to provide
more IP addresses than are required to run your workloads. If necessary, you can also build a
dedicated VPC with a large address space. For example, you can create a VPC with multiple
CIDRs in 10.x.0.0/16 and a subnet in every Availability Zone with a CIDR of 10.x.y.0/17.
In this example, x is between 1-4 and y is either 0 or 128. This configuration provides 36,000 IP
addresses in every subnet.

Common errors and troubleshooting

493