File size: 28,195 Bytes
49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 9cab4b9 49a5af2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 |
AWS Batch User Guide What is AWS Batch? AWS Batch helps you to run batch computing workloads on the AWS Cloud. Batch computing is a common way for developers, scientists, and engineers to access large amounts of compute resources. AWS Batch removes the undifferentiated heavy lifting of configuring and managing the required infrastructure, similar to traditional batch computing software. This service can efficiently provision resources in response to jobs submitted in order to eliminate capacity constraints, reduce compute costs, and deliver results quickly. As a fully managed service, AWS Batch helps you to run batch computing workloads of any scale. AWS Batch automatically provisions compute resources and optimizes the workload distribution based on the quantity and scale of the workloads. With AWS Batch, there's no need to install or manage batch computing software, so you can focus your time on analyzing results and solving problems. 1 AWS Batch User Guide AWS Batch provides all of the necessary functionality to run high-scale, compute-intensive workloads on top of AWS managed container orchestration services, Amazon ECS and Amazon EKS. AWS Batch is able to scale compute capacity on Amazon EC2 instances and Fargate resources. AWS Batch provides a fully managed service for batch workloads, and delivers the operational capabilities to optimize these types of workloads for throughput, speed, resource efficiency, and cost. AWS Batch also enables SageMaker Training job queuing, allowing data scientists and ML engineers to submit Training jobs with priorities to configurable queues. This capability ensures that ML workloads run automatically as soon as resources become available, eliminating the need for manual coordination and improving resource utilization. For machine learning workloads, AWS Batch provides queuing capabilities for SageMaker Training jobs. You can configure queues with specific policies to optimize cost, performance, and resource allocation for your ML Training workloads. This provides a shared responsibility model where administrators set up the infrastructure and permissions, while data scientists can focus on submitting and monitoring their ML training workloads. Jobs are automatically queued and executed based on configured priorities and resource availability. 2 AWS Batch User Guide Are you a first-time AWS Batch user? If you are a first-time user of AWS Batch, we recommend that you begin by reading the following sections: • Components of AWS Batch • Create IAM account and administrative user • Setting up AWS Batch • Getting started with AWS Batch tutorials • Getting started with AWS Batch on SageMaker AI Related services AWS Batch is a fully managed batch computing service that plans, schedules, and runs your containerized batch ML, simulation, and analytics workloads across the full range of AWS compute offerings, such as Amazon ECS, Amazon EKS, AWS Fargate, and Spot or On-Demand Instances. For more information about each managed compute service, see: • Amazon EC2 User Guide • AWS Fargate Developer Guide • Amazon EKS User Guide • Amazon SageMaker AI Developer Guide Accessing AWS Batch You can access AWS Batch using the following: AWS Batch console The web interface where you create and manage resources. AWS Command Line Interface Interact with AWS services using commands in your command line shell. The AWS Command Line Interface is supported on Windows, macOS, and Linux. For more information about the AWS CLI, see AWS Command Line Interface User Guide. You can find the AWS Batch commands in the AWS CLI Command Reference. Are you a first-time AWS Batch user? 3 AWS Batch User Guide AWS SDKs If you prefer to build applications using language-specific APIs instead of submitting a request over HTTP or HTTPS, use the libraries, sample code, tutorials, and other resources provided by AWS. These libraries provide basic functions that automate tasks, such as cryptographically signing your requests, retrying requests, and handling error responses. These functions make it more efficient for you to get started. For more information, see Tools to Build on AWS. Components of AWS Batch AWS Batch simplifies running batch jobs across multiple Availability Zones within a Region. You can create AWS Batch compute environments within a new or existing VPC. After a compute environment is up and associated with a job queue, you can define job definitions that specify which Docker container images to run your jobs. Container images are stored in and pulled from container registries, which may exist within or outside of your AWS infrastructure. Compute environment A compute environment is a set of managed or unmanaged compute resources that are used to run jobs. With managed compute environments, you can specify desired compute type (Fargate or EC2) at several levels of detail. You can set up compute environments that use a particular type of EC2 instance, a particular model such as c5.2xlarge or m5.10xlarge. Or, you can choose only to specify that you want to use the newest instance types. You can also specify the minimum, desired, and maximum number of vCPUs for the environment, along with the amount that you're Components of AWS Batch 4 AWS Batch User Guide willing to pay for a Spot Instance as a percentage of the On-Demand Instance price and a target set of VPC subnets. AWS Batch efficiently launches, manages, and terminates compute types as needed. You can also manage your own compute environments. As such, you're responsible for setting up and scaling the instances in an Amazon ECS cluster that AWS Batch creates for you. For more information, see Compute environments for AWS Batch. Job queues When you submit an AWS Batch job, you submit it to a particular job queue, where the job resides until it's scheduled onto a compute environment. You associate one or more compute environments with a job queue. You can also assign priority values for these compute environments and even across job queues themselves. For example, you can have a high priority queue that you submit time-sensitive jobs to, and a low priority queue for jobs that can run anytime when compute resources are cheaper. For more information, see Job queues. Job definitions A job definition specifies how jobs are to be run. You can think of a job definition as a blueprint for the resources in your job. You can supply your job with an IAM role to provide access to other AWS resources. You also specify both memory and CPU requirements. The job definition can also control container properties, environment variables, and mount points for persistent storage. Many of the specifications in a job definition can be overridden by specifying new values when submitting individual Jobs. For more information, see Job definitions Jobs A unit of work (such as a shell script, a Linux executable, or a Docker container image) that you submit to AWS Batch. It has a name, and runs as a containerized application on AWS Fargate or Amazon EC2 resources in your compute environment, using parameters that you specify in a job definition. Jobs can reference other jobs by name or by ID, and can be dependent on the successful completion of other jobs or the availability of resources you specify. For more information, see Jobs. Scheduling policy You can use scheduling policies to configure how compute resources in a job queue are allocated between users or workloads. Using fair-share scheduling policies, you can assign different share identifiers to workloads or users. The AWS Batch job scheduler defaults to a first-in, first-out (FIFO) strategy. For more information, see Fair-share scheduling policies. Job queues 5 AWS Batch User Guide Consumable resources A consumable resource is a resource that is needed to run your jobs, such as a 3rd party license token, database access bandwidth, the need to throttle calls to a third-party API, and so on. You specify the consumable resources which are needed for a job to run, and Batch takes these resource dependencies into account when it schedules a job. You can reduce the under-utilization of compute resources by allocating only the jobs that have all the required resources available. For more information, see Resource-aware scheduling . Service Environment A Service Environment define how AWS Batch integrates with SageMaker for job execution. Service Environments enable AWS Batch to submit and manage jobs on SageMaker while providing the queuing, scheduling, and priority management capabilities of AWS Batch. Service Environments define capacity limits for specific service types such as SageMaker Training jobs. The capacity limits control the maximum resources that can be used by service jobs in the environment. For more information, see Service environments for AWS Batch. Service job A service job is a unit of work that you submit to AWS Batch to run on a service environment. Service jobs leverage AWS Batch's queuing and scheduling capabilities while delegating actual execution to the external service. For example, SageMaker Training jobs submitted as service jobs are queued and prioritized by AWS Batch, but the SageMaker Training job execution occurs within SageMaker AI infrastructure. This integration enables data scientists and ML engineers to benefit from AWS Batch's automated workload management, and priority queuing, for their SageMaker AI Training workloads. Service jobs can reference other jobs by name or ID and support job dependencies. For more information, see Service jobs in AWS Batch. Consumable resources 6 AWS Batch User Guide Setting up AWS Batch If you've already signed up for Amazon Web Services (AWS) and are using Amazon Elastic Compute Cloud (Amazon EC2) or Amazon Elastic Container Service (Amazon ECS), you can soon use AWS Batch. The setup process for these services is similar. This is because AWS Batch uses Amazon ECS container instances in its compute environments. To use the AWS CLI with AWS Batch, you must use a version of the AWS CLI that supports the latest AWS Batch features. If you don't see support for an AWS Batch feature in the AWS CLI, upgrade to the latest version. For more information, see http://aws.amazon.com/cli/. Note Because AWS Batch uses components of Amazon EC2, you use the Amazon EC2 console for many of these steps. Complete the following tasks to get set up for AWS Batch. Topics • Create IAM account and administrative user • Create IAM roles for your compute environments and container instances • Create a key pair for your instances • Create a VPC • Create a security group • Install the AWS CLI Create IAM account and administrative user To get started, you need to create an AWS account and a single user that is typically granted administrative rights. To accomplish this, complete the following tutorials: Sign up for an AWS account If you do not have an AWS account, complete the following steps to create one. Create IAM account and administrative user 7 AWS Batch User Guide Getting started with AWS Batch tutorials You can use the AWS Batch first-run wizard to get started quickly with AWS Batch. After you complete the Prerequisites, you can use the first-run wizard to create a compute environment, a job definition, and a job queue. You can also submit a sample "Hello World" job using the AWS Batch first-run wizard to test your configuration. If you already have a Docker image that you want to launch in AWS Batch, you can use that image to create a job definition. Afterward, you can use the AWS Batch first-run wizard to create a compute environment, job queue, and submit a sample Hello World job. Getting started with Amazon EC2 orchestration using the Wizard Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing capacity in the AWS Cloud. Using Amazon EC2 eliminates your need to invest in hardware up front, so you can develop and deploy applications faster. You can use Amazon EC2 to launch as many or as few virtual servers as you need, configure security and networking, and manage storage. Amazon EC2 enables you to scale up or down to handle changes in requirements or spikes in popularity, reducing your need to forecast traffic. Overview This tutorial demonstrates how to setup AWS Batch with the Wizard to configure Amazon EC2 and run Hello World. Intended Audience This tutorial is designed for system administrators and developers responsible for setting up, testing, and deploying AWS Batch. Features Used This tutorial shows you how to use the AWS Batch console wizard to: • Create and configure an Amazon EC2 compute environment • Create a job queue. Getting started with Amazon EC2 using the Wizard 16 AWS Batch User Guide • Create a job definition • Create and submit a job to run • View the output of the job in CloudWatch Time Required It should take about 10–15 minutes to complete this tutorial. Regional Restrictions There are no country or regional restrictions associated with using this solution. Resource Usage Costs There's no charge for creating an AWS account. However, by implementing this solution, you might incur some or all of the costs that are listed in the following table. Description Cost (US dollars) Amazon EC2 instance You pay for each Amazon EC2 instance that is created. For more information about pricing, see Amazon EC2 Pricing. Prerequisites Before you begin: • Create an AWS account if you don't have one. • Create the ecsInstanceRole Instance role. Step 1: Create a compute environment Important To get started as simply and quickly as possible, this tutorial includes steps with default settings. Before creating for production use, we recommend that you familiarize yourself with all settings and deploy with the settings that meet your requirements. To create a compute environment for an Amazon EC2 orchestration, do the following: Prerequisites 17 AWS Batch User Guide Best practices for AWS Batch You can use AWS Batch to run a variety of demanding computational workloads at scale without managing a complex architecture. AWS Batch jobs can be used in a wide range of use cases in areas such as epidemiology, gaming, and machine learning. This topic covers the best practices to consider while using AWS Batch and guidance on how to run and optimize your workloads when using AWS Batch. Topics • When to use AWS Batch • Checklist to run at scale • Optimize containers and AMIs • Choose the right compute environment resource • Amazon EC2 On-Demand or Amazon EC2 Spot • Use Amazon EC2 Spot best practices for AWS Batch • Common errors and troubleshooting When to use AWS Batch AWS Batch runs jobs at scale and at low cost, and provides queuing services and cost-optimized scaling. However, not every workload is suitable to be run using AWS Batch. • Short jobs – If a job runs for only a few seconds, the overhead to schedule the batch job might take longer than the runtime of the job itself. As a workaround, binpack your tasks together before you submit them in AWS Batch. Then, configure your AWS Batch jobs to iterate over the tasks. For example, stage the individual task arguments into an Amazon DynamoDB table or as a file in an Amazon S3 bucket. Consider grouping tasks so the jobs run 3-5 minutes each. After you binpack the jobs, loop through your task groups within your AWS Batch job. • Jobs that must be run immediately – AWS Batch can process jobs quickly. However, AWS Batch is a scheduler and optimizes for cost performance, job priority, and throughput. AWS Batch might require time to process your requests. If you need a response in under a few seconds, then a service-based approach using Amazon ECS or Amazon EKS is more suitable. When to use AWS Batch 487 AWS Batch User Guide Checklist to run at scale Before you run a large workload on 50 thousand or more vCPUs, consider the following checklist. Note If you plan to run a large workload on a million or more vCPUs or need guidance running at large scale, contact your AWS team. • Check your Amazon EC2 quotas – Check your Amazon EC2 quotas (also known as limits) in the Service Quotas panel of the AWS Management Console. If necessary, request a quota increase for your peak number of Amazon EC2 instances. Remember that Amazon EC2 Spot and Amazon OnDemand instances have separate quotas. For more information, see Getting started with Service Quotas. • Verify your Amazon Elastic Block Store quota for each Region – Each instance uses a GP2 or GP3 volume for the operating system. By default, the quota for each AWS Region is 300 TiB. However, each instance uses counts as part of this quota. So, make sure to factor this in when you verify your Amazon Elastic Block Store quota for each Region. If your quota is reached, you can’t create more instances. For more information, see Amazon Elastic Block Store endpoints and quotas • Use Amazon S3 for storage – Amazon S3 provides high throughput and helps to eliminate the guesswork on how much storage to provision based on the number of jobs and instances in each Availability Zone. For more information, see Best practices design patterns: optimizing Amazon S3 performance. • Scale gradually to identify bottlenecks early – For a job that runs on a million or more vCPUs, start lower and gradually increase so that you can identify bottlenecks early. For example, start by running on 50 thousand vCPUs. Then, increase the count to 200 thousand vCPUs, and then 500 thousand vCPUs, and so on. In other words, continue to gradually increase the vCPU count until you reach the desired number of vCPUs. • Monitor to identify potential issues early – To avoid potential breaks and issues when running at scale, make sure to monitor both your application and architecture. Breaks might occur even when scaling from 1 thousand to 5 thousand vCPUs. You can use Amazon CloudWatch Logs to review log data or use CloudWatch Embedded Metrics using a client library. For more information, see CloudWatch Logs agent reference and aws-embedded-metrics Checklist to run at scale 488 AWS Batch User Guide Optimize containers and AMIs Container size and structure are important for the first set of jobs that you run. This is especially true if the container is larger than 4 GB. Container images are built in layers. The layers are retrieved in parallel by Docker using three concurrent threads. You can increase the number of concurrent threads using the max-concurrent-downloads parameter. For more information, see the Dockerd documentation. Although you can use larger containers, we recommend that you optimize container structure and size for faster startup times. • Smaller containers are fetched faster – Smaller containers can lead to faster application start times. To decrease container size, offload libraries or files that are updated infrequently to the Amazon Machine Image (AMI). You can also use bind mounts to give access to your containers. For more information, see Bind mounts. • Create layers that are even in size and break up large layers – Each layer is retrieved by one thread. So, a large layer might significantly impact your job startup time. We recommend a maximum layer size of 2 GB as a good tradeoff between larger container size and faster startup times. You can run the docker history your_image_id command to check your container image structure and layer size. For more information, see the Docker documentation. • Use Amazon Elastic Container Registry as your container repository – When you run thousands of jobs in parallel, a self-managed repository can fail or throttle throughput. Amazon ECR works at scale and can handle workloads with up to over a million vCPUs. Optimize containers and AMIs 489 AWS Batch User Guide Choose the right compute environment resource AWS Fargate requires less initial setup and configuration than Amazon EC2 and is likely easier to use, particularly if it's your first time. With Fargate, you don't need to manage servers, handle capacity planning, or isolate container workloads for security. If you have the following requirements, we recommend you use Fargate instances: • Your jobs must start quickly, specifically less than 30 seconds. • The requirements of your jobs are 16 vCPUs or less, no GPUs, and 120 GiB of memory or less. For more information, see When to use Fargate. If you have the following requirements, we recommend that you use Amazon EC2 instances: • You require increased control over the instance selection or require using specific instance types. • Your jobs require resources that AWS Fargate can’t provide, such as GPUs, more memory, a custom AMI, or the Amazon Elastic Fabric Adapter. • You require a high level of throughput or concurrency. • You need to customize your AMI, Amazon EC2 Launch Template, or access to special Linux parameters. With Amazon EC2, you can more finely tune your workload to your specific requirements and run at scale if needed. Amazon EC2 On-Demand or Amazon EC2 Spot Most AWS Batch customers use Amazon EC2 Spot instances because of the savings over OnDemand instances. However, if your workload runs for multiple hours and can't be interrupted, On-Demand instances might be more suitable for you. You can always try Spot instances first and switch to On-Demand if necessary. If you have the following requirements and expectations, use Amazon EC2 On-Demand instances: • The runtime of your jobs is more than an hour, and you can't tolerate interruptions to your workload. Choose the right compute environment resource 490 AWS Batch User Guide • You have a strict SLO (service-level objective) for your overall workload and can’t increase computational time. • The instances that you require are more likely to see interruptions. If you have the following requirements and expectations, use Amazon EC2 Spot instances: • The runtime for your jobs is typically 30 minutes or less. • You can tolerate potential interruptions and job rescheduling as a part of your workload. For more information, see Spot Instance advisor. • Long running jobs can be restarted from a checkpoint if interrupted. You can mix both purchasing models by submitting on Spot instance first and then use On-Demand instance as a fallback option. For example, submit your jobs on a queue that's connected to compute environments that are running on Amazon EC2 Spot instances. If a job gets interrupted, catch the event from Amazon EventBridge and correlate it to a Spot instance reclamation. Then, resubmit the job to an On-Demand queue using an AWS Lambda function or AWS Step Functions. For more information, see Tutorial: Sending Amazon Simple Notification Service alerts for failed job events, Best practices for handling Amazon EC2 Spot Instance interruptions and Manage AWS Batch with Step Functions. Important Use different instance types, sizes, and Availability Zones for your On-Demand compute environment to maintain Amazon EC2 Spot instance pool availability and decrease the interruption rate. Use Amazon EC2 Spot best practices for AWS Batch When you choose Amazon Elastic Compute Cloud (EC2) Spot instances, you likely can optimize your workflow to save costs, sometimes significantly. For more information, see Best practices for Amazon EC2 Spot. To optimize your workflow to save costs, consider the following Amazon EC2 Spot best practices for AWS Batch: Use Amazon EC2 Spot best practices for AWS Batch 491 AWS Batch User Guide • Choose the SPOT_CAPACITY_OPTIMIZED allocation strategy – AWS Batch chooses Amazon EC2 instances from the deepest Amazon EC2 Spot capacity pools. If you’re concerned about interruptions, this is a suitable choice. For more information, see Instance type allocation strategies for AWS Batch. • Diversify instance types – To diversify your instance types, consider compatible sizes and families, then let AWS Batch choose based on price or availability. For example, consider c5.24xlarge as an alternative to c5.12xlarge or c5a, c5n, c5d, m5, and m5d families. For more information, see Be flexible about instance types and Availability Zones. • Reduce job runtime or checkpoint – We advise against running jobs that take an hour or more when using Amazon EC2 Spot instances to avoid interruptions. If you divide or checkpoint your jobs into smaller parts that consist of 30 minutes or less, you can significantly reduce the possibility of interruptions. • Use automated retries – To avoid disruptions to AWS Batch jobs, set automated retries for jobs. Batch jobs can be disrupted for any of the following reasons: a non-zero exit code is returned, a service error occurs, or an instance reclamation occurs. You can set up to 10 automated retries. For a start, we recommend that you set at least 1-3 automated retries. For information about tracking Amazon EC2 Spot interruptions, see Spot Interruption Dashboard. For AWS Batch, if you set the retry parameter, the job is placed at the front of the job queue. That is, the job is given priority. When you create the job definition or you submit the job in the AWS CLI, you can configure a retry strategy. For more information, see submit-job. $ aws batch submit-job --job-name MyJob \ --job-queue MyJQ \ --job-definition MyJD \ --retry-strategy attempts=2 • Use custom retries – You can configure a job retry strategy to a specific application exit code or instance reclamation. In the following example, if the host causes the failure, the job can be retried up to five times. However, if the job fails for a different reason, the job exits and the status is set to FAILED. "retryStrategy": { "attempts": 5, "evaluateOnExit": [{ "onStatusReason" :"Host EC2*", "action": "RETRY" Use Amazon EC2 Spot best practices for AWS Batch 492 AWS Batch User Guide },{ "onReason" : "*", "action": "EXIT" }] } • Use the Spot Interruption Dashboard – You can use the Spot Interruption Dashboard to track Spot interruptions. The application provides metrics on Amazon EC2 Spot instances that are reclaimed and which Availability Zones that Spot instances are in. For more information, see Spot Interruption Dashboard Common errors and troubleshooting Errors in AWS Batch often occur at the application level or are caused by instance configurations that don’t meet your specific job requirements. Other issues include jobs getting stuck in the RUNNABLE status or compute environments getting stuck in an INVALID state. For more information about troubleshooting jobs getting stuck in RUNNABLE status, see Jobs stuck in a RUNNABLE status. For information about troubleshooting compute environments in an INVALID state, see INVALID compute environment. • Check Amazon EC2 Spot vCPU quotas – Verify that your current service quotas meet the job requirements. For example, suppose that your current service quota is 256 vCPUs and the job requires 10,000 vCPUs. Then, the service quota doesn't meet the job requirement. For more information and troubleshooting instructions, see Amazon EC2 service quotas and How do I increase the service quota of my Amazon EC2resources?. • Jobs fail before the application runs – Some jobs might fail because of a DockerTimeoutError error or a CannotPullContainerError error. For troubleshooting information, see How do I resolve the "DockerTimeoutError" error in AWS Batch?. • Insufficient IP addresses – The number of IP addresses in your VPC and subnets can limit the number of instances that you can create. Use Classless Inter-Domain Routings (CIDRs) to provide more IP addresses than are required to run your workloads. If necessary, you can also build a dedicated VPC with a large address space. For example, you can create a VPC with multiple CIDRs in 10.x.0.0/16 and a subnet in every Availability Zone with a CIDR of 10.x.y.0/17. In this example, x is between 1-4 and y is either 0 or 128. This configuration provides 36,000 IP addresses in every subnet. Common errors and troubleshooting 493 |