Skip to main content
This guide details the configuration options available for Runpod Serverless endpoints. These settings control how your endpoint scales, how it utilizes hardware, and how it manages request lifecycles.
Some settings can only be updated after deploying your endpoint. For instructions on modifying an existing endpoint, see Edit an endpoint.

General configuration

Endpoint name

The name assigned to your endpoint helps you identify it within the Runpod console. This is a local display name and does not impact the endpoint ID used for API requests.

Endpoint type

Select the architecture that best fits your application’s traffic pattern: Queue based endpoints utilize a built-in queueing system to manage requests. They are ideal for asynchronous tasks, batch processing, and long-running jobs where immediate synchronous responses are not required. These endpoints provide guaranteed execution and automatic retries for failed requests. Queue based endpoints are implemented using handler functions. Load balancing endpoints route traffic directly to available workers, bypassing the internal queue. They are designed for high-throughput, low-latency applications that require synchronous request/response cycles, such as real-time inference or custom REST APIs. For implementation details, see Load balancing endpoints.

GPU configuration

This setting determines the hardware tier your workers will utilize. You can select multiple GPU categories to create a prioritized list. Runpod attempts to allocate the first category in your list. If that hardware is unavailable, it automatically falls back to the subsequent options. Selecting multiple GPU types significantly improves endpoint availability during periods of high demand.
GPU type(s)MemoryFlex cost per secondActive cost per secondDescription
A4000, A4500, RTX 400016 GB$0.00016$0.00011The most cost-effective for small models.
4090 PRO24 GB$0.00031$0.00021Extreme throughput for small-to-medium models.
L4, A5000, 309024 GB$0.00019$0.00013Great for small-to-medium sized inference workloads.
L40, L40S, 6000 Ada PRO48 GB$0.00053$0.00037Extreme inference throughput on LLMs like Llama 3 7B.
A6000, A4048 GB$0.00034$0.00024A cost-effective option for running big models.
H100 PRO80 GB$0.00116$0.00093Extreme throughput for big models.
A10080 GB$0.00076$0.00060High throughput GPU, yet still very cost-effective.
H200 PRO141 GB$0.00155$0.00124Extreme throughput for huge models.
B200180 GB$0.00240$0.00190Maximum throughput for huge models.

Worker scaling

Active workers

This setting defines the minimum number of workers that remain warm and ready to process requests at all times. Setting this to 1 or higher eliminates cold starts for the initial wave of requests. Active workers incur charges even when idle, but they receive a 20-30% discount compared to on-demand workers.

Max workers

This setting controls the maximum number of concurrent instances your endpoint can scale to. This acts as a safety limit for costs and a cap on concurrency. We recommend setting your max worker count approximately 20% higher than your expected maximum concurrency. This buffer allows for smoother scaling during traffic spikes.

GPUs per worker

This defines how many GPUs are assigned to a single worker instance. The default is 1. When choosing between multiple lower-tier GPUs or fewer high-end GPUs, you should generally prioritize high-end GPUs with lower GPU count per worker when possible.

Auto-scaling type

This setting determines the logic used to scale workers up and down. Queue delay scaling adds workers based on wait times. If requests sit in the queue for longer than a defined threshold (default 4 seconds), the system provisions new workers. This is best for workloads where slight delays are acceptable in exchange for higher utilization. Request count scaling is more aggressive. It adjusts worker numbers based on the total volume of pending and active work. The formula used is Math.ceil((requestsInQueue + requestsInProgress) / scalerValue). Use a scaler value of 1 for maximum responsiveness, or increase it to scale more conservatively. This strategy is recommended for LLM workloads or applications with frequent, short requests.

Lifecycle and timeouts

Idle timeout

The idle timeout determines how long a worker remains active after completing a request before shutting down. While a worker is idle, you are billed for the time, but the worker remains “warm,” allowing it to process subsequent requests immediately. The default is 5 seconds.

Execution timeout

The execution timeout specifies the maximum duration a single job is allowed to run while actively being processed by a worker. When exceeded, the job is marked as failed and the worker is stopped. We strongly recommend keeping this enabled to prevent runaway jobs from consuming infinite resources. The default is 600 seconds (10 minutes). The minimum is 5 seconds and maximum is 7 days. You can configure the execution timeout in the Advanced section of your endpoint settings. You can also override this setting on a per-request basis using the executionTimeout field in the job policy.

Job TTL (time-to-live)

The TTL controls how long a job’s data is retained in the system, covering both queue time and execution time. The default is 24 hours. The minimum is 10 seconds and maximum is 7 days. The TTL timer starts when the job is submitted, not when execution begins. This means if a job sits in the queue waiting for an available worker, that time counts against the TTL. For example, if you set a TTL of 1 hour and the job waits in queue for 45 minutes, only 15 minutes remain for actual execution.
If the TTL is too short and your endpoint experiences high queue times, jobs may expire before a worker ever picks them up—regardless of the execution timeout setting. For long-running jobs or endpoints with variable queue times, set TTL to comfortably cover both expected queue time and execution time.
You can override this on a per-request basis using the ttl field in the job policy.

Result retention

After a job completes, the system retains the results for a limited time. This retention period is separate from the Job TTL and cannot be extended:
Request typeResult retentionNotes
Asynchronous (/run)30 minutesRetrieve results via /status/{job_id}
Synchronous (/runsync)1 minuteResults returned in the response; also available via /status/{job_id}
Once the retention period expires, the job data is permanently deleted.

Performance features

FlashBoot

FlashBoot reduces cold start times by retaining the state of worker resources shortly after they spin down. This allows the system to “revive” a worker much faster than a standard fresh boot. FlashBoot is most effective on endpoints with consistent traffic, where workers frequently cycle between active and idle states.

Model

The Model field allows you to select from a list of cached models. When selected, Runpod schedules your workers on host machines that already have these large model files pre-loaded. This significantly reduces the time required to load models during worker initialization.

Advanced settings

Data centers

You can restrict your endpoint to specific geographical regions. For maximum reliability and availability, we recommend allowing all data centers. Restricting this list decreases the pool of available GPUs your endpoint can draw from.

Network volumes

Network volumes provide persistent storage that survives worker restarts. While they enable data sharing between workers, they introduce network latency and restrict your endpoint to the specific data center where the volume resides. Use network volumes only if your workload specifically requires shared persistence or datasets larger than the container limit.

CUDA version selection

This filter ensures your workers are scheduled on host machines with compatible drivers. While you should select the version your code requires, we recommend also selecting all newer versions. CUDA is generally backward compatible, and selecting a wider range of versions increases the pool of available hardware.

Expose HTTP/TCP ports

Enabling this option exposes the public IP and port of the worker, allowing for direct external communication. This is required for applications that need persistent connections, such as WebSockets.