dstack is an open-source tool that automates Pod orchestration for AI and ML workloads. It lets you define your application and resource requirements in YAML files, then handles provisioning and managing cloud resources on Runpod so you can focus on your application instead of infrastructure.
This guide shows you how to set up dstack with Runpod and deploy vLLM to serve the meta-llama/Llama-3.1-8B-Instruct model from Hugging Face.
Requirements
You’ll need:
These instructions work on macOS, Linux, and Windows.
Set up dstack
Prepare your workspace
Open a terminal and create a new directory:mkdir runpod-dstack-tutorial
cd runpod-dstack-tutorial
Set up a Python virtual environment
python3 -m venv .venv
source .venv/bin/activate
python3 -m venv .venv
source .venv/bin/activate
Command Prompt:python -m venv .venv
.venv\Scripts\activate
PowerShell:python -m venv .venv
.venv\Scripts\Activate.ps1
Install dstack
Install dstack using pip:pip3 install -U "dstack[all]"
pip install -U "dstack[all]"
pip install -U "dstack[all]"
Create the global configuration file
Create a config.yml file in the dstack configuration directory. This file stores your Runpod credentials for all dstack deployments.
-
Create the configuration directory:
mkdir -p ~/.dstack/server
mkdir -p ~/.dstack/server
mkdir %USERPROFILE%\.dstack\server
-
Navigate to the configuration directory:
cd %USERPROFILE%\.dstack\server
Create a file named config.yml with the following content:projects:
- name: main
backends:
- type: runpod
creds:
type: api_key
api_key: YOUR_RUNPOD_API_KEY
Replace YOUR_RUNPOD_API_KEY with your actual Runpod API key. Start the dstack server
Start the dstack server:You’ll see output like this:[INFO] Applying ~/.dstack/server/config.yml...
[INFO] The admin token is ADMIN-TOKEN
[INFO] The dstack server is running at http://127.0.0.1:3000
Save the ADMIN-TOKEN to access the dstack web UI.
Access the dstack web UI
Open your browser and go to http://127.0.0.1:3000. Enter the ADMIN-TOKEN from the server output to access the web UI where you can monitor and manage deployments.
Deploy vLLM
Prepare for deployment
Open a new terminal and navigate to your tutorial directory:cd runpod-dstack-tutorial
Activate the Python virtual environment:source .venv/bin/activate
source .venv/bin/activate
Command Prompt:PowerShell:.venv\Scripts\Activate.ps1
Create a directory for the task
Create a new directory for the deployment:mkdir task-vllm-llama
cd task-vllm-llama
Create the dstack configuration file
Create a file named .dstack.yml with the following content:type: task
name: vllm-llama-3.1-8b-instruct
python: "3.10"
env:
- HUGGING_FACE_HUB_TOKEN=YOUR_HUGGING_FACE_HUB_TOKEN
- MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
- MAX_MODEL_LEN=8192
commands:
- pip install vllm
- vllm serve $MODEL_NAME --port 8000 --max-model-len $MAX_MODEL_LEN
ports:
- 8000
spot_policy: on-demand
resources:
gpu:
name: "RTX4090"
memory: "24GB"
cpu: 16..
Replace YOUR_HUGGING_FACE_HUB_TOKEN with your Hugging Face access token. The model is gated and requires authentication to download.
Initialize and deploy
Initialize dstack
In the directory with your .dstack.yml file, run: Apply the configuration
Deploy the task:You’ll see the deployment configuration and available instances. When prompted:Submit the run vllm-llama-3.1-8b-instruct? [y/n]:
Type y and press Enter.The ports configuration forwards the deployed Pod’s port to localhost:8000 on your machine. Monitor the deployment
dstack will provision the Pod, download the Docker image, install packages, download the model, and start the vLLM server. You’ll see progress logs in the terminal.To view logs at any time, run:dstack logs vllm-llama-3.1-8b-instruct
Wait until you see logs indicating the server is ready:INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Test the deployment
The vLLM server is now accessible at http://localhost:8000.
Test it with curl:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are Poddy, a helpful assistant."},
{"role": "user", "content": "What is your name?"}
],
"temperature": 0,
"max_tokens": 150
}'
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are Poddy, a helpful assistant."},
{"role": "user", "content": "What is your name?"}
],
"temperature": 0,
"max_tokens": 150
}'
Command Prompt:curl -X POST http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d "{ \"model\": \"meta-llama/Llama-3.1-8B-Instruct\", \"messages\": [ {\"role\": \"system\", \"content\": \"You are Poddy, a helpful assistant.\"}, {\"role\": \"user\", \"content\": \"What is your name?\"} ], \"temperature\": 0, \"max_tokens\": 150 }"
PowerShell:curl.exe -Method Post http://localhost:8000/v1/chat/completions `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ {"role": "system", "content": "You are Poddy, a helpful assistant."}, {"role": "user", "content": "What is your name?"} ], "temperature": 0, "max_tokens": 150 }'
You’ll receive a JSON response:
{
"id": "chat-f0566a5143244d34a0c64c968f03f80c",
"object": "chat.completion",
"created": 1727902323,
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "My name is Poddy, and I'm here to assist you with any questions or information you may need.",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 49,
"total_tokens": 199,
"completion_tokens": 150
},
"prompt_logprobs": null
}
Clean up
Stop the task when you’re done to avoid charges.
Press Ctrl + C in the terminal where you ran dstack apply. When prompted:
Stop the run vllm-llama-3.1-8b-instruct before detaching? [y/n]:
Type y and press Enter.
The instance will terminate automatically. To ensure immediate termination, run:
dstack stop vllm-llama-3.1-8b-instruct
Verify termination in your Runpod dashboard or the dstack web UI.
Use volumes for persistent storage
Volumes let you store data between runs and cache models to reduce startup times.
Create a volume
Create a file named volume.dstack.yml:
type: volume
name: llama31-volume
backend: runpod
region: EUR-IS-1
# Required size
size: 100GB
The region ties your volume to a specific region, which also ties your Pod to that region.
Apply the volume configuration:
dstack apply -f volume.dstack.yml
Use the volume in your task
Modify your .dstack.yml file to include the volume:
volumes:
- name: llama31-volume
path: /data
This mounts the volume to the /data directory inside your container, letting you store models and data persistently. This is useful for large models that take time to download.
For more information, see the dstack blog on volumes.