Model Deployment is a very common GPU use case. With Shadeform, it’s easy to deploy models right to the most affordable gpu’s in the market with just a few commands.In this guide, we will deploy Mistral-7b-v0.1 with vLLM onto an A6000.
This guide builds off of our others for finding the best gpu and for deploying gpu containers.
We have a python notebook already to go for you to deploy this model that you can find here.
The requirements are simple, so in a python environment with (requests + optionally openai) installed:
Once we have an instance, we deploy a model serving container with this request payload.
Copy
model_id = "mistralai/Mistral-7B-v0.1"payload = { "cloud": best_instance["cloud"], "region": region, "shade_instance_type": shade_instance_type, "shade_cloud": True, "name": "cool_gpu_server", "launch_configuration": { "type": "docker", #This selects the image to launch, and sets environment variables "tasks" and "num_fewshot" "docker_configuration": { "image": "vllm/vllm-openai:latest", "args": "--model " + model_id, "envs": [], "port_mappings": [ { "container_port": 8000, "host_port": 8000 } ] } }}#request the best instance that is availableresponse = requests.request("POST", create_url, json=payload, headers=headers)#easy way to visually see if this request workedprint(response.text)
Once we request it, Shadeform will provision the machine, and deploy a docker container based on the image, arguments, and environment variables that we selected.
This might take 5-10 minutes depending on the machine chosen and the size of the model weights you choose.
For more information on the API fields, check out the Create Instance API Reference.We can see that this will deploy an openai compatible server with vLLM serving Mistral-7b-v0.1.
There are three main steps that we need to wait for: VM Provisioning, image building, and spinning up vLLM.
Copy
instance_response = requests.request("GET", base_url, headers=headers)ip_addr = ""print(instance_response.text)instance = json.loads(instance_response.text)["instances"][0]instance_status = instance['status']if instance_status == 'active': print(f"Instance is active with IP: {instance['ip']}") ip_addr = instance['ip']else: print(f"Instance isn't yet active: {instance}" )
This cell will print the IP address once it has provisioned. However, the image needs to download, and vLLM needs to download the model and spin up, which should take a few minutes.
Once the model is ready, this code will output the model list and a response to our query. We can use either requests or OpenAI’s completions library.
Copy
#Wait until the previous cell has an IP address associated with it, and then add a few minutes for the vLLM server to stand up.#It is usually best to look at the logs on the dashboard to tell when the model is loaded.model_list_response = requests.get(f'http://{ip_addr}:8000/v1/models')print(model_list_response.text)vllm_headers = {'Content-Type': 'application/json',}json_data = {'model': model_id,'prompt': 'San Francisco is a','max_tokens': 7,'temperature': 0,}completion_response = requests.post(f'http://{ip_addr}:8000/v1/completions', headers=vllm_headers, json=json_data)print(completion_response.text)
Or once we’ve made the request, we can watch the logs under Running Instances. Once it is ready to serve it should look something like this:
Happy Serving!