Yulong Liu
February 15, 2024
•
3 mins read
Empower is a serverless LLM hosting platform for fine-tuned models that delivers performance on par with dedicated instances, but only at a fraction of the cost.
Today, we're excited to announce the public beta launch of Empower, a super fast and cost-effective platform to serve fine-tuned LLMs. With Empower, we provide a serverless LLM hosting platform for fine-tuned models that delivers performance on par with dedicated instances, but only at a fraction of the cost. Get started to deploy your model for free today.
Fine-tuned open-source LLMs can achieve superior performance, sometimes even outperforming larger models like GPT-4 on some tasks. As a result, an increasing number of companies have started to fine-tune models.
However, deploying and hosting fine-tuned LLMs is challenging, primarily because of the associated costs. Most LLM hosting platforms require merging the fine-tuned LoRA weights with the base model and renting dedicated GPUs to run the merged model. In this way, one fine-tuned model can cost a few thousand dollars a month to run, making using fine-tuned models economically impractical for most businesses.
Empower is a serverless fine-tuned LLM hosting platform. Unlike other platforms that require reserving GPUs to run fine-tuned models, Empower allows users to serverless deploy fine-tuned LoRAs, and pay for token usage, not GPU time. With Empower, you can easily cut your model serving cost by more than 80% with no compromise on performance.
Key Benefits of Using Empower, compared to other alternatives (e.g.,AWS, GCP, Together.ai, Modal, runpod, etc.):
- Cost Effective: Empower adopts the pay-by-token-usage model ($0.4 per 1 million tokens for 7B models), significantly reducing costs compared to the traditional GPU time-based billing, which can cost thousands of dollars monthly per fine-tuned model.
- Minimum Cold Start Time: Empower’s cold start time is more than 80% lower than other alternatives, ensuring immediate response for an enhanced user experience.
- Flexibility: Empower supports any PEFT based LoRA models. Users are free to train their models on any platform and deploy with Empower, to serve with OpenAI compatible API, no concern of vendor lock-in.
Deploying a LoRA on Empower is straightforward and effortless. We support importing models from Hugging Face. To deploy, simply go to the LoRAs page of Empower, and enter the Hugging Face model ID. The OpenAI compatible LoRA inference endpoint will be ready to use right after deployment. The screen recording below shows the flow of deploying a LoRA to production.
At the core of Empower’s platform is a high performance LoRA inference engine we built atop existing open-source projects using Rust. Empower’s engine allows us to run multiple LoRA’s on the same GPU with low latency and high throughput, which outperforms other leading open-source inference engines such as vLLM (developed by UCB) and LoraX (built on top of HuggingFace’s text generation inference).
We conducted two benchmarks to test Empower’s performance:
- Cold start latency benchmark against AnyScale and Modal
- latency / throughput benchmark against vLLM and LoraX
with the following setup:
- Model: llama2-7b-chat
- Lora: Llama-7b-chinese-chat-lora
- Hardware: one A100-40G GPU
Cold Start Latency Benchmark:
We sent 5 cold requests (the first request after service start, or the first request after a long idle time) to each platform. The average cold request response time is:
Latency / Throughput Benchmark:
Below is the setup of the latency / throughput benchmark:
- Total of 1000 requests, varying QPS to simulate different load scenarios.
- Varied input length, with an average of 233 input tokens per request
- Varied output length, with an average of 245 output tokens per request
This is just the initial version of the Empower engine, we will continue to optimize the engine to make it perform even better!
Deploy and serve your first fine-tuned LLM in 1 minute for free!