Cloudflare Challenges AWS By Bringing Serverless AI To The Edge

Serverless AI Inference

Cloudflare

Cloudflare, the leading connectivity cloud company, recently announced the general availability of its Workers AI platform, as well as several new capabilities aimed at simplifying how developers build and deploy AI applications. This announcement represents a significant step forward in Cloudflare’s efforts to democratize AI and make it more accessible to developers worldwide.

After months of being in open beta, Cloudflare’s Workers AI platform has now achieved general availability status. This means that the service has undergone rigorous testing and improvements to ensure greater reliability and performance.

Cloudflare’s Workers AI is an inference platform that enables developers to run machine learning models on Cloudflare’s global network with just a few lines of code. It provides a serverless and scalable solution for GPU-accelerated AI inference, allowing developers to leverage pre-trained models for tasks such as text generation, image recognition and speech recognition without the need to manage infrastructure or GPUs.

With Workers AI, developers can now run machine learning models on Cloudflare’s global network, leveraging the company’s distributed infrastructure to deliver low-latency inference capabilities.

Cloudflare has GPUs operational in over 150 of its data center locations as of now, with plans to expand to nearly all of its 300+ data centers globally by the end of 2024.

GPU PoP Locations

Cloudflare

Expanding its partnership with Hugging Face, Cloudflare now provides a curated list of popular open-source models that are ideal for serverless GPU inference across their extensive global network. Developers can deploy models from Hugging Face with a single click. This partnership makes Cloudflare one of the few to offer serverless GPU inference for Hugging Face models.

Currently, there are 14 curated Hugging Face models optimized for Cloudflare’s serverless inference platform, supporting tasks such as text generation, embeddings and sentence similarity. Developers can simply choose a model from Hugging Face, click “Deploy to Cloudflare Workers AI,” and instantly distribute it across Cloudflare’s global network of over 150 cities with GPUs deployed.

Single-click Deployment for Hugging Face Models

Cloudflare

Developers can interact with LLMs like Mistral, Llama 2 and others via a simple REST API. They can also use advanced techniques like retrieval-augmented generation to create domain-specific chatbots that can access contextual data.

One of the key advantages of Workers AI is its serverless nature, which allows developers to pay only for the resources they consume without the need to manage or scale GPUs or infrastructure. This pay-as-you-go model makes AI inference more affordable and accessible, especially for smaller organizations and startups.

As part of the GA release, Cloudflare has introduced several performance and reliability enhancements to the Workers AI. The load balancing capabilities have been upgraded, enabling requests to be routed to more GPUs across Cloudflare’s global network. This ensures that if a request would have to wait in a queue at a particular location, it can be seamlessly routed to another city, reducing latency and improving overall performance.

Additionally, Cloudflare has increased the rate limits for most large language models to 300 requests per minute, up from 50 requests per minute during the beta phase. Smaller models now have rate limits ranging from 1,500 to 3,000 requests per minute, further enhancing the platform’s scalability and responsiveness.

One of the most requested features for Workers AI has been the ability to perform fine-tuned inference. Cloudflare has taken a step in this direction by enabling Bring Your Own Low-Rank Adaptation. This BYO LoRA technique allows developers to adapt a subset of a model’s parameters to a specific task, rather than rewriting all the parameters as in a fully fine-tuned model.

Cloudflare’s support for custom LoRA weights and adapters enables efficient multi-tenancy in model hosting, allowing customers to deploy and access fine-tuned models based on their custom datasets.

While there are currently some limitations, such as quantized LoRA models not being supported and adapter size and rank restrictions, Cloudflare plans to expand its fine-tuning capabilities further, eventually supporting fine-tuning jobs and fully fine-tuned models directly on the Workers AI platform.

Cloudflare is also offering an AI Gateway, which is a powerful platform that acts as a control plane for managing and governing the usage of AI models and services across an organization.

It sits between applications and AI providers like OpenAI, Hugging Face and Replicate, enabling developers to connect their applications to these providers with just a single line of code change.

Cloudflare AI Gateway serves as a management and governance control plane for AI models and service utilization within enterprises. It acts as a conduit between the model providers and organizations, offering a streamlined method for developers to link their applications to these services with minimal code adjustments.

This gateway offers centralized control, enabling a single interface for various AI services, thereby simplifying integration and enhancing organizational AI capability consumption. It boasts observability through extensive analytics and monitoring, ensuring application performance and usage transparency. It addresses crucial security and governance aspects by enabling policy enforcement and access control.

Finally, Cloudflare has added Python support to Workers, its serverless platform for deploying web functions and applications. Since its inception, Workers has only supported JavaScript as a language for writing edge-running functions. With the addition of Python, Cloudflare now caters to the large community of Python developers, allowing them to use the power of Cloudflare’s global network in their applications.

Cloudflare is challenging AWS by constantly improving the capabilities of its edge network. Amazon’s serverless platform, AWS Lambda, has yet to support GPU-based model inference, while its load balancers and API gateway are not updated for AI inference endpoints. Interestingly, Cloudflare’s AI Gateway includes built-in support for Amazon Bedrock API endpoints, providing developers with a consistent interface.

With Cloudflare expanding the availability of GPU nodes across multiple points of presence, developers can now access state-of-the art AI models with low latency and the best price/performance ratio. It’s AI Gateway brings proven API management and governance to managing AI endpoints offered by various providers.

Follow me on Twitter or LinkedIn. Check out my website.

Janakiram MSV is an analyst, advisor and an architect at Janakiram & Associates. He was the founder and CTO of Get Cloud Ready Consulting, a niche cloud migration and cloud operations firm that got acquired by Aditi Technologies. Through his speaking, writing and analysis, he helps businesses take advantage of the emerging technologies.

Janakiram is one of the first few Microsoft Certified Azure Professionals in India. He is one of the few professionals with Amazon Certified Solution Architect, Amazon Certified Developer and Amazon Certified SysOps Administrator credentials. Janakiram is a Google Certified Professional Cloud Architect. He is recognised by Google as the Google Developer Expert (GDE) for his subject matter expertise in cloud and IoT technologies. He is awarded the title of Most Valuable Professional and Regional Director by Microsoft Corporation. Janakiram is an Intel Software Innovator, an award given by Intel for community contributions in AI and IoT. Janakiram is a guest faculty at the International Institute of Information Technology (IIIT-H) where he teaches Big Data, Cloud Computing, Containers, and DevOps to the students enrolled for the Master’s course. He is an Ambassador for The Cloud Native Computing Foundation.

Janakiram was a senior analyst with Gigaom Research analyst network where he analyzed the cloud services landscape. During his 18 years of corporate career, Janakiram worked at world-class product companies including Microsoft Corporation, Amazon Web Services and Alcatel-Lucent. His last role was with AWS as the technology evangelist where he joined them as the first employee in India. Prior to that, Janakiram spent over 10 years at Microsoft Corporation where he was involved in selling, marketing and evangelizing the Microsoft application platform and tools. At the time of leaving Microsoft, he was the cloud architect focused on Azure.

Leave a Comment Cancel reply