Gen AI Edge Computing: Bridging the Gap towards Real Enterprise Value

and

Nov 13, 2024

In the weeks leading up to Apple’s June WWDC announcement, buzz swirled around their rumored OpenAI integration. However, while this partnership is exciting, Siri’s latest update, dubbed Apple Intelligence, aspires towards much more.

For many, Apple Intelligence is their first mainstream exposure towards Gen AI Edge Computing, which embeds AI models directly into users’ devices and nearby servers. For a product that values real-time performance, cost/power efficiency, and consumer privacy, leveraging Edge Computing empowers Apple to provide real, reliable value, rather than merely cutting-edge tech demos.

Apple Intelligence certainly proves Gen AI Edge Computing’s viability within consumer devices, though this technology’s utility is arguably more pronounced in B2B contexts. In fact, Edge Computing’s privacy/security, performance, and cost/power benefits are critical towards bridging the gap towards real-enterprise value in a variety of applications.

This article first provides an overview of Gen AI Edge Computing and how it powers Apple Intelligence, before diving into its enterprise applications, including the industries and products the technology is essential towards. Namely, we’re excited about industries in which security and privacy are vital, such as the military, as well as products that require real-time (and sometimes offline) interactions, such as various IoT devices. Additionally, Edge Computing has significant implications on Gen AI’s cost structure, which will be explored in depth in an upcoming article (coming soon).

Generative AI has quickly emerged as one of the most exciting developments of our current era, but in order for this excitement to translate towards sustainable business value, Edge Computing will be a necessary piece of the puzzle.

Gen AI Edge Computing & Apple Intelligence

By storing Generative AI models directly onto consumer hardware, Apple enables a host of conversational capabilities without the latency, compute costs, online-requirements, or security/privacy risks of cloud-hosted experiences such as ChatGPT or Claude. This section of the article walks through Gen AI Edge Computing and how Apple Intelligence utilizes this technology alongside its Cloud-Hosted Gen AI models, diving into their pros and cons.

Gen AI Edge Computing Overview

Generative AI Edge Computing has two variants: “Locally-Hosted” and “At the Edge of the Network”. The former refers to storing AI models directly within the hardware that sources its Inference input data; the latter refers to AI processing carried out on local servers or hardware close to IoT devices, but not on the devices themselves. This article will primarily detail the former, which is the variant that Apple Intelligence leverages.

There are two ways in which data is processed to empower a Gen AI experience: Training and Inference. Training pertains to the computationally expensive process of ingesting data into, and subsequently creating (and fine-tuning) an AI model. In the vast majority of cases, training an AI model will take place on cloud servers. However, Inference, which involves the actual execution of user inputs by said AI model, can benefit greatly from Edge Computing.

Within the user flow for cloud-hosted Gen AI models’ Inferences, a user’s device will first collect input data. This data may take the form of an LLM prompt, the contents of a piece of information (book, webpage, spreadsheet, etc.), a photo, or any other combination of data that may help define the output of a specific Gen AI use case. Next, that input data will be transported to servers which power given Gen AI models, that then process said input data and return an output response. Finally, that output response is transported back to the user's device to be displayed and, in agentic use cases, executed.

The characteristic of Gen AI Edge Computing that defines its benefits is the lessened distance collected data must travel to be processed. Conversely to the above user flow, within Edge Computing, the Gen AI model is stored directly on (or near) the user’s device. Thus, once the input data is collected, it’s already effectively where it needs to be to get processed by the artificial intelligence. As a result, latency is greatly reduced (since the data’s required travel time is negligible), offline capabilities are enabled, Inference costs are all but removed (due to computations being executed on the local device itself), and the data stays private and secure (since it practically cannot be intercepted).

Edge Computing & Apple Intelligence

Apple’s aim for its Generative AI integration is to serve as a powerful and versatile intelligence system. In contrast to more pointed applications of Gen AI that optimize around executing a small subset of use cases very well, Apple devices are designed as a starting point for a wide variety of user interactions.

As the conversational entry point for such multifaceted products, Apple Intelligence performs many tasks. These tasks include (but are not limited to) completing context-specific basic actions, synthesizing context-specific information, and responding as accurately as possible to any user query. Importantly, Apple Intelligence integrates with Apple devices’ OSs, thus giving its Gen AI the ability to tap into the information and capabilities of both 1P and 3P applications.

In order to appropriately assign tasks to AI models, Apple Intelligence must consider the tradeoff between the model complexity enabled by Cloud-Hosted Gen AI and the reduced latency enabled by Edge-Hosted Gen AI. Given such a broad range of tasks Apple Intelligence can handle, the level of model size and complexity these tasks require is similarly diverse. While simply leveraging the largest and most computationally expensive Gen AI models would maximize the chances that user queries are accurately and acceptably answered, this methodology would come with significant drawbacks. Because cutting-edge Gen AI models cannot fit onto local devices and require more compute, their optimized response quality comes at the expense of increased latency, security risks, barriers to accessibility, and marginal costs. These costs are either incurred by the product team or passed onto the customer, which will be detailed further in an upcoming article (coming soon).

For many Gen AI output responses, optimizing primarily for the quality of response outputs is unnecessary. In fact, responses that successfully perform a user’s use case can be achieved by non-cutting-edge models for the majority of Gen AI scenarios. If an answer does reach this “acceptability threshold”, then there are diminishing returns from any marginal improvements in response quality, making it best to instead optimize for the aforementioned additional variables (performance, marginal costs, offline accessibility, and privacy/security).

Apple’s Gen AI integration follows this schema through Model Routing (via a layer of Orchestration). In this context, Orchestration refers to a Gen AI model call used to identify the user’s intent and then using said intent to route a user’s prompt to a specific model (i.e. Model Routing). The Orchestration layer aims to determine the smallest Gen AI model that can reliably output an acceptable response, thus successfully completing a user’s use case while optimizing for performance, marginal costs, offline accessibility, and privacy/security. For more information about Model Routing and Orchestration, reference this article.

To illustrate Apple Intelligence’s Model Routing methodology, as well as the overall benefits and drawbacks of Gen AI Edge Computing, see the diagram below:

Note that the concept of an “Acceptability Threshold” used in this article is a simplification. In reality, different Gen AI queries have varied needs that cannot be encapsulated in a single metric of quality or acceptability. For example, a smaller AI model (that generally has a low Acceptability Threshold) may be optimized around text summarization, thus being able to handle complex summarization tasks that a larger, more versatile model could not.

On Device Models

Referencing the above diagram of Apple’s Gen AI implementation, when a user’s prompt is inputted, Apple’s Orchestration layer identifies this prompt’s intent and aims to route it to the most suitable AI model for Inference. Ideally, a query can be handled by less capable/versatile on-device models due to their lower latency, offline capabilities, and higher privacy and security.

Due to these benefits, Apple Intelligence utilizes various strategies to compress its on-device LLM as much as possible, thus maximizing its capabilities and consequently its Acceptability Threshold. Some key strategies that are employed are (but are not limited to) LoRa, Quantization, Pruning (Network Sparsity), and Palettization (Weight Clustering).

Model Parameters

To understand model compression methodologies, it is important to first understand model parameters and their significance. In machine learning, model parameters are internal variables that are learned from training data, which are crucial towards the model’s ability to make predictions. These parameters help the model understand and replicate the patterns and structures found in training data, enabling it to generate new, similar content.

For example, in the context of Generative AI (and thus neural networks), “weights” and “biases” are fundamental model parameters. To illustrate these concepts, consider their parallels towards a simple linear equation y = mx + b. Weights determine the strength and significance of connections between neurons in different layers of a neural network, dictating how much influence input data has on a model’s output. Thus, weights are analogous to m in a linear equation: the higher a neuron’s weight, the more outsized an impact it has on the output response y. Biases, on the other hand, act as parameters that shift the entire function independently of any input data, thus acting similarly to b in a linear function.

LoRa

One of the most crucial techniques Apple Intelligence employs, Low-Rank Adaptation (LoRa), is used to optimize models for specific tasks without retraining or replicating them. Consequently, instead of deploying multiple function-specific fine-tuned models on a single device, LoRa enables Apple to fine-tune a single base model for multiple tasks without significantly increasing the size of said model (see below diagram):

At a high-level, LoRa works by quickly loading a single AI model with various small task-specific “adaptors” at the time of Inference. When paired with an adaptor, the base model becomes fine-tuned towards a specific task (via updating a subset of the model’s parameters necessary for this fine-tuning), thus reaching a higher Acceptability Threshold for particular functions.

For example, assume that a user wants to use Apple Intelligence to summarize a complex text. First, Apple’s on-device Orchestrator will identify the inputted query’s intent as “Summarization”. Next, Apple Intelligence’s on-device model is loaded with a “Summarization” adaptor. This adaptor updates the base model’s parameters, increasing the weights and biases (among other variables) relevant towards text-summarization training data.

While the on-device LLM may have previously been incapable of producing an acceptable response, it now becomes more optimized for the task (potentially at the expense of the base LLM’s other capabilities). Thus, without significantly increasing the size of this base model, the Acceptability Threshold can now be reached.

Quantization, Pruning, and Palettization

Quantization, Pruning, and Palettization, all involve manipulating a model’s parameters to compress its size (at the risk of response quality degradation). In contrast to LoRa, which updates model parameters at the time of Inference, these 3 techniques alter the fundamental base model itself. For a more detailed overview of these techniques, reference this paper.

Quantization works by reducing the precision of model parameters (i.e. from 32-bit to 8-bit), thus enabling the parameters to occupy less memory than before. The impact of Quantization can be compared to reducing the diameter of a basketball hoop: While the LLM was trained to “play” on a normal hoop, once quantized post-training, the model now executes Inferencing with a slightly smaller hoop. Most “shots” will go in (i.e. produce acceptable output responses), but some will miss, causing minor response-quality-degradation.

Instead of reducing the precision of model parameters, Pruning outright removes unnecessary weights, thus minimizing the base model’s size. Pruning’s response-quality-degradation risks are intuitive, since the removed weights may be valuable towards processing an inputted prompt. However, if done carefully, Pruning seeks only to remove parameters irrelevant towards the prompts a model will receive (assuming the Orchestration layer successfully routes prompts that said pruned model can handle).

Palettization, inspired by an artist’s palette, involves mapping many similar weights onto a single parameter, thus reducing the total number of weights in a model without significantly affecting output responses. Although the grouped weights are often near indistinguishable in practice, this technique can nevertheless lead to a loss in precision and accuracy.

Server Models

If a user’s inputted query cannot be acceptably processed by its on-device models, Apple Intelligence’s Orchestrator can alternatively route requests to its Server models, which Apple calls “Private Cloud Compute”. Note that these models are hosted on the cloud (and thus not Edge Computing), but provide similar benefits to Edge Computing due to Apple’s emphasis on privacy and security.

By running proprietary AI models on their own servers, Apple retains more control than when requests are routed to third-party servers and models. As a result, it is likely that Apple Intelligence’s Server models result in improved performance/latency, privacy/security, model size, and compute costs (versus third-party infrastructure). Apple has publicly emphasized the importance of privacy and security within its Gen AI implementation, going as far as to pay up to $1 million to anyone who can hack into their servers.

Since this is an Apple-specific implementation, this overview is partially speculatory. Latency may be reduced due to two factors: The use of Apple Silicon and less server traffic. Apple’s proprietary chips, called Apple Silicon, power its Server models. As a result, this tight ecosystem integration likely optimizes performance. Additionally, while Apple devices are highly-used, ideally most requests can be handled by its on-device models. Thus, the overall volume of queries routed to the company’s servers may be smaller than the total amount of Gen AI traffic going to third-party infrastructures.

Regarding Apple’s server models’ sizes and compute costs, third-party models are expected to reach an Acceptability Threshold for as many inputted prompts as possible. Conversely, Apple’s Server models are only routed towards if on-device models are not suitable. As a result, these models can be compressed accordingly. Notably, Apple’s Server models also leverage LoRa, further compressing them (and thus minimizing compute costs). In addition to optimizing performance, Apple’s tightly-coupled infrastructure likely introduces model-size and compute-cost improvements, as well. For a full overview on Apple’s Server models, reference this article.

However, Apple’s Private Cloud Compute comes with limitations. Its models, while capable, do not produce cutting-edge response quality (both due to competitors’ excellence in research and infrastructure, as well as its privacy-emphasis likely being at the expense of response optimizations). Thus, if its Server models cannot reach an inputted prompt’s Acceptability Threshold, Apple enables prompts to be routed to various third-party models as a final fallback.

World Models

When necessary to reach an Acceptability Threshold (and when enabled by users), Apple Intelligence’s Orchestration Layer can route queries to various third-party models hosted on third-party servers, such as OpenAI’s GPT. While this integration is currently free, Apple may leverage certain World Models features as one of their “Services” SKUs, charging users a monthly cost to gain revenue and offset their marginal compute costs.

These models are generally optimized for response quality among all possible user prompts. As a result, their Acceptability Thresholds are above that of Apple’s on-device and Server models. Still, as outlined above, World Models have significant drawbacks around performance/latency, privacy/security, and model-size/compute-costs. Therefore, Apple Intelligence only leverages these models as a fallback option to their proprietary Gen AI solutions.

Enterprise Applications of Gen AI Edge Computing

The aforementioned benefits of Edge Computing are pronounced in B2B contexts, enabling enterprises to realize decreased marginal compute costs and amplified real-business value. For a deep-dive into the implications of Gen AI Edge Computing on B2B pricing, reference our upcoming article (coming soon).

However, in contrast to Apple Intelligence, which can in theory function with entirely Cloud-Hosted Gen AI, many B2B contexts necessitate Edge Computing to achieve real-business value. Primarily, many B2B Gen AI products are only viable if paired with real-time latency, iron-clad privacy/security, and/or offline capabilities.

The remainder of this article walks through various use cases within the Military industry, in which Edge Computing is particularly crucial towards Gen AI products’ business viability, as an emblematic and generalizable case study. Specifically, in Military contexts and beyond, Edge Computing enables Gen AI products to turn from promising tech demos into real-word value.

Military Industry

While many industries’ Gen AI products may necessitate Edge Computing due to their latency, offline accessibility, and privacy/security requirements, the Military industry is especially sensitive to these aforementioned variables. Not only do military scenarios often involve IoT hardware, real-time decision making, sensitive data, and poor internet connections, but they also necessitate the types of complex communication and strategic thinking towards which Gen AI is particularly useful. For a thorough overview of AI applications in the Military, reference this article.

As a result, Military Gen AI applications serve as emblematic case studies that can be applied selectively to other industries, such as finance, smart city-planning, and agriculture (among many others), which share a subset of the military’s characteristics.

IoT Hardware

Military technology advancements have spurred on the use of IoT (Internet of Things) devices, pieces of hardware that are embedded with sensors, software, and network connectivity. These devices, including hardware like vehicles, drone swarms, cameras, and sensors, are enhanced with the functionality of computers, giving them the potential to leverage Gen AI capabilities.

For example, consider the use case of Drone Swarms, which leverage AI to collect data/commands, disseminate them to all other drones in the swarm, and act on said information accordingly. A key Silicon Valley player in this space is Anduril’s air dominance autonomy software, Lattice for Mission Autonomy. Lattice controls decision-making, coordination, and action for multiple aircrafts simultaneously, headed by a human “quarterback”, who provides intent that the software interprets into discrete tasks (Source).

These “discrete tasks” are key to Gen AI’s value. Without Gen AI, humans would have to command Drone Swarms in a uniform, explicit manner, which is not only unnatural to commanders used to delegating to human pilots, but also wastes valuable time. With Gen AI, Drone Swarms can receive, process, and communicate information identically to that of human pilots via voice and text. In addition to avoiding human casualties, Lattice can execute repeatedly without getting bored, tired, or emotional in often distressing situations.

Since perfect GPS and internet connectivity are not realities in live military environments, Edge Computing is crucial towards enabling Gen AI in IoT devices such as Drone Swarms. Additionally, these drones will need to act upon real time information and commands, making any latency potentially detrimental.

Another example of Military IoT Hardware being enhanced by Gen AI concerns the monitoring of medical devices that provide Casualty Care. Without Gen AI, when an injured soldier is connected to a host of medical devices, medical military personnel must manually evaluate, diagnose, and treat them. However, this is often a highly emotional and high stress environment, in which split-second decisions may be consequently clouded.

Gen AI can aid significantly when paired with Military medical IoT devices. Rather than force emotionally-distressed medical personnel to weave together multiple sources of information to make medical analyses of soldiers, Gen AI can provide diagnostic information, analyses, and recommendations in a natural language context without emotional considerations as a hindrance. While humans do need to be in the loop for medical decisions, as their emotional abilities are valuable to make appropriate decisions, Gen AI’s recommendations deliver vital background information.

However, if cloud-hosted AI Models were used for these two mentioned examples, significant complications and limitations would arise, thus precluding real-enterprise value. Medical patient information (particularly in Military contexts) can be highly volatile. If IoT-collected information must travel to cloud-hosted servers, its resultant analyses may come too late. This is only heightened by the more remote locations of military operations, which may have subpar internet connectivity (or even be fully offline, meaning that cloud-hosted Gen AI can not take place at all).

While Edge Computing is particularly necessary within Military IoT Gen AI contexts, these aforementioned examples can also be applied to other industries. For example, IoT cameras and sensors are used in both the Smart Cities and Agriculture industries. Gen AI could allow these IoT devices to quickly recognize and respond to collected information, such as a camera recognizing farming crop diseases or a city energy outage. If acted upon quickly by Gen AI (either through providing fast AI-produced analyses to humans or performing actions autonomously), the economic cost of those situations may be minimized.

Like in the prior IoT Military examples, these Smart City and Agriculture scenarios necessitate Edge Computing to enable real-time latency, (especially in the Agriculture case) offline access, and (especially in the Smart City case) privacy/security.

Overall, IoT hardware involves the collection of (often dynamic and volatile) data that is essential to resultant analyses which can be performed and communicated by Gen AI. As a result, when said IoT-collected data changes in real-time, is gathered in poor-internet (or offline) areas, and/or involves sensitive information, Edge Computing becomes highly important towards real-enterprise value.

Applications Beyond Hardware – Data Processing & Research

Additionally, Edge Computing can also be necessary for Gen AI use cases that do not involve any IoT hardware. While there may not be hardware-collected data to inform Gen AI use cases’ inferences, non-IoT use cases nonetheless may require offline (or poor-internet) accessibility, real-time latency, and/or sensitive information.

For instance, the DoD collects a wealth of data. The full extent of this data’s resultant strategic recommendations and multimodal analysis can be unlocked by Gen AI. Military operations today include various sensors that enable commanders to access text, images, video, radio signals and other data assets. “However, each data type is often analyzed separately, leaving human analysts to draw – and potentially miss – connections” (Source).

This is where Gen AI can prove enormously helpful. While, for instance, Navy systems can identify a nearby ship from multiple standpoints (imaging, radio signals, etc.), Gen AI can provide multimodal insights to analyze these different data streams together for complementary information. For example, the ship may stop emitting radio signals, suggesting that enemy soldiers may be preparing for battle. A lull in storm conditions (shown through collected weather reports), could confirm this hypothesis and result in strategic recommendations. “This enhanced situational awareness is only possible if real-time analysis happens at the edge instead of sending data to a central location for processing” (Source).

Gen AI’s detail-oriented, computational, and emotionless data analysis can be extremely useful. However, the data inputs needed to output these strategic recommendations may contain sensitive information that, if intercepted when traveling to data centers, would pose a security threat. The benefits of Gen AI Edge Computing in this context are bolstered by military operations’ often poor-internet environments making leveraging cloud-hosted LLMs risky and potentially ineffective even when real-time analysis isn’t strictly required.

Gen AI Edge Computing use cases in other B2B industries also often do not involve IoT devices, yet are nevertheless essential. One of Carya’s portfolio companies, Sanas, is an example of this. Sanas improves both the employee and end user satisfactions of customer support workers through accent translation and background-noise elimination services via AI. Due to the requirement of real-time latency for these conversations (since customer support calls with noticeable lag deliver unacceptable customer experiences), Edge Computing is required to produce a functioning product. For a deeper view into Voice AI and its relation to Edge Computing, reference Carya’s upcoming series on this topic (coming soon).

Both within and outside of IoT Hardware use cases, Gen AI’s content generation proves both valuable and truly disruptive towards many industries. However, for enterprise applications that process sensitive information and cannot contain lag, Edge Computing is most likely a needed addition.

While Apple Intelligence was many’s first exposure to Gen AI Edge Computing, it will nonetheless be a powerful part of Gen AI’s story in B2B contexts. Especially as Gen AI advancements bring previously-cutting-edge capabilities directly onto IoT Hardware (as well as traditional smart devices), we’re excited about the real-enterprise-value enabled by hosting AI Models on the edge. For an investigation into Gen AI Edge Computing’s implications on B2B Gen AI pricing, reference our upcoming article (coming soon).

Carya Venture Partners

Discussion about this post