Speech Transformation: A Subset of Voice AI through the Lens of Sanas
When people think about Voice AI, what often first comes into mind are AI-generated phone calls and personal assistants. The technology that makes this possible, Speech Generation, powers experiences that seemingly pass the Turing test, fueling the sentiment that AI is coming for our jobs and leading to advertisements about how AI won’t complain about work-life balance.
However, while Speech Generation is rising in prominence, there’s another often-underlooked subsection of Voice AI, Speech Transformation, that seeks to augment human workers’ communications, rather than replace them. Despite the hype surrounding Speech Generation, Carya remains excited about Speech Transformation’s long-term B2B utility as a field of Voice AI and begets unique technical factors worth investing in.
This article explores Speech Transformation through the lens of the startup Sanas. Sanas, founded by Carya's GPs, Andrés and Sharath, initially leveraged Speech Transformation to enable real-time accent translation within call centers, before broadening to all aspects of speech understanding.
First, we’ll overview Sanas’ inception and what led it to become a Speech Transformation company, before outlining Sanas’ future direction as a horizontal company enhancing human-to-human communication through Speech Transformation infrastructure.
Speech Transformation Overview
Speech Transformation, a subsection of Voice AI, focuses on modifying existing speech in real-time, as opposed to Speech Generation, which generates speech from scratch. Cartesia, for instance, provides APIs for cloning voices by ingesting real voice audio, is a Speech Generation infrastructure company.
Enterprises can use Cartesia’s APIs to turn text into authentic-sounding AI-generated speech for a variety of use cases, including automating customer support experiences, powering video game NPCs, screening candidates with AI-powered voice interviews, and generally turning any LLM-generated text into dialogue.
Speech Transformation, conversely, does not seek to mimic human dialogue, but to enhance it. Microsoft Teams meetings, for instance, provide an option to turn on “noise suppression”. This feature aims to reduce common background noises found in work meetings like barking dogs or shuffling papers. Unlike Speech Generation, no new voices are outputted. Instead, Speech Transformation outputs a modified version of the voices being inputted.
Business Value Considerations
While the distinction between Speech Generation and Speech Transformation may not seem very significant, the business values of these infrastructures are considerably different. By seeking to augment, not replace human workers’ conversations, Speech Transformation has unique value propositions for Enterprises.
Speech Generation Business Value
Speech Generation, at its core, aims to generate a cheaper (at scale) version of acceptable audio-conversations. What constitutes an “acceptable” conversation is a subjective threshold worth its own debate. However, generally, if the cost savings from automating conversations exceeds the dissatisfaction costs incurred (i.e. customer churn, reduced revenue, etc.), then that automation can be deemed “acceptable”.
For example, businesses have been using Robotic Process Automation (RPA) to automate basic customer service calls for decades. RPA-powered agents in this context use a limited number of deterministic actions to deliver high-frequency information and perform high-frequency actions to customers.
Take, for instance, a customer calling their bank to ask about the balance of their debit card. The RPA-powered call will, in a robotic voice, present this customer with several options, ask for authentication via keypad, then output this balance, all without human employee involvement.
With many customer support queries being acceptably addressable via RPA, the bank can hire (either via BPO – Business Process Outsourcing – or internally) less employees. If human workers handled every basic customer service call, the resultant labor force would dramatically increase their incurred costs.
Yet, automated conversations, such as RPA-powered customer support queries, are known for significant customer dissatisfaction. As of 2023, studies show that 88% of people prefer to talk with a human customer service agent and 77% feel that customer service chatbots are frustrating. Thus, automating conversations instead of having humans conduct them is financially fruitful, yet risky.
However, Gen AI’s advancements in Speech Generation have significantly increased the complexity of acceptably-automated conversations. Speech Generation leverages Gen AI to produce the content of AI-generated conversations and uses Speech Generation infrastructure to communicate this content in a human-sounding tone. There is still risk in utilizing modern Voice-AI-powered Speech Generation, though this technology has greatly expanded the amount of enterprise use cases where automated conversations (and thus their resultant cost savings) can be acceptably leveraged.
Speech Transformation Business Value
Rather than focusing on automation for these human-to-human conversations, Speech Transformation hones in on mitigating humans’ natural speaking flaws to augment the effectiveness of, and ultimately the productivity of, professional conversations. This is a fundamental mindset shift: Instead of automating professional conversations to an acceptable threshold as cost effectively as possible, Speech Transformation aims to optimize high-value human-performed conversations as much as possible.
While simple inquiries like checking a card balance may be worth the churn and dissatisfaction risk of automation, as voice agents move to higher-value use cases, the bar becomes very high for these agents to perform reliably.
Bessemer Venture Partners’ Voice AI Roadmap, in order to illustrate the challenges of industry adoption in regards to high-value use cases, walks through an example of a small roofing company. While this company, “...might happily employ an agent to field inbound customer calls after hours when they have no alternative…, they may be slow to move to a voice agent as the primary answering service…where each customer call could represent a $30k project…, as customers may have very little tolerance for an AI agent that fumbles a call and costs them a valuable lead.”
However, BVP’s Voice AI Roadmap does not mention Speech Transformation, which is optimized specifically for these high-value and high-complexity human conversations. For the foreseeable future, there will always be a need for human-to-human conversations in professional settings, whether they’re (customer/patient/client)-to-employee or employee-to-employee interactions.
Technical Considerations
Latency & Concurrency
While Latency, the time delay between an action and its response, is a consideration for both Speech Generation and Speech Transformation, it is arguably much more important for the latter. Within Speech Generation, Latency manifests in a slight delay experienced between a human-to-AI conversation. Take, for example, the ChatGPT Advanced Voice Mode, which is still useful despite the pauses between conversations and the AI often interrupting users.
In Speech Transformation, on the other hand, Latency is significantly more debilitating. Since Speech Transformation modifies a user’s voice in real-time, the Latency-caused lag between this voice and the “modified” (i.e. transformed) voice is very noticeable. Studies support this, showing that even a marginal 250ms increase in Latency can be detrimental towards conversations.
This is largely because while Speech Generation produces its outputs from human speech inputs, it does so after the fact. Speech Transformation, on the other hand, modifies inputted human speech in real-time in order to preserve the conversation’s flow, as well as produces its outputs concurrently with the inputted human speech.
Speech Maintenance
Speech Maintenance, which concerns preserving human-to-human voice input characteristics (such as intonation, identity, emotion, and reciprocity), is another category of technical considerations that impact Speech Transformation.
While Speech Generation does strive to produce AI-voices that emulate these characteristics (either through creating a consistent voice or cloning someone’s voice from pre-ingested audio), it creates them in a bespoke way. Speech Transformation, conversely, must preserve how a human’s characteristics are in a conversation. For example, if a user switches their conversational style mid-conversation or another voice is added to its input, Speech Transformation must preserve these conversational curveballs.
Speed Modification
Speed Modification, which (for the purposes of this article) refers to changing the speed of speech in real-time so that voices can be better understood, is unique to Speech Transformation. Note that Sanas does not currently have Speed Modification infrastructure, though this technical consideration does present significant Speech Transformation utility.
For example, if AI recognizes that important information such as a phone number is mentioned during a human-to-human conversation, Speed Modification infrastructure could slow down this portion of the speech to ensure the recipient understands it. Additionally, if AI recognizes that a user has a stutter or other speech impediment that slows down a conversation, this infrastructure could speed up the user’s dialogue to enhance understanding.
Sanas’ Inception
Customer Experience’s (CX’s) Gen AI Opportunity
To understand Sanas and its inception, it’s first important to understand the unique Gen AI opportunity within Customer Experience (CX). While CX is a broad concept which encapsulates all the ways a customer interacts with your company and how they perceive those interactions, for the purposes of this article, CX details the customer service interactions present in a company’s UX (User Experience).
While some companies’ customer services are handled in-house, many are outsourced to Business Process Outsourcing services (BPOs). Sanas was born due in part to the recognition that CX BPOs are especially willing to adopt Gen AI products. Namely, CX shifting to a value/revenue center has made customer service BPOs a Gen AI first mover from a UX perspective. Additionally, BPOs’ thin margins (due primarily to human labor costs) bolsters Gen AI’s adoption from a cost perspective.
CX BPOs – Gen AI First Movers
BPOs have significantly higher AI adoption rates compared to other industries, with 70% of BPOs using AI in their contact centers (i.e. customer service function) compared to only 36% of all organizations. This adoption is poised to increase dramatically, with Gartner anticipating that AI use in BPO services will increase more than six-fold over the next 2 years.
CX – from a Cost Center to a Value Center
As of late, CX is largely transforming from a cost center to a value center. In other words, instead of viewing CX as a necessary, yet expensive commodity, businesses are increasingly viewing CX as a revenue-generating differentiator worth investing in.
Recent data backs these claims up. Qualtrics’ 2025 Consumer Trends Report outlines a decrease in brand loyalty, with 53% (a considerable increase from 2024) of bad experiences resulting in customers cutting their spend. Previously, brand loyalty, achieved in part by marketing, was a more significant factor that could allow products to thrive despite subpar CX. Now, however, CX is growing in importance. Note that while this report defines CX holistically, customer service is nonetheless a significant factor in customers’ perceptions of overall CX.
Additionally, a 2022 Accenture report found that, …”companies which view customer service as a value center, rather than as a cost center, achieve 3.5x more revenue growth…”. The same report also details that companies, “...are seeing 10x+ higher revenue growth the more involved their service organization is in the development of new products.”
Overall, the combination of customer service’s increasing power to sway purchasing behavior and inform product development via insights has powered significantly higher revenue growth when focused on as a revenue opportunity, rather than a commodity. As a result, Gen AI’s potential to drive revenue within CX has been very well received, making CX BPOs a first mover for this product category.
Human Labor Costs & Margins
However, while solely focusing on CX from a cost perspective isn’t ideal, especially for CX BPOs, costs are nonetheless a considerable factor when adopting Gen AI solutions. Generally, BPOs’ largest expenses are labor costs. With over 5,000 hourly employees working at any given time, which is typical for larger BPOs, it’s easy to see how significant this marginal cost can be (with even a $1 increase in salary translated to over $5,000/hour in expenses. As a result, BPOs’ often have tight margins in which efficiency is crucial towards profitability.
Consequently, BPOs see Gen AI solutions as potentially powerful mechanisms to improve margins, thus adopting them as first-movers in the space. While cost-cutting Voice AI solutions seemingly only concern Speech Generation (via replacing certain conversations with AI), Speech Transformation can also improve a BPO’s cost efficiency through augmenting the effectiveness of a human-to-human conversation. If a human customer service agent can resolve a customer’s issue in less time with less redirections, this too reduces a BPO’s costs.
The Need for a Human CX/Customer Service Workforce
Complex Queries
While this article has touched on the need for CX BPOs and human-led customer service centers, it’s important to highlight this need as a driving motivation for Sanas’ inception. Sanas’ co-founder (and Carya’s GP) Sharath Keshava highlights this need in his Medium post.
In this article, he asserts that, especially in more sensitive sectors like finance or healthcare (where empathy, trust, and complex decision-maker matter most), Speech Generation and AI will handle routine inquiries and automate low-level support, freeing human agents to focus on more complex interactions.
While AI is poised to take over Tier 1 and Tier 2 support queries, there will still be a need for human agents to tackle more complex scenarios. Particularly in high average-order-value (AOV) transactions (like in the previously-referenced roofing company scenario) customers spending significant amounts of money will largely prefer a human touch.
See the below diagram for definitions of the “Tiers” of customer support queries (Note that they are sometimes referred to as “Levels” from “L1” to “L5”, instead of Tier 0 to Tier 4):
More Contact Points
Additionally, while one could argue that automating low-level support queries is a negative sign for BPOs (as it reduces the overall TAM), We actually argues the opposite, claiming that, “An important consideration is that increased automation will likely lead to more customer touchpoints…Today, 70% of customer interactions are via [human-agent, complex] voice, and while this percentage might decrease to, let’s say, 30% in the future, the sheer volume of that 30% could be as significant as the current total — if not double.”
As a result, the increased number of customer touchpoints a product has will create more opportunity for complex, human-led customer service opportunities than ever before, despite simple queries being automated away.
Regulations
Finally, current and predicted regulation efforts may prevent or reduce automated conversations in certain jurisdictions and verticals. Potential examples include the California Consumer Privacy Act, which gives consumers opt-out rights with respect to businesses’ use of “automated decision-making technology”, as well as Gartner’s prediction that by 2028, the EU will mandate the “Right to Talk to a Human” in customer service.
Overall, despite advancements in Voice AI technology, human-to-human customer support is here to stay, freeing human agents to tackle complex queries and leading to a better holistic CX.
Speech Transformation & Sanas Enhancing Speech Understanding
Customer Service’s Miscommunication Problem
Up until this point in the article, improving human-to-human conversations via Speech Transformation has been described without detail. To understand how these high-value conversations can be augmented with Voice AI, it’s important to first understand the problem being solved: Miscommunication.
In human-to-human business conversations (especially those involving cross-region BPOs) dialectic differences, cultural barriers, and plain human error may occur. If something is mispronounced, not correctly heard, or misinterpreted, this could be costly to a business. Even if there are no outright ‘errors’ in a conversation, if the CX is an unpleasant experience, this could cause customers dissatisfaction, churn, or even lost contracts.
This is where Speech Transformation is particularly helpful. By modifying an employee’s voice in real time to mitigate miscommunication, customers’ CX will be of a higher satisfaction, thus driving successful business outcomes.
Accent Translation
Sanas's first speech transformation product for CX BPOs was 'Accent Translation’. Within outsourced customer support centers, oftentimes the accents of human agents caused miscommunications and general dissatisfaction due to accent biases. This miscommunication and bias also negatively impacts customer service agents, who may consequently face hostility from customers and more limited career opportunities.
By leveraging Sanas’ Accent Translation Speech Transformation infrastructure, BPOs are able to modify human agents’ speech in real time to be better understood by their customers, all while maintaining the agents’ intonations, identities, emotions, and reciprocities. Sanas’ adoption has driven great results for both customers and agents, resulting in improved metrics regarding call times, customer satisfactions, and sales transfers:
Sanas’ Accent Translation product is only possible through Speech Transformation, due to the necessity of real-time speech modification and the maintenance of a human agent’s in-the-moment speech characteristics (intonation, identity, emotion, reciprocity, etc.).
If Speech Generation were employed here, there would not only be noticeable lag in conversations, but there would also be preset speech characteristics that may not match the conversations’ dynamics. In other words, regardless of lag, Speech Generation would result in a generic, Americanized voice that does not preserve an agent’s humanity, thus preventing them from giving the best possible care.
Background Noise Cancellation
Following Sanas’ Accent Translation product, they released a free, background voice and noise cancellation capability. In addition to providing advancements on existing competitor technology, this Speech Transformation product pairs with Sanas’ Accent Translator to give customer support centers a complete speech modification solution at no extra cost.
Sanas’ Future – Horizontal Speech Transformation
With both its Accent Translation and Background Noise Cancellation offerings, Sanas achieved its initial success by reducing miscommunication within customer service BPOs. However, Sanas's Speech Transformation infrastructure has horizontal utility that extends to domains beyond customer service and BPOs.
At its core, Speech Transformation infrastructure augments human-to-human conversations so that their participants can better understand each other. As a result, Sanas’ technological developments may have utility wherever humans naturally have flaws speaking that can impact customer satisfaction and successful business outcomes.
Potential Verticals with Speech Transformation Benefits
While Speech Transformation can be useful towards many verticals in a variety of ways, this article highlights a subset of these industries (Video Games, Telemedicine, and Sales & Media) benefitting from a subset of Speech Transformation’s functions (Accent Translation and Tonal Consistency – including combatting vocal fatigue). Note that in all human-to-human online conversations, eliminating background noise is helpful, though is not included below.
Video Games
In the video game industry, online multiplayer experiences are among the most prolific and profitable. However, especially for family-friendly online video games, where children are often playing alongside adults without parental involvement, voice chat within online multiplayer risks toxic and damaging experiences.
Not only do many voice chat participants engage in inappropriate, hateful, and harassing language, but children have also been solicited for their personal information, creating opportunities for scams and predatory behavior. As a result, voice chat is often disabled for some video games, preventing younger players from accessing a core part of the experience.
There are existing solutions to mitigate this such as ToxMod, which leverages AI to detect harmful or solicitous language before intervening. However, software like ToxMod can only detect potentially harmful language. Thus, there is no instant way to intervene beyond banning any suspicious users. Additionally, this technology may lead to false positives, which then ban users without good reason.
As a result, developers have to strike a balance between cracking down on toxic behavior and still allowing more spirited conversations between players, which is a hallmark of video game online communities. Consequently, younger players may be exposed to harmful or solicitous language before software is certain of toxic behavior, trading off a risk of safety with a risk of user experience.
If Speech Transformation technology is implemented into these types of video games, software could have a much lower threshold for mitigating against suspicious behavior by modifying potentially harmful or solicitous speech in real time without banning players. For example, if detection software flags a user as suspicious, instead of banning them outright (which risks a false positive), Speech Transformation could be applied to this user, modifying any harmful tones or wording in real-time while still preserving players’ vocal identities. Additionally, if personal information (like an address or password) is detected, this Speech could be censored in real-time, thus preventing risk of solicitation.
Telemedicine
Increasingly, medical and mental health professionals are leveraging Telemedicine to offer more accessible care. However, especially without an in-person setting, the risk of unclear speech is significant, as medical misunderstandings can have serious consequences. A 2024 study found that racial/ethnic minorities are 20-30% more likely to be misdiagnosed, with language barriers and accent-related biases being key factors.
While these biases are sometimes xenophobic, such as patients with accents receiving less thorough explanations, they can also be well-intentioned but similarly consequential, like the reduced accuracy in recalling accented patients’ critical health details. In mental health Telemedicine settings (including online therapy which is exploding in popularity) accent biases may prevent therapists from recognizing signs of depression or mania from patients’ tonalities.
Additionally, this bias cuts both ways, with patients perceiving accented doctors and nurses as less competent. As a result, introducing Speech Transformation technology, such as Accent Translation, into Telemedicine could benefit the quality of care given significantly.
Moreover, especially in the realm of mental health and emergency responder situations related to mental health, both perceiving and communicating tonal consistency often is a critical factor. When frontline medical professionals talk on the phone with distressed individuals, their tone may be the difference between action and inaction, although the situation is distressing for these error-prone professionals, as well. As a result, applying a layer of tonal-consistency-ensuring Speech Transformation for these conversations may be impactful.
Sales & Media
In addition to the prior two industries that can leverage Speech Transformation to maintain tonal consistency and mitigate accent bias, the technology can also be used to combat vocal fatigue. In verticals like sales or media, remaining high-energy and avoiding vocal fatigue is paramount. If their speech reflects this fatigue, this may result in a wearier sales pitch or media performance. Speech Transformation can ensure that users’ voices maintain a certain level of energy while preserving vocal identity, preventing such cases.
Overall, Voice AI’s ability to create AI-Generated speech is a groundbreaking innovation that will continue to transform how business’s conversations are conducted. Still, a less hyped segment of Voice AI, Speech Transformation, will transform human-to-human online conversations in enterprise contexts. Sanas is poised to leverage its Speech Transformation infrastructure to not only expand within CX BPOs, but to augment online conversations in many different verticals.