Discerning Disruptive Gen AI Products from Hype Part 2: Ensuring Meaningfully Valuable Generated Content
Gen AI advancements are fueling new initiatives and disruption which, according to AI leaders, are nowhere near the point of diminishing. Thus, to navigate this exciting yet increasingly saturated space, it’s crucial to develop a framework that discerns truly disruptive Gen AI products from those built on hype.
Focusing on B2B Gen AI products, which leverage AI to generate content that an organization’s employees use in their workflows, a core pulse check that can distinguish viable products from novelties is if they meaningfully reduce an employee’s ‘Time to Value’.
If an employee can reduce their Time to Value, or TTV, via a Gen AI product, through a combination of interacting with a UX optimized for their workflow and finding meaningful value in its LLM-generated content, it’s reasonable that this product can find its way into long-term usage habits and be something customers are willing to continuously pay for.
In Part 1 of this series, we discussed optimizing Gen AI products’ UXs as a lever to meaningfully reduce TTV. In this article, we focus on how to ensure that a Gen AI product generates meaningfully valuable (and thus TTV-reducing) content.
Time to Value and Meaningfully Valuable Generated Content
TTV, put simply, is a metric used to determine the amount of time it takes for a customer to see the value of a product or service. While some narrow TTV’s scope to new customers, it’s defined here as the amount of time it takes for an employee to get value from their workflow.
For examples of TTV in the context of Gen AI products, reference the “What is Time to Value?” section of Part 1 of this series, though Gen AI products follow a generalizable TTV formula:
Regarding if Gen AI implementation lowers TTV, an existential consideration is if the content a Gen AI product generates is valuable enough to its users’ workflows. In other words, is the reduction in TTV of a user’s workflow with Gen AI…
…large enough to justify a separate product’s cost and time to integrate into an organization? With all of the hype and investor-receptiveness Gen AI products receive, it’s nonetheless apparent that in many cases the answer towards this existential question is “no”.
At Microsoft, teams across the organization were pushed to experiment with Gen AI integration in just about every offering they could before taking a step back and evaluating whether they should follow through with that initiative. In cases where Gen AI integration was not pursued, there were consistent patterns that indicated Gen AI workflows were not the best path forward.
While B2B Gen AI products fail to reach this foundational threshold for a variety of reasons, there are 2 common pitfalls that, if not consciously avoided, can prevent these experiences from being disruptive: the “Minimal Problem Space” and the “Very Rough Draft”.
The “Minimal Problem Space”
Arguably the most common reason why Gen AI isn’t integrated into B2B workflows is that the time it takes to create or locate content isn’t long enough to warrant a need for it to be generated by AI. In other words, a task’s Time(Content Creation) - Time(Content Generation) isn’t significant, thus indicating that there’s no real problem space Gen AI is solving.
For example, a prevalent Gen AI application is to insert a chatbot into a website’s FAQ or support docs. This chatbot ingests the support doc’s or FAQ’s specific context (either via RAG or model fine-tuning) to provide generated answers to user queries (see below example).
[AWS’s FAQ’s are accompanied by the company’s Gen AI Chatbot “Q”]
While this is an effective tool for expansive and complex support documentation (like AWS), for less-technical FAQs Gen AI integration may not be appropriate. In the latter cases, simply using “CTRL F” to locate specific keywords related to a user’s inquiry could effectively find a user’s desired content.
In this example, the Time(Content Creation) of locating an FAQ answer (assuming a user’s “CTRL F” action is successful) isn’t a meaningful problem space. In fact, Gen AI integration has the potential to be a much worse workflow. Depending on the Chatbot’s UX, Time(Content Generation) could be significantly higher than “CTRL F”, via the task’s latency and the friction of opening and typing an appropriate prompt. Additionally, unlike the deterministic nature of “CTRL F”, Gen AI has high failure rates when locating a sentence in longer context windows, potentially leading to a high Time(Content Refinement), as well.
While this example is simplistic, it’s effective at conveying the idea that Gen AI doesn’t always result in improved experiences. We’re currently in a startup landscape that often inherently rewards Gen AI integration, rather than asking the fundamental question of why its inclusion is beneficial towards a workflow in the first place. Thus, the ability to identify use cases where Gen AI is not improving an experience (or in some cases actually worsening it) is crucial.
The “Very Rough Draft”
On the other end of the spectrum, there’s workflows where the content Gen AI can produce would address a meaningful problem space, but isn’t close enough to an acceptable end product to be useful.
Despite all the hype associated with Gen AI, there’s of course a lot that it can’t do. A product that could generate believable CGI with nothing more than a prompt engineer and compute costs would be beloved by Hollywood studios and amateur filmmakers alike, but it doesn’t yet exist because the required capabilities do not yet exist.
That’s not to say that startups shouldn’t think big with their applied AI aspirations. After all, if industry leaders are to be believed, founders should expect what’s possible to drastically improve or model advancements will steamroll them. However, when startups do face the reality that Gen AI’s capabilities are not advanced enough to solve their problem space, there’s 2 approaches they often employ that, if taken too far, can hinder their products significantly.
The 1st approach is the simplest: If a product’s generated content isn’t immediately usable, users can refine the content manually until it is. Many Gen AI products give users the ability to refine content in 3 forms: Re-loading prompts, inserting additional parameters retroactively, and using traditional deterministic tools to edit any unsuitable results.
However, if AI-generated content isn’t close enough in usability to traditionally created content, content refinement can get effortful and timely, leading to a scenario where although…
The reasons for this go back to the aforementioned 3 forms of Gen AI content refinement. Concerning refinement that includes both re-loading prompts and inserting additional parameters retroactively, this net new addition to an employee’s workflow includes learning a new interaction paradigm which often includes a frustrating “guess-and-check” methodology. Additionally, if an employee must use their traditional content creation tools to refine Gen AI content anyways, it may have been quicker to just create the content deterministically from the beginning.
While the 1st approach relies on content refinement, the 2nd approach involves scoping back the ambitions of a Gen AI product’s content generation. By narrowing the portion of an employee’s workflow that benefits from content generation, Gen AI products can provide incremental value with currently available technology.
This is often a great approach and has led to massively successful products, such as Github Copilot, which generates boilerplate code that saves developers time, but not an entire project from scratch. If Github Copilot originally aspired to generate an entire end-to-end project from its inception, it’s likely that it’d actually increase overall TTV through error-prone content generation and extensive content refinement (especially since AI-generated bugs may be hard to find).
Knowing this, Github started small, generating basic blocks of code that, while a huge time saver, only generated content that helped with a small portion of a software engineer’s tasks. This isn’t to say that Github Copilot’s aspirations were small. In fact, as the product has evolved, new initiatives such as Github Copilot Workspaces and Github Copilot Chat aim to generate content that helps with more of the software development process. Additionally, since Github Copilot has already integrated itself into many software developers’ workflows via their original incremental product (and in the process made Github intimately familiar with this problem space), their new offerings will be excellently positioned versus competitors’.
However, this approach can easily be overdone. If a Gen AI product is scaled back enough, it may not provide enough value to warrant its existence, thus falling into “minimal problem space” territory. At Microsoft, there’s a concept of a Gen AI experience being “minimally trusted”. For an experience to be minimally trusted, simply put, its generated content must meet the majority of users’ expectations with low failure rates.
While this sounds great, optimizing for this metric can prove counterproductive towards creating an experience with differentiating value (that can meaningfully lower an employee’s TTV). Oftentimes, due to a small amount of generated content and a lengthy content refinement process, optimizing for a trusted experience may result in products that function similarly to the deterministic “Wizards” we’ve had for many years, rather than disruptive AI offerings.
Consider a hypothetical Gen AI product that enables employees to fill out a specific type of form faster. This form includes both basic, factual information (names, numbers, locations, etc.) and more complex qualitative paragraphs. At first, leveraging Gen AI to autocomplete these forms seems like it will provide differentiating value, saving employees time in retrieving facts from across the organization and creating paragraphs of written information.
However, let’s now assume that this form has to do with heavily-regulated medical information, and the company’s priorities center on ensuring that generated content is as accurate as possible (since there is zero user-tolerance for error). Because the more complex qualitative paragraphs contain sensitive patient information, the Gen AI product does not attempt to generate them, instead optimizing for an experience that is minimally trusted where:
The product generates basic form information with 100% accuracy, then leaves users to fill out the paragraph-format areas of the form themselves.
While this product is reliable, it does not meaningfully reduce TTV. In this scenario, a user’s workflow is primarily spent locating various sources of qualitative information and creating complex paragraphs from them (whereas finding this form’s required basic information was never a primary pain point).
Conversely, if this same product was optimized for providing differentiating value (while considering trustworthiness as a guardrail metric), you could imagine an alternative experience:
The product generates basic information with lower accuracy (due to a lower prioritization of this accuracy). However, it links the sources it uses for easy fact checking and encourages content refinement via its UX.
The product then generates basic, high-level summaries of different sources of qualitative information relevant towards the paragraph sections. The summaries include citations towards their original sources for easy fact checking and encourage content refinement via its UX.
Although the latter product example is less trustworthy than the former (in an industry where content accuracy is essential), it is nonetheless a more meaningful TTV reducer. The latter example still guardrails against errors through enabling (and encouraging) users to easily fact check generated information, and reduces TTV in an area of the user’s workflow with the greatest Time(Content Creation).
To illustrate this, assume the Time(Content Creation) of filling in basic information is 20% of a user’s TTV(Workflow), while writing complex paragraphs makes up the other 80%. Let’s also assume that in the first trustworthiness-optimized product, they nail content generation, requiring zero content refinement and reducing that section of the workflow’s TTV by 99.5%.
In the second differentiated-value-optimized product, content refinement is very necessary. In fact, its content generation and refinement together only reduce TTV by 20% (as compared to creating this content traditionally).
Even with this 79.5% delta in product trustworthiness, the second example is still more effective due to reducing TTV within a much larger portion of an employee’s workflow (despite being much worse at accurately generating content than the first example):
The deterministic software that’s fueled technology’s growth has lauded itself on its accuracy, since this is where computers thrive. However, Gen AI has flipped this notion on its head, as the most cutting-edge developments are probabilistic. Naturally, many teams try to respond to this by getting their products as accurate as possible to mitigate these shortcomings. Instead, the Gen AI experiences that will be truly disruptive embrace AI’s strengths through, above all else, generating content that’s meaningfully valuable towards its users.
When Microsoft announced their partnership with OpenAI, the effects were felt viscerally, as the organization’s upcoming Hackathon shifted towards a singular mission: “Imagine if Generative AI was integrated into every corner of the company. How would it change the way we work?” Both in big tech and startups alike, we’ve seen products that are genuinely transformative and hype-laden campaigns that resulted in novelties.
As we move past seeing Gen AI as “magic toys” and instead drivers of value and productivity, a simple but powerful guide towards discerning which products will actually change the way we work can be distilled to the notion of if they’re reducing Time to Value, and thus changing habits.
The point I'm curious is that why method 1 and method 2 could not combined? We could fill basic information with 1 and the complex paragraph with 2?