Gen AI: Here, yet still arriving, and always evolving
When used in specialized domains, gen AI tools need the careful guidance of human experts.

Gen AI: Here, yet still arriving, and always evolving

By Bridget Haby , Werner Rehm , and Stefan Roos

Large language models (LLMs) can do incredible things, but their value impact is still hotly debated as the generative AI (gen AI) landscape evolves. For example, Goldman Sachs reported both optimism and pessimism about a measurable impact on US GDP by 2027, citing research by MIT and Goldman Sachs itself. Rather than add to that debate, we decided to test whether gen AI can be helpful to us as valuation practitioners. We all know that things can go wrong with gen AI when models are not trained correctly, or there isn’t a human-in-the-loop approach. At best, the resulting errors can be hilarious, but sometimes, the errors can result in more serious consequences (confidential health information being shared with the wrong users, a car dealer chatbot attempting to sell a car for $1, and an airline customer service bot refunding money outside of its standard rules are just a few of the examples out there). The common theme here is that a model’s responses need to be fully validated before the model can be trusted with decision making.

Crucially, this is an ongoing process, as the technology, use cases, and users’ capabilities, evolve. What doesn’t work today might work tomorrow, so we have to keep experimenting and stay on top of the advances as they arrive. To illustrate this point, we can look to our own interactions with gen AI in a corporate finance context and see how quickly the tools are evolving.

Earlier this year we asked a publicly available AI assistant a question of central importance to valuation practitioners: define the cost of debt. We gave the assistant no further context, and of course lacking that, it gave us an accurate, but entirely inappropriate, definition of a different type of debt. Rather than describing the cost of debt in a corporate finance meaning, the AI assistant defined technical debt in the context of software development:


Q: What’s the definition of cost of debt?

A: The cost of debt refers to the lost productivity caused by tech debt. In other words, it is the quantification of the “interest” on tech debt, which is the cost of lost productivity.

Tech debt is the accumulation of technical issues that arise when software development teams take shortcuts or make compromises in order to meet deadlines or cut costs. These shortcuts or compromises can lead to a buildup of technical debt that can slow down development and increase the risk of errors or bugs.Additional steps followed, including really good ideas beyond numbers, like “check credit rating” and “compare to benchmarks to make sure the answer is reasonable.”

This second answer is simple, clear, and easy to read, yet it is completely wrong for the question asked. (If you’re a valuation practitioner, you probably already saw that this uses cash cost, not opportunity cost.)

More recently, we asked the same question of a newer model of the publicly available AI assistant, and received a better, albeit still incorrect answer. It included this sentence:


This is one of the dangers of tossing freely worded questions at an AI assistant, and a major reason why it’s not that simple to adopt gen AI. Prompts matter a lot.

So we tweaked our initial prompt and provided more context, this time asking, “Define the cost of debt in calculating WACC for company valuation.” This time the answer we got was:


“To estimate the cost of debt for an estimate of cost of capital, you can follow these steps:

Determine the current interest rate on your company's debt: The interest rate on your company's debt is the cost of borrowing money. You can find this information in your company's financial statements or by contacting your lender…”


Additional steps followed, including really good ideas beyond numbers, like “check credit rating” and “compare to benchmarks to make sure the answer is reasonable.”

This second answer is simple, clear, and easy to read, yet it is completely wrong for the question asked. (If you’re a valuation practitioner, you probably already saw that this uses cash cost, not opportunity cost.)

More recently, we asked the same question of a newer model of the publicly available AI assistant, and received a better, albeit still incorrect answer. It included this sentence:


“Market Rate: For valuation purposes, the cost of debt should reflect the current market rate at which the company can borrow. This is often estimated using the yield to maturity (YTM) on existing debt or the interest rates on new debt issuances.”


This is definitely better than the previous, cash-focused, answer. But it also didn’t include a disclaimer that the YTM can be way higher than the cost of debt for non-investment grade ratings. Furthermore, gen AI overlooked essential methods (using company bond ratings to add an estimated default premium to the risk-free rate) needed to estimate the cost of debt for companies with lower maturity debt.

In another example, an earlier version of a gen AI bot provided what looked like a high-level formula to estimate return on capital correctly. However, when we probed further into the details of how to calculate "invested capital,” it incorrectly included short-term debt as an operating liability.


Q: Provide the formula and calculation of ROIC

A: Return on Invested Capital (ROIC) is a financial metric that measures the profitability of a company based on its net invested capital. The formula for calculating ROIC is as follows: ROIC = Net Operating Profit After Tax (NOPAT)/Net Invested Capital Q: Provide the formula and calculation of net invested capital based on operating assets and operating liabilities. A: Net Invested Capital = Operating Assets – Operating Liabilities 1. Operating Assets: These are the total assets directly related to the company’s core operating activities. Examples include cash, accounts receivable, inventory, property, plant, and equipment.

2. Operating Liabilities: These are the total liabilities that arise from a company’s core operating activities. Examples include accounts payable, short-term debt, and other operating-related obligations.


Interestingly, this is an error often made by students and inexperienced analysts. It’s not even that subtle of an error, as debt is clearly a non-operating working capital item. It seems that when forecasting words for the answer, the bot perceived “short-term debt” to be a statistically significant word alongside “liabilities” and “working capital” in the context of operating capital. Larger-parameter models might have been able to “discover” and forecast correctly with more context to be analyzed in the source data. Again, this is a great example of how the models are still evolving despite their huge parameter space, and so are not yet perfect in understanding the context of an expert question.

In another example of what can go wrong, when tasked with extracting segment revenue of commercial banking from bank financial statements, a gen AI system came up with fictional numbers that did not reconcile to the source’s 10-K filings. We have learned that many models will have trouble accurately extracting information from complex tables without the proper pre-processing of documents used. But this doesn’t change the fact that the model still “confidently” answered with segment revenue numbers, seemingly from a company’s original filing but completely wrong. The question here becomes, how do I know whether the numbers are correct? Again, this is where a proprietary gen AI tool can solve the problem, by pre-processing documents and providing source links for all chatbot outputs that lead to the documents and/or exhibits from which the output statement was sourced.

At McKinsey, we have a proprietary gen AI system that uses expert documents as additional input, and so it gives us the correct answer the first time, because human judgment was used in the source data for “what is right.” This is an interesting and critical point: human judgment is still needed in these expert cases to identify “correct” as opposed to “wrong,” either in the input for the system or the output. These models have no inherent judgment of the “truth,” and so they still need expert guidance.

How do I know?

The point here is not that “AI is wrong,” or that these tools are useless, or even that only proprietary gen AI tools work. Rather, the point is that when used in specialized domains, gen AI tools need the careful guidance of human experts. This makes sense, in that the tools have no way on their own of determining what is correct. They merely answer by predicting the best sentence based on the prompt and source data provided. As such, off-the-shelf gen AI products alone do not possess the logic required to understand right and wrong answers. Rather, they require custom coding, pre- and post-processing, and business logic applied in conjunction with gen AI to get to more expert-like responses.

We see three avenues for improving model errors:

1.     Larger models that “understand” context better.

More recent models released have much larger context windows and already have an enhanced understanding of specialized domains. However, there can still be no guarantee of 100 percent accuracy.

2.     Expert check of the output with iterative prompts.

Providing the right context through judicious prompting and spotting inaccuracies will continue to be required. The consistency and accuracy of outputs can be improved through expert-driven prompt creation and a curated prompt library, but an expert check on outputs is still needed to ensure accuracy for domain-specific questions, and prompt engineering can still lead to undesired results.

3.     Expert-curated input.

We continue to experiment and refine our approach to curated content and intent in our proprietary gen AI tool. For example, our valuation book (Valuation: Measuring and Managing the Value of Companies, Wiley, 2020) is without a doubt a better source of “truth” on how to estimate ROIC than a collection of websites. Quality weightings within source data, more complex logic applied to how gen AI searches for the right information, among other adjustments can all help improve the quality of responses. However, this requires much more extensive data processing and engineering than any of the online models can provide today.

What do I do?

To consciously accelerate experimentation, executives can think about implementing their gen AI efforts in “safe areas” such as the back office to rapidly realize efficiency benefits, while remaining more careful when deploying such new solutions in customer-facing situations where the core of the business could be affected.

1.     In the back office: One of the key strengths of gen AI is summarizing reports and relevant data, automating competitive analysis, creating an initial draft of memos, et cetera. These applications can be experimented with quickly, at low risk, and at relatively low cost in parallel to current processes. These use cases rely on a defined “prompt library” of questions that are known to create the desired output and can be reliably repeated given the source data. This will feel like a traditional efficiency effort but should have a much faster impact than trends like offshoring back-office tasks, which often took years to implement.

2.     Customer facing: On the other hand, executives should be careful in implementing current gen AI systems when it comes to customer-facing tasks. And they should also be wary of allowing users to freely ask questions, which often yields poor results, rather than using engineered prompts that are far more likely to offer useable results. Additionally, it’s important to remember that systems can “hallucinate” even when configured to answer questions from specific materials. The strength of gen AI can also be its weakness: the ability to generate answers at a human-level conversation is game changing. However, what if you unintentionally enable gen AI to actually do something that hinders the quality of your core business?

What's next?

Gen AI is here to stay. It will drive massive efficiencies and could do so rapidly. But it will be a while before the need for expert guidance on these tools wanes. In the meantime, as AI gradually takes over more tasks, companies can focus on three critical questions as they try to capture a slice of the multi-trillion-dollar AI opportunity:

1.     How are you rapidly identifying high-impact AI applications for your core business? Quick wins in back offices can reveal top use cases worth doubling down on.

2.     What steps are you taking to avoid AI hallucinations and errors reaching customers? Prompt engineering, custom models, and expert review processes are key to ensure reliability.

3.     Is your workforce ready to adopt AI at scale? The right infrastructure, talent, and training will determine how quickly AI becomes business-as-usual across your organization.


Let’s discuss—what are you doing in your organization?

What examples have you seen of a customer-facing AI or “expert AI” going wrong?

Great insights on the evolving role of Gen AI! It's clear that while the technology offers massive potential, expert oversight is still crucial to avoid errors, especially in specialized domains like finance. No doubt AI adoption will continue to grow with these safeguards in place.

To view or add a comment, sign in

More articles by McKinsey Strategy & Corporate Finance

Insights from the community

Others also viewed

Explore topics