RDEL #72: how can engineering teams resolve challenges with integrating LLMs?
The top challenges, related distractions, and solutions for engineering teams working with LLMs
Welcome back to Research-Driven Engineering Leadership. Each week, we pose an interesting topic in engineering leadership and apply the latest research in the field to drive to an answer.
Large Language Models (LLMs) are transforming software products across industries, offering new capabilities but also introducing significant engineering challenges. As these models disrupt traditional workflows, engineers must navigate uncharted territory, from changing quality standards to evolving testing practices. This week we ask: How can engineering leaders resolve the common challenges with integrating LLMs into products?
The context
The integration of LLMs into software products has become a cornerstone for innovation in various fields, including healthcare, legal services, and enterprise software. Unlike traditional software, LLM-based features demand new engineering paradigms. Their non-deterministic nature, reliance on subjective quality metrics, and operational costs challenge established engineering norms.
Teams are now faced with reconciling these unique demands with existing software engineering workflows, which often lack the tools or processes to address these complexities. This gap has spurred the need for innovative solutions to ensure quality, fairness, and reliability in LLM-enabled systems.
The research
A recent mixed-method study from Carnegie Mellon University and Microsoft Research investigated these challenges by interviewing 26 practitioners and surveying 332 developers involved in integrating LLMs. The research identified the top challenges, and categorized their impact into disruptions and solutions to quality assurance in LLM-based products.
First, researchers organized the most common challenges into the following nine categories:
Lack of specification: A bug is ambiguously defined.
Subjectivity: there are not clear expectations of what a correct answer is.
Metrics dilemma: developing the right metrics is complicated.
LLM properties: because LLMs are made non-deterministic, teams struggle with testing outputs that aren’t consistently reproducible.
Evaluation: lack of robust evaluation methods and pipeline.
Manual efforts: manual testing is common, but labor intensive and inefficient
Infrastructure constraints: there is an over-dependence on the current evaluation infrastructure that lacks flexibility for all use cases.
Model migration issues: minor tweaks can cause substantial changes.
Compliance: processes are lengthy, manual, and often bureaucratic.
Next, researchers reviewed the impact of those challenges into the top disruptions, the challenges most likely to cause those disruptions, and the top reported solutions to resolve them. (We will be reporting the top few solutions for each disruption - for more inspiration, check out the full paper).
Evaluation metrics for LLMs change frequently.
Caused by challenge: #1, #2, #3, #4, #6
Top solutions
Define custom metrics through iterative collaboration and expert consultations (57% adopted, 55% satisfied, 2% did not find useful).
Combine qualitative and quantitative metrics to evaluate the multifaceted outputs effectively (54.4% adopted, 53.7% satisfied, 0.7% did not find useful)
Evaluate subjective metrics using LLM validators (50.6% adopted, 47.3% satisfied, 3.3% did not find useful).
Establish clear rubrics and scoring mechanisms (33.8% adopted, 33.1% found useful, 0.7% did not find useful)
Common assumptions about test processes and environments break.
Caused by challenge: #5, #6, #9
Top solutions
Establish internal team standards for evaluation processes and pipelines (48.7% adopted, 48% satisfied, 0.7% did not find useful)
Automate offline evaluation to run periodically on a schedule (29.6% adopted, 28.9% satisfied, 0.7% did not find useful)
Engineers require new skills to handle LLM evaluations.
Caused by challenge: #4
Top solutions:
Involve data scientists in authoring tests, as they understand the limitations of the model better (33.5% adopted, 32.1% satisfied, 1.4% did not find useful)
Even with extensive evaluations, LLM solutions remain unreliable and developers have difficulty establishing trust.
Caused by challenge: #4, #5
Top solutions:
Employ canary release strategy to enhance confidence in LLM outputs (standardized practice, no adoption stats available)
Use A/B testing to track chanegs in different versions (ie prompt updates) (36.7% adopted, 36.7% satisfied, 0% did not find useful)
Establish extensive guardrails (44.2% adopted, 41.3% satisfied, 2.9% did not find useful)
Existing approaches to telemetry and monitoring need to be revised.
Caused by challenge: #3, #9
Top solutions:
Develop new and multiple telemetry metrics that may better suit LLM solutions (58.1% adopted, 57.4% satisfied, 0.7% did not find useful).
Use the LLM-as-a-validator strategy to gain granular insight in production behavior without revealing private data (24.8% adopted, 24.1% satisfied, 0.7% did not find useful)
Lack of focus on focus on system-wide evaluation of LLM based products.
Caused by challenge: #5
Top solutions:
Set up an end-to-end test automation infrastructure (38.2% adopted, 37.6% satisfied, 0.6% did not find useful)
Conduct comprehensive tests beyond unit tests, including reliability and availability (61% adopted, 60.3% satisfied, 0.7% did not find useful)
Responsible AI is a relatively new concept, has a steep learning curve, and might be bureaucratic.
Caused by challenge: #9
Top solutions:
Standardize Responsible AI (RAI) evaluation and practices. The most common were bias and fairness evaluation (39.7% adopted), safety evaluation (55.9% adopted), privacy compliance (55.8% adoption), and security and vulnerability adoption (59.8% adoption)
Apply RAI red teaming strategies (48% adoption)
Follow a robust RAI audit process (mandatory for adoption at Microsoft)
The application
The findings of this paper point out just how new and nascent this technology is, and how teams are adapting their software development processes to manage these new integrations. Even still, there is no clear best practice for developing products with LLMs, which is why it can feel so experimental to build a development pipeline.
For engineering teams who are integrating with LLMs within their core offering, consider the top challenges you are currently encountering, and use this guide to discover a new solution that you can use to adjust your existing processes. If you are just getting started, consider choosing the most popular strategy as a baseline, and to adapt to the development practices, security standards, and culture of your own engineering team.
—
Thats it for this week’s RDEL. We hope you enjoyed this week’s dose of research!
Lizzie