
Key takeaways
|
SOC 2 auditors do not accept probabilistic outputs as evidence. HIPAA does not allow ePHI to be processed without an auditable record of who accessed what and when. The NIST AI Risk Management Framework requires organizations to establish monitoring, auditing, and review processes for AI systems as a core governance function. All three frameworks share the same foundational requirement: AI systems must produce evidence that their decisions were made correctly, on authorized data, following a defensible process, and that evidence must be available on demand.
LLM applications do not produce this evidence automatically. Unlike a traditional system where every database query and function call can be logged deterministically, an LLM operates probabilistically. The same prompt may produce different outputs. The model’s reasoning is not directly observable from its output. Without explicit audit logging infrastructure, the LLM application is a black box that satisfies none of the evidentiary requirements of any compliance framework.
Gartner predicts that by 2028, LLM observability investments will reach 50% of all GenAI deployments, up from 15% today. Gartner states that without both explainability and observability, GenAI cannot mature beyond controlled lab environments. Teams not building this infrastructure now are accumulating compliance debt.
| Building an LLM application in a regulated industry and need to scope the compliance architecture? WebOsmotic designs and builds LLM audit logging, observability, and compliance documentation for fintech, healthcare, and SaaS clients. We treat the audit trail as a first-class deliverable in every regulated industry engagement. |
IBM’s guidance on building trustworthy AI agents for compliance documents five categories of audit evidence for compliance-critical AI. These categories apply to any LLM application making decisions that affect regulated data or outcomes, not only agentic systems.
IBM frames the standard precisely: building trustworthy AI for compliance is not about achieving perfect explainability. It is about producing enough evidence to demonstrate that decisions were made through a defensible process, sufficient to reconstruct the reasoning six months after the decision was made.
NIST’s AI Risk Management Framework, released January 2023, organizes AI governance into four functions: Govern, Map, Measure, and Manage. For LLM applications, the Measure and Manage functions carry the most direct audit implications.
SOC 2 Type II applies the AICPA Common Criteria to LLM applications just as it applies them to any system handling customer data. The controls most directly relevant to LLM applications are in the CC7 monitoring series and CC6 logical access series.
Microsoft’s Azure architecture documentation states this directly: the tool produces the compliance record. Letting an LLM evaluate whether a rule was followed collapses the audit trail and introduces probabilistic outputs on questions that have regulatory answers.
Microsoft’s Agent Governance Toolkit integrates with LangSmith, OpenTelemetry, Arize, MLflow, and others, confirming that observability and compliance are the same engineering investment. The toolkit maps to NIST AI RMF and OWASP Agentic Top 10 for automated governance grading.
| Tool | What it captures | Primary compliance use case |
| LangSmith (LangChain) | Traces of every LLM call: prompt, response, latency, token usage. Step-by-step chain execution for LangChain and LangGraph applications | SOC 2 monitoring evidence. Production agent debugging. Regression testing after model updates |
| OpenTelemetry | Vendor-agnostic distributed tracing across the full stack including LLM calls, tool calls, and downstream system interactions | Integration with existing SIEM for unified audit logging. Framework-agnostic compatibility across any LLM stack |
| Microsoft Purview | eDiscovery and audit trail records for AI agent interactions. HIPAA, GDPR, and EU AI Act compliance templates | HIPAA audit trail for Azure AI workloads. Legal hold and forensic investigation support for AI-generated decisions |
| Custom SIEM integration | Application-layer logging of every inference call with structured metadata routed to organization-controlled audit infrastructure | Regulated industry deployments where all audit evidence must reside within organization-controlled infrastructure |
Microsoft specifically identifies observability as the architectural element that separates teams shipping production agents from teams perpetually in pilot. The audit trail and the production reliability system are not separate engineering investments, they are the same infrastructure.
The minimum viable audit record for an LLM inference call in a regulated environment covers six fields.
WebOsmotic’s compliance architecture practice for clients in fintech and healthcare designs the logging layer before the application layer is built. The six-field audit record is implemented from the first production call, ensuring that compliance evidence is available from day one, not retrofitted when an auditor asks for it.
| Ready to build an LLM application with compliance-grade audit trails from day one? WebOsmotic builds LLM audit logging, observability, and compliance documentation for fintech, healthcare, and SaaS clients. SOC 2, HIPAA, and NIST AI RMF requirements are scoped and addressed in the architecture phase, not retrofitted after deployment. |
What does SOC 2 require for LLM applications?
SOC 2 Type II applies its Common Criteria to LLM applications. The most relevant controls are CC7.2, which requires monitoring system components for anomalies and addressing alerts, and CC6.1, which requires that only authorized users and systems access data. For LLM applications, auditors sample inference logs showing user identity and data source per call, model output logs, access control enforcement records, and change management records for model updates. AI-generated summaries cannot serve as audit evidence, compliance records must be produced by deterministic logging infrastructure, not by the model being audited.
What does HIPAA require for LLM logging?
HIPAA’s Security Rule requires audit controls recording activity on systems that access ePHI. For LLM applications receiving prompts containing PHI, every inference call must log user identity, data accessed, timestamp, and clinical or operational purpose. HHS’s cloud computing guidance establishes that any service processing ePHI on behalf of a covered entity is a business associate requiring a BAA. Logs must be stored in encrypted, tamper-evident format and retained for the HIPAA-required minimum of six years. The BAA must be verified for every API endpoint in the call chain, including tool calls the agent makes during inference.
What is the NIST AI RMF and why does it matter for LLM compliance?
The NIST AI Risk Management Framework, released January 2023, provides voluntary governance guidance for AI systems organized into Govern, Map, Measure, and Manage functions. The Measure function requires monitoring, auditing, and review processes connected to existing organizational risk controls. The Manage function requires incident response plans and change management for model updates. NIST AI 600-1 adds content provenance as a primary requirement for generative AI, the ability to demonstrate that AI-generated content was produced by an authorized system using authorized data. Organizations aligning with NIST AI RMF should build these four functions into their governance documentation alongside the technical logging infrastructure.
What is LangSmith and is it sufficient for compliance observability?
LangSmith is LangChain’s observability platform that captures traces of every LLM call, prompt, response, latency, and token usage, with step-by-step execution traces for LangChain and LangGraph applications. It is well-suited to debugging production LLM behavior, regression testing after model updates, and SOC 2 monitoring evidence. For regulated industries with strict data residency requirements, the question of whether LangSmith’s data residency satisfies HIPAA or financial services requirements should be verified before it is used as the primary audit log store for PHI-involving inference calls. Microsoft’s Agent Governance Toolkit also integrates LangSmith into a broader compliance framework including NIST AI RMF mapping.
What is the key architectural principle for LLM audit trails?
Microsoft’s Azure architecture documentation states it directly: the tool produces the compliance record, not the LLM. Letting an LLM evaluate whether a rule was followed collapses the audit trail and introduces probabilistic outputs on questions that have regulatory answers. Deterministic logging infrastructure must capture audit evidence before or after the LLM call. Compliance verification checks, was this access authorized, was this output within policy, must be deterministic functions, not LLM calls. LLM-generated explanations can be provided as user-facing context but cannot serve as legally defensible audit evidence.
How does WebOsmotic build LLM audit logging for regulated industries?
WebOsmotic designs the audit logging layer before the LLM application layer. The logging infrastructure captures six fields for every inference call: user identity and authorization, data sources accessed with BAA status for HIPAA workloads, complete prompt context in encrypted format, model response including tool call details, timestamp and session identifier, and outcome and escalation status. This six-field record is mapped to the specific compliance requirements of the engagement, SOC 2 Common Criteria, HIPAA Security Rule, or NIST AI RMF Measure requirements, before any development begins. We work with fintech, healthcare, and SaaS clients in the US and India.