[Meta] AI Logging/Observability

Created on 30 June 2025, 15 days ago

Overview

This is a meta issue to address logging and observability requirements for Drupal AI.

Problem / Motivation

AI agentic systems, with their autonomous decision-making, tool use, and multi-step planning, present unique monitoring challenges that go beyond traditional logging. Effective observability for these systems is crucial not just for detecting errors, but for truly understanding why and how an agent behaves. This deep insight is vital for efficient troubleshooting, optimizing performance, managing costs, and ensuring responsible AI deployment.

Report / Analysis

To address the challenge a research was done and a report created how to move forward.

The report is available here, comment access can be requrested.

This report explores the conceptual approaches to monitoring AI agents, details the essential data typically collected, provides examples of how this data can be visualized, and surveys the monitoring strategies of leading AI providers. A key industry trend is a move from fragmented logging to standardized, end-to-end observability, often powered by OpenTelemetry. The focus is on capturing granular details of workflow execution, operational performance, and critical quality and safety evaluations. This comprehensive approach is essential for continuous improvement and building trust in AI systems.

Results / How to move forward

Leveraging Drupal's existing OpenTelemetry integration, a robust observability strategy for Drupal AI can be built. This involves systematically using traces, spans, attributes, and events to capture detailed AI functionality, from simple LLM calls to complex multi-agent systems with guardrails. This approach enables comprehensive monitoring and analysis, while allowing data collection and visualization to be handled by third-party services like Grafana, especially beneficial for development environments.

Sub-Tasks for implementation

  • Co-maintain or re-create OpenTelemetry module
  • Update DDEV with something like Grafana
  • Update AI Logging sub-module, to support OpenTelemetry
  • Support OT
  • Alternative "Stupid" implementation
  • Update AI module (and sub-modules) to correctly use traces, spans, attributes, events. Or the AI Logging just listens to AI Core and AI Agents events and implement traces, spans, attributes, events, so we avoid more code complexity
  • Document and make available to AI module developers
πŸ“Œ Task
Status

Active

Version

1.1

Component

AI Logging

Created by

πŸ‡©πŸ‡ͺGermany breidert

Live updates comments and jobs are added and updated live.
Sign in to follow issues

Comments & Activities

  • Issue created by @breidert
  • πŸ‡©πŸ‡ͺGermany breidert
  • πŸ‡©πŸ‡ͺGermany breidert
  • πŸ‡©πŸ‡ͺGermany breidert
  • πŸ‡¦πŸ‡²Armenia murz Yerevan, Armenia

    Joining this issue and will try to do my best to help! So about the subtasks:

    Co-maintain or re-create OpenTelemetry module

    Update AI Logging sub-module, to support OpenTelemetry

    I'm the creator and maintainer of the Drupal OpenTelemetry module β†’ , will try to invest time into improving and adapting it.

    Update DDEV with something like Grafana

    To deploy the full OpenTelemetry stack locally, we can use this addon: https://github.com/MurzNN/ddev-grafana

    Implement AI Logging to listen listens to AI Core and AI Agents events and implement OpenTelemetrys traces, spans, attributes, events and correctly process

    From the Drupal side, we should pass the trace_id, generated on the Drupal side, to all services that it calls, using the traceparent HTTP header, here is more details about this: https://www.w3.org/TR/trace-context/#traceparent-header

    Upgrade existing "Simple" logging implementation

    The Drupal logging is pretty limited in storing additional metadata together with the log in a structured format. To resolve this, I created the Extended Logger β†’ module that doesn't replace the Drupal logging system, but just extends it by allowing to store the free-form structured metadata directly in the log records, and build reports and charts directly from logs.

    It should suit well for "Simple" logging task, but the best tool for logging decoupled systems still is OpenTelemery :)

  • πŸ‡¨πŸ‡­Switzerland dan2k3k4 Zurich

    Do we want to do this in the ai_logging submodule or create an ai_opentelemetry wrapper module to link ai_logging with opentelemetry?

  • πŸ‡¦πŸ‡²Armenia murz Yerevan, Armenia

    murz β†’ changed the visibility of the branch 3533109-ai-logging-to-logger-context to hidden.

  • πŸ‡¦πŸ‡²Armenia murz Yerevan, Armenia

    I analyzed the current ai_logging module, and it seems that we can make the logging part much simpler, created a separate subtask for this: ✨ Rework AI Logging to use the standard PSR Logger interface with passing metadata in context Active with more details about the idea.

  • πŸ‡ΊπŸ‡ΈUnited States johnpicozzi Providence, RI
  • πŸ‡¬πŸ‡§United Kingdom yautja_cetanu

    So the questions I have abotu this as someone who understands OT if you could help me murz?

    - My understanding is OT is primarily useful when we want to aggregate logging metadata across an organisation and multiple different use-cases software. If all you wanted was to know how much money you've spent with Claude for your single drupal site, OT integration would be overkill right?

    - There is an issue that the OpenAI Agents SDK solves which uses an architecture of traces, group ids (for chat history) and spans to show you a trace of the agent workflow across LLM calls and tool/ function calls which helps you understand WHY the agent architecture ended that way.
    - Similarly you might want to log a trace of an ECA workflow where each span is the output of a specific piece of logic inside a node on the graph.

    My understanding is that OT is not an out of the box solution to either of these problems? it's not primarily for logging and debugging a workflow of functions inside code, but for monitoring api calls and information transferring between systems. My understanding is you COULD use OT for this if you wanted to, creating your own approach to traces and spans for agent work flows and make them fully compatible with OT, but OT doesn't necessarily give you anything towards that goal?

    Is this correct or have I missunderstood OT?

  • πŸ‡¦πŸ‡²Armenia murz Yerevan, Armenia

    - My understanding is OT is primarily useful when we want to aggregate logging metadata across an organisation and multiple different use-cases software. If all you wanted was to know how much money you've spent with Claude for your single drupal site, OT integration would be overkill right?

    Yes, just to count the money and wasted tokens - OTEL Traces and Metrics is overkill, logs are pretty enough, and with Extended Logger we can even aggregate them and count the sum per period. Will come with examples later.

    So, OpenTelemetry is mostly for tracking complex operations, which include several steps and subcalls. As I see, most calls to AI are just simple HTTP request-response, that do not need any OTEL Spans to track the details.

    Byt OpenTelemetry can act just as a log ingestor, to collect Drupal regular logs like any other logger, like syslog, file, stdout, etc. So, we can switch from Extended Logger to OpenTelemetry and back at any moment.

    The problem with OpenTelemetry is that we can only send data to it, not receive and render, so if we switch to OT, we will lose all abilities to render logs, traces, metrics, etc in the Drupal admin panel, that's the main problem with our OT case, I suppose.

    But we can provide a simple pre-configured setup of the Grafana OTEL Stack locally with ready-to-use reports.

    Otherwise, we will need to invent the storage of logs, traces, and metrics on the Drupal side, which is a huge task.

    - There is an issue that the OpenAI Agents SDK solves which uses an architecture of traces, group ids (for chat history) and spans to show you a trace of the agent workflow across LLM calls and tool/ function calls which helps you understand WHY the agent architecture ended that way.
    - Similarly you might want to log a trace of an ECA workflow where each span is the output of a specific piece of logic inside a node on the graph.

    Yes, this case is exactly for OTEL infrastructure, but again - in conjunction with Drupal logs ;) The log is the source of the event that happened, and Traces are just additional metadata of what and how it was executed.

    So, my plan is:

    1. Improve the current logging to contain all necessary info in the log record. As I see, the Drupal default logger should be enough for simple logging, and using the Extended Logger module will allow us to store additional metadata together with the log entry - this will not require writing any specific code on the AI modules side. Tracking this in the ✨ Rework AI Logging to use the standard PSR Logger interface with passing metadata in context Active

    2. Add OpenTelemetry spans to the operations, with tracking the span id and trace id in the log records, to be able to find the connection between them.

    3. Create Grafana configurations to visualize the logs and spans.

    4. Add reporting metrics to OTEL.

  • πŸ‡¬πŸ‡§United Kingdom yautja_cetanu

    Descisions from a call on 10th July:

    • Bucket 1: We aim to completely replace the existing AI logging module.
      • The AI logging module will require https://www.drupal.org/project/extended_logger β†’
      • We will move to the format of logs being stored in the extended logger entity where most of the data is in the custom Data field
      • We need an upgrade path + Migration of current entities to the new entities (Or introduce a new module (AI logging Deprecated)
    • Bucket 2: We make some simple views in the AI logging module to do things like monitoring costs, etc
    • Bucket 3: We make a method of exporting logs (to a file system or database system
    • Bucket 4: We would create another module for bringing the AI logging into OT, to do more advanced observability. This would maybe include work with Grefana
      • Receive work and convert it into traces and spans
      • What to send and how to send? (They think this won't be the same as in the simple version?) (Alexy thinks that the events will be more developed events and just logs. (Jamie thinks that this will be the same as most advanced stuff needs to go into the DB version)

    Things we will need to think about:

    • Abstraction layer, if different API's report back token cost differently, we need to handle bringing it all into one place
    • We will likely want AI to summarise the logs (similar to the OpenAI logging module we had)
    • We might want to look at the OpenAI SDK to see if we can create custom attributes to allow us to do similar things to what OpenAI Agents SDK does.
    • If we store all the information in the AI logging, we may need to provide the ability for provider modules to tell us what syntax they use for similar metadata (like Request cost)
    • We need to think about how we bring Search AI results and tools alongside this
    • Show Alexy the UI we used for LLM logging right now
    • We probably want the AI to summarise each span to increase readability
    • Do we want to make use of Michal's UI for LLM logging of agents
    • Do we have some way of categorising AI logs depending on purpose and doing different things with metadata (we log agent calls, differently to AI translate, or simple API explorer)
  • πŸ‡©πŸ‡ͺGermany breidert

    Effort Estimation based on current information

    Re-Build Logging (Buckets 1-3 above): 10-15 days

    The majority of the work goes into the re-creation of the AI logging sub-module, and updating the Extended Logger β†’ module.

    Creating sensible management views (like number of tokens spent, or number of AI calls) does not take so much time. Most of the work for this would be to define what views make sense.

    Creating a log export function does not take much time.

    Create OTEL integration: 5-10 days

    The creation of a new sub-module called AI Oberservability is not very difficult. The majority of the work flows into mapping AI event data to OTEL's traces, spans, attrributes, and events in a meaningful way. The module Opentelemetry β†’ also needs to be upgraded.

    Delivery

    It is planned that EPAM delivers above functionality with their own team. Communication with the core maintainers Marcus and Artem will be via a dedicated channel in Slack (#ai-obersability).

    First the new logging will be developed, then the OTEL integration.

  • πŸ‡ΊπŸ‡ΈUnited States tonytosta

    I will be involved with this effort.

  • πŸ‡©πŸ‡ͺGermany marcus_johansson

    Upped to 1.2.x

    Some technical background that might be good as background when going into this:

    • The current AI Logging basically hasn't had any real updates since it was created, and the actual AI core solution has evolved a lot since, so its not always the best place to look for a good solution.
    • However, having the possibility to log specific tags, operation types and/or decide if the responses should be logged is necessary to keep. Embeddings creates a lot of superflous information that easily fills up and kicks out other logs if you have limits on how many logs to keep. If its a new module, we should opt-in on chat operation type by default.
    • There are three events that matter - PreGenerateResponseEvent, PostGenerateResponseEvent and PostStreamingResponseEvent. For bucket 4, more will probably be needed. The events are documented here: https://project.pages.drupalcode.org/ai/1.1.x/developers/events/
    • In the case of none-streaming, the result can be fetched from PostGenerateResponseEvent, in case of streaming it is instead fetched from PostStreamingResponseEvent.
    • Each event has a request thread id, so you can easily connect an PreGenerateResponseEvent, with a PostGenerateResponseEvent and its necessary with a PostStreamingResponseEvent since that only hold output information.
    • Each normalized/abstracted operation type has an input and one or many output objects. Examples for Chat Input and none-streamed Chat Output
    • You will notice that they have toArray and fromArray functions, we added this because they need to be serializable for agents - not all operation types have this yet, but it will be there at the launch of 1.2.0. This means that for your data object you can probably just choose to log this. This includes the raw response if it exists, but you will also see that there is a normalized way of getting tokens for instance.
    • The streaming response event does not have this, but we are working on adding it for tokens, so I will add a follow up issue on this and try to work oin that asap. But just assume that you will be able to use toArray on the full stream.
Production build 0.71.5 2024