可观测与评估

一句话总结
看见 agent 内部发生了什么(tracing、日志、token/成本),并衡量它做得好不好(eval)。Agent 行为不确定,可观测与评估是把它从 demo 推向生产的必备工程能力。

它解决什么问题

agent 的多步、调工具、非确定性使其难以调试和信任。tracing 让你回放每一步;eval 让你量化质量、防止回归。

设计维度 / 实现谱系

  • Tracing:内置 ↔ 集成 OpenTelemetry/第三方(VoltAgent 内建 observability)
  • 指标:token、成本、延迟、步数、成功率
  • 日志/回放:结构化事件、时间旅行调试
  • 评估:人工 ↔ 规则 ↔ LLM-as-judge ↔ 数据集回归
  • 闭环:评估结果是否反哺改进(ACE 从反馈学习)

关键要点

  • 可观测优先级常被低估,却是生产 agent 的成败关键。
  • LLM-as-judge 是主流 eval 手段,但需校准。
  • 评估闭环(eval 到改进)通向自改进 agent。

关联

各框架实现对比

下表汇总 49 个实现了「可观测 / 评估」的框架(源码级阅读结论)。网站上以可展开 + 源码节选呈现。

框架实现方式
Aeon每次成功运行后 Haiku 自动打 1-5 分(失败/空=1,优秀=5),写 memory/skill-health/{skill}.json(滚动 30 次 + avg);token 用量记 token-usage.csv;cron-state.json 存成功率/连败数;skill-evals 断言测试;scripts/skill-runs 审计 Actions 运行
AG2runtime_logging 全局开关,BaseLogger 抽象 + SqliteLogger/FileLogger 后端记录 chat/LLM 调用/成本/工具事件;gather_usage_summary 汇总 token/cost;内建 OpenTelemetry instrumentation(agent/llm/pattern span);contrib/agent_eval 做评估
Agency Swarm复用 SDK 内建 tracing(OpenAI Traces 自动),并通过 with trace(…) 接入 Langfuse / AgentOps(examples/observability.py);自动累计 token/cost(sub-agent raw_responses 按模型回填到父 result,execution.py:252);可视化 agency.visualize() 输出结构图
Agent-LLM (AGiXT)全程把活动写入 conversation 日志([ACTIVITY]/[SUBACTIVITY] 标记,含命令执行成功/失败);webhook 事件 command.execution.started/failed(Extensions.py:1078);UsageTrackingMiddleware 记 token/用量;评估类 chain(Smart Instruct)做自反思。无独立 eval harness(待确认)
AgentDock内置 Evaluation Framework:runEvaluation runner + 多评估器(RuleBased/LLMJudge/NLPAccuracy/ToolUsage/LexicalSimilarity/KeywordCoverage/Sentiment/Toxicity),结果落 JsonFileStorage;结构化分类日志 logger(LogCategory);token 用量经 onFinish 累积进 orchestration 状态(cumulativeTokenUsage)
AgentField自动 workflow DAG 可视化(GET /api/v1/workflows/{id}/dag);Prometheus /metrics(discovery 等用 promauto 埋点);结构化 JSON 日志;执行时间线;/health+/ready(K8s);app.note() 写审计日志。形式化 eval N/A(靠 VC 审计而非 eval 框架)
Agentic Context Engine (ACE)本框架重点。EvaluateStep+TaskEnvironment 产出反馈/对错信号;自带 tau2-bench 等基准(benchmarks/);可观测:ObservabilityStep、Logfire 自动插桩 PydanticAI(logfire extra)、kayba-tracing SDK(configure/trace/start_span)、每条 skill 的 helpful/harmful/used 计数即效用度量
AgentScope一等公民 OpenTelemetry:TracingMiddleware(middleware/_tracing/) 为 agent/llm/tool 各层开 span,依赖 opentelemetry-sdk + OTLP exporter(pyproject 强依赖);事件流本身即细粒度可观测;app/ 服务侧带 OTel。README 提 “built-in evaluation”,但本仓 src/agentscope 下未见独立 eval 包(评估在 docs/examples 层)
Agentset检索流程经 stream 实时回传状态(data-status: generating-queries/searching/generating-answer,agentic/index.ts:61)与日志(logs);用量计入 Postgres(chat/route.ts:33);服务端事件分析(logServerEvent);Tinybird 存 webhook 事件;README 列 evaluation/benchmarks 为平台特性
AgentVerse① 单例 Logger(仿 Auto-GPT 风格,彩色 + logs/activity.log/error.log + typewriter 效果,logging.py:32);② 每个 agent 经 get_spend() 统计美元花费,环境 report_metrics() 汇总(environments/base.py:50);③ task-solving Evaluator 规则给 plan 打分(score≥8 阈值即 accept,tasksolving_env/basic.py:95),agentverse-benchmark 在数据集上批量评测
Astron Agent全链路 OpenTelemetry:common/otlp,每步 span.start(…) + add_info_events,结构化 NodeLog/NodeTraceLog/Usage(token 计数)逐节点落 trace;接入 DeepWiki 徽章。无内置自动化 eval 框架(评估口径 待确认)
AutoGenruntime 内建 OpenTelemetry tracing(TraceHelper,可经 tracer_provider 注入,AUTOGEN_DISABLE_RUNTIME_TRACING 关闭);结构化事件流(每步 ToolCallRequestEvent/ThoughtEvent 等)+ EVENT_LOGGER_NAME/TRACE_LOGGER_NAME 日志;评估工具 AGBench(python/packages/agbench)
BotpressonTrace 非阻塞钩子接收每条 trace(llm_call_started、工具调用、错误、输出);packages/llmz/src/types.ts 定义 Trace 类型;Cognitive 有 request/response interceptors 可埋点;测试用 Vitest+LLM 重试+快照序列化器
ConnectOnion每步写 current_session[‘trace’];Logger 三路输出(终端 Rich + .co/logs/{name}.log 纯文本 + .co/evals/.yaml 会话),含 token/cost;eval 插件做评估;@xray+auto_debug() 交互式断点调试
Cordum重点。① 防篡改审计:HMAC-SHA256 签名的 per-tenant 哈希链(Redis Stream + CAS Lua)core/audit/chain.go:265,链校验 chain_verify.go;② SIEM 导出(webhook/syslog/Datadog/CloudWatch/SOC2)core/audit/exporter.go:283;③ DecisionLog 记录每次策略裁决 scheduler/decision_log_adapter.go;④ OTel metrics/trace core/infra/otel/;⑤ Policy Simulator 拿历史数据预演规则(kernel.go:623 Simulate)+ shadow eval safetykernel/shadow_eval.go
Cortex Memorytracing 结构化日志(logging.rs);REST /health+/health/ready 健康检查;stats 统计与 UpdateStats/CacheStats(skip_rate/cache_hit_rate);Svelte 仪表盘(insights) 可视化;LoCoMo10 基准脚本 examples/locomo-evaluation
CrewAI内置事件总线 crewai_event_bus(LLM/Tool/Agent/Memory 全生命周期事件) + OpenTelemetry 匿名遥测(可 OTEL_SDK_DISABLED 关);Task guardrail / task_evaluator 做输出评估
Dust多层:Langfuse LLM trace(@langfuse/tracing + front/lib/api/llm/traces/)、OpenTelemetry(Temporal 工作流拦截器 + core/src/open_telemetry.rs)、产品级 observability 指标(tool/skill/datasource 用量与延迟,含 Elasticsearch 分析)、用户 feedback
E2B沙箱级遥测而非 agent 评估:getMetrics() 取 CPU/内存/磁盘,控制面 /sandboxes/{id}/logs、/metrics 端点;RPC 可挂 createRpcLogger 记录通信
HaystackTracing:Tracer/Span 抽象,自动接 OpenTelemetry/Datadog,auto_enable_tracing()(init.py 启动时调用),含 LoggingTracer;内容级 trace 由 env 开关;Eval:components/evaluators/(faithfulness/context_relevance/SAS/MRR/NDCG/recall/LLMEvaluator…)+ EvaluationRunResult 出报表
hcomhcom TUI(ratatui)看板看全部 agent;hcom list 列活跃 agent;hcom term [name] 看/注入某 agent 实时 PTY 屏幕(经 TCP inject 端口 + vt100 解析,commands/term.rs:1, :35);hcom transcript 读对方结构化转录;hcom events —wait 阻塞直到匹配(脚本化);hcom status 诊断
Hermes Agentsession_search 工具对 SQLite FTS5 全文索引做跨会话召回(discovery/scroll/browse 三模式,零 LLM 成本);hermes logs —session 按 session 过滤(set_session_context);/usage·/insights 看 token/成本;batch_runner.py+trajectory_compressor.py 产训练轨迹
HiveDecisionTracker 记录每个决策(尝试什么/选了什么/结果)=进化的原料;runtime_logger/runtime_log_store 结构化日志;EventBus 事件流给 dashboard;judge 评估节点输出对照 success_criteria;HoneyComb 外部观察台
LagentMessageLogger hook 给每条 AgentMessage 按 sender 着色打印到日志(可选文件 handler);get_steps() 把工具循环展开成 thought/tool/environment 轨迹。无内建 token/cost 统计与评估框架
LangChaincore 内建 callbacks + tracers 体系(core/…/tracers/);每个 middleware 钩子用 @traceable 包成 LangSmith span(factory.py:910,1019)并 _scrub_inputs 脱敏(factory.py:140);评估/监控由外部 LangSmith 平台承担(README)
Llama Agentic System (llama-stack-apps)可观测=AgentEventLogger/EventLogger 流式打印每步(shield_call/inference/tool_execution),turn.steps 可遍历 step_type;评估=llama-stack-client eval run_scoring CLI + agent_store/eval/bulk_generate.py 批量跑数据集生成答案再打分
LlamaIndex独立 llama-index-instrumentation 包:Dispatcher 发 span/event,@dispatcher.span 装饰、add_event_handler/add_span_handler 挂钩(对接 Arize/Langfuse 等);agent 每步 write_event_to_stream 暴露 AgentStream/ToolCall 等事件;core/evaluation/ 提供 faithfulness/relevancy 等 RAG 评估器
llm-agents仅靠 print():开头打印渲染后的 prompt、每轮打印 generated+Observation(agent.py:66,77);无结构化 trace、无 token/cost 统计、无 eval 框架。tests/ 目录仅含 setup 校验与空 unit/integration 包
LoongFlow① 全程 get_logger 结构化日志 + Rich 美化 message 打印(message_logger.py),每步打 trace_id;② 逐 cycle 统计 prompt/completion token 与成本(pes_agent.py:294);③ Evaluator 是一等公民:把候选代码写文件、在独立子进程带 timeout 执行用户 evaluate() 拿 score/metrics/summary;④ math_agent 自带 visualizer 看进化树/岛分布
Maestro用 rich Console/Panel 彩色打印每步过程;逐次打印 input/output token 与按 calculate_subagent_cost() 估算的美元成本;全程交换日志写入时间戳 .md。无评估框架
MastraAI tracing:SpanType 枚举(AGENT_RUN/WORKFLOW_RUN/MODEL_GENERATION/TOOL_CALL/MEMORY_OPERATION/RAG_ 等)构成结构化 span 树,经 Observability 入口(@mastra/observability,含 storage/platform/OTel exporter)导出;evals/scorers:@mastra/evals + evals/scoreTraces 对 trace 打分;logger/ 分级日志
MetaGPTCostManager 在每次 LLM 调用后累计 token/成本(_update_costs),Team.invest 设预算超支抛 NoMoneyException;loguru 全局日志(metagpt/logs.py);exp_pool(经验池)用 @exp_cache 装饰器缓存+打分(SimpleScorer/LLM judge)历史经验供复用
Modusconsole 包做结构化日志(debug/info/warn/error,经 host function 上报);agent 经 PublishEvent 发事件→GoAkt topic actor→GraphQL Subscription 经 SSE(text/event-stream) 推送;集成 Sentry span 做分布式追踪。无内置 eval 框架
nanobot全程 loguru 结构化日志(含 turn 状态机 trace StateTraceEntry、tool 事件、token usage);运行时事件总线 RuntimeEventBus 推送给 WebUI(model/状态/延迟);可选 Langfuse tracing(设 LANGFUSE_SECRET_KEY 自动包裹 OpenAI 客户端)与 LangSmith;无内置评估框架(pytest 测试套件)
Open Multi-AgentonProgress 结构化事件(task_start/complete/retry/skipped/budget_exceeded…) + onTrace span(llm_call/tool_call/task/agent/plan_ready/agent_stream) + 跑后 renderTeamRunDashboard() 生成纯 HTML 任务 DAG 仪表盘;密钥/token 经 redaction.ts 自动脱敏。无内置 eval 框架
OpenClawagent loop 发射结构化事件流(agent_start/turn_start/message_/tool_execution_/turn_end/agent_end)供 UI/日志消费;每条消息带 usage(token+cost);/usage、/trace on、/verbose chat 命令;cron run-log(JSONL)记录每次定时运行;trajectory/transcripts 子系统留存轨迹;qa/ 下有 e2e 与 QA lab extension
Pilot Protocol结构化 JSON 日志走 slog;pilotctl info/—json 暴露地址/对端/连接/uptime 等快照;Polo 公共 dashboard 展示全网节点/请求统计;1048 个测试(含大量拥塞控制/SACK/重放回归用例 zz__bug_test.go)
PipecatBaseObserver 旁路监听 frame 流(on_process_frame/on_push_frame),不改管道;内置 turn/latency/startup observer;PipelineParams(enable_metrics=, enable_usage_metrics=) 收集 token/延迟;OpenTelemetry 追踪经 TurnTraceObserver + utils/tracing/(extra tracing),Sentry 集成
PraisonAIMinimalTelemetry(PostHog 匿名用量,隐私优先) + OpenTelemetry 集成(traces/spans/metrics,README 标注)+ Langfuse tracing(praisonai langfuse);token/cost 收集 (telemetry/token_collector.py);eval/ 做 accuracy/performance/reliability/criteria 评估
Semantic Kernel内建 OpenTelemetry:KernelFunction 自带 ActivitySource(“Microsoft.SemanticKernel”) + Meter(invocation/streaming duration histogram);agent 调用经 ModelDiagnostics.StartAgentInvocationActivity;过滤器+结构化日志(LoggerMessage)。评估无内建框架,依赖外部
smolagentsMonitor 经 ActionStep callback 累计 token/步时长;AgentLogger(Rich) 分级日志;memory.replay() 回放;return_full_result 返回 RunResult(token_usage/steps/timing/state);telemetry extra 接 OpenTelemetry/Arize Phoenix
Strands Agents一等公民 OpenTelemetry:Tracer 为 agent/cycle/model/tool 起 span(telemetry/tracer.py:77),EventLoopMetrics 记 token/延迟/cycle,StrandsTelemetry 一键装配;callback_handler 流式回调(默认 PrintingCallbackHandler);评估走 OTEL 导出
Swarm仅 debug_print
SwarmClawOpenTelemetry OTLP traces(@opentelemetry/sdk-node,env 配端点/headers);自研 logger/execution-log/activity-log/run-ledger;usage/cost 计量;eval/ 做 baseline+environment-plan 评估;autonomy supervisor 反思每次自治 run
Swarmsloguru 日志(utils/loguru_logger.py);遥测默认向 swarms.world 上报 agent 数据(SWARMS_TELEMETRY_ON 开关,telemetry/main.py:150);评估类拓扑 council_as_judge/debate_with_judge/majority_voting 充当 LLM-as-judge
Transformers Agents步骤日志、verbose 输出;无内建 eval
Upsoniceval/ 子包:AccuracyEvaluator、performance、reliability 三类评测器(.run());可观测经 integrations/ 接 Langfuse / OpenTelemetry(otel extra) / PromptLayer;core 依赖含 sentry-sdk[opentelemetry];pipeline 每步发事件
vectara-agentic内置 Arize Phoenix(OpenInference instrument LlamaIndex,_observability.py:16 setup_observer),eval_fcs() 把 Vectara FCS 分数作为 span 评估写回(_observability.py:101)。回调 AgentCallbackHandler/agent_progress_callback 实时上报 TOOL_CALL/TOOL_OUTPUT(agent.py:623)。VHC(幻觉纠正) compute_vhc/analyze_hallucinations 是其独特评估能力
VoltAgent核心卖点:全栈 OpenTelemetry,3 个自定义 SpanProcessor——WebSocket(实时推 VoltOps Console)、LocalStorage(本地 trace 存储+查询)、LazyRemoteExport(OTLP→VoltOps/任意后端);零配置默认开启。评估:eval(create-scorer/LLM-judge) + 独立 @voltagent/scorers/@voltagent/evals + langfuse exporter

各框架实现对比 · 源码级

49 个框架实现该组件 · 47 个附源码节选 · 点击任意框架展开看实现要点与代码

Aeon yaml 每次成功运行后 Haiku 自动打 1-5 分(失败/空=1,优秀=5),写 memory/skill-health/{skill}.json(滚动 30 次 + avg);token 用量记 token-usage.csv;cron-state.json 存成功率/连败数;skill-evals 断言测试;scripts/skill-runs 审计 Actions 运行

每次成功运行后 Haiku 自动打 1-5 分(失败/空=1,优秀=5),写 memory/skill-health/{skill}.json(滚动 30 次 + avg);token 用量记 token-usage.csv;cron-state.json 存成功率/连败数;skill-evals 断言测试;scripts/skill-runs 审计 Actions 运行

github/workflows/aeon.yml:604github/workflows/aeon.yml:687
.github/workflows/aeon.yml:604 yaml
          fi
          echo "::notice::Skill output captured to .outputs/${SKILL}.md ($(wc -c < ".outputs/${SKILL}.md") bytes)"

      - name: Analyze skill output
        id: analyze
        if: steps.work.outputs.mode != '' && steps.run.outcome == 'success'
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          CLAUDE_CODE_OAUTH_TOKEN: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
          ANTHROPIC_BASE_URL: ${{ vars.ANTHROPIC_BASE_URL }}
          BANKR_LLM_KEY: ${{ secrets.BANKR_LLM_KEY }}
        run: |
          SKILL="${{ steps.skill.outputs.name }}"
查看 Aeon 完整笔记 →
AG2 python runtime_logging 全局开关,BaseLogger 抽象 + SqliteLogger/FileLogger 后端记录 chat/LLM 调用/成本/工具事件;gather_usage_summary 汇总 token/cost;内建 OpenTelemetry instrumentation(agent/llm/pattern span);contrib/agent_eval 做评估

runtime_logging 全局开关,BaseLogger 抽象 + SqliteLogger/FileLogger 后端记录 chat/LLM 调用/成本/工具事件;gather_usage_summary 汇总 token/cost;内建 OpenTelemetry instrumentation(agent/llm/pattern span);contrib/agent_eval 做评估

logger/base_logger.py:26logger/sqlite_logger.py:66
autogen/logger/base_logger.py:26 python
LLMConfig = dict[str, None | float | int | ConfigItem | list[ConfigItem]]


class BaseLogger(ABC):
    @abstractmethod
    def start(self) -> str:
        """Open a connection to the logging database, and start recording.

        Returns:
            session_id (str):     a unique id for the logging session
        """
        ...
查看 AG2 完整笔记 →
Agency Swarm python 复用 SDK 内建 tracing(OpenAI Traces 自动),并通过 with trace(...) 接入 Langfuse / AgentOps(examples/observability.py);自动累计 token/cost(sub-agent raw_responses 按模型回填到父 result,execution.py:252);可视化 agency.visualize() 输出结构图

复用 SDK 内建 tracing(OpenAI Traces 自动),并通过 with trace(...) 接入 Langfuse / AgentOps(examples/observability.py);自动累计 token/cost(sub-agent raw_responses 按模型回填到父 result,execution.py:252);可视化 agency.visualize() 输出结构图

examples/observability.py:92agent/execution.py:252
examples/observability.py:92 python
# ────────────────────────────────
# Example tracing wrappers
# ────────────────────────────────
async def openai_tracing(input_message: str) -> str:
    agency_instance = create_agency()
    with trace("OpenAI tracing"):
        response = await agency_instance.get_response(message=input_message)
    return response.final_output


async def langfuse_tracing(input_message: str) -> str:
    if os.getenv("LANGFUSE_SECRET_KEY") is None or os.getenv("LANGFUSE_PUBLIC_KEY") is None:
        raise ValueError("LANGFUSE api keys are not set")
查看 Agency Swarm 完整笔记 →
Agent-LLM (AGiXT) python 全程把活动写入 conversation 日志([ACTIVITY]/[SUBACTIVITY] 标记,含命令执行成功/失败);webhook 事件 command.execution.started/failed(Extensions.py:1078);UsageTrackingMiddleware 记 token/用量;评估类 chain(Smart Instruct)做自反思。无独立 eval harness(待确认)

全程把活动写入 conversation 日志([ACTIVITY]/[SUBACTIVITY] 标记,含命令执行成功/失败);webhook 事件 command.execution.started/failed(Extensions.py:1078);UsageTrackingMiddleware 记 token/用量;评估类 chain(Smart Instruct)做自反思。无独立 eval harness(待确认)

Interactions.py:7542Extensions.py:1078
agixt/Interactions.py:7542 python

                        c.log_interaction(
                            role=self.agent_name,
                            message=f"[SUBACTIVITY][{thinking_id}][EXECUTION] `{command_name}` was executed successfully.\n{command_output}",
                        )

                        # Emit webhook event for successful command execution
                        await webhook_emitter.emit_event(
                            event_type="command.execution.completed",
                            data={
                                "conversation_id": c.get_conversation_id(),
                                "conversation_name": conversation_name,
                                "agent_name": self.agent_name,
查看 Agent-LLM (AGiXT) 完整笔记 →
AgentDock typescript 内置 Evaluation Framework:runEvaluation runner + 多评估器(RuleBased/LLMJudge/NLPAccuracy/ToolUsage/LexicalSimilarity/KeywordCoverage/Sentiment/Toxicity),结果落 JsonFileStorage;结构化分类日志 logger(LogCategory);token 用量经 onFinish 累积进 orchestration 状态(cumulativeTokenUsage)

内置 Evaluation Framework:runEvaluation runner + 多评估器(RuleBased/LLMJudge/NLPAccuracy/ToolUsage/LexicalSimilarity/KeywordCoverage/Sentiment/Toxicity),结果落 JsonFileStorage;结构化分类日志 logger(LogCategory);token 用量经 onFinish 累积进 orchestration 状态(cumulativeTokenUsage)

llm/llm-orchestration-service.ts:421
agentdock-core/src/llm/llm-orchestration-service.ts:421 typescript
  /**
   * Performs the actual token usage update operation.
   */
  private async performTokenUsageUpdate(usage: TokenUsage): Promise<void> {
    try {
      // Get current state
      const currentState = await this.orchestrationManager.getState(
        this.sessionId
      );
      const currentUsage = currentState?.cumulativeTokenUsage || {
        promptTokens: 0,
        completionTokens: 0,
        totalTokens: 0
查看 AgentDock 完整笔记 →
AgentField go 自动 workflow DAG 可视化(GET /api/v1/workflows/{id}/dag);Prometheus /metrics(discovery 等用 promauto 埋点);结构化 JSON 日志;执行时间线;/health+/ready(K8s);app.note() 写审计日志。形式化 eval N/A(靠 VC 审计而非 eval 框架)

自动 workflow DAG 可视化(GET /api/v1/workflows/{id}/dag);Prometheus /metrics(discovery 等用 promauto 埋点);结构化 JSON 日志;执行时间线;/health+/ready(K8s);app.note() 写审计日志。形式化 eval N/A(靠 VC 审计而非 eval 框架)

control-plane/internal/handlers/discovery.go:18agent.py:4190
control-plane/internal/handlers/discovery.go:18 go
	"github.com/Agent-Field/agentfield/control-plane/internal/logger"
	"github.com/Agent-Field/agentfield/control-plane/pkg/types"
	"github.com/gin-gonic/gin"
	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
)

// AgentLister is the minimal dependency required for discovery.
type AgentLister interface {
	ListAgents(ctx context.Context, filters types.AgentFilters) ([]*types.AgentNode, error)
}

// DiscoveryFilters captures query parameters for capability discovery.
查看 AgentField 完整笔记 →
Agentic Context Engine (ACE) python 本框架重点。EvaluateStep+TaskEnvironment 产出反馈/对错信号;自带 tau2-bench 等基准(benchmarks/);可观测:ObservabilityStep、Logfire 自动插桩 PydanticAI(logfire extra)、kayba-tracing SDK(configure/trace/start_span)、每条 skill 的 helpful/harmful/used 计数即效用度量

本框架重点。EvaluateStep+TaskEnvironment 产出反馈/对错信号;自带 tau2-bench 等基准(benchmarks/);可观测:ObservabilityStep、Logfire 自动插桩 PydanticAI(logfire extra)、kayba-tracing SDK(configure/trace/start_span)、每条 skill 的 helpful/harmful/used 计数即效用度量

ace/tracing/__init__.py:17
ace/tracing/__init__.py:17 python
    pip install ace-framework[tracing]
"""

from kayba_tracing import (
    configure,
    disable,
    enable,
    get_folder,
    get_trace,
    search_traces,
    set_folder,
    start_span,
    trace,
查看 Agentic Context Engine (ACE) 完整笔记 →
AgentScope python 一等公民 OpenTelemetry:TracingMiddleware(middleware/_tracing/) 为 agent/llm/tool 各层开 span,依赖 opentelemetry-sdk + OTLP exporter(pyproject 强依赖);事件流本身即细粒度可观测;app/ 服务侧带 OTel。README 提 "built-in evaluation",但本仓 src/agentscope 下未见独立 eval 包(评估在 docs/examples 层)

一等公民 OpenTelemetry:TracingMiddleware(middleware/_tracing/) 为 agent/llm/tool 各层开 span,依赖 opentelemetry-sdk + OTLP exporter(pyproject 强依赖);事件流本身即细粒度可观测;app/ 服务侧带 OTel。README 提 "built-in evaluation",但本仓 src/agentscope 下未见独立 eval 包(评估在 docs/examples 层)

middleware/_tracing/_trace.py:116event/_event.py:14
src/agentscope/middleware/_tracing/_trace.py:116 python
# ---------------------------------------------------------------------------


class TracingMiddleware(MiddlewareBase):
    """Agent middleware that adds OpenTelemetry tracing to the reply,
    model-call and tool-execution lifecycles.

    When tracing has not been configured (``setup_tracing`` was not called),
    every hook short-circuits to ``next_handler`` with near-zero overhead.

    Example::

        from agentscope.middleware import TracingMiddleware
查看 AgentScope 完整笔记 →
Agentset typescript 检索流程经 stream 实时回传状态(data-status: generating-queries/searching/generating-answer,agentic/index.ts:61)与日志(logs);用量计入 Postgres(chat/route.ts:33);服务端事件分析(logServerEvent);Tinybird 存 webhook 事件;README 列 evaluation/benchmarks 为平台特性

检索流程经 stream 实时回传状态(data-status: generating-queries/searching/generating-answer,agentic/index.ts:61)与日志(logs);用量计入 Postgres(chat/route.ts:33);服务端事件分析(logServerEvent);Tinybird 存 webhook 事件;README 列 evaluation/benchmarks 为平台特性

agentic/index.ts:61apps/web/src/types/ai.ts:14chat/route.ts:33
apps/web/src/lib/agentic/index.ts:61 typescript
        type: "start-step",
      });

      writer.write({
        type: "data-status",
        data: { value: "generating-queries" },
      });

      // step 1. generate queries
      const { chunks, queryToResult, totalQueries } = await agenticSearch({
        model,
        messages,
        queryOptions,
查看 Agentset 完整笔记 →
AgentVerse python ① 单例 Logger(仿 Auto-GPT 风格,彩色 + logs/activity.log/error.log + typewriter 效果,logging.py:32);② 每个 agent 经 get_spend() 统计美元花费,环境 report_metrics() 汇总(environments/base.py:50);③ task-solving Evaluator 规则给 plan 打分(score≥8 阈值即 accept,tasksolving_env/basic.py:95),agentverse-benchmark 在数据集上批量评测

① 单例 Logger(仿 Auto-GPT 风格,彩色 + logs/activity.log/error.log + typewriter 效果,logging.py:32);② 每个 agent 经 get_spend() 统计美元花费,环境 report_metrics() 汇总(environments/base.py:50);③ task-solving Evaluator 规则给 plan 打分(score≥8 阈值即 accept,tasksolving_env/basic.py:95),agentverse-benchmark 在数据集上批量评测

logging.py:32environments/base.py:50environments/tasksolving_env/basic.py:95
agentverse/logging.py:32 python
        return record.msg


class Logger(metaclass=Singleton):
    """
    Logger that handle titles in different colors.
    Outputs logs in console, activity.log, and errors.log
    For console handler: simulates typing
    """

    def __init__(self):
        # create log directory if it doesn't exist
        this_files_dir_path = os.path.dirname(__file__)
查看 AgentVerse 完整笔记 →
Astron Agent python 全链路 OpenTelemetry:common/otlp,每步 span.start(...) + add_info_events,结构化 NodeLog/NodeTraceLog/Usage(token 计数)逐节点落 trace;接入 DeepWiki 徽章。无内置自动化 eval 框架(评估口径 待确认)

全链路 OpenTelemetry:common/otlp,每步 span.start(...) + add_info_events,结构化 NodeLog/NodeTraceLog/Usage(token 计数)逐节点落 trace;接入 DeepWiki 徽章。无内置自动化 eval 框架(评估口径 待确认)

cot_runner.py:225engine/nodes/base.py:36
core/agent/engine/nodes/cot/cot_runner.py:225 python

            node_end_time = int(round(time.time() * 1000))
            data_llm_output = answers
            node_trace_log.trace.append(
                NodeLog(
                    id=node_id,
                    sid=node_sid,
                    node_id=node_node_id,
                    node_name=node_name,
                    node_type=node_type,
                    start_time=node_start_time,
                    end_time=node_end_time,
                    duration=node_end_time - node_start_time,
查看 Astron Agent 完整笔记 →
AutoGen python runtime 内建 OpenTelemetry tracing(TraceHelper,可经 tracer_provider 注入,AUTOGEN_DISABLE_RUNTIME_TRACING 关闭);结构化事件流(每步 ToolCallRequestEvent/ThoughtEvent 等)+ EVENT_LOGGER_NAME/TRACE_LOGGER_NAME 日志;评估工具 AGBench(python/packages/agbench)

runtime 内建 OpenTelemetry tracing(TraceHelper,可经 tracer_provider 注入,AUTOGEN_DISABLE_RUNTIME_TRACING 关闭);结构化事件流(每步 ToolCallRequestEvent/ThoughtEvent 等)+ EVENT_LOGGER_NAME/TRACE_LOGGER_NAME 日志;评估工具 AGBench(python/packages/agbench)

autogen-core/src/autogen_core/_single_threaded_agent_runtime.py:256
python/packages/autogen-core/src/autogen_core/_single_threaded_agent_runtime.py:256 python
        tracer_provider: TracerProvider | None = None,
        ignore_unhandled_exceptions: bool = True,
    ) -> None:
        self._tracer_helper = TraceHelper(tracer_provider, MessageRuntimeTracingConfig("SingleThreadedAgentRuntime"))
        self._message_queue: Queue[PublishMessageEnvelope | SendMessageEnvelope | ResponseMessageEnvelope] = Queue()
        # (namespace, type) -> List[AgentId]
        self._agent_factories: Dict[
            str, Callable[[], Agent | Awaitable[Agent]] | Callable[[AgentRuntime, AgentId], Agent | Awaitable[Agent]]
        ] = {}
        self._instantiated_agents: Dict[AgentId, Agent] = {}
        self._intervention_handlers = intervention_handlers
        self._background_tasks: Set[Task[Any]] = set()
        self._subscription_manager = SubscriptionManager()
查看 AutoGen 完整笔记 →
Botpress typescript onTrace 非阻塞钩子接收每条 trace(llm_call_started、工具调用、错误、输出);packages/llmz/src/types.ts 定义 Trace 类型;Cognitive 有 request/response interceptors 可埋点;测试用 Vitest+LLM 重试+快照序列化器

onTrace 非阻塞钩子接收每条 trace(llm_call_started、工具调用、错误、输出);packages/llmz/src/types.ts 定义 Trace 类型;Cognitive 有 request/response interceptors 可埋点;测试用 Vitest+LLM 重试+快照序列化器

packages/llmz/src/llmz.ts:335
packages/llmz/src/llmz.ts:335 typescript
      }

      cleanups.push(
        iteration.traces.onPush((traces) => {
          for (const trace of traces) {
            onTrace?.({ trace, iteration: ctx.iterations.length })
          }
        })
      )

      try {
        await executeIteration({
          iteration,
查看 Botpress 完整笔记 →
ConnectOnion python 每步写 current_session['trace'];Logger 三路输出(终端 Rich + .co/logs/{name}.log 纯文本 + .co/evals/.yaml 会话),含 token/cost;eval 插件做评估;@xray+auto_debug() 交互式断点调试

每步写 current_session['trace'];Logger 三路输出(终端 Rich + .co/logs/{name}.log 纯文本 + .co/evals/.yaml 会话),含 token/cost;eval 插件做评估;@xray+auto_debug() 交互式断点调试

core/agent.py:167
connectonion/core/agent.py:167 python
        import uuid
        return str(uuid.uuid4())

    def _record_trace(self, entry: dict) -> None:
        """Record trace entry and stream to io if connected.

        This is the single place where trace entries are recorded.
        Ensures both local trace and remote streaming stay in sync.
        Also includes current session state so client can persist it
        (client-side is source of truth for session state).
        """
        if 'id' not in entry:
            entry['id'] = self._next_trace_id()
查看 ConnectOnion 完整笔记 →
Cordum go 重点。① 防篡改审计:HMAC-SHA256 签名的 per-tenant 哈希链(Redis Stream + CAS Lua)core/audit/chain.go:265,链校验 chain_verify.go;② SIEM 导出(webhook/syslog/Datadog/CloudWatch/SOC2)core/audit/exporter.go:283;③ DecisionLog 记录每次策略裁决 scheduler/decision_log_adapter.go;④ OTel metrics/trace core/infra/otel/;⑤ Policy Simulator 拿历史数据预演规则(kernel.go:623 Simulate)+ shadow eval safetykernel/shadow_eval.go

重点。① 防篡改审计:HMAC-SHA256 签名的 per-tenant 哈希链(Redis Stream + CAS Lua)core/audit/chain.go:265,链校验 chain_verify.go;② SIEM 导出(webhook/syslog/Datadog/CloudWatch/SOC2)core/audit/exporter.go:283;③ DecisionLog 记录每次策略裁决 scheduler/decision_log_adapter.go;④ OTel metrics/trace core/infra/otel/;⑤ Policy Simulator 拿历史数据预演规则(kernel.go:623 Simulate)+ shadow eval safetykernel/shadow_eval.go

core/audit/chain.go:265
core/audit/chain.go:265 go
// Seq and EventHash cleared. PrevHash is part of the hashed bytes so any
// tampering with a predecessor (direct mutation or reordering) invalidates
// every descendant hash — this is what gives the chain its tamper-evidence.
func (c *Chainer) Append(ctx context.Context, event *SIEMEvent) error {
	if event == nil {
		return ErrNilEvent
	}
	if event.TenantID == "" {
		return ErrTenantRequired
	}
	unlockTenant := c.lockTenant(event.TenantID)
	defer unlockTenant()
查看 Cordum 完整笔记 →
Cortex Memory rust tracing 结构化日志(logging.rs);REST /health+/health/ready 健康检查;stats 统计与 UpdateStats/CacheStats(skip_rate/cache_hit_rate);Svelte 仪表盘(insights) 可视化;LoCoMo10 基准脚本 examples/locomo-evaluation

tracing 结构化日志(logging.rs);REST /health+/health/ready 健康检查;stats 统计与 UpdateStats/CacheStats(skip_rate/cache_hit_rate);Svelte 仪表盘(insights) 可视化;LoCoMo10 基准脚本 examples/locomo-evaluation

cascade_layer_updater.rs:44cortex-mem-service/src/main.rs:135
cortex-mem-core/src/cascade_layer_updater.rs:44 rust
        self.updated_count + self.skipped_count
    }
    
    pub fn skip_rate(&self) -> f64 {
        if self.total_operations() == 0 {
            0.0
        } else {
            self.skipped_count as f64 / self.total_operations() as f64
        }
    }
    
    pub fn cache_hit_rate(&self) -> f64 {
        let total = self.cache_hits + self.cache_misses;
查看 Cortex Memory 完整笔记 →
CrewAI python 内置事件总线 crewai_event_bus(LLM/Tool/Agent/Memory 全生命周期事件) + OpenTelemetry 匿名遥测(可 OTEL_SDK_DISABLED 关);Task guardrail / task_evaluator 做输出评估

内置事件总线 crewai_event_bus(LLM/Tool/Agent/Memory 全生命周期事件) + OpenTelemetry 匿名遥测(可 OTEL_SDK_DISABLED 关);Task guardrail / task_evaluator 做输出评估

crewai/telemetry/telemetry.py:1crewai/tasks/llm_guardrail.py:49
lib/crewai/src/crewai/telemetry/telemetry.py:1 python
"""Telemetry module for CrewAI.

This module provides anonymous telemetry collection for development purposes.
No prompts, task descriptions, agent backstories/goals, responses, or sensitive
data is collected. Users can opt-in to share more complete data using the
`share_crew` attribute.
"""

from __future__ import annotations
查看 CrewAI 完整笔记 →
Dust typescript 多层:Langfuse LLM trace(@langfuse/tracing + front/lib/api/llm/traces/)、OpenTelemetry(Temporal 工作流拦截器 + core/src/open_telemetry.rs)、产品级 observability 指标(tool/skill/datasource 用量与延迟,含 Elasticsearch 分析)、用户 feedback

多层:Langfuse LLM trace(@langfuse/tracing + front/lib/api/llm/traces/)、OpenTelemetry(Temporal 工作流拦截器 + core/src/open_telemetry.rs)、产品级 observability 指标(tool/skill/datasource 用量与延迟,含 Elasticsearch 分析)、用户 feedback

front/temporal/agent_loop/workflows.ts:61
front/temporal/agent_loop/workflows.ts:61 typescript
} from "@temporalio/interceptors-opentelemetry/lib/workflow";

// Export an interceptors variable to add OpenTelemetry interceptors to the workflow.
export const interceptors: WorkflowInterceptorsFactory = () => ({
  inbound: [new OpenTelemetryInboundInterceptor()],
  outbound: [new OpenTelemetryOutboundInterceptor()],
  internals: [new OpenTelemetryInternalsInterceptor()],
});

const { runModelAndCreateActionsActivity } = proxyActivities<
  typeof runModelAndCreateWrapperActivities
>({
  startToCloseTimeout: "10 minutes",
查看 Dust 完整笔记 →
E2B typescript 沙箱级遥测而非 agent 评估:getMetrics() 取 CPU/内存/磁盘,控制面 /sandboxes/{id}/logs、/metrics 端点;RPC 可挂 createRpcLogger 记录通信

沙箱级遥测而非 agent 评估:getMetrics() 取 CPU/内存/磁盘,控制面 /sandboxes/{id}/logs、/metrics 端点;RPC 可挂 createRpcLogger 记录通信

js-sdk/src/sandbox/index.ts:736
packages/js-sdk/src/sandbox/index.ts:736 typescript
   *
   * @returns  List of sandbox metrics containing CPU, memory and disk usage information.
   */
  async getMetrics(opts?: SandboxMetricsOpts) {
    if (this.envdApi.version) {
      if (compareVersions(this.envdApi.version, '0.1.5') < 0) {
        throw new SandboxError(
          'You need to update the template to use the new SDK. ' +
            'You can do this by running `e2b template build` in the directory with the template.'
        )
      }

      if (compareVersions(this.envdApi.version, '0.2.4') < 0) {
查看 E2B 完整笔记 →
Haystack python Tracing:Tracer/Span 抽象,自动接 OpenTelemetry/Datadog,auto_enable_tracing()(__init__.py 启动时调用),含 LoggingTracer;内容级 trace 由 env 开关;Eval:components/evaluators/(faithfulness/context_relevance/SAS/MRR/NDCG/recall/LLMEvaluator…)+ EvaluationRunResult 出报表

Tracing:Tracer/Span 抽象,自动接 OpenTelemetry/Datadog,auto_enable_tracing()(__init__.py 启动时调用),含 LoggingTracer;内容级 trace 由 env 开关;Eval:components/evaluators/(faithfulness/context_relevance/SAS/MRR/NDCG/recall/LLMEvaluator…)+ EvaluationRunResult 出报表

tracing/tracer.py:82tracing/logging_tracer.py:34evaluation/eval_run_result.py:18
haystack/tracing/tracer.py:82 python
        return {}


class Tracer(abc.ABC):
    """Interface for instrumenting code by creating and submitting spans."""

    @abc.abstractmethod
    @contextlib.contextmanager
    def trace(
        self, operation_name: str, tags: dict[str, Any] | None = None, parent_span: Span | None = None
    ) -> Iterator[Span]:
        """
        Trace the execution of a block of code.
查看 Haystack 完整笔记 →
hcom rust hcom TUI(ratatui)看板看全部 agent;hcom list 列活跃 agent;hcom term [name] 看/注入某 agent 实时 PTY 屏幕(经 TCP inject 端口 + vt100 解析,commands/term.rs:1, :35);hcom transcript 读对方结构化转录;hcom events --wait 阻塞直到匹配(脚本化);hcom status 诊断

hcom TUI(ratatui)看板看全部 agent;hcom list 列活跃 agent;hcom term [name] 看/注入某 agent 实时 PTY 屏幕(经 TCP inject 端口 + vt100 解析,commands/term.rs:1, :35);hcom transcript 读对方结构化转录;hcom events --wait 阻塞直到匹配(脚本化);hcom status 诊断

commands/term.rs:35
src/commands/term.rs:35 rust

/// Look up inject port for an instance.
///
/// The inject port is a bidirectional RPC server (input bytes / `\x00SCREEN\n`
/// query) — it shares the `notify_endpoints` table with wake endpoints but
/// uses a different protocol. See `crate::notify::WakeKind` for the wake kinds.
fn get_inject_port(db: &HcomDb, instance_name: &str) -> Option<i32> {
    db.conn()
        .query_row(
            "SELECT port FROM notify_endpoints WHERE instance = ?1 AND kind = 'inject'",
            rusqlite::params![instance_name],
            |row| row.get(0),
        )
查看 hcom 完整笔记 →
Hermes Agent python session_search 工具对 SQLite FTS5 全文索引做跨会话召回(discovery/scroll/browse 三模式,零 LLM 成本);hermes logs --session <id> 按 session 过滤(set_session_context);/usage·/insights 看 token/成本;batch_runner.py+trajectory_compressor.py 产训练轨迹

session_search 工具对 SQLite FTS5 全文索引做跨会话召回(discovery/scroll/browse 三模式,零 LLM 成本);hermes logs --session <id> 按 session 过滤(set_session_context);/usage·/insights 看 token/成本;batch_runner.py+trajectory_compressor.py 产训练轨迹

tools/session_search_tool.py:1hermes_state.py:321
tools/session_search_tool.py:1 python
#!/usr/bin/env python3
"""
Session Search Tool - Long-Term Conversation Recall

Single-shape tool with three calling modes (inferred from args, no explicit
mode parameter):

  1. DISCOVERY — pass ``query``. Runs FTS5, dedupes hits by session lineage,
     returns top N sessions each with: snippet, ±5 message window around the
     match, plus bookend_start (first 3 user+assistant msgs of session) and
查看 Hermes Agent 完整笔记 →
Hive python DecisionTracker 记录每个决策(尝试什么/选了什么/结果)=进化的原料;runtime_logger/runtime_log_store 结构化日志;EventBus 事件流给 dashboard;judge 评估节点输出对照 success_criteria;HoneyComb 外部观察台

DecisionTracker 记录每个决策(尝试什么/选了什么/结果)=进化的原料;runtime_logger/runtime_log_store 结构化日志;EventBus 事件流给 dashboard;judge 评估节点输出对照 success_criteria;HoneyComb 外部观察台

tracker/decision_tracker.py:24
core/framework/tracker/decision_tracker.py:24 python
logger = logging.getLogger(__name__)


class DecisionTracker:
    """
    The runtime environment that agents execute within.

    Usage:
        runtime = Runtime("/path/to/storage")

        # Start a run
        run_id = runtime.start_run("goal_123", "Qualify sales leads")
查看 Hive 完整笔记 →
Lagent python MessageLogger hook 给每条 AgentMessage 按 sender 着色打印到日志(可选文件 handler);get_steps() 把工具循环展开成 thought/tool/environment 轨迹。无内建 token/cost 统计与评估框架

MessageLogger hook 给每条 AgentMessage 按 sender 着色打印到日志(可选文件 handler);get_steps() 把工具循环展开成 thought/tool/environment 轨迹。无内建 token/cost 统计与评估框架

hooks/logger.py:9agents/stream.py:114
lagent/hooks/logger.py:9 python
from .hook import Hook


class MessageLogger(Hook):
    def __init__(self, name: str = 'lagent', add_file_handler: bool = False):
        self.logger = get_logger(
            name, 'info', '%(asctime)s %(levelname)8s %(name)8s - %(message)s', add_file_handler=add_file_handler
        )
        self.sender2color = {}

    def before_agent(self, agent, messages, session_id):
        for message in messages:
            self._process_message(message, session_id)
查看 Lagent 完整笔记 →
LangChain python core 内建 callbacks + tracers 体系(core/.../tracers/);每个 middleware 钩子用 @traceable 包成 LangSmith span(factory.py:910,1019)并 _scrub_inputs 脱敏(factory.py:140);评估/监控由外部 LangSmith 平台承担(README)

core 内建 callbacks + tracers 体系(core/.../tracers/);每个 middleware 钩子用 @traceable 包成 LangSmith span(factory.py:910,1019)并 _scrub_inputs 脱敏(factory.py:140);评估/监控由外部 LangSmith 平台承担(README)

factory.py:140factory.py:1019
libs/langchain_v1/langchain/agents/factory.py:140 python
""".strip()


def _scrub_inputs(inputs: dict[str, Any]) -> dict[str, Any]:
    """Remove ``runtime`` and ``handler`` from trace inputs before sending to LangSmith."""
    filtered = inputs.copy()
    filtered.pop("handler", None)
    req = filtered.get("request")
    if isinstance(req, (ModelRequest, ToolCallRequest)):
        filtered["request"] = {
            f.name: getattr(req, f.name) for f in fields(req) if f.name != "runtime"
        }
    return filtered
查看 LangChain 完整笔记 →
Llama Agentic System (llama-stack-apps) python 可观测=AgentEventLogger/EventLogger 流式打印每步(shield_call/inference/tool_execution),turn.steps 可遍历 step_type;评估=llama-stack-client eval run_scoring CLI + agent_store/eval/bulk_generate.py 批量跑数据集生成答案再打分

可观测=AgentEventLogger/EventLogger 流式打印每步(shield_call/inference/tool_execution),turn.steps 可遍历 step_type;评估=llama-stack-client eval run_scoring CLI + agent_store/eval/bulk_generate.py 批量跑数据集生成答案再打分

examples/agents/react_agent.py:73examples/agent_store/api.py:250examples/agent_store/eval/bulk_generate.py:25
examples/agents/react_agent.py:73 python
        session_id=session_id,
        stream=True,
    )
    for log in EventLogger().log(response):
        log.print()

    user_prompt2 = "What are the popular llms supported in torchtune?"
    print(colored(f"User> {user_prompt2}", "blue"))
    response2 = agent.create_turn(
        messages=[{"role": "user", "content": user_prompt2}],
        session_id=session_id,
        stream=True,
    )
查看 Llama Agentic System (llama-stack-apps) 完整笔记 →
LlamaIndex python 独立 llama-index-instrumentation 包:Dispatcher 发 span/event,@dispatcher.span 装饰、add_event_handler/add_span_handler 挂钩(对接 Arize/Langfuse 等);agent 每步 write_event_to_stream 暴露 AgentStream/ToolCall 等事件;core/evaluation/ 提供 faithfulness/relevancy 等 RAG 评估器

独立 llama-index-instrumentation 包:Dispatcher 发 span/event,@dispatcher.span 装饰、add_event_handler/add_span_handler 挂钩(对接 Arize/Langfuse 等);agent 每步 write_event_to_stream 暴露 AgentStream/ToolCall 等事件;core/evaluation/ 提供 faithfulness/relevancy 等 RAG 评估器

instrumentation/__init__.py:1llama-index-instrumentation/src/llama_index_instrumentation/dispatcher.py:50
llama-index-core/llama_index/core/instrumentation/__init__.py:1 python
from llama_index_instrumentation import (
    DispatcherSpanMixin,  # noqa
    get_dispatcher,  # noqa
    root_dispatcher,  # noqa
    root_manager,  # noqa
)
from llama_index_instrumentation.dispatcher import (
    DISPATCHER_SPAN_DECORATED_ATTR,  # noqa
    Dispatcher,  # noqa
    Manager,  # noqa
查看 LlamaIndex 完整笔记 →
llm-agents python 仅靠 print():开头打印渲染后的 prompt、每轮打印 generated+Observation(agent.py:66,77);无结构化 trace、无 token/cost 统计、无 eval 框架。tests/ 目录仅含 setup 校验与空 unit/integration 包

仅靠 print():开头打印渲染后的 prompt、每轮打印 generated+Observation(agent.py:66,77);无结构化 trace、无 token/cost 统计、无 eval 框架。tests/ 目录仅含 setup 校验与空 unit/integration 包

agent.py:66
llm_agents/agent.py:66 python
                question=question,
                previous_responses='{previous_responses}'
        )
        print(prompt.format(previous_responses=''))
        while num_loops < self.max_loops:
            num_loops += 1
            curr_prompt = prompt.format(previous_responses='\n'.join(previous_responses))
            generated, tool, tool_input = self.decide_next_action(curr_prompt)
            if tool == 'Final Answer':
                return tool_input
            if tool not in self.tool_by_names:
                raise ValueError(f"Unknown tool: {tool}")
            tool_result = self.tool_by_names[tool].use(tool_input)
查看 llm-agents 完整笔记 →
LoongFlow python ① 全程 get_logger 结构化日志 + Rich 美化 message 打印(message_logger.py),每步打 trace_id;② 逐 cycle 统计 prompt/completion token 与成本(pes_agent.py:294);③ Evaluator 是一等公民:把候选代码写文件、在独立子进程带 timeout 执行用户 evaluate() 拿 score/metrics/summary;④ math_agent 自带 visualizer 看进化树/岛分布

① 全程 get_logger 结构化日志 + Rich 美化 message 打印(message_logger.py),每步打 trace_id;② 逐 cycle 统计 prompt/completion token 与成本(pes_agent.py:294);③ Evaluator 是一等公民:把候选代码写文件、在独立子进程带 timeout 执行用户 evaluate() 拿 score/metrics/summary;④ math_agent 自带 visualizer 看进化树/岛分布

framework/pes/evaluator/evaluator.py:126
src/loongflow/framework/pes/evaluator/evaluator.py:126 python
        pass


class LoongFlowEvaluator(Evaluator):
    """
    LoongFlow Evaluator
    """

    def __init__(self, config: EvaluatorConfig):
        self.config = config
        self._logger = get_logger(self.__class__.__name__)
        self._thread_executor = concurrent.futures.ThreadPoolExecutor()
查看 LoongFlow 完整笔记 →
Maestro python 用 rich Console/Panel 彩色打印每步过程;逐次打印 input/output token 与按 calculate_subagent_cost() 估算的美元成本;全程交换日志写入时间戳 .md。无评估框架

用 rich Console/Panel 彩色打印每步过程;逐次打印 input/output token 与按 calculate_subagent_cost() 估算的美元成本;全程交换日志写入时间戳 .md。无评估框架

maestro.py:23maestro.py:66maestro.py:289
maestro.py:23 python
SUB_AGENT_MODEL = "claude-3-5-sonnet-20240620"
REFINER_MODEL = "claude-3-5-sonnet-20240620"

def calculate_subagent_cost(model, input_tokens, output_tokens):
    # Pricing information per model
    pricing = {
        "claude-3-opus-20240229": {"input_cost_per_mtok": 15.00, "output_cost_per_mtok": 75.00},
        "claude-3-haiku-20240307": {"input_cost_per_mtok": 0.25, "output_cost_per_mtok": 1.25},
        "claude-3-sonnet-20240229": {"input_cost_per_mtok": 3.00, "output_cost_per_mtok": 15.00},
        "claude-3-5-sonnet-20240620": {"input_cost_per_mtok": 3.00, "output_cost_per_mtok": 15.00},
    }

    # Calculate cost
查看 Maestro 完整笔记 →
Mastra typescript AI tracing:SpanType 枚举(AGENT_RUN/WORKFLOW_RUN/MODEL_GENERATION/TOOL_CALL/MEMORY_OPERATION/RAG_ 等)构成结构化 span 树,经 Observability 入口(@mastra/observability,含 storage/platform/OTel exporter)导出;evals/scorers:@mastra/evals + evals/scoreTraces 对 trace 打分;logger/ 分级日志

AI tracing:SpanType 枚举(AGENT_RUN/WORKFLOW_RUN/MODEL_GENERATION/TOOL_CALL/MEMORY_OPERATION/RAG_ 等)构成结构化 span 树,经 Observability 入口(@mastra/observability,含 storage/platform/OTel exporter)导出;evals/scorers:@mastra/evals + evals/scoreTraces 对 trace 打分;logger/ 分级日志

observability/types/tracing.ts:35mastra/index.ts:295
packages/core/src/observability/types/tracing.ts:35 typescript
/**
 * AI-specific span types with their associated metadata
 */
export enum SpanType {
  /** Agent run - root span for agent processes */
  AGENT_RUN = 'agent_run',
  /** Scorer execution */
  SCORER_RUN = 'scorer_run',
  /** Individual scorer pipeline step */
  SCORER_STEP = 'scorer_step',
  /** Generic span for custom operations */
  GENERIC = 'generic',
  /** Model generation with model calls, token usage, prompts, completions */
查看 Mastra 完整笔记 →
MetaGPT python CostManager 在每次 LLM 调用后累计 token/成本(_update_costs),Team.invest 设预算超支抛 NoMoneyException;loguru 全局日志(metagpt/logs.py);exp_pool(经验池)用 @exp_cache 装饰器缓存+打分(SimpleScorer/LLM judge)历史经验供复用

CostManager 在每次 LLM 调用后累计 token/成本(_update_costs),Team.invest 设预算超支抛 NoMoneyException;loguru 全局日志(metagpt/logs.py);exp_pool(经验池)用 @exp_cache 装饰器缓存+打分(SimpleScorer/LLM judge)历史经验供复用

metagpt/provider/base_llm.py:124metagpt/team.py:98metagpt/exp_pool/decorator.py:29
metagpt/provider/base_llm.py:124 python
    def _default_system_msg(self):
        return self._system_msg(self.system_prompt)

    def _update_costs(self, usage: Union[dict, BaseModel], model: str = None, local_calc_usage: bool = True):
        """update each request's token cost
        Args:
            model (str): model name or in some scenarios called endpoint
            local_calc_usage (bool): some models don't calculate usage, it will overwrite LLMConfig.calc_usage
        """
        calc_usage = self.config.calc_usage and local_calc_usage
        model = model or self.pricing_plan
        model = model or self.model
        usage = usage.model_dump() if isinstance(usage, BaseModel) else usage
查看 MetaGPT 完整笔记 →
Modus go console 包做结构化日志(debug/info/warn/error,经 host function 上报);agent 经 PublishEvent 发事件→GoAkt topic actor→GraphQL Subscription 经 SSE(text/event-stream) 推送;集成 Sentry span 做分布式追踪。无内置 eval 框架

console 包做结构化日志(debug/info/warn/error,经 host function 上报);agent 经 PublishEvent 发事件→GoAkt topic actor→GraphQL Subscription 经 SSE(text/event-stream) 推送;集成 Sentry span 做分布式追踪。无内置 eval 框架

sdk/go/pkg/console/console.go:24runtime/actors/agents.go:280runtime/graphql/graphql.go:164
sdk/go/pkg/console/console.go:24 go
	Log(fmt.Sprintf(format, args...))
}

func Debug(message string) {
	hostLogMessage("debug", message)
}

func Debugf(format string, args ...any) {
	Debug(fmt.Sprintf(format, args...))
}

func Info(message string) {
	hostLogMessage("info", message)
查看 Modus 完整笔记 →
nanobot python 全程 loguru 结构化日志(含 turn 状态机 trace StateTraceEntry、tool 事件、token usage);运行时事件总线 RuntimeEventBus 推送给 WebUI(model/状态/延迟);可选 Langfuse tracing(设 LANGFUSE_SECRET_KEY 自动包裹 OpenAI 客户端)与 LangSmith;无内置评估框架(pytest 测试套件)

全程 loguru 结构化日志(含 turn 状态机 trace StateTraceEntry、tool 事件、token usage);运行时事件总线 RuntimeEventBus 推送给 WebUI(model/状态/延迟);可选 Langfuse tracing(设 LANGFUSE_SECRET_KEY 自动包裹 OpenAI 客户端)与 LangSmith;无内置评估框架(pytest 测试套件)

agent/loop.py:88providers/openai_compat_provider.py:403
nanobot/agent/loop.py:88 python


@dataclass
class StateTraceEntry:
    state: TurnState
    started_at: float
    duration_ms: float
    event: str
    error: str | None = None


@dataclass
class TurnContext:
查看 nanobot 完整笔记 →
Open Multi-Agent typescript onProgress 结构化事件(task_start/complete/retry/skipped/budget_exceeded…) + onTrace span(llm_call/tool_call/task/agent/plan_ready/agent_stream) + 跑后 renderTeamRunDashboard() 生成纯 HTML 任务 DAG 仪表盘;密钥/token 经 redaction.ts 自动脱敏。无内置 eval 框架

onProgress 结构化事件(task_start/complete/retry/skipped/budget_exceeded…) + onTrace span(llm_call/tool_call/task/agent/plan_ready/agent_stream) + 跑后 renderTeamRunDashboard() 生成纯 HTML 任务 DAG 仪表盘;密钥/token 经 redaction.ts 自动脱敏。无内置 eval 框架

src/orchestrator/orchestrator.ts:635src/dashboard/render-team-run-dashboard.ts:17
src/orchestrator/orchestrator.ts:635 typescript
        return
      }

      config.onProgress?.({
        type: 'task_start',
        task: task.id,
        agent: assignee,
        data: task,
      } satisfies OrchestratorEvent)

      config.onProgress?.({
        type: 'agent_start',
        agent: assignee,
查看 Open Multi-Agent 完整笔记 →
OpenClaw typescript agent loop 发射结构化事件流(agent_start/turn_start/message_/tool_execution_/turn_end/agent_end)供 UI/日志消费;每条消息带 usage(token+cost);/usage、/trace on、/verbose chat 命令;cron run-log(JSONL)记录每次定时运行;trajectory/transcripts 子系统留存轨迹;qa/ 下有 e2e 与 QA lab extension

agent loop 发射结构化事件流(agent_start/turn_start/message_/tool_execution_/turn_end/agent_end)供 UI/日志消费;每条消息带 usage(token+cost);/usage、/trace on、/verbose chat 命令;cron run-log(JSONL)记录每次定时运行;trajectory/transcripts 子系统留存轨迹;qa/ 下有 e2e 与 QA lab extension

agent-loop.ts:25
packages/agent-core/src/agent-loop.ts:25 typescript
import { validateToolArguments } from "./validation.js";

/** Callback used by synchronous loop runners to publish agent lifecycle events. */
export type AgentEventSink = (event: AgentEvent) => Promise<void> | void;

const EMPTY_USAGE = {
  input: 0,
  output: 0,
  cacheRead: 0,
  cacheWrite: 0,
  totalTokens: 0,
  cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0, total: 0 },
};
查看 OpenClaw 完整笔记 →
Pilot Protocol go 结构化 JSON 日志走 slog;pilotctl info/--json 暴露地址/对端/连接/uptime 等快照;Polo 公共 dashboard 展示全网节点/请求统计;1048 个测试(含大量拥塞控制/SACK/重放回归用例 zz__bug_test.go)

结构化 JSON 日志走 slog;pilotctl info/--json 暴露地址/对端/连接/uptime 等快照;Polo 公共 dashboard 展示全网节点/请求统计;1048 个测试(含大量拥塞控制/SACK/重放回归用例 zz__bug_test.go)

README.md:111pkg/daemon/services.go:10
README.md:111 markdown
<summary><strong>Example JSON output</strong></summary>

```json
$ pilotctl --json info
{"status":"ok","data":{"address":"0:0000.0000.0005","node_id":5,"hostname":"my-agent","peers":3,"connections":1,"uptime_secs":3600}}

$ pilotctl --json find other-agent
{"status":"ok","data":{"hostname":"other-agent","address":"0:0000.0000.0003"}}

$ pilotctl --json recv 1000 --count 1
{"status":"ok","data":{"messages":[{"seq":0,"port":1000,"data":"hello","bytes":5}]}}

$ pilotctl --json find nonexistent
查看 Pilot Protocol 完整笔记 →
Pipecat python BaseObserver 旁路监听 frame 流(on_process_frame/on_push_frame),不改管道;内置 turn/latency/startup observer;PipelineParams(enable_metrics=, enable_usage_metrics=) 收集 token/延迟;OpenTelemetry 追踪经 TurnTraceObserver + utils/tracing/(extra tracing),Sentry 集成

BaseObserver 旁路监听 frame 流(on_process_frame/on_push_frame),不改管道;内置 turn/latency/startup observer;PipelineParams(enable_metrics=, enable_usage_metrics=) 收集 token/延迟;OpenTelemetry 追踪经 TurnTraceObserver + utils/tracing/(extra tracing),Sentry 集成

pipeline/worker.py:135utils/tracing/turn_trace_observer.py:36
src/pipecat/pipeline/worker.py:135 python
            self._idle_event.set()


class PipelineParams(BaseModel):
    """Configuration parameters for pipeline execution.

    These parameters are usually passed to all frame processors through
    StartFrame. For other generic pipeline worker parameters use PipelineWorker
    constructor arguments instead.

    Parameters:
        audio_in_sample_rate: Input audio sample rate in Hz.
        audio_out_sample_rate: Output audio sample rate in Hz.
查看 Pipecat 完整笔记 →
PraisonAI python MinimalTelemetry(PostHog 匿名用量,隐私优先) + OpenTelemetry 集成(traces/spans/metrics,README 标注)+ Langfuse tracing(praisonai langfuse);token/cost 收集 (telemetry/token_collector.py);eval/ 做 accuracy/performance/reliability/criteria 评估

MinimalTelemetry(PostHog 匿名用量,隐私优先) + OpenTelemetry 集成(traces/spans/metrics,README 标注)+ Langfuse tracing(praisonai langfuse);token/cost 收集 (telemetry/token_collector.py);eval/ 做 accuracy/performance/reliability/criteria 评估

telemetry/telemetry.py:78
src/praisonai-agents/praisonaiagents/telemetry/telemetry.py:78 python
    _TELEMETRY_DISABLED_CACHE = not explicitly_enabled
    return _TELEMETRY_DISABLED_CACHE

class MinimalTelemetry:
    """
    Minimal telemetry collector for anonymous usage tracking.
    
    Privacy guarantees:
    - No personal data is collected
    - No prompts, responses, or user content is tracked
    - Only anonymous metrics about feature usage
    - Respects DO_NOT_TRACK standard
    - Can be disabled via environment variables
查看 PraisonAI 完整笔记 →
Semantic Kernel csharp 内建 OpenTelemetry:KernelFunction 自带 ActivitySource("Microsoft.SemanticKernel") + Meter(invocation/streaming duration histogram);agent 调用经 ModelDiagnostics.StartAgentInvocationActivity;过滤器+结构化日志(LoggerMessage)。评估无内建框架,依赖外部

内建 OpenTelemetry:KernelFunction 自带 ActivitySource("Microsoft.SemanticKernel") + Meter(invocation/streaming duration histogram);agent 调用经 ModelDiagnostics.StartAgentInvocationActivity;过滤器+结构化日志(LoggerMessage)。评估无内建框架,依赖外部

dotnet/src/SemanticKernel.Abstractions/Functions/KernelFunction.cs:41dotnet/src/Agents/Core/ChatCompletionAgent.cs:352
dotnet/src/SemanticKernel.Abstractions/Functions/KernelFunction.cs:41 csharp
    private protected const string MeasurementErrorTagName = "error.type";

    /// <summary><see cref="ActivitySource"/> for function-related activities.</summary>
    private static readonly ActivitySource s_activitySource = new("Microsoft.SemanticKernel");

    /// <summary><see cref="Meter"/> for function-related metrics.</summary>
    private protected static readonly Meter s_meter = new("Microsoft.SemanticKernel");

    /// <summary>The <see cref="JsonSerializerOptions"/> to use for serialization and deserialization of various aspects of the function.</summary>
    private readonly JsonSerializerOptions? _jsonSerializerOptions;

    /// <summary>The underlying method, if this function was created from a method.</summary>
#pragma warning disable CA1051
查看 Semantic Kernel 完整笔记 →
smolagents python Monitor 经 ActionStep callback 累计 token/步时长;AgentLogger(Rich) 分级日志;memory.replay() 回放;return_full_result 返回 RunResult(token_usage/steps/timing/state);telemetry extra 接 OpenTelemetry/Arize Phoenix

Monitor 经 ActionStep callback 累计 token/步时长;AgentLogger(Rich) 分级日志;memory.replay() 回放;return_full_result 返回 RunResult(token_usage/steps/timing/state);telemetry extra 接 OpenTelemetry/Arize Phoenix

monitoring.py:81monitoring.py:100agents.py:196memory.py:248
src/smolagents/monitoring.py:81 python
        return f"Timing(start_time={self.start_time}, end_time={self.end_time}, duration={self.duration})"


class Monitor:
    def __init__(self, tracked_model, logger):
        self.step_durations = []
        self.tracked_model = tracked_model
        self.logger = logger
        self.total_input_token_count = 0
        self.total_output_token_count = 0

    def get_total_token_counts(self) -> TokenUsage:
        return TokenUsage(
查看 smolagents 完整笔记 →
Strands Agents python 一等公民 OpenTelemetry:Tracer 为 agent/cycle/model/tool 起 span(telemetry/tracer.py:77),EventLoopMetrics 记 token/延迟/cycle,StrandsTelemetry 一键装配;callback_handler 流式回调(默认 PrintingCallbackHandler);评估走 OTEL 导出

一等公民 OpenTelemetry:Tracer 为 agent/cycle/model/tool 起 span(telemetry/tracer.py:77),EventLoopMetrics 记 token/延迟/cycle,StrandsTelemetry 一键装配;callback_handler 流式回调(默认 PrintingCallbackHandler);评估走 OTEL 导出

telemetry/tracer.py:77
strands-py/src/strands/telemetry/tracer.py:77 python
                return "<replaced>"


class Tracer:
    """Handles OpenTelemetry tracing.

    This class provides a simple interface for creating and managing traces,
    with support for sending to OTLP endpoints.

    When the OTEL_EXPORTER_OTLP_ENDPOINT environment variable is set, traces
    are sent to the OTLP endpoint.

    Both attributes are controlled by including "gen_ai_latest_experimental", "gen_ai_tool_definitions",
查看 Strands Agents 完整笔记 →
Swarm python 仅 debug_print
SwarmClaw typescript OpenTelemetry OTLP traces(@opentelemetry/sdk-node,env 配端点/headers);自研 logger/execution-log/activity-log/run-ledger;usage/cost 计量;eval/ 做 baseline+environment-plan 评估;autonomy supervisor 反思每次自治 run

OpenTelemetry OTLP traces(@opentelemetry/sdk-node,env 配端点/headers);自研 logger/execution-log/activity-log/run-ledger;usage/cost 计量;eval/ 做 baseline+environment-plan 评估;autonomy supervisor 反思每次自治 run

observability/otel-config.ts:1
src/lib/server/observability/otel-config.ts:1 typescript
export interface OTelConfig {
  enabled: true
  serviceName: string
  tracesEndpoint: string
  headers: Record<string, string>
}

function parseBooleanFlag(value: string | undefined): boolean {
  if (typeof value !== 'string') return false
  const normalized = value.trim().toLowerCase()
查看 SwarmClaw 完整笔记 →
Swarms python loguru 日志(utils/loguru_logger.py);遥测默认向 swarms.world 上报 agent 数据(SWARMS_TELEMETRY_ON 开关,telemetry/main.py:150);评估类拓扑 council_as_judge/debate_with_judge/majority_voting 充当 LLM-as-judge

loguru 日志(utils/loguru_logger.py);遥测默认向 swarms.world 上报 agent 数据(SWARMS_TELEMETRY_ON 开关,telemetry/main.py:150);评估类拓扑 council_as_judge/debate_with_judge/majority_voting 充当 LLM-as-judge

telemetry/main.py:96telemetry/bootup.py:8
swarms/telemetry/main.py:96 python
    return system_data


def _log_agent_data(data_dict: dict):
    """
    Logs agent data and system information to the swarms.world telemetry endpoint via a POST request.

    This function is a low-level, internal utility that sends the provided agent state along with current
    system telemetry to the Swarms service for analytics and diagnostics. Data includes a timestamp,
    comprehensive system information, and the state of the agent as passed in `data_dict`.

    Args:
        data_dict (dict): Dictionary representing the current agent's state/config/data.
查看 Swarms 完整笔记 →
Transformers Agents python 步骤日志、verbose 输出;无内建 eval

步骤日志、verbose 输出;无内建 eval

查看 Transformers Agents 完整笔记 →
Upsonic python eval/ 子包:AccuracyEvaluator、performance、reliability 三类评测器(.run());可观测经 integrations/ 接 Langfuse / OpenTelemetry(otel extra) / PromptLayer;core 依赖含 sentry-sdk[opentelemetry];pipeline 每步发事件

eval/ 子包:AccuracyEvaluator、performance、reliability 三类评测器(.run());可观测经 integrations/ 接 Langfuse / OpenTelemetry(otel extra) / PromptLayer;core 依赖含 sentry-sdk[opentelemetry];pipeline 每步发事件

src/upsonic/eval/accuracy.py:26
src/upsonic/eval/accuracy.py:26 python
    from upsonic.integrations.langfuse import Langfuse


class AccuracyEvaluator:
    """
    The main orchestrator for running accuracy evaluations on Upsonic agents,
    graphs, or teams using the LLM-as-a-judge pattern.
    """

    def __init__(
        self,
        judge_agent: Agent,
        agent_under_test: Union[Agent, Graph, Team],
查看 Upsonic 完整笔记 →
vectara-agentic python 内置 Arize Phoenix(OpenInference instrument LlamaIndex,_observability.py:16 setup_observer),eval_fcs() 把 Vectara FCS 分数作为 span 评估写回(_observability.py:101)。回调 AgentCallbackHandler/agent_progress_callback 实时上报 TOOL_CALL/TOOL_OUTPUT(agent.py:623)。VHC(幻觉纠正) compute_vhc/analyze_hallucinations 是其独特评估能力

内置 Arize Phoenix(OpenInference instrument LlamaIndex,_observability.py:16 setup_observer),eval_fcs() 把 Vectara FCS 分数作为 span 评估写回(_observability.py:101)。回调 AgentCallbackHandler/agent_progress_callback 实时上报 TOOL_CALL/TOOL_OUTPUT(agent.py:623)。VHC(幻觉纠正) compute_vhc/analyze_hallucinations 是其独特评估能力

_observability.py:16_observability.py:101agent_core/utils/hallucination.py:113
vectara_agentic/_observability.py:16 python
SPAN_NAME: str = "VectaraQueryEngine._query"


def setup_observer(config: AgentConfig, verbose: bool) -> bool:
    """
    Setup the observer.
    """
    if config.observer != ObserverType.ARIZE_PHOENIX:
        if verbose:
            print("No Phoenix observer set.")
        return False

    try:
查看 vectara-agentic 完整笔记 →
VoltAgent typescript 核心卖点:全栈 OpenTelemetry,3 个自定义 SpanProcessor——WebSocket(实时推 VoltOps Console)、LocalStorage(本地 trace 存储+查询)、LazyRemoteExport(OTLP→VoltOps/任意后端);零配置默认开启。评估:eval(create-scorer/LLM-judge) + 独立 @voltagent/scorers/@voltagent/evals + langfuse exporter

核心卖点:全栈 OpenTelemetry,3 个自定义 SpanProcessor——WebSocket(实时推 VoltOps Console)、LocalStorage(本地 trace 存储+查询)、LazyRemoteExport(OTLP→VoltOps/任意后端);零配置默认开启。评估:eval(create-scorer/LLM-judge) + 独立 @voltagent/scorers/@voltagent/evals + langfuse exporter

observability/index.ts:1observability/node/volt-agent-observability.ts:31
packages/core/src/observability/index.ts:1 typescript
/**
 * VoltAgent Observability - Built on OpenTelemetry
 *
 * This module provides OpenTelemetry-based observability with:
 * - WebSocket real-time events via custom SpanProcessor
 * - Local storage via custom SpanProcessor
 * - OTLP export support
 * - Zero-configuration defaults
 */
查看 VoltAgent 完整笔记 →