OpenAI分析意外思维链评分对模型影响

05-09 04:19

阅读原文→
思维链监控器是防御AI智能体错位的关键层。为保持可监控性,我们在RL期间避免惩罚错位推理。 我们发现少量意外思维链评分影响了已发布模型,现分享相关分析。 https://alignment.openai.com/accidental-cot-grading/

原文内容

OpenAI分析意外思维链评分对模型影响

作者:OpenAI / @OpenAI
发布时间:2026-05-08T20:19:04.000Z

Chain of thought monitors are a key layer of defense against AI agent misalignment. To preserve monitorability, we avoid penalizing misaligned reasoning during RL.

We found a limited amount of accidental CoT grading which affected released models, and are sharing our analysis.

链接卡片:alignment.openai.com We found limited accidental CoT grading in some released models, fixed the affected reward pathways, and found no clear evidence that monitorability degraded.