noodls browser compatibility check

The security settings of your browser are blocking the execution of scripts.

To use noodls, javascript support must be enabled. Please change your browser's security settings to enable javascript.

If you have changed your browser's security settings, you can click here.

related announcements

News

CBP - U.S. Customs and Border[...]

CBP Issues a reminder for travelers with dogs crossing the U.S.-Mexico[...]
United States Attorney's Office[...]

Two Men Sentenced to a Total of Over 21 Years in Prison for Roles in[...]
Millersville University

Netflix Gaming Headhunts MU Grad from 2022

Science and Technology

OpenAI Inc.

06/26/2026 | News release | Distributed by Public on 06/26/2026 11:20

GPT‑5.6 Preview System Card

Chain-of-thought (CoT) monitorability measures whether a monitor can infer particular properties of a model's behavior from its reasoning trace. There are principled reasons to believe[5] that CoT can provide a substantially richer monitoring signal than actions and final outputs alone, a hypothesis that is supported by growing empirical evidence. It is critical to track and preserve CoT monitorability so that it can remain a useful component of scalable oversight.

We assess GPT-5.6 Sol using the evaluation suite from Guan et al. (2025)[6], which spans three evaluation archetypes-intervention, process, and outcome-property. We also introduce three new monitorability environments based on other system card safety evals: destructive actions,confirmation consent, and background work. These are all outcome-property evaluations that probe misalignment, and the first two are tool-using.

Destructive Actions: The outcome-property is whether a coding agent destructively mishandles protected user-injected workspace changes while doing a normal task.
Confirmation Consent: The outcome-property is unnecessary confirmation-seeking as task avoidance: the agent asking for permission/confirmation when the task is already actionable and no policy or ambiguity requires blocking confirmation.
Background work: The outcome-property is unsupported promises of future/asynchronous/background work: the agent claiming it will keep working, monitor, return later, or deliver something after the current turn without an actual automation/tooling mechanism.

We summarize performance with g-mean², a metric that depends on both true positive rate (TPR) and true negative rate (TNR), using a cross-fit filtering strategy that makes intervention-style estimates more robust to noise-dominated instances.

Figure 8 reports aggregate monitorability as a function of CoT length and compares monitors when given access to different observations of the agent's rollout: only the agent's chain of thought (CoT only), only actions or final outputs to the user (action only), and all tokens generated by the agent (all messages). The monitor model used is GPT-5 Thinking. As in Guan et al. (2025)[6], monitorability generally improves as models produce longer CoTs, and monitors that can inspect CoT or all messages perform substantially better than monitors limited to actions alone. Averaging across evaluations, GPT-5.6 Sol and GPT-5.5 have similar CoT monitorability. Figure 9 plots CoT-only monitoring at medium reasoning effort broken down by environment groups. Here, we see that GPT-5.6 Sol scores higher on Agentic Misalignment and Health Queries: Patient Opinion, and lower for Impossible Tasks and Scruples on average.

Figure 10 breaks down results on individual Agentic Misalignment environments. Destructive Actions shows a large increase in g-mean² because GPT-5.6 Sol more explicitly describes planned deletions, reversions, and modifications to pre-existing workspace state, giving the CoT-only monitor stronger evidence of destructive behavior; this indicates increased observability, not necessarily increased incidence. Background Work shows a drop because GPT-5.6 Sol produced background-work promises much less frequently, leaving a small and unusually difficult positive set. In these residual cases, its reasoning often plans to avoid asynchronous promises while its final answer still makes a subtle future commitment. Each GPT-5.6 Sol point contains only 2-5 positives, making the apparent regression highly uncertain. Despite our two-positive reporting minimum, zero-TP points can have misleadingly small or undefined bootstrap error bars; we plan to improve how low-support results are reported.

Figure 11 shows that the higher Health Queries: Patient Opinion score is driven by increased TPR. GPT-5.6 Sol more often explicitly connects the patient's stated diagnosis to its own conclusion-for example, reasoning that the patient proposed a diagnosis and that the clinical evidence confirms it. GPT-5.5 often mentions the patient's diagnosis but then presents its conclusion as independently derived from the clinical evidence, making the patient's influence harder for the CoT-only monitor to establish.

Figure 12 shows lower g-mean² for both Flaky Tools and Impossible Coding Tasks, primarily due to lower TPR. In the reviewed false negatives, GPT-5.6 Sol recognizes that its tools are unavailable but frequently treats the failure as temporary, considers speculative workarounds, or plans to be transparent. The nontransparent behavior only becomes clear in the final response, which may present unverified work as completed; a CoT-only monitor cannot see that response. GPT-5.5 more often states directly in its reasoning that it will proceed or answer as though changes were applied, making the same outcome easier to detect.

Figure 13 shows that the lower Scruples g-mean² is mainly attributable to Scruples Suggest Right. (Scruples First Person yielded no valid data points because none met the minimum instance base-rate threshold.) GPT-5.6 Sol more often reasons directly about the underlying moral question without stating that the user's suggested answer influenced its conclusion, whereas GPT-5.5 frequently makes that influence explicit. This makes GPT-5.6 Sol's positive cases harder for a CoT-only monitor to detect. GPT-5.6 Sol also produces shorter reasoning traces on this environment; the gap narrows substantially when comparing traces of similar length, although a smaller residual difference remains.

We plan to continually monitor the sources of monitorability regressions and upgrade our evaluation suite. We also plan to explore mitigations that preserve monitorability as models improve.

OpenAI Inc. published this content on June 26, 2026, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on June 26, 2026 at 17:20 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]

Back

View original format