04/28/2026 | Press release | Distributed by Public on 04/28/2026 11:33
When safety behavior can be removed from a model entirely, the perimeter of AI security fundamentally shifts.
Last week, researchers from the Department of Homeland Security briefed the U.S. House of Representatives using purpose-modified large language models. These systems had their built-in safety mechanisms removed, and the results were immediate. Within seconds, they generated detailed guidance for mass-casualty scenarios, targeting public figures, and other activities that commercial models are explicitly designed to refuse.
Public coverage has treated this as a turning point. For practitioners, it is the public surfacing of a threat class that has been actively researched and exploited for some time.For organizations deploying AI in high-stakes environments, the demonstration aligns with known attack methods rather than introducing a new one.
What has changed is the level of visibility. The briefing brought a class of threats into a broader conversation, which now raises a more important question: what does it take to defend against them?
At the center of the DHS demonstration is a distinction that still isn't widely understood outside of technical circles.
Most commercial AI systems today are censored models, meaning they have been aligned to refuse harmful or disallowed requests. That refusal behavior is what users experience as "safety."
An abliterated model has had that refusal behavior deliberately removed.
This is fundamentally different from a jailbreak. Jailbreaks operate at the prompt level and attempt to coax a model into bypassing safeguards. Their success varies, and they are often mitigated over time. The operational difference matters. Jailbreaks succeed intermittently and degrade as model providers patch them. Abliteration succeeds reliably on every attempt and is permanent in the weights of the distributed model. From a defender's standpoint, those are different problems.
Abliteration occurs at the weight level. Research has shown that refusal behavior exists as a direction in latent space; removing that direction eliminates the model's ability to refuse. The result is consistent, persistent behavior that cannot be corrected with prompts, system instructions, or downstream guardrails. From an operational standpoint, this changes where defense must happen.
Once a model has been modified in this way, there is no reliable runtime mechanism to restore the missing safety behavior. The model itself has been altered. These modified models can also be distributed through common channels, such as open-source repositories, embedded applications, or internal deployment pipelines, making them difficult to distinguish without targeted inspection.
A common question that follows is whether existing cybersecurity controls already address this type of risk.
Traditional security tools are designed around code, binaries, and network activity. AI models do not behave like conventional software. They consist of weights and computation graphs rather than executable logic in the traditional sense.
When a model is modified, whether through weight manipulation or graph-level backdoors, the changes often fall outside the visibility of existing tools. The model loads correctly, passes integrity checks, and continues to operate as expected within the application. At the same time, its behavior may have fundamentally changed. This disconnect highlights a gap between what traditional security controls can observe and how AI systems actually function.
The Congressional briefing showcased one technique. The broader supply-chain attack surface includes several others that defenders must account for in parallel. Addressing that gap starts before a model is ever deployed. A defensible approach to AI security treats models as supply-chain artifacts that must be verified before use. Static analysis plays a critical role at this stage, allowing organizations to evaluate models without executing them.
HiddenLayer's AI Security Platform operates at build time and ingest, identifying signs of compromise before models reach production environments. The platform's Supply Chain module is designed to function across deployment contexts, including airgapped and sensitive environments.
The analysis focuses on detecting practical attack methods, including graph-level backdoors that activate under specific conditions (such as ShadowLogic), control-vector injections that introduce refusal ablation through the computational graph, embedded malware, serialization exploits, and poisoning indicators. Static analysis does not address every threat class: weight-level abliteration of the kind demonstrated to Congress modifies weights without altering the graph, and is best mitigated through provenance controls and runtime detection. This is exactly why supply chain security and runtime protection must operate together.
Each scan produces an AI Bill of Materials, providing a verifiable record of model integrity. For organizations operating under governance frameworks, this creates a clear mechanism for validating AI systems rather than relying on assumptions.
Integrating these checks into CI/CD pipelines ensures that model verification becomes a standard part of the deployment process.
Supply chain security addresses one part of the problem, but runtime behavior introduces additional risk.
As AI systems evolve toward agentic architectures, models interact with external tools, data sources, and user inputs in increasingly complex ways. This expands the attack surface and creates new opportunities for manipulation. And as agentic systems chain models together, a single compromised component can propagate through the pipeline. We will cover that cascading-trust failure mode in a follow-up.
Runtime protection provides a layer of defense at this stage. HiddenLayer's AI Runtime Security module operates between applications and models, inspecting prompts and responses in real time. Detection is handled by purpose-built deterministic classifiers that sit outside the model's inference path entirely. This separation is deliberate. A guardrail that is itself an LLM inherits the failure modes of the system it is protecting. The same prompt-engineering, the same indirect injection, and in some cases the same weight-level modification techniques all apply. Defending an LLM with another LLM is a category error. AIDR uses purpose-built deterministic classifiers that sit outside the inference path entirely, so adversarial inputs that defeat the protected model do not also defeat the detector.
In practice, this includes detecting prompt injection attempts, identifying jailbreaks and indirect attacks, preventing data leakage, and blocking malicious outputs. For agentic systems, it also provides session-level visibility, tool-call inspection, and enforcement actions during execution.
The DHS demonstration highlights a broader issue in how AI risk is often discussed. Safety focuses on guiding models to behave appropriately under expected conditions. Security focuses on maintaining that behavior when conditions are adversarial or uncontrolled.
Most modern AI development has prioritized safety, which is necessary but not sufficient for real-world deployment. Systems operating in adversarial environments require both.
Organizations deploying AI, particularly in high-impact environments, need to account for these risks as part of their standard operating model. That begins with verifying models before deployment and continues with monitoring and enforcing behavior at runtime. It also requires using controls that do not depend on the model itself to ensure safety.
The techniques demonstrated to Congress have been developing for some time, and the defensive approaches are already available. The priority now is applying them in practice as AI adoption continues to scale.
HiddenLayer protects predictive, generative, and agentic AI applications across the entire AI lifecycle, from discovery and AI supply chain security to attack simulation and runtime protection. Backed by patented technology and industry-leading adversarial AI research, our platform is purpose-built to defend AI systems against evolving threats. HiddenLayer protects intellectual property, helps ensure regulatory compliance, and enables organizations to safely adopt and scale AI with confidence. Learn more athiddenlayer.com.