In the Hands of the Classifier

Classifiers in the context of AI are systems that weigh the intention and risk of a Large Language Model prompt or response. In early 2026, classifiers were introduced to reduce the need to continuously approve every action taken by an AI agent. Previously, operators would frequently have to provide permission to their coding agent to allow reading a specific file, or invoking a program, causing friction in the process. The classifier takes ownership of the decision space, on behalf of the user.

This essay takes a look at what could happen when companies are applying LLM-based classifiers to determine which capabilities of a product are available to its user.

The Classifier

Classifiers were introduced on a large scale with the release of Anthropic's Opus 4.7 and its new "Auto Mode" capability. The classifier has two main objectives: prevent responses to objectively harmful prompts and enable a safe and continuous agent workflow. Without the classifier, an overnight automatic research loop could come to a grinding halt because a harmless tool execution was prevented by the harness. To prevent this, the classifier observes each tool call made by the agent and deems it to be in the user's interest or not. Classifiers are able to prevent destructive actions like deleting a file or dropping a database. A useful tool that acts in the defense of the user's interest: keeping the loop going while preventing harm.

The introduction of Anthropic's latest model, Fable, also brought with it a new classifier. According to two paragraphs buried in Fable's 319-page system card, Anthropic now aims to prevent the use of the model for legitimate use cases such as work on biology, chemistry, or frontier LLM development. To do that, they're employing Fable's new classifier. In most cases the LLM would refuse to continue, in other cases, the model would intentionally introduce errors and thus actively subvert and undermine the user's work. After a wave of customer complaints, Anthropic ended up disabling the subversive behaviors of their model. Yet, the crater left by the impact of their policy decision remains and will not be correctable through a swift software update.

Unlike our interventions [...] these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning [...] Claude will still respond helpfully to user requests.

↪ Anthropic, Fable System Card: Claude Fable 5 & Claude Mythos 5, pg. 13

The Censorship of Capabilities

An AI model should always act directionally positive, it may disagree but it should not generate responses that could lead to harmful outcomes such as AI psychosis or worse. A model that censors information is naturally treading in a space that is prone to being aligned with contemporary political opinions and influence. It is a thin line that needs to be walked carefully to not undermine liberal access to information.

However, censoring capability and actively subverting the user opens up a new chapter on the AI frontier that we have not been confronted with yet.

Extend the concept of capability censorship to other domains and it becomes obvious that it leads to a future where humans have less freedom and less ownership over their daily lives.

Consider getting into your car, only for it to refuse to drive you to your destination. The car notifies you that the destination is unsafe, because you are planning to drive to a meeting that the classifier deemed unacceptable. It based its decision on the context it had from your inbox and the email you received today. You attempt to drive manually, but the classifier refuses and your car is now disabled. This is not the reality that we live in today, but it is obvious now that it is a possible reality, not in ten or twenty years, but within the coming years.

The Ownership

Is the self-driving model provided by the manufacturer of your car aligned with your preferences and safety considerations, or could its classifier even prevent you from driving? As a driver you have ownership over your driving capabilities, and the same should hold true when it comes to the capabilities of the AI that drives on your behalf.

Artificial Intelligence will soon be an extension of human capabilities for almost any task, therefore, there is a need for direct ownership and control over the models we engage with. Individuals need to be able to control the decision space of their AI systems, otherwise we risk handing over our civil liberties and security to the stakeholders of these AIs.

The landscape is shifting fast. The drivers for local AI inference used to be cost and privacy, but within the span of a week, a new one emerged: capability and trust. Owning the decision space is no longer a niche concern for the security-minded. Soon it will be what stands between you and an appliance that can overrule you. The car, the drone, the security system, and any other AI-powered tool: each is becoming something you must hold a stake in, or leave your fate in the classifier's hands.

The Classifier

The Censorship of Capabilities

The Ownership

Related reading.

Ten Years on the Frontier