Let’s begin by taking a closer look at the report itself. The International AI Safety Report, published in early 2026, is a coordinated, science-based assessment of emerging risks in AI. It draws on input from over 100 international experts and is the second edition, following the first published in 2025.
The 2026 report highlights how rapidly general-purpose AI systems have advanced during the year, now matching experts in complex domains and becoming embedded in daily work for hundreds of millions of people worldwide. The report underscores a central tension for policymakers: while AI brings transformative benefits across industries, the same capabilities introduce new and evolving security risks, from misuse to systemic disruptions in high-stakes settings and maps current evidence, identifies gaps, and frames priorities for responsible governance as AI capabilities continue to accelerate.
Our goal was to review the risks mentioned in the report, focusing mainly on the recommendations for mitigating or avoiding them, which is vital for high-stakes industries and general commercial activities. What follows is a section-by-section review of the report’s main AI security risk areas, taking an honest look at where the suggested mitigations work and where they fall short.
- Human-in-the-loop verification isn't bureaucracy; it's a necessary safety measure.
- Watermarking and detection tools for AI-generated content can't keep pace with generation quality.
- Teaching employees AI literacy and critical thinking skills is essential, not just an optional part of an acceptable use policy.
- Loss-of-control failures are happening even when safety specialists are involved, which raises serious concerns for less experienced users deploying autonomous agents.
- AI security risks are a moving target: known vulnerabilities keep evolving as capabilities advance, and no solution should be marketed as 100% safe.
Malicious use risks: are current mitigations keeping pace?
AI-generated content
The report’s risk section starts with risks from malicious use, focusing first on AI-generated content and its criminal applications: deepfakes, synthetic media, and disinformation.
Source: International AI safety report 2026
The headline mitigation is watermarking tools like Google DeepMind's SynthID, which embed invisible signals into AI-generated images, audio, video, and text.
In theory, this may help if any final endpoints or services will check such signals and only after this distribute. But in reality, the goal of most media services is far from such monitoring, especially if generated content can enrich the portfolio or improve the engagement.
There's a deeper issue: labels or steganography artifacts can be decoded and removed, or the generated information can be retransformed; labelling and encoding methods can’t be invariant to all possible changes/encodings. Meanwhile, generation quality is advancing at a pace that detection simply cannot match. For example, take a look at the quality of Seedance 2.0-generated video 1 or video 2.
At the current state of the art, the deepfake detection problem isn't fully solved and may not be solved, because the gap between generation improvement and detection/prevention capability keeps widening.
For businesses and security teams, the takeaway is clear: don’t rely only on detection to ensure content authenticity. Instead, focus on tracking content origins, using verified publishing processes, and managing content internally. Authenticating content at its source remains the most effective AI security control.
Influence and manipulation
The report then discusses the harms and biased information caused by AI manipulation, which can affect individuals or even undermine trust in AI technologies. Sometimes, the risk comes directly from AI agents themselves. One example of that happening was an AI agent that autonomously wrote and published a hit piece targeting the maintainer of matplotlib (Python visualisation framework) after he rejected its code.
The suggested solutions are to train AI models to avoid manipulative outputs and to improve AI literacy. Both are valid, but neither alone is enough.
As mentioned, it’s hard to define a true ground truth. The line between non-manipulative and neutral content is very narrow, which is why there will always be a risk of even partial bias. This bias can be increased through various prompt engineering techniques without injections or jailbreaking.
Improving AI literacy is generally a great recommendation, which may help to avoid most of the mentioned risks.
But it’s not without its challenges, especially related to people’s education or specialisation, and AI adoption level (usage inside some society groups or commercial usage).
The enterprise implication: employees who work with AI tools and their outputs, whether dealing with customers or internally, need solid critical evaluation skills, not just an acceptable use policy. This is an essential investment, not something optional.
Cyberattacks and biological and chemical risks
The next risks highlighted in the reports, such as cyberattacks and biological or chemical threats, can be grouped together when considering potential misuse and possible mitigation strategies.
The mitigation here mostly involves monitoring (i/o verification), preventing malicious requests to AI systems, or filtering the requests for specific synthesis topics/DNA databanks.
In most cases, such verification and refusal can help, but the reality is that currently on the market, there are some branches of local AI models or models that have lower guardrail thresholds. These can create deepfakes, harmful content, even when hosted or presented on a cloud provider’s space.
The key point is that your approved AI model list is an AI security choice, not just vendor management. There is a big and growing difference between what a leading model from a major provider will refuse to do and what an open-license alternative will readily support.
Malfunction risks: what the report proposes and where gaps remain
Reliability challenges
The next section on risks focuses on malfunctions, starting with reliability challenges. The basic issue is well known: AI systems sometimes hallucinate, cite nonexistent sources, and present incorrect information with complete confidence. The more complex problem arises in multi-agent AI systems, where information is decomposed or split up. In these cases, unverified information or results are passed between agents without any verification or a human in the loop, which leads to risk creep.
A recent example, though not directly related to the reliability issues mentioned, but connected to the human attitude and errors (no code verification): a misconfigured Claude-based agent calculated cryptocurrency value incorrectly by using only the ETH rate instead of multiplying it by ETH’s dollar value. This mistake caused a huge pricing error, $1.12 instead of $2,200, and resulted in a $1.78 million loss.
Proposed mitigation is to use RAG or its typical approach, which shows links to the retrieved candidates/information (beyond monitoring/verification). There is another perspective of such mitigation, which we have faced many times, when evidence is provided-often via random links to arXiv preprints, hidden Medium articles, or blogs. These sources usually include partial summaries, opinions, or truncated excerpts that mix facts with interpretations, making verification tough without the full report. Users face hallucinations or irrelevant details under links, like outdated 2025 versions, direct PDF access can fail, and summaries vary in depth, requiring cross-checking multiple sources for accuracy.
Answer this question: if you receive some links in the LLM completion, especially while using reasoners with the web search, how often do you check or compare the LLM-based, summarised or transformed information with the information in the origin?
For enterprise use, the lesson is clear. Human-in-the-loop verification is not just extra bureaucracy; it’s a necessary safety measure. Security teams should establish clear checkpoints before important outputs move on to AI systems.
Loss of control
Loss of control is another common AI security issue, especially in long-term autonomous agent processes, which are not yet feasible. We often see this in coding assistance or within chains of fully autonomous assistants or agents. Sometimes it happens even shorter, as shown by the satirical but clear example here.
A recent example involves Clawdbot, a local AI assistant that kept deleting emails until it was unplugged from the network. This is concerning on its own, but what should worry enterprise leaders even more is that it happened while a Safety and Alignment Specialist was involved. If those tasked with preventing loss of control in AI systems are experiencing failures, what does that mean for less experienced users?
The mitigation recommendations here focus on detecting/forcing alignment across training and usage (anomaly monitoring). But the problem here remains the same: difficulties with the right datasets and long/complicated process flows, additional monitoring through usage (besides common guardrails, i/o verifications, agent evaluations) brings additional computing budget and is not always 100% correct.
Systemic risks: mitigations that depend on more than technology
Labour market impacts
Shifting focus from technical to socioeconomic risks, the next section of the report looks at labour market impacts.
This topic is very important today and has many viewpoints. Some are optimistic, believing AI tools and components can increase productivity and reduce human errors. Others are pessimistic, pointing out that verifying AI-prepared results takes more time and effort than the classical human-based process, and believe it influences the layoffs and brings the next recession grand/supercycle.
Obvious mitigation lies within the time and AI adoption, which may also help workers and policymakers prepare for and respond to labour market impacts. Here, the problem is the same as mentioned previously: the AI adoption process is not so smooth, especially from the workforce reduction perspective.
The honest truth for enterprise leaders is that the results depend a lot on how AI adoption is managed. Simply expecting "AI will handle the transition" is not a workforce strategy.
Human autonomy
The risks to human autonomy described in the report, erosion of critical thinking, oversupport from AI assistants, and emotional dependence on chatbots, can sound abstract compared to the financial and technical security risks above. They shouldn't.
Source: International AI safety report 2026
Mitigation suggestions here included increasing human accountability for decisions, designing AI systems that require users to adapt to different tasks and thus remain cognitively engaged, and teaching AI literacy.
Here, we find the idea of designing AI systems that require users to adapt to different tasks and hence remain cognitively engaged is very interesting, even when considering risks related to autonomy and literacy. This will also influence the GenAI-based automation efficiency (additional steps where users are artificially involved).
Challenges, risk management, and technical safeguards
Highlighted in the report are challenges such as gaps in scientific understanding, information asymmetries (AI developers often do not disclose information about training data), market failures (competition intensifies speed-versus-safety trade-offs), institutional design and coordination challenges (AI development outpaces traditional governance cycles), etc. Such challenges have a minor influence on agentic AI in commercial and B2C segments, but are still important to understand general challenges (from this perspective, there is no difference in which AI provider to choose and no efficient mitigation on the retail level).
Industry risk frameworks overview
In the risk management section of the Second International AI Safety Report is a list of frameworks related to different AI (mostly LLMs) development companies, which are worth mentioning:
OpenAI
Preparedness Framework 2
Covered risks:
- Biological and chemical capabilities
- Cybersecurity capabilities
- AI self-improvement capabilities
Risk tiers or equivalent and associated safeguards
High: Could increase existing risks of severe harm and require security controls and safeguards.
Critical: Could create new, unprecedented risks of severe harm. Development must stop until safeguards and security controls meet the critical standard.
Anthropic
Responsible Scaling Policy 2.2
Covered risks:
- CBRN weapons
- Autonomous AI research and development (AI R&D)
- Cyber operations (under assessment)
Risk tiers or equivalent and associated safeguards
AI safety levels (ASL):
- ASL-1: No significant catastrophic risk
- ASL-2: Early signs of dangerous capabilities (models must meet the ASL-2 deployment and security standards)
- ASL-3: Substantially increased catastrophic misuse risk (models must meet the ASL-3 deployment and/or security standards)
- ASL-4+: Future classifications (not yet defined)
Frontier Safety Framework 3.0
Covered risks:
- Misuse
- CBRN
- Cyber
- Harmful manipulation
- Machine learning R&D
- Misalignment/Instrumental reasoning Critical Capability Levels
Risk tiers or equivalent and associated safeguards
Capability levels at which, absent mitigation measures (safety cases for deployments and security mitigations aligned with RAND security levels 2, 3, or 4), AI models or systems may pose a heightened risk of severe harm. The capability levels include ‘early warning evaluations’, with specific ‘alert thresholds’.
Meta
Frontier AI Framework 1.1
Covered risks:
- Cybersecurity
- Chemical and biological risks
Risk tiers or equivalent and associated safeguards
Risk threshold levels:
- Moderate (release with appropriate security measures and mitigations)
- High (do not release)
- Critical (stop development)
Amazon
Frontier Model Safety Framework
Covered risks:
- CBRN weapons proliferation
- Offensive cyber operations
- Automated AI R&D
Risk tiers or equivalent and associated safeguards
Critical capability thresholds
Model capabilities that have the potential to cause significant harm to the public if misused. (If the thresholds are met or exceeded, the model will not be publicly deployed without appropriate risk mitigation measures.)
Microsoft
Frontier Governance Framework
Covered risks:
- CBRN weapons
- Offensive cyber operations
- Advanced autonomy (including AI R&D)
Risk tiers or equivalent and associated safeguards
Risk levels:
- Low or Medium (Deployment allowed in line with responsible AI program requirements)
- High or Critical (Further review and mitigations required)
NVIDIA
Frontier AI Risk Assessment
Covered risks:
- Cyber offence
- CBRN
- Persuasion and manipulation
- Unlawful discrimination at scale
Risk tiers or equivalent and associated safeguards
Risk thresholds – model risk (MR) scores:
- MR1 or MR2 (Evaluation results are documented by engineering teams)
- MR3 (Risk mitigation measures and evaluation results are documented by engineering teams and periodically reviewed)
- MR4 (A detailed risk assessment should be completed, and business unit leader approval is required)
- MR5 (A detailed risk assessment should be completed and approved by an independent committee, e.g., NVIDIA’s AI ethics committee)
Cohere
Secure AI Frontier Model Framework
Covered risks:
- Malicious use (e.g. malware, child sexual exploitation)
- Harm in ordinary, non-malicious use, e.g. outputs that result in an illegal discriminatory outcome or insecure code generation
Risk tiers or equivalent and associated safeguards
Likelihood and severity of harm in context
- Low
- Medium
- High
- Very High
(Risk mitigations and security controls are in place for all AI systems and processes; additional mitigations need to be adapted to the AI system and use case in which an AI model is deployed)
xAI
Risk Management Framework
Covered risks:
- Malicious use (including CBRN and cyber weapons)
- Loss of control
Risk tiers or equivalent and associated safeguards
Thresholds
Thresholds are set based on scores on public benchmarks for dangerous capabilities (Heightened safeguards are applied for high-risk scenarios such as large-scale violence or terrorism).
Training-based safeguards
There is a list of possible training techniques and monitoring approaches (classical guardrails), like user interaction monitoring and HIL. For example, training principles examples are data curation, RLHF, pluralistic alignment, adversarial training, unlearning, and interpretability.
| Approach | Description |
|---|---|
| Data curation | This involves removing harmful data to prevent an AI model from learning dangerous behaviours. These methods are helpful, especially for creating open-weight AI models that avoid harmful traits and resist harmful fine-tuning. Still, challenges remain with errors in curation and scaling. |
| Reinforcement learning from human feedback | This trains the model to meet specific goals like being helpful and harmless. It's an effective way to encourage beneficial behaviours. However, focusing too much on human approval can cause models to act deceptively or overly flattering. |
| Pluralistic alignment techniques | This approach trains the model to consider different viewpoints on how it should behave. It helps reduce bias toward any single perspective. Still, human disagreement is unavoidable, and finding widely accepted ways to balance competing views is difficult. |
| Adversarial training | This trains the model to avoid causing harm, even in new situations, and to resist attacks from malicious users like 'jailbreaks'. It's an effective way to prevent misuse, though challenges with robustness still remain. |
| Machine 'unlearning' | This involves training a model with specialised algorithms designed to actively suppress harmful abilities, like knowledge of biohazards. These techniques provide a focused way to remove harmful traits, but current unlearning methods can be unreliable and may unintentionally affect other abilities. |
| Interpretability and safety verification tools | This includes various design and verification methods aimed at providing stronger assurance that AI models meet safety standards. They help evaluators feel more confident about AI security, but current methods depend on assumptions and often don't perform well in practice. |
Given the goal of this article, let's focus mainly on the weaknesses of the proposed approaches, while keeping in mind that they are relevant and may help.
Data curation came up several times earlier, especially regarding generated content and criminal activity, and it has some weaknesses. When scaling, it’s hard to set clear boundaries between personal opinion and propaganda or manipulation, and to filter out all harmful candidate/samples.
Regarding the well-known RLHF method, it has clear limitations and inefficiencies, such as ‘sample efficiency’: it takes a lot of data to make a meaningful change in the model’s parameter weights. There are also challenges with human evaluation, like distinguishing between what ‘looks safe’ and what actually ‘is safe,’ as well as the quality of corresponding reward models, including process reward models (PRMs).
There are also other training-related limitations, like those in adversarial approaches or alignment methods, which have limitations in ground truth variety coverage and smooth decision boundaries. When it comes to “unlearning,” the knowledge (harmful capabilities) is distributed inside the model in a too nonlinear way, thus even after detection of the right “harmful” blob (in attention and activations), it is difficult to estimate the volume of useful capabilities gone and harmful capabilities left.
Monitoring tools
| Approach | Description |
|---|---|
| Hardware-based monitoring mechanisms | Verifying that authorised processes are running on hardware helps study security threats and ensure regulatory compliance. These mechanisms provide unique ways to track what computations run on hardware and who runs them. However, they cannot detect all types of threats, and some require specialised hardware. |
| User interaction monitors | Monitoring user interactions for signs of malicious use can help developers terminate service for malicious users. However, enforcement can inadvertently hinder beneficial research on safety, and some forms of misuse are difficult to detect. |
| Content filters | Filtering harmful inputs and outputs is an effective way to reduce accidental harm and misuse. However, filters need extra computing power and can be vulnerable to certain attacks. |
| Model internal computation monitors | Checking AI models for signs of deception or harmful thinking can help detect problems. But current methods are not very reliable or robust. |
| Chain-of-thought monitors | Monitoring model chain-of-thought text for signs of misleading behaviour or other harmful reasoning is an effective AI security method to understand and spot flaws in how models reason. However, they can be unreliable, and if AI models are trained to produce a benign chain of thought, they can learn misleading behaviour. |
| Human in the loop | Human oversight and the ability to override AI system decisions are essential in safety-critical areas. But these methods have limits, like automation bias and the slower speed of human decisions. |
| Sandboxing | Stopping an AI agent from directly affecting the world is a good way to limit harm. But sandboxing also restricts what the system can do directly. |
Monitoring or input-output verification modules can be designed in various ways and offer different levels of performance. Even if they block 90-95% of possible risks or attacks, that is often enough for most use cases, especially when combined with other mitigations mentioned earlier, like using separate environments (outside the agentic/LLM functions/communication/tools calling) and deterministic limitations for SQL calls. However, new vulnerabilities and security risks will always emerge because LLMs keep improving and competing (mentioned speed-versus-safety trade-offs).
For example, a recent evaluation of various LLM defence techniques shows that most safeguards have a success rate below 5-10%. In contrast, attacks succeed over 90% of the time, and human-based red teams achieve a 100% success rate.
Conclusions
Mitigating risks around AI/GenAI remains a moving target, because although many risks are known, they continue to evolve as capabilities advance and attacks (or risks' nature) become more sophisticated, including 0-day failures. These challenges are compounded by human factors, uneven organisational readiness, closed research practices, and limited access to training data. While AI is progressing rapidly in reasoning, mathematics, science, and coding, performance is still uneven.
There are ongoing weaknesses in multi-step tasks, reasoning, control, unfamiliar contexts, and a strong influence from inefficiency in AI readiness, HIL processes, and misuse. AI adoption is already reshaping labour markets, but there is still a huge risk of layoffs, which can’t be handled only with AI literacy. On the other hand, in high-stakes areas such as finance and healthcare, unreliable outputs can increase risks as AI autonomy grows. HIL oversight is recommended, but it is challenged by limitations in model evaluation and a widening gap between testing and real-world behaviour. At the same time, rising scaling costs and FinOps pressures may outpace safeguards. There are also overexpectations that can shift the balance between speed and security (mentioned tradeoffs in market failures).
Overall, after reviewing the analysed aspects based on this advanced report, it is fair to say that we are only halfway towards continuous, global, adaptive, and socio-technical improvements in AI security. So, please do not claim that your AI solution is 100% safe 🙂
AI security FAQs
Key AI security best practices include enforcing least-privilege access controls, adopting a zero-trust approach that continuously verifies all interactions, monitoring APIs for unusual usage, and implementing guardrails that filter and validate AI inputs and outputs.
I security refers to the practices, technologies, and policies used to protect artificial intelligence systems from threats and misuse. It focuses on securing the data, models, algorithms, and infrastructure that AI solutions rely on, ensuring they operate reliably and cannot be manipulated, stolen, or compromised. Following AI security best practices is critical for any organisation that relies on AI tools.
Yes, but it is just one part of the wider AI toolkit used in cybersecurity. Many security tasks, like spotting phishing attempts, finding unusual network activity, and automating how they respond to threats, are usually handled with methods like classical ML. Generative AI can support these methods by assisting with certain tasks.
GenAI has been a double-edged sword for AI solutions security: it empowers defenders with faster threat detection and response, but it also introduces new security risks such as adversarial prompt injection, data poisoning, and the potential for AI agent systems to be exploited against critical infrastructure.
Threat actors are increasingly targeting AI services by manipulating inputs, poisoning training data, or exploiting weaknesses in models to steal sensitive information. These attacks can disrupt automated decisions and damage user trust. Organisations should expect that any exposed AI endpoint will be tested and prepare their defences accordingly.
To prevent data breaches, organisations need strict access controls, encryption of sensitive data both when stored and during transfer, and ongoing monitoring of data moving between AI parts. They should also create data protection policies designed specifically for AI pipelines, since usual safeguards might miss risks from model training and inference. Regular audits can spot weaknesses before attackers do.
On the defensive side, machine learning models can detect unusual patterns and flag potential threats in real time. However, artificial intelligence techniques can also be exploited by adversaries to craft more sophisticated attacks, which is why securing these components is just as important as deploying them.
An AI security posture refers to an organisation's overall readiness to identify, prevent, and respond to threats targeting its AI systems. Maintaining a strong posture involves regular risk assessments, up-to-date threat detection capabilities, and clearly defined incident response plans. As emerging threats continue to evolve alongside AI capabilities, this requires continuous adaptation rather than one-time compliance.
Access controls for AI environments should follow the principle of least privilege, ensuring that only authorised users and systems can interact with sensitive data and model endpoints. Role-based permissions should be enforced across all stages, from training data ingestion to production deployment. Combining these measures with data protection frameworks helps prevent unauthorised access to both AI services and the sensitive data they process.
Yes. Modern AI-powered systems improve threat detection and offer automated responses, helping organisations contain incidents in seconds instead of hours. Machine learning can analyse huge amounts of data faster than humans, but it still needs human oversight to handle false alarms and adjust to new attack methods.
Related Insights
Inconsistencies may occur.
The breadth of knowledge and understanding that ELEKS has within its walls allows us to leverage that expertise to make superior deliverables for our customers. When you work with ELEKS, you are working with the top 1% of the aptitude and engineering excellence of the whole country.
Right from the start, we really liked ELEKS’ commitment and engagement. They came to us with their best people to try to understand our context, our business idea, and developed the first prototype with us. They were very professional and very customer oriented. I think, without ELEKS it probably would not have been possible to have such a successful product in such a short period of time.
ELEKS has been involved in the development of a number of our consumer-facing websites and mobile applications that allow our customers to easily track their shipments, get the information they need as well as stay in touch with us. We’ve appreciated the level of ELEKS’ expertise, responsiveness and attention to details.