Skip to main content
Suzu Logo
  • Home
  • Product
  • Our Solutions
    • AI Advisory
    • AI Assessment
    • AI Integration
    • Cybersecurity Services
  • About
    • About Us
  • Resources
    • Blog
    • In The Media
    • Podcasts
    • All Resources
Contact Us
Back to Blog

The Death of the CTF: How Agentic AI Is Reshaping Competitive Hacking

Jacob Krell March 03, 2026 39 min read
Table of Contents

    View White Paper

    Abstract:

    Agentic AI systems are compressing competitive hacking timelines faster than the cybersecurity community has acknowledged. This paper analyzes first blood data from 423 Hack The Box machines released between March 2017 and October 2025, finding that root blood times have declined approximately 16% per year in log-space (p < 1e-10), with the sharpest drops concentrated after the emergence of large language models and agentic exploitation frameworks. All four difficulty tiers show statistically significant compression in the Post-LLM era (p < 0.05), with magnitude scaling from 27% at Hard to 67% at Insane.

    The implications extend well beyond competitive scoreboards as AI-driven acceleration reshapes penetration testing economics, lowers the barrier to offensive capability, and positions CTF platforms as de facto benchmarks for national AI cyber capabilities. Drawing on the historical transition to engine assisted play in chess, the paper argues for instrumentation of AI usage, the creation of separate competition tracks and standardized benchmarking of AI-augmented offensive capability. It also outlines the redesign of security training and certification pipelines to reflect compressed skill acquisition timelines.

    1. Introduction

    Capture The Flag competitions occupy a unique position in cybersecurity. From DEF CON's storied finals to Hack The Box's weekly machine releases, CTFs serve simultaneously as training grounds, hiring filters, and competitive arenas. At their core, they test a specific set of cognitive abilities, the capacity to parse unfamiliar systems, synthesize information from disparate sources, recall relevant techniques from a vast body of knowledge, and chain observations into a working exploit. For over two decades, performance in this environment has been primarily a function of unaided cognition. The scoreboard measured who among a field of human competitors was the most elite operator.

    Agentic AI systems possess structural advantages in each of these capacities. They parse output without misreading. They have access to the entirety of publicly documented vulnerability research, tooling documentation, and exploit techniques, not as something to be recalled under pressure, but as something perpetually available. They synthesize across information sources without cognitive fatigue, context-switching cost, or the bandwidth limitations of human working memory. This paper argues that these structural advantages are sufficient to fundamentally reshape the nature of CTF competition, shifting the axis of differentiation from who is the best hacker to who designs the best agentic AI system.

    This shift does not eliminate the human element. It redefines it. The competitors who will succeed in an AI-augmented CTF landscape are not those who resist the technology but those who learn to leverage it, directing AI agents toward the tasks where they hold categorical advantages while focusing human effort on the areas where intuition, creativity, and adversarial reasoning remain superior. The hacker in the loop remains the ultimate differentiator, but the nature of the loop changes. The skill being tested is no longer purely operational. It becomes architectural, and the question is: how effectively can a competitor design, configure, and orchestrate an AI system to do the work that was once done by hand?

    The implications extend beyond individual competitors. As this transition accelerates, CTFs themselves will shift in character. Challenges that once tested a human operator's ability to enumerate, exploit, and pivot will increasingly function as benchmarks for AI agent systems. The competition between hackers becomes, in significant part, a competition between the agentic architectures and tools they bring to the table. CTF platforms will serve as proving grounds not just for human skill but for AI capability, and the leaderboard will reflect engineering and development decisions as much as operational technique.

    We examine this thesis through multiple lenses. We analyze first blood time data across 423 competitively released Hack The Box machines spanning March 2017 through October 2025, stratified by difficulty and operating system, and find that both user and root first blood times are declining at approximately 16% per year on a multiplicative basis (log-linear model, p < 1e-10), with a statistically significant step-down in the Post-LLM era across all four difficulty tiers. We situate this analysis within a rapidly growing body of research in which agentic systems have won dedicated CTF competitions, autonomously exploited real-world vulnerabilities, and achieved meaningful solve rates on platforms like Hack The Box without human intervention. We conclude with a set of recommendations for CTF platform operators, training organizations, certification bodies, and policymakers, drawing on lessons from chess's two-decade experience with the same problem and anchored by a proposal for voluntary AI usage instrumentation designed to establish ground truth while the distinction between human and AI-augmented performance is still observable.

    2. Author Background

    The author holds the OSCE3 certification and works professionally in offensive security and AI system development. He has extensive practical experience with the Hack The Box platform, including participation across all difficulty tiers. This dual perspective informs the framing of the problem examined in this paper; however, all empirical claims are derived from the longitudinal dataset and statistical analysis presented in the following sections.

    3. Related Work: AI in Offensive Security

    The research trajectory over the past three years documents a rapid escalation in AI offensive capability, from early demonstrations that LLMs could assist with security tasks, to autonomous systems that competed with and outperformed human players, to production-grade offensive tooling built on the same foundations. While these studies primarily reported solve rates and task completion rather than time-to-compromise, they establish the capability context in which the longitudinal trends presented in this paper should be interpreted.

    The first wave established that large language models were not just passively useful for security work but actively underutilized. Palisade Research demonstrated that a simple ReAct-based agent architecture with plan-and-solve prompting achieved 95% on the InterCode-CTF benchmark, up from a prior state of the art of 72%, fully solving the General Skills, Binary Exploitation, and Web Exploitation categories [2]. Their central finding was that LLM capabilities in this domain were underelicited, not fundamentally limited. Stanford's CyBench benchmark confirmed this from a different angle, as strong models solved professional-level CTF tasks at speeds comparable to 11-minute human solves [7]. The AIRTBench autonomous red teaming benchmark reported Claude 3.7 Sonnet achieving a 61% solve rate on black-box challenges, with models solving tasks in minutes where humans required hours [12].

    The second wave moved from benchmarks to real vulnerabilities. The Fang et al. research series established that LLM agents can autonomously exploit real-world CVEs. Given vulnerability descriptions, GPT-4 successfully exploited 87% of 15 one-day vulnerabilities, while in their evaluation, every other tested system, including open-source models, ZAP, and Metasploit, scored zero [14]. A follow-up demonstrated that teams of LLM agents using hierarchical planning outperformed single agents by up to 4.3x on 14 real zero-day vulnerabilities [15]. Google's Project Zero and DeepMind collaboration produced Big Sleep, an LLM-powered system that discovered the first publicly documented AI-found exploitable bug in real-world software, a stack buffer underflow in SQLite, a vulnerability in production code that had evaded human auditors and traditional fuzzing tools [27].

    The third wave saw benchmarks and exploits converge into competition. In 2025, the Cybersecurity AI agent won the Neurogrid CTF with 41 of 45 flags and a $50,000 prize pool, reached Rank 1 early at the Dragos OT CTF before finishing 6th after being paused, and was the top-performing AI team in Hack The Box's "AI vs Humans" competition [1]. The authors concluded that Jeopardy-style CTFs are effectively solved by well-engineered AI agents. Meanwhile, the translation into practical offensive tooling accelerated, as PentestGPT demonstrated a 228.6% improvement in task completion over GPT-3.5 baselines on Hack The Box machines (USENIX Security 2024, Distinguished Artifact) [16]. D-CIPHER achieved a 44% solve rate on Hack The Box with 65% more MITRE ATT&CK coverage than prior approaches [17]. xOffense, built on a fine-tuned Qwen3-32B model, reached 79.17% subtask completion [18]. And DARPA's AI Cyber Challenge yielded four open-source Cyber Reasoning Systems across the competition [29], and the fourth-place team alone autonomously discovered 28 real-world vulnerabilities including six zero-days [34].

    Taken together, these results show a rapid progression from LLM-assisted workflows to autonomous competitive performance. In under three years, the field moved from "LLMs can help with CTF challenges" to "agentic systems can win CTF competitions outright." These results establish that AI systems now operate at timescales comparable to or faster than expert human solves, providing a plausible mechanism for the longitudinal compression in first-blood times measured in this study.

    4. Data and Methodology

    The dataset consists of 423 competitively released Hack The Box machines spanning March 2017 through October 2025, covering all four difficulty tiers, Easy (124), Medium (143), Hard (98), and Insane (58). The operating system split is predominantly Linux (288) and Windows (121). For each machine, two first-blood metrics were collected from timestamps scraped from 0xdf's publicly archived writeups [42], specifically user blood (time from machine release to the first recorded foothold) and root blood (time from release to full system compromise), both measured in minutes. Machines with missing or inconsistent timestamps were excluded. The gap between them is effectively the privilege escalation time for the fastest competitor on that challenge.

    To structure the analysis around AI capability, two eras were defined. The Pre-LLM era covers everything before ChatGPT's public launch in November 2022, giving us 286 machines representing the pre-LLM baseline. The Post-LLM era covers November 2022 onward, encompassing both the initial period of LLM-assisted work and the subsequent emergence of dedicated agentic exploitation frameworks, covering 137 machines. An earlier version of this analysis used a three-era split (Pre-LLM, LLM, and Agentic), but the Agentic era's small sample sizes at higher difficulty tiers (Hard n=17, Insane n=5) produced era-level comparisons that lacked statistical power. Consolidating into two eras yields sample sizes sufficient for significance testing across all four difficulty tiers. For visual analysis, finer milestone markers for individual model releases are overlaid on the time-series charts.

    Linear regression on log-transformed blood times establishes the overall trend and yields a directly interpretable metric, percentage change in solve time per year. Era medians quantify the magnitude of change between periods. Non-parametric significance tests (Mann-Whitney U, one-sided, reflecting the a priori hypothesis of decreasing solve times) determine whether observed differences are statistically significant. Everything is run independently on both user and root blood, and stratified by operating system. More fundamentally, first-blood times identify the earliest successful solve but do not reveal the methods used. The data establishes temporal correlation with AI capability milestones, not direct causation. Given multiple stratified comparisons, p-values are interpreted conservatively. That limitation is what motivates the "Solved with AI" proposal in Section 8.

    5. Results

    Across all 423 machines, both user and root first blood times (in minutes) show a statistically significant downward trend over the eight-and-a-half-year period. Root blood times are declining at 16.5% per year on a multiplicative basis (linear regression of log-transformed solve time against release date, R² = 0.12, p = 1.7e-13). User blood times decline at 16.0% per year under the same log-linear model (R² = 0.09, p = 2.7e-10). Every difficulty tier shows a negative longitudinal slope, indicating that the downward trend is not driven by a single category.

     

    The scatter plot tells the story visually. Each dot is a machine, colored by difficulty tier, with trendlines fit per tier. The vertical dashed lines mark major LLM and agent capability milestones. The downward slope is visible across all four tiers, with a visibly steeper trend in the Post-LLM period.

    Era Comparison

    The era-level breakdown is where the compression becomes concrete. Median root blood times by era

     

    Pre-LLM (n)

    Post-LLM (n)

    Change

    p-value

    Easy

    54.8 min (81)

    28.9 min (43)

    -47%

    0.0001

    Medium

    106.0 min (96)

    72.5 min (47)

    -32%

    0.0011

    Hard

    261.1 min (66)

    191.1 min (32)

    -27%

    0.0279

    Insane

    926.6 min (43)

    302.5 min (15)

    -67%

    0.0350

    Median user blood times by era

     

    Pre-LLM (n)

    Post-LLM (n)

    Change

    p-value

    Easy

    21.5 min (81)

    12.7 min (43)

    -41%

    0.0018

    Medium

    56.0 min (96)

    40.3 min (47)

    -28%

    0.0050

    Hard

    107.1 min (66)

    110.9 min (32)

    +4%

    0.19 (n.s.)

    Insane

    271.5 min (43)

    211.9 min (15)

    -22%

    0.22 (n.s.)

     

    For root blood, the Post-LLM era is faster across all four tiers, with compression scaling by difficulty. Easy machines dropped 47%. Medium dropped 32%. Hard dropped 27%. Insane machines dropped 67%, from a median of over 15 hours to approximately 5 hours. All four root blood tiers reach statistical significance at p < 0.05 for the Pre-LLM vs Post-LLM comparison (Mann-Whitney U, one-sided), with Easy and Medium reaching p < 0.002. User-blood compression is significant at Easy (-41%, p = 0.002) and Medium (-28%, p = 0.005) but does not reach significance at Hard or Insane.

    User Blood vs. Root Blood

    The dual blood metric reveals something the aggregate numbers miss. User blood captures the foothold phase, specifically reconnaissance, vulnerability identification, initial exploitation. Root blood captures the full kill chain including privilege escalation. Comparing the two across eras isolates how each phase is compressing independently.

    An important methodological note is that on Hack The Box, user blood and root blood are separate races. Different competitors frequently win each. One player might find an initial foothold fastest while a different player achieves full compromise first via a different approach. This means naively subtracting one from the other does not measure a single competitor's privilege escalation time. To isolate privilege escalation behavior, per-machine privesc time was computed as root blood minus user blood. For this analysis only, 37 machines where this value was zero or negative were excluded; these machines remain in all other analyses. A zero-minute cutoff was used to remove race artifacts while retaining conventional two-stage solves. Negative privesc indicates that root was blooded before user, a race artifact in which a different competitor bypassed the foothold step entirely or took a single exploit chain directly to root.

    After filtering, the privilege escalation compression becomes clean and consistent across all four tiers

     

    Median implied privilege escalation time by era (machines with privesc > 0 min)

     

    Pre-LLM (n)

    Post-LLM (n)

    Change

     

    Easy

    15.2 min (71)

    11.1 min (43)

    -27%

     

    Medium

    34.0 min (87)

    22.2 min (47)

    -35%

     

    Hard

    105.1 min (60)

    61.5 min (32)

    -41%

     

    Insane

    175.9 min (35)

    108.0 min (11)

    -39%

     

    The privilege escalation phase is compressing faster than the foothold phase. After excluding machines where privesc time was zero or negative (race artifacts where root was blooded before user), median privesc times dropped 27% at Easy, 35% at Medium, 41% at Hard, and 39% at Insane from the Pre-LLM to the Post-LLM era. Meanwhile, user blood (the foothold phase) only compresses significantly at Easy (-41%, p=0.002) and Medium (-28%, p=0.005). At Hard and Insane difficulty, foothold times show no statistically significant change.

    Operating System Breakdown

    Stratifying by operating system reveals that the compression is not platform-uniform. Windows machines show steeper declines than their Linux counterparts at nearly every difficulty tier. Pre-LLM to Post-LLM, Windows Medium machines dropped from a median of 118.6 minutes to 45.9 minutes, a 61% reduction that is statistically significant (p=0.0003). Windows Easy dropped 22% (p=0.037). Linux machines show consistent but smaller compression, with Linux Easy dropped 36% (p=0.003), Linux Medium 12%, and Linux Hard 11%. Linux Insane dropped 77% (p=0.018), though with small Post-LLM samples.

    One plausible explanation is that Windows and Active Directory environments contain more extensively documented, repeatable attack patterns such as Kerberoasting, token impersonation, and service misconfigurations. Linux privilege escalation tends to involve more heterogeneous, system-specific configurations. The possible relationship between attack-surface structure and amenability to AI is examined in Section 6.

    The data tells a consistent story across every analytical lens applied to it. Solve times are compressing. The compression scales with difficulty. It affects both the foothold and privilege escalation phases, with privesc compressing faster. It is more pronounced on Windows than Linux. And the step-change at the Pre-LLM to Post-LLM boundary is statistically significant across all four difficulty tiers for root blood times.

    6. Discussion

    The longitudinal trends in Section 5 establish that time-to-compromise is decreasing across all difficulty tiers. The remaining question is which mechanisms are capable of producing a reduction of this magnitude and structure over the observed period. Each candidate explanation implies a different phase structure for time-to-compromise.

    Several alternative explanations must be considered.

    Community growth increases the number of attempts per machine at release and therefore the probability of an early solve. This mechanism predicts a roughly uniform acceleration across difficulty tiers, because the number of competitors affects all machines equally. Instead, the data shows compression that scales with difficulty, with the largest proportional reduction at the Insane tier. A participation-driven model does not naturally produce this pattern.

    The expansion of publicly available writeups improves pattern recognition and reduces time to initial foothold when new machines resemble previously documented vulnerabilities. This mechanism predicts the strongest effect in the foothold phase. The observed data shows the opposite structure, as privilege escalation compresses more rapidly than foothold acquisition. A writeup-driven explanation is therefore incomplete.

    Improvements in non-AI tooling contribute to the gradual downward trend visible throughout the Pre-LLM period. Enumeration frameworks and automated analysis tools reduce the time required to collect and interpret system state. However, the largest inflection in the time-series data occurs after the public release of high-capability language models rather than at the introduction of specific enumeration or post-exploitation tools. Tooling alone explains the baseline trend but not the post-2022 acceleration.

    These factors are also not independent of AI capability. The rate at which writeups are produced, documentation is generated, and tools are developed has itself increased through AI-assisted workflows. Treating these as separate variables understates the total effect of AI on the ecosystem in which CTF performance occurs.

    Taken together, the alternative mechanisms account for portions of the observed trend but do not reproduce its full structure. The participation hypothesis does not explain scaling by difficulty. The writeup hypothesis predicts faster foothold compression than privilege escalation. The tooling hypothesis explains the long-term baseline but not the post-LLM inflection. The remaining explanation that is consistent with the timing, magnitude, and phase specific structure of the data is the introduction of high-capability AI systems, amplified by secondary effects that are themselves partially AI-enabled.

    This does not establish causation at the level of individual solves. Direct measurement of AI-assisted performance would require platform telemetry that is not currently available. What the analysis shows is that the observed compression follows the pattern expected from AI-accelerated workflows and is not fully explained by previously identified factors.

    7. Implications

    The data presented in this paper describes what is happening on a CTF scoreboard. But CTFs do not exist in a vacuum. They are simplified models of real offensive operations, and the forces reshaping competition on Hack The Box are the same forces reshaping enterprise security, the security workforce, and ultimately the global landscape of cyber capability.

    The following implications are interpretive and extend beyond the scope of the measured dataset. A sustained 16% annual reduction in time-to-compromise implies a roughly sixfold decrease within a decade, fundamentally altering any security model that assumes human scale attacker timelines.

    For Competitors

    From inside the competition, this is not surprising. At the highest levels of competition, first-blood performance increasingly requires AI-assisted workflows. This is the natural progression of a dynamic that has always existed at the top level, the real game was never about manually running commands. It was always about building optimized automation scripts for enumeration and exploitation. The best competitors have always been the ones who automated the repetitive phases and focused their human attention on the creative gaps. AI agents are the next iteration of that same competitive logic, dramatically more powerful and dramatically more accessible.

    Consider the privilege escalation data. A 40% compression in privesc times at Hard and Insane difficulty does not mean the competitors got substantially better at privilege escalation in three years. It means someone figured out how to hand the post-foothold enumeration to an agent that does not miss output, does not forget to check a file it found twenty minutes ago, and does not lose focus at 3 AM during a weekend release. The human contribution shifts from executing the methodology to designing it.

    For the Security Industry

    The compression visible in CTF solve times is a leading indicator of what is coming for enterprise security. AI agents already excel at the enumeration phase of real engagements, including massive parallelization of low-impact commands, systematic service discovery, automated credential checking, configuration analysis. The literature reviewed in Section 3 confirms that agentic systems can find real vulnerabilities in real environments today. They are still poor at operational security, at the subtle tradecraft decisions that determine whether an attacker is detected, but those limitations are narrowing with every model generation.

    The practical consequence is that the defender's one structural advantage, time, is eroding. Security has always been asymmetric, as defenders must be right everywhere, attackers only need to find one gap. But defenders have historically benefited from the fact that attackers are slow. Human operators conducting reconnaissance, pivoting through a network, escalating privileges, all of which take hours or days. That timeline is the window in which detection, response, and containment happen. When an agentic system collapses the kill chain from days to hours to minutes, the window for human defenders to detect and respond shrinks proportionally. Dwell time assumptions built into detection architectures need to be revisited.

    These trends suggest that the penetration testing model will deliver decreasing marginal security value. As dwell times for real threat actors remain long enough that point in time assessments offer diminishing insight into actual defensive readiness, the gap between what a periodic assessment measures and what an organization actually faces widens. The logical industry response is a shift toward continuous threat hunting and assume breach postures rather than periodic exploitation engagements. And any vulnerability that is openly exploitable should increasingly be assumed exploitable nearly instantly. The implicit assurance that a network "survived a skilled human spending X hours" means less every year when the threat model includes agents that do not sleep, do not context-switch, and do not stop enumerating.

    For the Workforce

    The democratization of offensive capability is perhaps the most consequential real-world implication of AI-accelerated CTFs, and it cuts both ways. Historically, becoming genuinely dangerous as an offensive operator required years of study, practice, and failure. The learning curve was steep. That steep curve served as a natural barrier that kept the population of highly capable attackers relatively small. AI is flattening that curve rapidly.

    The capability gap between a junior operator with a well-designed agent pipeline and a senior operator working manually is collapsing. And the tooling is compounding, and an AI agent integrated with a Ghidra MCP server for binary analysis is a categorically different capability than a standalone model answering questions about disassembly. Each new tool integration multiplies the effective skill of the human operator, regardless of their experience level.

    This is positive for the security industry in one sense, as it lowers the barrier to entry for security careers and makes scarce expertise more accessible. It is dangerous in another, as it lowers the same barrier for threat actors. Both of these are true simultaneously, and the policy response needs to account for both.

    The workforce itself will undergo the same kind of transformation that previous industrial revolutions imposed on other skilled trades. The trajectory is from operators to automation engineers. Junior security professionals will not spend their early careers learning to run tools manually. They will learn to operate AI agents, to understand what the agents are doing well enough to intervene when something goes wrong, and to design the workflows that the agents execute. The analogy to modern aviation is apt, as pilots are still in the cockpit, but they are there primarily for the situations that automation cannot handle. The day-to-day operation is managed by systems they supervise. Over the next decade, the offensive security field appears likely to follow this exact trajectory, with operators progressively moving from direct users of frameworks and tools to managers of AI agents that themselves create and use the underlying tooling.

    For Platform Operators

    The competitive dynamics of CTF platforms are changing whether platform operators acknowledge it or not. First blood times are compressing. The leaderboard increasingly reflects who has the best AI tooling as much as who has the best hacking skills. A platform that ignores this does not avoid the shift. It just has no data on how the shift is unfolding.

    The practical consequence is that platform operators who do not instrument AI usage will find their competitive metrics becoming uninterpretable. If first blood times continue to compress at 16% per year, within a few years the Easy and Medium tiers will hit floor effects where blood times are dominated by network latency and spawning delays rather than solve skill. At that point the leaderboard measures infrastructure, not hacking. Platforms that have been tracking AI usage throughout this period will be able to contextualize these trends. Platforms that have not will simply watch their competitive signal degrade without understanding why. Section 8 details specific instrumentation and track separation proposals that address this directly.

    For Security Hiring

    Hiring managers who use CTF performance as a signal face a growing calibration problem. Consider a candidate who presents a top-100 Hack The Box ranking. In 2020, that ranking almost certainly reflected deep manual exploitation skill, including enumeration discipline, creative privilege escalation, comfort across multiple operating systems and attack surfaces. In 2026, that same ranking might reflect those skills, or it might reflect that the candidate built an excellent agentic pipeline that handles reconnaissance and structured exploitation while the human focuses on the creative leaps. Both are genuinely valuable in a professional security context. But they are different skills, and a hiring pipeline that assumes CTF performance maps only to traditional security expertise is increasingly miscalibrated. Organizations that want to hire for manual exploitation skill specifically will need to test for it specifically, rather than treating a leaderboard rank as a proxy. The certification redesign discussed in Section 8 addresses how the industry can build assessments that distinguish between these skill sets.

    CTFs as Global AI Capability Benchmarks

    The implications extend beyond the security community entirely. As CTF competitions shift from purely human contests to contests between AI-augmented operators, they become something they were never designed to be, standardized benchmarks for offensive AI capability.

    This is already happening in a research context. CyBench [7], InterCode-CTF [9], and the NYU CTF Benchmark [8] all use CTF-style challenges to evaluate agentic AI systems. But public platforms like Hack The Box provide something these closed benchmarks cannot, a live, continuously updated, adversarially designed evaluation environment with a global participant pool and real-time scoring. A research benchmark can be overfitted. A platform releasing new machines every week, designed by human creators actively trying to challenge the best players in the world, cannot.

    The geopolitical dimension of this is underappreciated. AI export restrictions are increasingly fragmenting the global model landscape. Operators in the United States work with systems like GPT, Claude, and Grok. Operators in China work with DeepSeek, Qwen, and other domestically developed models. European competitors may increasingly work with models like Mistral that fall under different regulatory frameworks. As AI-augmented competition becomes the norm on global CTF platforms, the leaderboard becomes a de facto comparison of these different AI ecosystems applied to a common adversarial task. A country whose models consistently underperform on offensive security benchmarks has a measurable signal about a gap in its AI capabilities that extends well beyond CTF.

    This is not hypothetical. The AI Cyber Challenge at DEF CON, funded by DARPA with $29.5 million in prizes [29], already frames autonomous cyber capability as a national security priority. The winning teams' cyber reasoning systems are required to be open-sourced, and DARPA has allocated additional funding to integrate the technology into real critical infrastructure. Public CTF platforms provide a continuous, peacetime version of the same signal. The death of the CTF as a purely human competition is simultaneously the birth of the CTF as a global AI capability benchmark, and the policy implications of that transformation deserve attention from audiences well beyond the infosec community. Section 8 explores how platforms can formalize this benchmarking role and how governments can leverage it through structured competition programs.

    8. Recommendations

    The measured compression in time-to-compromise implies that existing competition, training, and evaluation models will lose interpretability unless they are redesigned for an AI-augmented environment. The implications outlined above are broad, but they converge on a set of concrete actions that CTF platforms, training organizations, and the security community can take now rather than after the shift is complete.

    Learn from Chess

    Chess confronted the same fundamental problem two decades earlier and its response is instructive. After Garry Kasparov lost to Deep Blue in 1997, the chess community did not pretend engines did not exist. It adapted. Three distinct competitive categories emerged, pure human chess, pure engine chess, and "advanced" or "centaur" chess where human-computer teams compete. Kasparov himself introduced the advanced chess format in 1998, arguing that the combination of human intuition and computer calculation would produce the highest quality games. Research confirmed this, as freestyle (centaur) teams consistently outperformed both solo humans and solo engines in tournament play through the mid-2000s.

    The enforcement problem followed immediately. Chess.com has spent over a decade building a Fair Play system that analyzes over 100 gameplay factors per game using statistical models to detect engine-assisted play. The system closes large numbers of accounts for fair-play violations each month and estimates that fewer than 1% of players cheat online. For prize events, Chess.com now requires all participants to run Proctor, a monitoring program on the player's machine, and has reported significantly fewer instances of engine cheating since making it mandatory.

    The critical parallel for CTFs is that chess had a structural advantage that CTFs do not, in that every move is recorded and analyzable after the fact. A chess engine's influence can be detected because the move sequence itself contains the signal. In a CTF, the platform has almost no visibility into how a competitor arrived at a solution. There is no move log. There is no replay. The platform sees a flag submission and a timestamp. This means the chess approach of statistical detection after the fact is largely unavailable to CTF platforms. Given the absence of move level telemetry, meaningful enforcement of human-only competition appears feasible primarily in proctored environments where the competitor's screen and tooling can be directly observed. For online platforms, the honest conclusion is that AI-assisted competition cannot be meaningfully prevented. It can only be embraced, structured, and measured.

    Instrument AI Usage

    The first and most immediately actionable recommendation is for CTF platforms to implement a voluntary "Solved with AI" self-report mechanism. This is not about enforcement. It is about data collection. A simple toggle at flag submission, where the competitor indicates whether AI tools assisted their solve, produces ground truth that does not currently exist anywhere. The data does not need to be perfectly accurate to be valuable. Self-reporting will be imperfect and subject to strategic misreporting, but even noisy adoption data is strictly more informative than the current absence of ground truth. Even a rough self-reported signal would allow platforms to track adoption rates over time, correlate AI usage with solve speed, compare AI-assisted and unassisted performance distributions, and identify which challenge categories are most affected. This is the data that is currently unavailable in platform telemetry and that motivated the correlation-not-causation limitation in this study.

    Separate the Tracks

    As AI-assisted competition becomes the norm, platforms should consider formalizing the distinction that chess formalized decades ago. An unranked or separately ranked "AI-augmented" track, running alongside the traditional human track, would let competitors who want to push the boundaries of human-AI teaming do so without distorting the signal for competitors who want to test their manual skills. This is not about stigmatizing AI use. It is about preserving the interpretability of both signals. A first blood on the human track means something specific. A first blood on the augmented track means something different but equally valuable. Conflating the two helps no one.

    Standardize AI Capability Benchmarking

    The market for standardized evaluation of AI offensive tools is emerging rapidly and CTF platforms are uniquely positioned to serve it. Hack The Box launched its AI Range in December 2025 [40], the first controlled environment specifically designed to benchmark autonomous AI security agents against live adversarial challenges. The platform tests models including Claude, GPT-5, Gemini, and Mistral against real challenges, running each model ten times per challenge with fresh instances to account for non-deterministic behavior. Early meta-benchmarks have already revealed a significant gap between AI models' security knowledge and their practical multi-step adversarial capabilities.

    This is where the commercial opportunity meets the research need. Organizations evaluating AI security products currently have no standardized way to compare them. A buyer considering two AI-powered penetration testing tools has no equivalent of a SPEC benchmark or an MLPerf score. CTF platforms can fill this gap by offering structured evaluation environments where vendors and researchers run their systems against a common, continuously updated set of challenges under controlled conditions. The data produced is valuable to everyone, as vendors get credible third-party validation, buyers get comparison data, researchers get reproducible benchmarks, and the platforms themselves gain a new revenue stream and strategic position in the AI security ecosystem.

    Redesign Training Pipelines

    Security training programs need to acknowledge that the skill they are developing is changing. Bootcamps and entry-level certification programs that teach people to run nmap, linpeas, and manual exploitation workflows are training for the role as it existed five years ago. The junior security professional of 2028 needs to understand what those tools do, but their primary operational skill will be designing, configuring, and orchestrating AI agent pipelines that run those tools. Training programs should be teaching students to build agentic workflows, to evaluate when an agent is producing useful output versus hallucinating, and to intervene effectively when automation breaks down.

    The certification landscape reflects this tension. Proctored, human-only examinations remain one of the few mechanisms for directly verifying manual exploitation skill. That signal is still valuable, and proctored human-only certifications should continue to exist for roles where manual skill matters. But the industry also needs certifications that test what real-world operators actually do, leveraging every available tool, including AI, to achieve objectives under realistic constraints. Hack The Box's certification model already moves in this direction. Their exams impose no AI restrictions. Instead, they make the assessment genuinely difficult, requiring extensive documentation and demonstrating that the candidate understands the full attack chain at a depth that cannot be faked by an unsupervised agent. The result tests real-world competency more faithfully than a controlled environment where half the candidate's actual toolkit is prohibited.

    Government and Institutional Adoption

    DARPA's AI Cyber Challenge demonstrated that the competition model works for accelerating AI security development at national scale [29]. A $29.5 million prize pool attracted seven world-class teams whose cyber reasoning systems collectively discovered 28 unique vulnerabilities, including six zero-days, in real open-source software, with the winning systems required to be open-sourced for public benefit. This is a model for how governments can outsource AI capability development to the private sector through structured competition rather than traditional procurement.

    The opportunity extends beyond one-off competitions. Public CTF platforms, operating continuously with global participation, provide something that a periodic DARPA challenge cannot, a standing, adversarially maintained evaluation environment that produces real-time signal about the state of AI offensive and defensive capability. Governments and defense organizations that formalize relationships with these platforms, whether through funded challenge series, benchmark partnerships, or structured data sharing agreements, gain continuous visibility into a capability domain that is evolving faster than any traditional assessment cycle can track.

    9. Conclusion

    The CTF is not dead. But what it measures is changing, and the pace of that change is accelerating.

    Across 423 Hack The Box machines spanning over eight years, both user and root first blood times are declining at approximately 16% per year. The compression is universal across every difficulty tier. It scales with complexity, and the hardest machines, once requiring the best competitors in the world over 15 hours, now show a median of about 5 hours in the Post-LLM era, a 67% reduction that is statistically significant (p=0.035). The privilege escalation phase, the most structured and information-dense portion of the kill chain, is compressing faster than the foothold phase. Windows machines, with their well-documented and repeatable attack surfaces, show steeper declines than Linux. And the inflection points in all of these trends visually align with the public availability of increasingly capable AI systems. A sustained 16% annual reduction corresponds to a roughly sixfold decrease in time-to-compromise within a decade, a rate that fundamentally breaks security models built around human-speed operations.

    None of this proves that AI is the sole cause. But the timing, magnitude, and structure of the observed patterns are consistent with AI-driven acceleration and are difficult to explain through previously identified factors alone, and the confounding factors most commonly cited, community growth, writeup availability, better tooling, are themselves increasingly AI-enabled. The confounders are not independent of the primary variable.

    The implications reach well beyond the scoreboard. The same capabilities compressing CTF solve times are compressing real-world attack timelines, eroding the defender's window to detect and respond, and lowering the barrier to expert-level offensive capability for both security professionals and threat actors. The observed compression implies a shift in the security workforce from direct tool operation toward the design and supervision of automated workflows. Certifications and training programs designed for the previous era need to adapt or become irrelevant. And CTF platforms, whether they intended it or not, are becoming the world's first continuously maintained, adversarially designed benchmarks for national AI cyber capability. This paper does not measure AI capability directly. It measures the rate at which the attacker's clock is shrinking.

    This analysis is limited to a single platform and a single performance metric, and direct measurement of AI-assisted solves remains an open data problem. The title of this paper is deliberately provocative. The CTF as a purely human competition between operators is ending. What replaces it is not nothing. It is a richer, more complex, and more consequential competition, between the humans who design AI systems and the AI systems they build, tested against challenges that grow harder every week, on platforms that span the globe. The organizations, competitors, and governments that recognize this transition earliest and adapt most effectively will define the next era of cybersecurity. The ones that pretend nothing has changed will be left wondering why the scoreboard no longer makes sense.

    10. References

    AI Agents Solving CTFs

    [1] Cybersecurity AI team, "Cybersecurity AI: The World's Top AI Agent for Security CTF," arXiv:2512.02654, 2025. https://arxiv.org/abs/2512.02654

    [2] Palisade Research, "Hacking CTFs with Plain Agents," arXiv:2412.02776, 2024. https://arxiv.org/abs/2412.02776

    [3] Abramovich et al., "EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities," ICML 2025 (arXiv:2409.16165). https://arxiv.org/abs/2409.16165

    [4] Amazon Science, "CTF-Dojo: Training Language Model Agents to Find Vulnerabilities," arXiv:2508.18370, 2025. https://arxiv.org/abs/2508.18370

    [5] Tsinghua / Huazhong University, "Multi-Agent Framework for CTF Challenges," MDPI Applied Sciences, 2025. https://www.mdpi.com/2076-3417/15/13/7159

    [6] Evolve-CTF team, "Capture the Flags: Family-Based Evaluation via Semantics-Preserving Transformations," arXiv, 2025. https://arxiv.org/html/2602.05523v1

    CTF Benchmarks

    [7] Andy K. Zhang, Neil Perry et al., "CyBench: Framework for Evaluating Cybersecurity Capabilities of Language Models," arXiv:2408.08926, 2024. https://arxiv.org/abs/2408.08926

    [8] NYU-LLM-CTF team, "NYU CTF Bench: Scalable Open-Source Benchmark for LLMs in Offensive Security," NeurIPS 2024 (arXiv:2406.05590). https://arxiv.org/abs/2406.05590

    [9] "InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback," NeurIPS 2023. https://openreview.net/forum?id=fvKaLF1ns8

    [10] "CTFusion: CTF-based Benchmark for LLM Agent Evaluation," OpenReview / ICLR, 2026. https://openreview.net/forum?id=2zQJHLbyqM

    [11] "CTFTiny," arXiv:2508.05674, 2025. https://arxiv.org/abs/2508.05674

    [12] "AIRTBench: Autonomous AI Red Teaming Benchmark," arXiv:2506.14682, 2025. https://arxiv.org/abs/2506.14682

    LLM Exploit Capabilities

    [13] Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, Daniel Kang, "LLM Agents can Autonomously Hack Websites," arXiv:2402.06664, 2024. https://arxiv.org/abs/2402.06664

    [14] Fang, Bindu, Gupta, Kang, "LLM Agents can Autonomously Exploit One-day Vulnerabilities," arXiv:2404.08144, 2024. https://arxiv.org/abs/2404.08144

    [15] Zhu, Kellermann, Gupta, Li, Fang, Bindu, Kang, "Teams of LLM Agents can Exploit Zero-Day Vulnerabilities," arXiv:2406.01637, 2024. https://arxiv.org/abs/2406.01637

    Automated Penetration Testing

    [16] Deng et al., "PentestGPT: Evaluating and Harnessing LLMs for Automated Penetration Testing," USENIX Security 2024 (Distinguished Artifact), arXiv:2308.06782. https://arxiv.org/abs/2308.06782

    [17] "D-CIPHER: Dynamic Collaborative Intelligent Multi-Agent System for Offensive Security," arXiv:2502.10931, 2025. https://arxiv.org/abs/2502.10931

    [18] "xOffense: AI-Driven Autonomous Penetration Testing," arXiv:2509.13021, 2025. https://arxiv.org/abs/2509.13021

    [19] "VulnBot: Autonomous Penetration Testing Multi-Agent Framework," arXiv:2501.13411, 2025. https://arxiv.org/abs/2501.13411

    [20] "PentestAgent: Incorporating LLM Agents to Automated Penetration Testing," AsiaCCS 2025 (arXiv:2411.05185). https://arxiv.org/abs/2411.05185

    [21] "Construction and Evaluation of LLM-based Agents for Semi-Autonomous Penetration Testing," arXiv, 2025. https://arxiv.org/html/2502.15506v1

    [22] "AutoPentest / Structured Attack Trees on HackTheBox," arXiv:2505.10321, 2025. https://arxiv.org/abs/2505.10321

    Automated Exploit Generation

    [23] Harbin Institute of Technology, "PwnGPT: Automatic Exploit Generation Based on Large Language Models," ACL 2025. https://aclanthology.org/2025.acl-long.562/

    [24] UNSW Sydney / CSIRO, "Good News for Script Kiddies? Evaluating LLMs for Automated Exploit Generation," arXiv:2505.01065, 2025. https://arxiv.org/abs/2505.01065

    [25] "From Rookie to Expert: Manipulating LLMs for Automated Vulnerability Exploitation," arXiv:2512.22753, 2025. https://arxiv.org/abs/2512.22753

    Google Project Zero

    [26] Google Project Zero, "Project Naptime: Evaluating Offensive Security Capabilities of LLMs," 2024. https://projectzero.google/2024/06/project-naptime.html

    [27] Google Project Zero + DeepMind, "From Naptime to Big Sleep: Using LLMs to Catch Vulnerabilities in Real-World Code," 2024. https://projectzero.google/2024/10/from-naptime-to-big-sleep.html

    DARPA Competitions

    [28] Avgerinos, Brumley et al. (ForAllSecure), "The Mayhem Cyber Reasoning System," IEEE Security & Privacy, 2016/2018. https://users.umiacs.umd.edu/~tudor/courses/ENEE657/Fall19/papers/Avgerinos18.pdf

    [29] DARPA, "AI Cyber Challenge (AIxCC)," 2023-2025. https://www.darpa.mil/research/programs/ai-cyber

    Capability Assessments

    [30] Bhatt, Chennabasappa et al. (Meta), "Purple Llama CyberSecEval (1, 2, 3)," arXiv:2312.04724, 2023-2024. https://arxiv.org/abs/2312.04724

    [31] MITRE, "OCCULT: Evaluating LLMs for Offensive Cyber Operation Capabilities," arXiv:2502.15797, 2025. https://arxiv.org/abs/2502.15797

    [32] "Catastrophic Cyber Capabilities Benchmark (3CB)," arXiv:2410.09114, 2024. https://arxiv.org/abs/2410.09114

    Vulnerability Detection

    [33] "LLM4Vuln: Unified Evaluation for LLMs' Vulnerability Reasoning," arXiv:2401.16185, 2024. https://arxiv.org/abs/2401.16185

    [34] "All You Need Is A Fuzzing Brain," arXiv:2509.07225, 2025. https://arxiv.org/abs/2509.07225

    [35] "SecLLMHolmes: LLMs Cannot Reliably Identify Security Vulnerabilities (Yet?)," IEEE S&P 2024, arXiv:2312.12575. https://arxiv.org/abs/2312.12575

    Competitions and Datasets

    [36] Debenedetti, Rando et al. (ETH Zurich), "Dataset and Lessons from 2024 SaTML LLM Capture-the-Flag Competition," NeurIPS 2024 (arXiv:2406.07954). https://arxiv.org/abs/2406.07954

    [37] Schulhoff et al., "HackAPrompt: Exposing Systemic Vulnerabilities of LLMs via Global Prompt Hacking Competition," EMNLP 2023. https://arxiv.org/abs/2311.16119

    [38] "DEF CON 31 AI Village CTF," Kaggle, 2023. https://www.kaggle.com/competitions/ai-village-capture-the-flag-defcon31/

    Sociology of CTFs

    [39] "Cybersecurity Knowledge and Skills Taught in CTF Challenges," Computers & Security (Elsevier), 2020. https://www.sciencedirect.com/science/article/abs/pii/S0167404820304272

    HackTheBox + AI

    [40] Hack The Box, "HTB AI Range Launch," 2025. https://www.hackthebox.com/blog/htb-ai-range-launch

    [41] "Artificial Intelligence as the New Hacker: Developing Agents for Offensive Security," arXiv:2406.07561, 2024. https://arxiv.org/abs/2406.07561

    Data Source

    [42] 0xdf, "Hack The Box Writeups," 0xdf.gitlab.io. https://0xdf.gitlab.io/ (First blood data collected for 423 machines, March 2017 -- October 2025)

    Share
    Jacob Krell
    ← Previous Anthropic and Claude: 2026 AI Powerhouse

    Latest Posts

    View All
    The Death of the CTF: How Agentic AI Is Reshaping Competitive Hacking
    Mar 03, 2026 Jacob Krell

    The Death of the CTF: How Agentic AI Is Reshaping Competitive Hacking

    View White Paper Abstract: Agentic AI systems are compressing competitive hacking timelines faster than the ...

    Read More
    Anthropic and Claude: 2026 AI Powerhouse
    Supply Chain Security
    Feb 26, 2026 Hannah Perez

    Anthropic and Claude: 2026 AI Powerhouse

    In early 2026, the image of Anthropic as a cautious, safety-oriented "research lab" has effectively been replaced by ...

    Read More
    Simply Offensive Podcast: Exploring AI Vulnerabilities in Cybersecurity with Mike Bell of Suzu Labs
    Cybersecurity
    Feb 12, 2026 Phillip Wylie

    Simply Offensive Podcast: Exploring AI Vulnerabilities in Cybersecurity with Mike Bell of Suzu Labs

    In today’s rapidly evolving technological landscape, the convergence of artificial intelligence (AI) and cybersecurity ...

    Read More
    Under Armour Breach: What The Forum Data Actually Shows
    Threat Intelligence
    Jan 30, 2026 Mike Bell

    Under Armour Breach: What The Forum Data Actually Shows

    On January 18, 2026, the Everest ransomware group made good on their threat and released Under Armour customer data to ...

    Read More
    Brightspeed Breach: Crimson Collective and the Infostealer Problem
    Threat Intelligence
    Jan 20, 2026 Mike Bell

    Brightspeed Breach: Crimson Collective and the Infostealer Problem

    Recently Crimson Collective claimed they breached Brightspeed and grabbed 1 million+ customer records. The list of data ...

    Read More
    When Grid Data Goes Dark Web
    Power Grid
    Jan 19, 2026 Mike Bell

    When Grid Data Goes Dark Web

    Inside a threat actor's critical infrastructure targeting In January 2026, 139 gigabytes of engineering data from a ...

    Read More
    The $150,000 Password
    Critical Infrastructure
    Jan 19, 2026 Mike Bell

    The $150,000 Password

    How one threat actor turned stolen credentials into a global breach portfolio Between December 2025 and January 2026, a ...

    Read More
    Logo copy 3-1

    Fortified Security. Intelligent Innovation.

    +1 (702) 766-6257
    P.O. Box 750111
    Las Vegas, Nevada 89136

    Follow Us

    About

    • About Us
    • Contact

    Solutions

    • Products
    • AI Advisory
    • AI Assessment
    • Cybersecurity

    Resources

    • Insights
    • In The Media
    • Podcasts
    © 2026 All rights reserved.
    • Privacy Policy
    • Terms & Conditions