Stay ahead of the AI revolution.

21 new articles published today across the AI landscape.

Tools

52

tracked

News

1184

+21 today

Topics

10

trending

Top Story

N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

AI Digest

Pro

Your personalized AI briefing — curated summaries of today's top AI news, tools, and research.

N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?
New

N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

N-Day-Bench tests whether frontier LLMs can find known security vulnerabilities in real repository code. Each month it pulls fresh cases from GitHub security advisories, checks out the repo at the last commit before the patch, and gives models a sandboxed bash shell to explore the codebase.<p>Static vulnerability discovery benchmarks become outdated quickly. Cases leak into training data, and scores start measuring memorization. The monthly refresh keeps the test set ahead of contamination — or at least makes the contamination window honest.<p>Each case runs three agents: a Curator reads the advisory and builds an answer key, a Finder (the model under test) gets 24 shell steps to explore the code and write a structured report, and a Judge scores the blinded submission. The Finder never sees the patch. It starts from sink hints and must trace the bug through actual code.<p>Only repos with 10k+ stars qualify. A diversity pass prevents any single repo from dominating the set. Ambiguous advisories (merge commits, multi-repo references, unresolvable refs) are dropped.<p>Currently evaluating GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, GLM-5.1, and Kimi K2.5. All traces are public.<p>Methodology: <a href="https:&#x2F;&#x2F;ndaybench.winfunc.com&#x2F;methodology">https:&#x2F;&#x2F;ndaybench.winfunc.com&#x2F;methodology</a><p>Live Leaderboard: <a href="https:&#x2F;&#x2F;ndaybench.winfunc.com&#x2F;leaderboard">https:&#x2F;&#x2F;ndaybench.winfunc.com&#x2F;leaderboard</a><p>Live Traces: <a href="https:&#x2F;&#x2F;ndaybench.winfunc.com&#x2F;traces">https:&#x2F;&#x2F;ndaybench.winfunc.com&#x2F;traces</a>

Products & ReleasesHackerNews4/13/2026
The looming college-enrollment death spiral
New

The looming college-enrollment death spiral

The looming college-enrollment death spiral

LLMsHackerNews4/13/2026
Evaluation of Claude Mythos Preview's cyber capabilities
New

Evaluation of Claude Mythos Preview's cyber capabilities

Evaluation of Claude Mythos Preview's cyber capabilities

Products & ReleasesHackerNews4/13/2026
Claude Mythos: The System Card
New

Claude Mythos: The System Card

Claude Mythos: The System Card

Products & ReleasesHackerNews4/13/2026
Claude.ai down
New

Claude.ai down

Claude.ai down

Products & ReleasesHackerNews4/13/2026
Anthropic loses appeals court bid to pause supply chain risk label

Anthropic loses appeals court bid to pause supply chain risk label

Anthropic loses appeals court bid to pause supply chain risk label

ResearchHackerNews4/12/2026
How We Broke Top AI Agent Benchmarks: And What Comes Next

How We Broke Top AI Agent Benchmarks: And What Comes Next

How We Broke Top AI Agent Benchmarks: And What Comes Next

Use CasesHackerNews4/11/2026
Cirrus Labs to join OpenAI shut down Circus CI on Monday, June 1, 2026

Cirrus Labs to join OpenAI shut down Circus CI on Monday, June 1, 2026

Cirrus Labs to join OpenAI shut down Circus CI on Monday, June 1, 2026

ResearchHackerNews4/11/2026
Daily Challenge

Find a free AI tool

Discover a tool with a free tier that fits your workflow.

+35 XP
Your Progress
1

Level 1 Explorer

0 XP total

0/100 XP to next level

First Steps

Explorer

Analyst

Ecosystem Insights

Top categories by tool count

LLM
10
Image Generation
7
Code Generation
6
Video Generation
5

Get Personalized Recommendations

Sign in to bookmark tools, save articles, and get AI tool recommendations tailored to your needs.

We use cookies to improve your experience, analyze usage, and personalize your news feed. By continuing to use AIscape, you consent to our use of cookies. Learn more