Can Your SOC's AI Actually Think? Evaluating LLMs with the Vectra AI MCP Server

November 4, 2025
Fabien Guillot
Director de Marketing Técnico de Vectra
Can Your SOC's AI Actually Think? Evaluating LLMs with the Vectra AI MCP Server

You know that moment when someone says, “Let’s just plug ChatGPT into the SOC” — and everyone nods like it’s totally fine? Yeah, this post is about what happens after that moment.

Because as cool as it sounds, adding GenAI to a SOC isn’t magic. It’s messy. It’s data-hungry. And if you don’t measure what’s really happening under the hood, you might just end up automating the confusion.

So… we decided to measure it.

GenAI in the SOC: cool idea, hard reality

Let’s start with the obvious: AI is everywhere in security right now.

Every SOC slide deck has a big “GenAI Assistant” bubble somewhere in it. But how those assistants actually perform when faced with real SOC workflows — that’s the real test.

Enter the Vectra MCP Server — the air traffic controller for all your AI agents.

It connects your LLM (say ChatGPT or Claude) to your security tools (and their data!) — in this case, Vectra AI.

The MCP orchestrates enrichment, correlation, containment, and context, letting your AI agent interact directly with the signals that matter instead of getting lost in dashboards.

And because we want everyone to leverage and experience these capabilities, we have released 2 MCP servers allowing you to connect any Vectra platform to your AI workflows.

So, if you’ve been thinking, “I wish I could just connect my LLM to my security stack and see what happens,” — now you can. No license hoops, no NDAs, just plug it in and play.

At Vectra AI, we genuinely believe that GenAI + MCP will fundamentally change how SOCs operate.

This isn’t a “someday” idea — it’s already happening, and we are making sure that Vectra AI users are fully equipped to leverage this change.

That’s also why we spend a lot of time talking with customers, prospects, and partners — to understand how fast these technologies are moving, and what “LLM-ready” really means in a live SOC.

So… we decided to measure it.

Because if GenAI is going to reshape security operations, then we need to be absolutely sure our platform, our data, and our MCP integrations can plug into that new world seamlessly. Measuring efficacy isn’t a side project — it’s how we future-proof the SOC.

It’s not about more data — it’s about better data

We’ll be blunt: GenAI without good data is like hiring Sherlock Holmes and giving him a blindfold.

At Vectra AI, data is the differentiator. Two things make it special:

  1. AI-based detections: built on years of research into attacker behaviors, not anomalies. They're designed to be robust, meaning they stay effective even as attackers change tools. Each detection focuses on intent and behavior rather than static indicators, giving SOC teams confidence that what they're seeing is real and relevant.
  1. Enriched network metadata: high-context telemetry that spans hybrid environments, structured and correlated so it's machine-readable and immediately actionable.

That's the kind of data GenAI can actually use. Feed that into an LLM, and it starts reasoning like a seasoned analyst. Feed it raw logs, and you'll get a very confident hallucination about DNS.

So, how do you evaluate an AI analyst anyway?

Turns out, you can’t just ask it to “find bad guys faster.”

You need to measure how it reasons. And when you deal with an AI agent with MCP, there are primarily 3 things that you can influence:

  1. The model (GPT-5, Claude, Deepseek, etc.)
  1. The prompt (how you tell it to act — tone, structure, goals)
  1. The MCP itself (how it plugs into your detection stack)

Each of those can move the performance needle.

Change the prompt slightly, and suddenly your “confident” AI analyst forgets how to spell “PowerShell.”

Change the model, and latency doubles.

Change the MCP integration, and half your context disappears.

That’s why we built a repeatable testbed — automated evaluation, real SOC scenarios, and a dash of brutal honesty.

The testbed (a.k.a. “we actually tried it”)

For the first run, we kept things intentionally simple: tier-1 tasks, light reasoning (two hops max), no fancy multi-agent choreography.

The stack looked like this:

  • n8n for quick prototyping and automation
  • A minimal SOC prompt (basically: “You’re an AI analyst. Help out. If you don’t know, say so.”)

But this wasn’t a toy experiment. We tested 28 real SOC tasks — the kind analysts actually face every single day. Things like:

  • Listing hosts in high or critical status
  • Pulling detections for specific endpoints (piper-desktop, deacon-desktop, etc.)
  • Checking for command-and-control detections tied to IPs or domains
  • Finding exfiltration over 1GB
  • Tagging and deleting host artifacts
  • Looking up accounts in “high” or “critical” risk quadrants
  • Hunting for “Admin” accounts involved in EntraID operations
  • Querying detections with specific JA3 fingerprints
  • Assigning analysts to hosts or detections

Basically, everything a Tier-1 or Tier-2 SOC analyst would touch on a busy Tuesday morning.

Each run was scored for correctness, speed, token use, and tool activity — all measured on a scale of 1-5.

What makes a good GenAI agent?

Evaluating GenAI inside a SOC isn’t about which model sounds smarter. It’s about how efficiently it thinks, acts, and learns. A good AI agent behaves like a sharp analyst — it doesn’t just get the right answer, it gets there efficiently. Here’s what to look for:

  1. Efficient token usage. The fewer words it needs to reason, the better. Long-winded models waste compute and context space.
  1. Smart tool calls. When a model keeps calling the same tool over and over, it’s basically saying “let me try again.” The best ones understand when and how to use a tool — minimal trial and error, maximum precision.
  1. Speed without sloppiness. Fast is good, but only if accuracy holds. The ideal model balances responsiveness with reasoning depth.

In short: your best AI analyst doesn’t just talk — it thinks efficiently.

Here’s what we found:

Highlights and practical takeaways

  • GPT-5 wins on accuracy and reasoning depth, but it's slow and pricey. Use when precision matters more than speed.
  • Claude Sonnet 4.5 delivers the best overall balance: accuracy, speed, and efficiency. Great for production SOCs.
  • Claude Haiku 4.5 is perfect for fast triage: quick, cheap, and "good enough" for first-line decisions.
  • Deepseek 3.1 is the value champion: impressive performance at a fraction of the cost.
  • Grok Code Fast 1 is for tool-heavy workflows (automation, enrichment, etc.), but watch your token bill.
  • GPT-4.1... let's just say it's not invited back for another shift.

And because every good article needs graphs — here’s some:

Correctness score comparison

GPT-5 is technically the winner at 4.32/5, but honestly? Claude Sonnet 4.5 and Deepseek 3.1 are basically tied at 4.11 and you probably won't notice the difference. The real plot twist? GPT 4.1 absolutely faceplants with 2.61/5. Like, yikes. Don't use that one for security stuff.

Execution time

Claude Haiku 4.5 is flying through these queries at 38 seconds. Meanwhile GPT-5 is taking a leisurely 93-second stroll — literally 2.5x slower. When there's a potential security incident, those extra seconds feel like forever. Haiku gets it done.

Value proposition matrix

Bigger bubble = fewer tokens used. GPT 4.1's bubble is huge, but that's not a flex — it's like saying "I finished the test super fast" when you failed it. Cheap and wrong isn't a value proposition, it's just... wrong. The models you actually want are in the upper-right corner: Deepseek 3.1 (efficient AND accurate), Claude Sonnet 4.5 (balanced beast), and Grok Code Fast (solid all-around). GPT-5's micro-bubble confirms it's the expensive option.

So, what did we learn?

  1. Accuracy isn’t everything. A model that’s slightly more accurate but takes twice as long — and burns five times the tokens — might not be your best option. In a SOC, efficiency and scale is part of accuracy.
  1. Tool use is a window into reasoning. “If an LLM needs ten tool calls to answer a simple question, it’s not being thorough — it’s lost. The best-performing models didn’t just get the answer right; they got there efficiently, using one or two smart queries through the MCP. Tool use isn’t about quantity — it’s about how quickly the model figures out the right path. It's not always the LLM to be blamed. A good MCP server is essential for optimal tool calling. But lets keep MCP evaluation for a later time.
  1. Prompt design is underrated. The tiniest tweak in wording can swing accuracy or hallucination rates wildly. We kept the prompt minimal on purpose — a baseline for future tuning — but it’s clear that small design choices have big effects.

Wrapping up (and a little reality check)

So, here’s the thing — it’s not really about which model wins a beauty contest. Sure, GPT-5 might edge out Claude on one metric or another, but that’s missing the point.

The real lesson is that evaluating your AI agent is not optional.
If you’re going to rely on GenAI inside your SOC — to triage alerts, summarize incidents, or even call containment actions — then you need to know how it behaves, where it fails, and how it evolves over time.

AI without evaluation is just automation without accountability.

And equally important: your security tools need to speak LLM.

That means structured data, clean APIs, and context that’s machine-readable — not locked in dashboards or vendor silos. The most advanced model in the world can’t reason if it’s fed half-broken telemetry.

That’s why at Vectra AI, we’re obsessed with making sure our platform — and our MCP server — are LLM-ready by design. The signals we produce aren’t just meant for humans; they’re built to be consumed by machines, by AI agents that can reason, enrich, and act.

Because in the next wave of security operations, it’s not enough to use AI — your entire ecosystem has to be AI-compatible.

The SOC of the future isn’t just AI-powered. It’s AI-measured, AI-connected, and AI-ready.  

Preguntas frecuentes