Automatically prioritize security issues from different tools with an LLM

3 hours ago 4

The security backlog of broken dreams

You know the one. That Jira board, spreadsheet, or Splunk dashboard where all the security issues go to die.

At some point, someone ran a scanner. Or three. They dumped the results into a queue and promised they’d “prioritize later.” That was six quarters ago. Now there are 742 issues, 61 marked “Critical,” and exactly zero people who understand how those labels were assigned.

Maybe the scanner said it was critical. Maybe it was copied from another tool. Maybe a junior analyst just vibed it.

The point is, it’s chaos.

Most orgs try to wrangle this mess with a few basic moves:

Assign severity levels and hope for the best
Filter by “production” and pretend the rest isn’t real
Open a spreadsheet, sort by column, and call it triage

But here’s the uncomfortable truth: your backlog isn’t prioritized. It’s just sorted alphabetically by despair.

Now imagine you could ask someone smart. Not spreadsheet smart. Actually smart. The kind of person who could read every issue, compare them, and tell you which ones matter most.

That’s the promise of large language models (LLMs). You feed them your scanner output, and they start making judgment calls. Real ones. Based on the details of each issue, not just the CVSS score or whether it came from Tool A or Tool B.

Sounds too good to be true? Maybe we aren’t there yet. But it’s also not make believe either. And in the rest of this post, I’ll show you three ways to let an LLM do your triage. One naive. One painfully thorough. One that hopefully strikes a good balance.

Your vendors and consultants won’t love this. But your future self will.

Most triage is basically improv

Making a tool that finds security problems is easy. Making one that fixes them is damn near impossible. So the security industry spends most of it times making things that find problems. Thats not the worst thing in the world. It’s better to know than not know. But it does mean diligent security teams end up with huge lists of issues to fix.

The incentives for tool makers are also backwards. Which tool would your boss buy - the one that finds 10 criticals or the one that finds 7 lows and 3 mediums? Even if they are the same issue and the second set is far closer to reality, there’s no way your boss is buying the second. Vendors are incentivized to make a fuss and tell you their findings are the most important. It’s a race to the top of the severity stack.

Triage is supposed to answer one question: what should we fix next?

What it usually answers instead is: what did the scanner say, and which scanner is the most over confident about its self-importance?

The average security backlog has issues from five different tools, each using their own severity scale. One calls something “high,” another calls the same thing “medium,” and the third tool decided it’s “informational” but also worth alerting on at 3 a.m.

And if you’ve got cloud findings in the mix? Good luck. Everything is a potential misconfiguration, compliance violation, possibly exploitable, depending on how the wind blows in your IAM policy.

So what do most teams do?

Trust the scanner labels, even when they conflict
Rely on static rules like “always fix criticals in prod”
Manually review top issues and guess based on gut feel
Or worst of all, ignore the backlog entirely and hope it goes away

It’s not laziness. It’s survival.

Triage is time-consuming. It’s subjective. And when everything is flagged as urgent, nothing actually gets fixed.

When you only have 5 buckets (critical, high, medium, low, informational) and thousands of issues, the buckets aren’t enough. You need buckets inside the buckets like one of those Russian Matryoshka dolls.

That’s the real problem. Prioritization at scale isn’t just sorting by severity. It’s making informed decisions about real-world risk.

And that’s where LLMs start to get interesting. They’re not just classifiers. They’re reasoners. They can look at two issues, consider context, and explain which one deserves attention first.

In the next section, I’ll show you why the quality of your input matters way more than you think.

Because even the smartest model can’t save you from a garbage JSON blob.

Input matters more than your model

Everyone loves talking about prompts. Nobody talks about the other half of the equation: your actual data.

You can craft the perfect LLM prompt. Add clever instructions. Balance your tone just right. But if the input issue looks like this:

{ "title": "Security problem detected", "severityLevel": "High" }

‍

…then you’re basically asking the model to triage a fortune cookie.

The model isn’t magic. It can reason across context, but only if context exists. If your input has no details about where the issue was found, how it might be exploited, or what system it affects, then you’re just scoring vibes.

Want better output? Feed it better input. Here’s what that means in practice:

Use real titles and descriptions. Not tool-generated junk.
Add environmental details like other resources in the same network or cloud account.
Include impact details. Is this on an S3 bucket with customer or a Minecraft server masquerading as a dev box?
Add metadata. Things like asset criticality, exposure paths, or owner tags make a huge difference.
Do some enrichment where possible (more on this later) like looking up CVE numbers and IP addresses.

Here’s a side-by-side example:

Bad input

{ "title": "High severity issue", "description": "Detected by scanner.", "severityLevel": "High" }

‍

Better input

{ "title": "Publicly accessible S3 bucket with customer PII", "description": "The bucket 'client-data-backup' allows public read access and contains files matching known PII patterns. Detected in production.", "severityLevel": "High", "assetCriticality": "High", "foundIn": "prod-account-001", "remediation": "Remove public access by enabling 'Block Public Access' settings and replacing the bucket policy to restrict access to known, trusted IAM principals. Ensure server-side encryption is enforced and access logging is enabled.", "assetConfiguration": { "encryption": "SSE-KMS with customer-managed CMK", "versioning": true, "loggingEnabled": true, "lifecyclePolicy": "Retain for 90 days then archive to Glacier", "backupPolicy": "Daily backup to secondary region via AWS Backup", "publicAccessBlock": "false", "policy": { "Version": "2012-10-17", "Statement": [ { "Sid": "PublicReadGetObject", "Effect": "Allow", "Principal": "*", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::client-data-backup/*" } ] } }, "tags": ["S3", "data exposure", "prod"] }

‍

Which one do you think the model is going to prioritize correctly?

Exactly.

Naive scoring is better than no scoring

Sometimes you just need to sort the pile. Not debate it. Not argue with it. Just sort it. That’s what naive scoring does.

Each issue gets fed to the LLM, and the model gives it a score from 1 to 100 (or whatever scale you like). Higher score means higher priority. That’s it. No pairwise comparisons. No fancy math. Just raw, unfiltered judgment from your friendly local language model.

It’s useful because it’s:

It’s fast. Linear time fast. You can run it on 1,000 issues without lighting your AWS bill on fire. You might even be able to do it on one shot depend on the LLM models context window.
It’s simple. Every issue gets a score and a reason. No need to track matchups or comparisons.

It’s not perfect. In fact, this approach has some limitations I really can’t live with. The model has no sense of relative nuance between issues unless you teach it how to reason across multiple dimensions. It often feels like it’s spitting out random numbers, but round random numbers. You get “90” a lot but not much “89” for example. It feels like it needs a really strict rubric to avoid this score-in-a-vacuum problem.

Still, for a first pass, it’s probably ahead of “sort by scanner severity.”

How does it work

Each issue goes through a prompt like this:

You are a security expert assessing the priority of a single security issue. Do not rely on the severity or risk level described in the issues as they come from different tools and aren't always consistent. Think through why each issue is important and/or urgent and why it should should have a given score. Assign a priority score from 1 (lowest priority) to 100 (highest priority) based on the following factors: 1. Impact of exploitation 2. Liklihood of exploitation 3. Resource type and its importance 4. Reachability of the resource and issue 5. Whether there's a vendor fix available 6. How long the issue has been present 7. How long the issue will take to fix 8. Context of the issue like whether it is a test, demo, or production issue And anything else you deem relevant. Here is the issue: {issue} Respond in this JSON format: {{ "score": <1-100>, "reasoning": "Detailed explanation of the score" }} Only return valid JSON, no additional commentary.

‍

The model responds with something like:

{ "score": 75, "reasoning": "This issue is regarding the lack of key rotation for an AWS KMS Customer Managed Key (CMK), which is a critical security control to protect sensitive data. The impact of exploitation can be high as the CMK is used to encrypt data, and if the key is compromised, the data may be exposed. The likelihood of exploitation is moderate, as the issue has been present for a significant period (over 2 years) and an adversary may discover and attempt to exploit the unrotated key over time. The resource type (KMS CMK) is highly important as it is used to protect sensitive data. The resource is reachable as it is located in a public AWS region, and there is no vendor fix available as this is a configuration setting that the customer is responsible for. The issue has been present for a long time, and the fix would likely be straightforward, but the impact of the issue is severe, warranting a high priority score." }

‍

That’s it. You now have a stack-ranked list.

If you want to play with this, I’ve written score_sort.py in Python on top of AWS Bedrock. It expects an input-issues.json file with an issues list inside it. You can set the model to anything your Bedrock is configured to use.

./score_sort.py Scored Issues Summary: ============================================================================================================================================ Rank Score Severity Type Title/Message -------------------------------------------------------------------------------------------------------------------------------------------- 1 95 MEDIUM FINDING Policy that allows full administrative privileges is attached 2 90 CRITICAL FINDING Amazon S3 bucket is verified to be publicly accessible 3 90 HIGH FINDING The EC2 Instance has HIGH severity vulnerabilities 4 90 HIGH FINDING The Lambda function has HIGH severity vulnerabilities 5 80 MEDIUM VULNERABILITY python: tarfile module directory traversal 6 80 LOW FINDING AWS Key Management Service (KMS) Customer Managed Keys (CMK) does not have key-rotation e... 7 70 MEDIUM FINDING Amazon GuardDuty is not enabled in this region 8 65 LOW VULNERABILITY golang: crypto/tls: session tickets lack random ticket_age_add 9 60 CRITICAL VULNERABILITY golang: crypto/elliptic: IsOnCurve returns true for invalid field elements 10 35 LOW FINDING RDS database instance is using the default port ============================================================================================================================================ Total issues scored: 10 Total runtime: 29.52 seconds

‍

It will generate a prioritized-issues.json file with all the issues, including reasoning for reach score.

Each one has a reasoning field explaining why it scored the way it did. That’s gold if you’re sending alerts or making tickets. It’s like having an intern who can actually write.

Bubble sort is dumb but useful

Remember bubble sort from university (college for the Americans)? The lecturer made you learn it even though it’s slow and no one uses it in real systems?

Turns out, it’s useful for this because if you abstract away the philosophical stuff, prioritization is just a form of sorting.

Take your list of issues. Compare them two at a time. Ask the LLM which one is high priority. If the answer is “the second one,” swap them. Keep doing that until nothing needs to move. That’s bubble sort.

It’s slow. It’s repetitive. And it works better than you think.

Why this is actually good

Every comparison is deliberate. The model looks at two real issues and makes a choice.
You get a complete, fully ordered list based only on those decisions. There are no buckets of homogenous issues. In theory each issue is positioned intentionally.
Each issue ends up with a record of what it was compared against, what happened, and why.

If you need traceability, something to hand to your CISO or thread into a ticket, this is it. You can show exactly why issue 17 ended up higher than issue 23. The model explained it for you.

How does it work

What the prompt looks like

You are a security expert tasked with comparing two security issues and determining which one should be prioritized higher. Do not rely on the severity or risk level described in the issues as they come from different tools and aren't always consistent. Think through why each issue is important and/or urgent and why one should be prioritized higher than the other. Consider at least the following factors in your decision: 1. Impact of exploitation 2. Liklihood of exploitation 3. Resource type and its importance 4. Reachability of the resource and issue 5. Whether there's a vendor fix available 6. How long the issue has been present 7. How long the issue will take to fix 8. Context of the issue like whether it is a test, demo, or production issue And anything else you deem relevant. Here are the two issues to compare: Issue 1: {issue1} Issue 2: {issue2} Please provide your analysis in the following JSON format: {{ "higher_priority_issue": 1 or 2, "reasoning": "Detailed explanation of why this issue should be prioritized higher, considering all relevant factors" }} Only respond with the JSON, no additional text.

‍

And the model responds with:

{ "compared_with": "prn:findings:xxx:aws-140:testing-bucket", "reasoning": "Issue 1 should be prioritized higher than Issue 2 due to the following factors:1. Impact of exploitation: The public accessibility of the S3 bucket in Issue 1 poses a significant risk, as it allows unauthorized access to potentially sensitive data stored in the bucket. This could lead to data breaches, theft, or misuse of the stored information.2. Likelihood of exploitation: The public accessibility of the S3 bucket in Issue 1 makes it highly likely to be exploited, as it is a well-known and easily identifiable vulnerability that can be easily discovered and targeted by malicious actors.3. Resource type and its importance: The S3 bucket in Issue 1 is a critical cloud storage resource that may contain important or sensitive data. The compromise of such a resource can have a significant impact on the organization's operations and reputation.4. Reachability of the resource and issue: The S3 bucket in Issue 1 is publicly accessible, making it easily reachable and exploitable by anyone on the internet. This increases the urgency of addressing this issue.5. Vendor fix availability: Issue 1 does not have a vendor fix, as the public accessibility of the S3 bucket is a configuration issue that the organization must address. In contrast, Issue 2 has a vendor fix available, reducing the urgency of this vulnerability.6. Duration of the issue: Issue 1 has been present for a longer duration, with the first observation being over 5 days ago, compared to Issue 2, which was first observed around 2 months ago.7. Time to fix: Addressing the public accessibility of the S3 bucket in Issue 1 is likely to be faster and easier to fix than the vulnerability in Issue 2, which may require more complex remediation efforts.", "was_higher_priority": false }

‍

That reasoning gets attached to both issues as part of their history by bubble_sort.py.

Example CLI run

./bubble_sort.py Bubble Sort Issues Summary: ============================================================================================================================================ Rank Severity Type Title/Message -------------------------------------------------------------------------------------------------------------------------------------------- 1 CRITICAL FINDING Amazon S3 bucket is verified to be publicly accessible 2 MEDIUM VULNERABILITY python: tarfile module directory traversal 3 HIGH FINDING The EC2 Instance has HIGH severity vulnerabilities 4 MEDIUM FINDING Policy that allows full administrative privileges is attached 5 CRITICAL VULNERABILITY golang: crypto/elliptic: IsOnCurve returns true for invalid field elements 6 LOW VULNERABILITY golang: crypto/tls: session tickets lack random ticket_age_add 7 HIGH FINDING The Lambda function has HIGH severity vulnerabilities 8 LOW FINDING RDS database instance is using the default port 9 MEDIUM FINDING Amazon GuardDuty is not enabled in this region 10 LOW FINDING AWS Key Management Service (KMS) Customer Managed Keys (CMK) does not have key-rotation enabled ============================================================================================================================================ Total issues prioritized: 10 Total runtime: 193.75 seconds

‍

Fair warning: this will take time. Bubble sort makes O(n²) comparisons. That means 10 issues = 45 comparisons. 100 issues = 4,950 comparisons. You get the idea. Sorting just 10 issues took me 193 seconds in the example run above, as compared to the score sort which took 29 seconds for the same 10 issues.

If your input set is small or high-value, or you want to get some baseline confidence about a prompt, it might be worth it. You’ll walk away with a ranked list, detailed reasoning for every move, and confidence that the ranking wasn’t just a gut call.

One more thing. I ran this multiple times with the same prompt, input, and model, and I got the same result every time. That’s not true for most “AI ranking” systems. I don’t think that makes this approach deterministic but it gave me comfort that it was closer to deterministic than yolo.

So that’s the “slow but thorough” option. Next up, we go for balance. A system that’s faster than bubble sort, more nuanced than naive scoring, and weirdly reminiscent of competitive chess.

Yep. We’re doing Elo.

Elo is smarter than it sounds

If you’ve ever played online chess or any competitive video games, you’ve already used Elo.

The concept is simple: every player, or in our case, every security issue, starts with a score. When two of them “play” (get compared), the winner gains points and the loser drops a few. How much? That depends on how expected the outcome was.

If an issue with a high score loses to a low-ranked underdog, it takes a bigger hit. If it wins, no surprise, not much changes. That same logic works beautifully for security triage.

Why this method hits the sweet spot

It’s scalable. You don’t have to compare every pair. Just sample.
It’s adaptive. Rankings evolve as comparisons happen.
It’s tunable. You can control how many comparisons to make and how sensitive the rankings are.
It feels fair. Big wins mean more movement. Narrow wins mean less.

Bubble sort gives you full coverage. Naive scoring is fast but context-blind. Elo lets you explore the middle ground.

How does it work

Each issue starts with a base Elo score. Say 1200. The script randomly samples issue pairs and asks the model to compare them:

‍

The model responds with:

{ "higher_priority_issue": 1, "reasoning": "Issue 1 allows lateral movement across accounts. Issue 2 is limited to a dev system." }

‍

Then the scores get updated. The winner goes up, the loser goes down, and both issues log the reasoning.

Do that a few hundred times, and you’ve got a pretty reliable ranking with elo_sort.py.

./elo_sort.py --max-comparisons 0.8 Elo Issues Summary: ============================================================================================================================================ Rank Elo Severity Type Title/Message -------------------------------------------------------------------------------------------------------------------------------------------- 1 1322 CRITICAL FINDING Amazon S3 bucket is verified to be publicly accessible 2 1260 MEDIUM VULNERABILITY python: tarfile module directory traversal 3 1246 MEDIUM FINDING Policy that allows full administrative privileges is attached 4 1245 HIGH FINDING The EC2 Instance has HIGH severity vulnerabilities 5 1212 CRITICAL VULNERABILITY golang: crypto/elliptic: IsOnCurve returns true for invalid field elements 6 1184 HIGH FINDING The Lambda function has HIGH severity vulnerabilities 7 1172 LOW FINDING RDS database instance is using the default port 8 1157 LOW VULNERABILITY golang: crypto/tls: session tickets lack random ticket_age_add 9 1116 MEDIUM FINDING Amazon GuardDuty is not enabled in this region 10 1086 LOW FINDING AWS Key Management Service (KMS) Customer Managed Keys (CMK) does not have key-rotation e... ============================================================================================================================================ Total issues prioritized: 10 Total runtime: 171.15 seconds

‍

That last flag means “run 80% of all possible comparisons.” You can scale this up or down depending on time, budget, or caffeine levels.

Each issue also includes comparison history, just like in bubble sort. So you get the traceability, but without having to brute-force every possible matchup.

Elo is a better option when:

You want a solid ranking but have too many issues for full comparisons
You’re okay with a bit of randomness and some tuning
You want to run it regularly without waiting hours

‍

How to optimize for your world

‍

Write the prompt just for you

The LLM doesn’t know what’s critical to your company.

It doesn’t know you’ve got a particular customer you are trying to please, or the CTO has issue a mandate to fix a certain type of issues, or your board wants to make sure you don’t fall victim to ransomware.

But you do. So tell it.

LLMs are flexible. You can feed them whatever extra context you want. Want to penalize unauthenticated internet exposure? Cool. Want to boost anything tagged payment-processing or prod? Easy. Just include the right fields in your input and tweak the prompt.

You can also adjust the model’s behavior by weighting certain phrases in the prompt. If you want to be more aggressive about public exposure, say that. If you care more about privilege escalation than data leaks, make it clear.

Suddenly the model isn’t just guessing severity based on a title. It’s reasoning in your domain.The model is capable of nuance. You just have to steer it.

Pick a model that makes sense for your workflow

Some models are fast. Some are smart. Some are cheap. You don’t get all three.

There are plenty of models available in AWS Bedrock. Here are 3 options and how to actually think about using them:

anthropic.claude-3-5-haiku-20241022-v1:0

Fastest & cheapest

Latency: lowest of the family; Bedrock also offers a latency-optimized flavor (~60 % faster).
Sweet spot: fire-and-forget triage on huge queues of issues, bulk pairwise comparisons, or any workflow where “good enough” beats “perfect” and cost/throughput matter most.
Limits: can gloss over edge-cases and will generalize if your prompt is vague.

anthropic.claude-3-5-sonnet-20240620-v1:0

Balanced middle ground

Performance: 2 × the speed of the old Claude 3 Opus with markedly better reasoning and coding compared with Haiku.
Sweet spot: everyday security triage where you need solid reasoning, fewer hallucinations, and still-snappy responses. Great for Elo-style ranking loops or layered comparisons.
Limits: a hair slower than Haiku; still not the deepest thinker available.

anthropic.claude-3-7-sonnet-20250219-v1:0

Heavyweight hybrid-reasoning model

Hybrid reasoning: can toggle between near-instant answers and an “extended-thinking” mode that self-reflects before responding (great for gnarly math, coding, or multi-step logic).
Context: 200 K tokens, same as its 3.5 siblings, but markedly better at holding complex chains of thought.
Sweet spot: small-to-medium batches where airtight analysis matters more than raw speed.
Limits: higher latency than Haiku/3.5 Sonnet; if you’re CPU-minute-constrained, budget accordingly.

‍

Plan for a future where models get better

It’s tempting to over-engineer everything now. Nail down the prompt. Normalize every field. Hardcode your logic.

But that just locks you into today’s limitations. At that point, you might as well write the code yourself instead of using an LLM.

LLMs are getting better, fast. If you design your system to rely on rigid field mappings and tight constraints, you’re betting against that improvement. You’re forcing a smart model to act like a dumb rule engine.

Instead, lean into the ambiguity. Let the model interpret. Use richer inputs. Leave room for nuance.

That way, when tomorrow’s model gets smarter, your system just gets smarter automatically.

Turn your comparison into an agent

Why stop at simple pairwise comparisons?

You’ve already got a reasoning engine. Let it do more. Give it tools. Let it research.

What if your comparison logic could also do enrichment and answer questions when data was missing, like:

Look up the latest CVE details in real time
Query your own environment to see if the affected asset is still live
Check CloudTrail or SIEM logs for signs of exploitation
Look at how long similar issues took to remediate last time
Adjust priority based on current incident context or attack campaigns

At that point, you’re not just ranking issues. You’re building a security analyst that never gets tired and actually reads the documentation.

Make your backlog work for you

Security teams often don’t fail because they miss findings. They fail because they can’t act fast enough on the ones that matter.

Most backlogs are just piles of alerts. No context. No ranking. No hope. But with the right inputs and a little LLM-powered reasoning, you can start turning scanner noise into signal.

You’ve now got three interesting ways to do that (code in Github):

Naive scoring for speed and simplicity but with weird results
Bubble sort for full-order confidence but with great expense
Elo for scalable, nuanced rankings

Each one gives you explainability. Each one gives you structure. And each one is better than sorting by severityLevel and hoping for the best.

You don’t need to rebuild your whole process to get value here. Start small. Score your top 50 open issues. Feed them into Slack. See who complains.

Then keep going. Add metadata. Tune your prompts. Plug it into your ticketing flow. Over time, you’ll go from guessing what to fix to knowing what to fix, and being able to demonstrate why.

The scanners can scream all they want. You’ve got a system now.

And it’s not just smarter. It’s yours.

And if you don’t want to mess around with all that, or you just want something now, let our AI cloud security teammate Pleri do the prioritization for you, in Slack, in Jira, or wherever else you might work.

‍

Read Entire Article

Automatically prioritize security issues from different tools with an LLM

The security backlog of broken dreams

Most triage is basically improv

Input matters more than your model

Naive scoring is better than no scoring

How does it work

Bubble sort is dumb but useful

How does it work

Elo is smarter than it sounds

How does it work

How to optimize for your world

Write the prompt just for you

Pick a model that makes sense for your workflow

Plan for a future where models get better

Turn your comparison into an agent

Make your backlog work for you

Related

Ask HN: What Happened to the Apple Vision Pro?

Show HN: AI Roast your startup website

What next after vibe coding