Scraping vulnerability data from 100 different sources (without LLMs)

3 months ago 16

SecAlerts is a service that sends alerts when vulnerabilities affect user's software. It's now into its 2nd iteration, with clients receiving vulnerability information from 100+ sources as soon as it's published.

However, this wasn't always the case. V1 SecAlerts received the bulk of its data from NVD once CVEs were analysed, which often meant a delay of several weeks, even months.

We began receiving questions from clients, asking why a certain vulnerability wasn’t alerted, when it was clearly in the public discourse. It became clear that the CVE process has its own “in due course” schedule and when you’re dealing with threats, delays aren’t acceptable.

Looking at what else was on the market, there was no publicly available vulnerability feeds that would aggregate vulnerabilities from all the well-known vendors security advisories and we believe this is still the case. So we knew we had to build our own.

One of the challenges is that each vendor tends to do things a little differently. Some will have an RSS feed or a JSON API. Some will have a blog. Some will have their own knowledge base system. Some will use bug trackers. And every now and then, some will actually follow a standard (anyone heard of CVRF?).

Due to the varied and inconsistent information sources, we opted to build a system that expected failure from the outset. Web scrapers (which we refer to as collectors) are inherently finicky, so the secret is to build accordingly.

Cheap collectors

We start with the assumption that a collector will need to be replaced. Why? Websites get rebuilt or redesigned, APIs get upgraded, the web is constantly changing shape. So, to make the process of building collectors cheap, we needed a framework that gave us quick development iteration. We developed a crawling engine that, while simple at its core, allowed us to build upon with increasingly powerful middleware. One such middleware is the ability to snapshot entire HTTP requests and responses to replay over and over, absolutely vital to quick iteration.

Extracting and finding data within DOM structures, then parsing said data, was absolutely tedious. To solve this we took a functional approach. We have two methods - extract and find - that will take a path of selectors and functions. extract will return strings from the given path and find will return nodes. The result then gets mapped onto pure reusable functions. For example there is a mapping function called mappers.Date . If we want to extract a date given a CSS selector it would look like this:

vuln.published = extract("#body .published/text").map(mappers.Date)[0]; // if the node wasn't found, we index on an empty array so end up with undefined

Notice that in our path we have a selector, then a slash, then a function named text. We can build up complex paths following selectors, then functions, then selectors again and eventually ending with a function that will return content - in this case, the inner text but this could be the innerHTML or an attribute. Most of these functions will be familiar to anyone who remembers the tree traversal jQuery API, however we have some more complex functions, my favourite being a function to normalise a table by collapsing all the colspans and rowspans.

We can then take this array of extracted text and map it against a huge library of functions that will transform the text into a common format or extract something useful. Sometimes the mapping function also returns an array so we simply use flatMap instead.

vuln.referenceUrls = extract("#body h1/next/html").flatMap(mappers.URL);

In this example the mapping function could find many URLs, not just one. We have a growing number of mapping functions that collectors can share. They can be combined with other mapping functions and can be unit tested independently. Some more examples are:

  • mappers.Version — extract semver like patterns

  • mappers.CVSS — extract CVSS strings

  • mappers.Trim — remove trailing whitespace

  • mappers.StripHTML — remove HTML tags

  • mappers.ParseJSON — parse text as JSON

  • mappers.TrimPrefix — takes a string as an argument and returns a mapping function to trim a given prefix

This process makes development extremely quick. Whenever we encounter particular hairy scenarios we can build testable, reusable mappers so we never need to solve it again.

Parsing with reducers

A pattern we use to extract structured data from plain text fields is a concept for which I’m struggling to find a canonical name, so I’ll call it a Regex Reducer pattern. We start with a source string and apply a list of regex replace functions, so on each iteration the source string gets simpler to parse. An example being we want to find the author of a vulnerability so maybe we have a free text byline node that looks something like: “Louis Stowasser (@louisstow) from SecAlerts”.

author = reduce(source, [ { pattern: /(?:of|working with|from) ([\\w\\s']+)/i, replacer: (_, c) => { company = c.trim(); return ""; }}, { pattern: /\\(?(@[a-z0-9_]+)\\)?/i, replacer: (_, t) => { social = t.trim(); return ""; }}, { pattern: /\\s+/, replacer: () => " " } ]).trim(); // author = Louis Stowasser // company = SecAlerts // social = @louisstow

In this example we start from top to bottom, matching against patterns, extracting parts of the string and removing what we matched, so the next pattern can disregard it and make the patterns much simpler than one giant regex and more resilient to unexpected content. We use this pattern to do the heavy lifting for things like extracting product names, version strings and more.

Health checks and canaries

Once we’ve built a new collector we need to observe it for the inevitable failure. We rely on two mechanisms to detect failure. The first is to mark where in the collector you would expect to find data, then if we don’t see anything by the end of the collector phase, throw an error.

if (/* page we can ignore */) { return; } expectDataFromHere(); vuln.cveId = extract("h1/text").flatMap(mappers.CVE)[0]; // ... throwIfNoData();

This way we can still exit early when we’re on a page that can be ignored, but when we know we should be expecting data we raise the flag and verify at the end of the collector.

The second mechanism is to use selector paths as a canary. If the path does not resolve to a valid node, then the page is different to how we expect and we should throw an error. It’s best to be as specific as possible in this selector to best ensure any structural changes to the page will fail the canary test.

canary = [ "div#body div.cve h1" ];

No LLMs

Maybe the more controversial take in our approach is that we use classical methods of data extraction. The reason for this is we want to be able to follow our data collection trail from origin all the way to the final vulnerability object. We need consistency and we need a clear understandable mapping from input to output.

However that’s not to say AI has no place. We currently use AI to extract data that might be contained in free text fields if it's not available in a structured way. But once we do have structured data from a collector, we prioritise that over any AI sourced data and very clearly mark it as AI.

In the future, we'd like to speed up development even further by looking at using AI to generate the extract/find paths and mapping functions given a HTML page. We could even build a generic collector that can try to extract very basic information from arbitrary URLs using LLMs.

In the next instalment we’ll cover what happens after the data has been collected, how we aggregate the data and how turn the collected semi structured data into a final vulnerability object.

Read Entire Article