Parsing Webpages with a LLM – Revisited

11 hours ago 1

I previously wrote about parsing websites and extract structured data, but that was in January 2025 and a lot has happened in the LLM sphere since then. The pace at which development moves forward is truly mind-boggling. Since then, I changed my mind about a couple of things, especially regarding the libraries that I would recommend.

llama.cpp > ollama

I won’t praise ollama anymore, instead I’ll recommend running llama.cpp. ollama is great to get a head start into the world of LLMs, because it makes installing and running your first LLM really easy, but the team behind it has shown behavior that raises red flags.

  • The project is essentially wrapping llama.cpp, but for a long time did not provide proper attribution, see ollama-3697. Even now you need to scroll down the whole README to find that attribution, which seems unfair as all the hard engineering work to make LLMs run fast on consumer hardware is done by the llama.cpp team.
  • ollama introduced their own format for storing LLMs on device for no particular reason, which is incompatible with the standard GGUF format, meaning that you cannot easily switch to other tools to run the same models that you already downloaded.
  • llama.cpp compiled from sources preforms better than ollama.
  • llama.cpp provides more feature and allows for greater control of said features.

PydanticAI > llama-index

In my post about RAG, I advertised llama-index, which was based on a survey of several AI libraries. I have since discovered PydanticAI, which is from the same team that brought us the fantastic Pydantic. Both libraries abstract away annoying details and boilerplate, while giving you layers of control from high-level down to the fundamentals, if you need it (and in the realm of LLMs, you often need to dig in deep). Most libraries only achieve the former, but fail at the latter. They also have excellent documentation. PydanticAI is great for extracting structured output, so we will use it here.

The task

With that out of the way, let’s revisit the task. In this post, I will let the LLM parse a web page to extract data and return it in a structured format. More specifically, I will read a couple of web pages from InspireHEP about a few scientific papers on which I am a co-author and then extract lists of references contained in these pages. Normally, one would write a parser to solve this task, but with LLMs we can skip that and just describe the task in human language. With the advent of strong coding models, there is also an interesting third option, the hybrid approach, where we let LLM write the grammar for a parser based on a bunch of example documents. The hybrid approach is arguably the best one if the structure of the source documents changes only rarely, because it provides deterministic outcomes and is much more energy efficient than using a LLM. LLMs are great for one-shot or few-shot tasks, where writing a parser would not make sense.

Disclaimer: I’ll note again that there are easier ways to solve this particular task: InspireHEP allows one to download information about papers in machine readable format (BibTeX and others). The point of this post is to show how to do it with an LLM, because that approach can also be used for other pages that do not offer access to their data in machine-readable format.

Converting dynamic web pages to Markdown

The code for this part was written by ChatGPT. We use Playwright to render the HTML a user would see in an actual browser. That’s important, because many websites are rendered dynamically with JavaScript, so that the raw HTML code does not contain the information we seek. Since the HTML downloaded by Playwright is still very cluttered and hard to read, we convert it with markdownify into simple Markdown, which is easier to read by humans and LLMs. This step removes lots of the HTML noise that deals with formatting. In signal processing terms, we increase the signal-to-noise ratio of the data. We save the Markdown files in the subdirectory scraped.

On Windows, the Playwright code cannot be run inside a Jupyter notebook, it is a long-standing issue. Playwright refuses to use its sync API when it detects that an event loop is running, and its async API fails on Windows with a NotImplementedError.

As a workaround, I run the code in a separate process, using joblib. If we weren’t running from a Jupyter notebook, we could also use a concurrent.future.ProcessPoolExecutor, but that doesn’t work in a notebook. joblib does some magic behind the scenes to enable this. As a sideeffect, this enables us to scrape multiple websites in parallel. We need to careful doing that too much, though, because websites, including Inspire, tend to block IPs that make too many calls in parallel.

from pathlib import Path import joblib def scrape_to_markdown(url: str, output_dir: Path): from playwright.sync_api import sync_playwright from markdownify import markdownify as md output_fn = url[url.index("://") + 3 :].replace("/", "_").replace(".", "_") + ".md" ofile = output_dir / output_fn if ofile.exists(): return f"Skipped {ofile}" with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() page.goto(url) # Wait for JavaScript-rendered content to load page.wait_for_load_state("networkidle") rendered_html = page.content() page.close() markdown_content = md(rendered_html) with open(ofile, "w", encoding="utf-8") as file: file.write(markdown_content) browser.close() return f"Saved {ofile}" scraped = Path() / "scraped" urls = """ https://inspirehep.net/literature/1889335 https://inspirehep.net/literature/2512593 https://inspirehep.net/literature/2017107 https://inspirehep.net/literature/2687746 https://inspirehep.net/literature/2727838 """.strip().split("\n") joblib.Parallel(n_jobs=4)( joblib.delayed(scrape_to_markdown)(url, scraped) for url in urls )
['Skipped scraped\\inspirehep_net_literature_1889335.md', 'Skipped scraped\\inspirehep_net_literature_2512593.md', 'Skipped scraped\\inspirehep_net_literature_2017107.md', 'Skipped scraped\\inspirehep_net_literature_2687746.md', 'Skipped scraped\\inspirehep_net_literature_2727838.md']

The content of an example files looks like this:

Measurement of prompt charged-particle production in pp collisions at $ \sqrt{\mathrm{s}} $ = 13 TeV - INSPIREYou need to enable JavaScript to run this app. [INSPIRE Logo](/) literature - Help - Submit - [Login](/user/login) [Literature](/literature) [Authors](/authors) [Jobs](/jobs) [Seminars](/seminars) [Conferences](/conferences) [Data](/data)BETA More... ## Measurement of prompt charged-particle production in pp collisions at s \sqrt{\mathrm{s}} s​ = 13 TeV - [LHCb](/literature?q=collaboration:LHCb) Collaboration - [Roel Aaij](/authors/1070843)( - [Nikhef, Amsterdam](/institutions/903832) ) Show All(972) Jul 28, 2021 35 pages Published in: - _JHEP_ 01 (2022) 166 - Published: Jan 27, 2022 e-Print: - [2107.10090](//arxiv.org/abs/2107.10090) [hep-ex] DOI: - [10.1007/JHEP01(2022)166](<//doi.org/10.1007/JHEP01(2022)166>) Report number: - LHCb-PAPER-2021-010, - CERN-EP-2021-110 Experiments: - [CERN-LHC-LHCb](/experiments/1110643) View in: - [CERN Document Server](http://cds.cern.ch/record/2777220), - [HAL Science Ouverte](https://hal.science/hal-03315290), - [ADS Abstract Service](https://ui.adsabs.harvard.edu/abs/arXiv:2107.10090) pdfciteclaim[datasets](/data/?q=literature.record.$ref:1889335) [reference search](/literature?q=citedby:recid:1889335)[32 citations](/literature?q=refersto:recid:1889335) ### Citations per year [...]

The web page also contains all the references cited by the paper. I skipped that part here, which is not of interest for us. In fact, one should cut that part away in order to help the model focus on the relevant text piece and to not waste time on processing irrelevant tokens.

The converted Markdown does not look perfect, the conversion garbled up the structure of the document. Let’s see whether the LLM can make sense of this raw text. We want it to extract the authors, the journal data, the title, and the DOI.

Extracting data from raw text with a LLM

In the original post, I used natural language to describe the structure of the output I want. Since then, models have become much better at returning structured output in form of JSON, and PydanticAI provides convenient tooling to return structured output with validation.

We don’t use ollama this time, but llama.cpp directly. For the model, we use the capable Qwen-2.5-coder-7b-instruct with the Q8_K quant and a 64000 token context window. It’s lauded on Reddit to be a good coding model for its size, and since a lot of computer code contains JSON, it should be good at producing that. I also experimented with some other models, see comments below.

We need to start the llama-server separately. There is no native support in PydanticAI for llama.cpp at this time, but since llama-server is OpenAI compatible, we merely need to adapt the OpenAI provider. This is an example of the great flexibility in PydanticAI that I mentioned at the beginning of the post.

from pydantic_ai import Agent, ModelSettings, capture_run_messages from pydantic_ai.providers.openai import OpenAIProvider from pydantic_ai.models.openai import OpenAIChatModel from pydantic import BaseModel, ConfigDict from pydantic.networks import HttpUrl from pydantic.types import PositiveInt from rich import print # Here we define the schema for the reference we want to extract. class Reference(BaseModel): title: str "Title of the paper" authors: list[str] "Authors, in the format 'First name Last name'" collaborations: list[str] "Collaborations involved in the paper, may be empty" journal: str | None "Journal name, leave empty if not published in a journal" volume: str | None "Volume, leave empty if not published in a journal" issue: PositiveInt | None "Issue number, leave empty if not provided" page: PositiveInt | None "Starting page number, leave empty if not published in a journal" year: PositiveInt "Year of publication" eprint: HttpUrl | None "URL to the arXiv preprint, leave empty if it is not provided" doi: str | None "DOI of the paper, leave empty if not provided" reports: list[str] "Report associated with the paper, leave empty if not provided" # enable docstrings for attributes model_config = ConfigDict(use_attribute_docstrings=True) agent = Agent( OpenAIChatModel( "", provider=OpenAIProvider(base_url="http://localhost:8080/v1"), settings=ModelSettings(temperature=0.5, max_tokens=1000), ), output_type=Reference, system_prompt="Extract a reference from the provided markdown.", instructions=""" - If you encounter LaTeX commands, copy them verbatim. - Journal references come in two formats: - *journal* volume (year) issue, page - *journal* volume (year) page [leave issue empty in this case] - Volume is not always numeric. - If two numbers follow the year in parentheses, and they are separated by a comma, the first is the issue, the second is the page. - Reports, if they exist, are listed after "Report number:". If this block doesn't exist, leave the `reports` field empty. """, ) documents = [fn.open(encoding="utf-8").read() for fn in scraped.glob("*.md")] # Trim off everything after the "### Citations per year" heading documents = [doc[: doc.index("### Citations per year")] for doc in documents] for doc in documents: with capture_run_messages() as messages: try: result = await agent.run(doc) print(result.output) except Exception as e: print(e) # If there is an error (typically a schema validation error), # print the messages for debugging. print(messages)
Reference( title='Measurement of prompt charged-particle production in pp collisions at s √{s} s\u200b = 13 TeV', authors=['Roel Aaij'], collaborations=['LHCb'], journal='JHEP', volume='01', issue=None, page=166, year=2022, eprint=HttpUrl('https://arxiv.org/abs/2107.10090'), doi='10.1007/JHEP01(2022)166', reports=['LHCb-PAPER-2021-010', 'CERN-EP-2021-110'] )
Reference( title='The Muon Puzzle in cosmic-ray induced air showers and its connection to the Large Hadron Collider', authors=['Johannes Albrecht', 'Lorenzo Cazon', 'Hans Dembinski', 'Anatoli Fedynitch', 'Karl-Heinz Kampert'], collaborations=[], journal='Astrophys.Space Sci.', volume='367', issue=3, page=27, year=2022, eprint=HttpUrl('https://arxiv.org/abs/2105.06148'), doi='10.1007/s10509-022-04054-5', reports=[] )
Reference( title='A new maximum-likelihood method for template fits', authors=['Hans Peter Dembinski', 'Ahmed Abdelmotteleb'], collaborations=[], journal='Eur.Phys.J.C', volume='82', issue=None, page=1043, year=2022, eprint=HttpUrl('https://arxiv.org/abs/2206.12346'), doi='10.1140/epjc/s10052-022-11019-z', reports=[] )
Reference( title='The muon measurements of Haverah Park and their connection to the muon puzzle', authors=['L. Cazon', 'H.P. Dembinski', 'G. Parente', 'F. Riehn', 'A.A. Watson'], collaborations=[], journal='PoS', volume='ICRC2023', issue=None, page=431, year=2023, eprint=None, doi='10.22323/1.444.0431', reports=[] )
Reference( title='Bias, variance, and confidence intervals for efficiency estimators in particle physics experiments', authors=['Hans Dembinski', 'Michael Schmelling'], collaborations=[], journal=None, volume=None, issue=None, page=None, year=2021, eprint=HttpUrl('https://arxiv.org/abs/2110.00294'), doi=None, reports=[] )

The results are amazing and much more consistent and detailed than those obtained in January. Thanks to LLM training and PydanticAI, the output of the LLM is constrained to follow the schema, and can be readily consumed by a program. The model is also clever enough to find the relevant information on the website, even without us describing were to look. I noticed issues with detecting the issue and reports, but instructions to the model in natural language fixed those.

PydanticAI enforces a good prompt structure and good practices, with its split of system_prompt and instructions and embedding information about the output format into the JSON schema.

For a proper validation of the accuracy, we would have to run validation tests against a ground truth.

Bonus: Testing other models

I tested this with some other models, all of which performed worse on this task that the chosen one. Some models work better or at all with the NativeOutput mode of PydanticAI.

  • Qwen-2.5-coder-7b-instruct: Q8_0
    • Works very well out of the box, some small issues were fixed with prompting. Reduced performance with Pydantic’s NativeOutput mode.
  • gpt-oss-20b: mxfp4
    • Couldn’t adhere to the format, it failed to produce valid URLs for the eprint field.
  • Gemma-3-12b-it: Q4_0
    • Doesn’t work with Pydantic, it’s complaining that user and model messages must alternate.
  • Qwen3-4B-Thinking-2507: Q6_K
    • Fails to return the result via tool call. It works with NativeOutput, but fails to produce valid
      URLs for the eprint field.

Some issues that these models have, like those with the eprint field, could probably be fixed with prompts that address the specific errors.

Seeing how sensitive the performance is to prompting, even when the task is to produce well-structured output, gives me pause about trusting public benchmarks.

Read Entire Article