
I previously wrote about parsing websites and extract structured data, but that was in January 2025 and a lot has happened in the LLM sphere since then. The pace at which development moves forward is truly mind-boggling. Since then, I changed my mind about a couple of things, especially regarding the libraries that I would recommend.
llama.cpp > ollama
I won’t praise ollama anymore, instead I’ll recommend running llama.cpp. ollama is great to get a head start into the world of LLMs, because it makes installing and running your first LLM really easy, but the team behind it has shown behavior that raises red flags.
- The project is essentially wrapping llama.cpp, but for a long time did not provide proper attribution, see ollama-3697. Even now you need to scroll down the whole README to find that attribution, which seems unfair as all the hard engineering work to make LLMs run fast on consumer hardware is done by the llama.cpp team.
- ollama introduced their own format for storing LLMs on device for no particular reason, which is incompatible with the standard GGUF format, meaning that you cannot easily switch to other tools to run the same models that you already downloaded.
- llama.cpp compiled from sources preforms better than ollama.
- llama.cpp provides more feature and allows for greater control of said features.
PydanticAI > llama-index
In my post about RAG, I advertised llama-index, which was based on a survey of several AI libraries. I have since discovered PydanticAI, which is from the same team that brought us the fantastic Pydantic. Both libraries abstract away annoying details and boilerplate, while giving you layers of control from high-level down to the fundamentals, if you need it (and in the realm of LLMs, you often need to dig in deep). Most libraries only achieve the former, but fail at the latter. They also have excellent documentation. PydanticAI is great for extracting structured output, so we will use it here.
The task
With that out of the way, let’s revisit the task. In this post, I will let the LLM parse a web page to extract data and return it in a structured format. More specifically, I will read a couple of web pages from InspireHEP about a few scientific papers on which I am a co-author and then extract lists of references contained in these pages. Normally, one would write a parser to solve this task, but with LLMs we can skip that and just describe the task in human language. With the advent of strong coding models, there is also an interesting third option, the hybrid approach, where we let LLM write the grammar for a parser based on a bunch of example documents. The hybrid approach is arguably the best one if the structure of the source documents changes only rarely, because it provides deterministic outcomes and is much more energy efficient than using a LLM. LLMs are great for one-shot or few-shot tasks, where writing a parser would not make sense.
Disclaimer: I’ll note again that there are easier ways to solve this particular task: InspireHEP allows one to download information about papers in machine readable format (BibTeX and others). The point of this post is to show how to do it with an LLM, because that approach can also be used for other pages that do not offer access to their data in machine-readable format.
Converting dynamic web pages to Markdown
The code for this part was written by ChatGPT. We use Playwright to render the HTML a user would see in an actual browser. That’s important, because many websites are rendered dynamically with JavaScript, so that the raw HTML code does not contain the information we seek. Since the HTML downloaded by Playwright is still very cluttered and hard to read, we convert it with markdownify into simple Markdown, which is easier to read by humans and LLMs. This step removes lots of the HTML noise that deals with formatting. In signal processing terms, we increase the signal-to-noise ratio of the data. We save the Markdown files in the subdirectory scraped.
On Windows, the Playwright code cannot be run inside a Jupyter notebook, it is a long-standing issue. Playwright refuses to use its sync API when it detects that an event loop is running, and its async API fails on Windows with a NotImplementedError.
As a workaround, I run the code in a separate process, using joblib. If we weren’t running from a Jupyter notebook, we could also use a concurrent.future.ProcessPoolExecutor, but that doesn’t work in a notebook. joblib does some magic behind the scenes to enable this. As a sideeffect, this enables us to scrape multiple websites in parallel. We need to careful doing that too much, though, because websites, including Inspire, tend to block IPs that make too many calls in parallel.
The content of an example files looks like this:
The web page also contains all the references cited by the paper. I skipped that part here, which is not of interest for us. In fact, one should cut that part away in order to help the model focus on the relevant text piece and to not waste time on processing irrelevant tokens.
The converted Markdown does not look perfect, the conversion garbled up the structure of the document. Let’s see whether the LLM can make sense of this raw text. We want it to extract the authors, the journal data, the title, and the DOI.
Extracting data from raw text with a LLM
In the original post, I used natural language to describe the structure of the output I want. Since then, models have become much better at returning structured output in form of JSON, and PydanticAI provides convenient tooling to return structured output with validation.
We don’t use ollama this time, but llama.cpp directly. For the model, we use the capable Qwen-2.5-coder-7b-instruct with the Q8_K quant and a 64000 token context window. It’s lauded on Reddit to be a good coding model for its size, and since a lot of computer code contains JSON, it should be good at producing that. I also experimented with some other models, see comments below.
We need to start the llama-server separately. There is no native support in PydanticAI for llama.cpp at this time, but since llama-server is OpenAI compatible, we merely need to adapt the OpenAI provider. This is an example of the great flexibility in PydanticAI that I mentioned at the beginning of the post.
The results are amazing and much more consistent and detailed than those obtained in January. Thanks to LLM training and PydanticAI, the output of the LLM is constrained to follow the schema, and can be readily consumed by a program. The model is also clever enough to find the relevant information on the website, even without us describing were to look. I noticed issues with detecting the issue and reports, but instructions to the model in natural language fixed those.
PydanticAI enforces a good prompt structure and good practices, with its split of system_prompt and instructions and embedding information about the output format into the JSON schema.
For a proper validation of the accuracy, we would have to run validation tests against a ground truth.
Bonus: Testing other models
I tested this with some other models, all of which performed worse on this task that the chosen one. Some models work better or at all with the NativeOutput mode of PydanticAI.
- Qwen-2.5-coder-7b-instruct: Q8_0
- Works very well out of the box, some small issues were fixed with prompting. Reduced performance with Pydantic’s NativeOutput mode.
 
- gpt-oss-20b: mxfp4
- Couldn’t adhere to the format, it failed to produce valid URLs for the eprint field.
 
- Gemma-3-12b-it: Q4_0
- Doesn’t work with Pydantic, it’s complaining that user and model messages must alternate.
 
- Qwen3-4B-Thinking-2507: Q6_K
- Fails to return the result via tool call. It works with NativeOutput, but fails to produce valid
 URLs for the eprint field.
 
- Fails to return the result via tool call. It works with NativeOutput, but fails to produce valid
Some issues that these models have, like those with the eprint field, could probably be fixed with prompts that address the specific errors.
Seeing how sensitive the performance is to prompting, even when the task is to produce well-structured output, gives me pause about trusting public benchmarks.
.png)
 11 hours ago
                                1
                        11 hours ago
                                1
                     
  


