An experiment generating a protocol spec from natural language source with LLM

1 month ago 3

Here I attempt to use LLM tooling to parse the Spring Protocol Spec that exists as an html page and extract it to json file.

This is my first serious attempt at deriving value from LLM usage.

I haven't had a good experience on my day job attempting to use LLMs to provide any productivity gain, but I recognize I didn't provide the best environment and tasks for the job.

We provide an environment and task that seems better suited for LLMs and:

Assess tooling for LLM usage
Assess LLM as a tool for this particular context
Record the experience
Conclude learnings and possible improvements

The end consumable must be a json file that specifies the Spring Protocol well enough for implementations. Think the LSP json spec providing metadata for type generation with LSPCodegen as an example.

Note

I don't claim to be providing any useful tutorial or rigorously scrutinized methodology and conclusions. I don't claim we are any expert at using the tools mentioned here and neither claim that the issues I encounter are necessarily issues with the tools themselves. This is just a report of my experience, for my particular context.

Spring is an RTS game engine, eventually soft-forked and de-facto superseded by Recoil with games moving over to it eventually (with few exceptions if any).

The terms Spring and Recoil are interchangeable here, since Recoil is mostly compatible with Spring (before 106, the version games could still run) and does not have the same amount of documentation. For our purposes we assume we are in the context of Recoil.

The so-called "Spring Protocol" is not a native engine concept, rather a protocol for communication between lobby servers and clients, that used to be dominant in most Recoil games.

Some reference implementations:

Good enough, the best approach provided 80% of what I needed in 30min/1hr accounting for the time spent setting things up.

Wasn't able to get a precise estimate of the productive part of the session, but I'd say it cost around $10-$15.

Maybe I could have got far with some smart sedding but I believe the LLM provided more value than what I'd manage to achieve manually.

Final result.

See the Results and Conclusion sections for more details.

We use Neovim as IDE with avante.nvim and MCPHub.nvim available and properly configured.

We use Anthropic with whatever avante decides to use for the "claude" provider at the time we installed and configured (should be available in the prompts shared on sibling files).

We eventually discover that the agent wants to use some run_python MCP Server. We install pydanthic/mcp-run-python and after a failed attempt from the agent to find the files we included in the context we install radek-baczynski/mcp-run-python.

Note that I didn't sandbox the mcp-run-python MCP Server properly and it's mounting my home directory at /workspace. don't be lazy like me!

The installation process is relatively straightforward, but there is a concern with the agent not already knowing these MCPs capabilities are not available and asking ahead of time for their provisioning or providing an alternative instead of just trying to run them directly.

We don't provide any particular prompt as context that I am aware of, avante can do this automatically (preset context prompts) but I haven't explicitly told it to do so.

We feed the protocol page html to some online html to markdown converter, with usable results. We remove every content that is not the list of what the page describes as "commands", provided as spring_commands.md.

We start simple, provide a small sample of the file and require only the command, source and arguments fields to be parsed.

We notice the Agent wants to use run-python MCP Capability but we don't have it available. The LLM still provides the correct result, even without the tool.

We perform a few attempts again, trying to provide the run-python capability without changing the task until we get a correct result.

Note

Consider "correct" above as "looks correct enough"

After we confirm run-python works as expected we gradually increase the complexity of the requests, confirming correct results at each attempt.

The increase in complexity stems from the inclusion of fields that contain a mix of html and markdown, and a few require some relatively complex natural language understanding (e.g. "read the description and figure out the possible responses for this command").

The agent seems to have been able to complete these tasks with flying colors.

A few notes:

The requests for MCP operation are annoying, in my setup there is not option for "Approve All". Note this is not the same as the other Avante supported operations (that do have an "Approve All" option). Might be some configuration I'm not aware of.
A few annoying issues with avante not working nice with the rest of my setup in neovim.
I issue stop to avante, the internal request gets cancelled but the window still shows * tool calling. That's a bit concerning as I can't be sure processing has stopped, seems like yes.

I finally ask the agent to run on the full file, I write the file again including the rest of the content. That's where I start having some issues:

Maybe avante didnt update the files contents on the agents session? It provided the correct result for the small sample again. I remove and add the file to the session again, asking the agent how many lines the file contains.
Simple things like asking how many lines the file contains cause the agent to bother about giving me an overly verbose response that I don't require, even trying to do further processing after it found the answer. Consider making it clearer in the prompt?

Unfortunately all the prompts and responses were lost due to the frustration of the process and some network issue. After I closed my neovim session I was able to find the prompt logs in ~/.local/state/nvim/avante.

These are available at attempt-1-prompts.json.

Total cost for the session: $ 21.85

Let's try a one shot attempt. Prompt at [attempt-2-prompt.md](https://gist.github.com/badosu/5c28379f5a9f218e5163717dd2a8d75e, logs at attempt-2-logs.json.

Seems like the agent tried to run python directly in my machine without using MCP? I already went through the effort if installing the MCP, we'll need a second attempt.

Further attempts: issues with setting up mcp-run-python and having the provided file available, mounting is problematic apparently when you try to use relative directories. Seems like my previous attempts were injecting the contents of the spec file directly into the code to be executed.

We'll ignore all the work to figure things out and call the next useful attempt Attempt #3.

Same-ish prompt as Attempt #2 with some minor corrections.

We notice that things work out much better than our frustrated attempts after the first one:

A lot of iterative steps, refining the parser
When saving the output it borks out, maybe related to the way the workspace is mounted by run_python_code or the usage of the tool itself.

The provided json spec is available at spring-protocol-1.json. Looks alright, but requires a few adjustments. After providing a few more prompts I realized I provided a misleading instruction, that the response type could be figured out inside the description (while it was present after ### Response).

Further prompts lead to spring-protocol-2.json which is where I decided to end my journey.

Total cost for the session: about $10 counting the previous attempts.

Initial progress was good, even with the setup not working as well as I thought it was, the MCP tools were not correctly configured.

Attempt 1 was very impressive, in that case we built up our intended result with multiple iterations. This attempt was very quick, spent about 30min-1hr even accounting setting things up.

The one shot attempts didn't work as well. I'm ashamed to admit I spent way too much time trying to get it "just right" instead of approaching the problem differently.

The final result is serviceable, requiring much fewer corrections than I expected.

Given the complexity of the task at hand I'd say the result is a success.

Quick first less ambitious attempts: 30min/1hr. Provided the 80% of what I wanted. Could have probably provided more value quickly until the LLM started to tunnel focus and try ever more complex approaches, consuming resources and probably polluting the context (?).

All the fixing and stubbornness of trying a one-shot solution: About 5-6 hours, unfortunately this accounts for the writing of this piece (so I wouldn't lose my train of thought) and all the fixing. I'd say the actual time spent with the LLM was much lower, about 20min-1hr.

Total cost of the experiment: $32.

It's better to start small and build up. Don't try to one-shot your requests.
Focus on the 80%, accept imperfections when the complexity of the task is high.
Imperfections on the initial prompt can cascade into long attempts to correct the mistakes you might have caused yourself, gaslighting the LLM into an unproductive session.
Pay attention when the agent starts to seemingly enter a loop of trying ever more complex tasks without asking for supervision.
It's easy for the agent to consume a huge amount of resources when it enters "tunnel vision".
For the 2 items above, it's probably a "prompt smell" when this happens. Approach the problem differently.
Spend more time providing better context in the initial prompt instead of correcting on other prompts, increasing the possibility of "tunnel vision" and "task loops".
Consider prompting how early you want the results to be returned, how accurate they need to be and asking for confirmation or corrections.

Avante works well, there are some annoying situations where closing some windows/buffers interact badly with autocmds, spamming errors.
MCPHub works well too, hard to figure out at start how to use it best. There are a lot of hidden functionalities that can save one a lot of time. Expand the available MCP Servers and their tools, you can issue tool invocation manually so you don't have to spend resources with LLM calls and have more focused debugging of the setup.

Read Entire Article

An experiment generating a protocol spec from natural language source with LLM

Related

Google Issues Critical New VPN Threat Warning for Billions o...

Open Table is spying on you and ratting out your bad habits

Mecanum Wheeled Car-Robot