AI agent development tools landscape study

2 days ago 3

If you want to implement AI in your automation processes, you can choose between AI workflows and AI agents - as defined by the folks at Anthropic - where workflows implement LLMs into conditional automation logic, while agents dynamically direct their own purposes and tool usage.

Both AI Workflows and AI Agents depend on the non-deterministic outputs of large language models. These varied outputs make AI-based automation volatile, a behavior that is not suitable for enterprise-grade applications. To address the risks associated with unpredictability and to enable more complex reasoning flows, you need to define deterministic logic to control the agents’ inputs and outputs. This is a natural part of AI workflows, but Agents should also be subject to conditional logic written by humans.

I write this evaluation with the perspective that AI Agents have self-determination, but their inputs, outputs, and actions are confined by a set of acceptable use rules written by humans.

While the most common way of defining deterministic logic is by coding, writing agentic systems from scratch just by coding is both time-consuming and expensive. This is a major deterrent, especially if you intend to bring your AI features to market sooner, and/or have not seen the returns on any AI features deployed so far.

The most efficient tools for writing agentic AI systems consist of the following:

  • No-code workflow development interfaces that allow users to define a series of logical steps using a drag-and-drop canvas
  • Code-based capabilities, where each step in a workflow can (but not must) be defined at a low-level using scripting, configuration files, calling external libraries, and other tools that developers are used to.
  • Integrations with third-party systems, which first and foremost include cloud-hosted large language models, but just as importantly include tools in your technology stack, such as ITSM, CRM, security, databases, and so on.

Almost all vendors that I’ve encountered which position themselves as some flavor of agentic AI development tools have all of the above. I found it interesting that almost every tool on the market has opted for a no-code workflow development GUI. This is to be expected from tools that have been around before AI agents were a (big) thing, but it’s also the case for new tools. I believe this is because no-code workflow-based automation is the best way to write deterministic logic with minimal investment and training. Considering the above, I can categorize vendors in this category into two groups:

Native Agentic AI development tools

These are mostly startups that have built their platform extensively (and exclusively) to build AI agents

Workflow automation tools that pivoted into AI agents

These vendors found themselves in a great position to hop onto the biggest bandwagon of the 2020s.

This is by no means an exhaustive list. For example, I found some tools that position themselves as AI agent development tools which do not use no-code/low-code workflow based automation. These include tools such as LangSmith, Crew AI, Restack, and Writer.com. At the time of writing, I do not have more information on how these three compare with the vendors included in this report, but if you are interested in developing AI agents without the workflow-based automation component, I would encourage you to look into these tools as well.

Workflow automation tools have some distinct maturity advantages, particularly around depth and breadth of integrations, which provide tried-and-tested methods of bringing a tech stack together without writing custom connectors for each application. Native Agentic AI dev tools, on the other hand, have the agility advantage and don’t have to retrofit AI to an existing product.

This piece of research aims to evaluate how extensive tools’ capabilities are for writing agentic AI systems and integrating them into an organization’s stack. The evaluation criteria revolve around coding capabilities and configuring LLMs and associated frameworks, and the tools’ integrability as measured by out-of-the-box content, ease of use, and reliability.

In short, we’re evaluating codability and integrability capabilities for writing AI agents.

This means that we’re only evaluating two out of the three characteristics used above to define what an Agentic AI development tool is. We’re not assessing the no-code interface component for three reasons. First, the scope would be too large. Second, we’re using a cartesian chart with two dimensions and would be unfeasible to introduce the third. Last, the range of capabilities for the no-code interface will not have as much of an impact on the agentic system as the codability and integrability component.

If you are in between two tools that are comparable for codability and integrability, the extensiveness of the no-code interface will be a great tie breaker.

Turing completeness

Many - if not all - of these tools are Turing complete, which means they can perform any computation. This makes it difficult to prove or disprove a tool can do something, because a Turing complete tool can do anything. The same goes for integrations, where I often hear that “the tool can do anything because it integrates with APIs”.

While this is nice, it makes it difficult and frustrating to select one tool over another. So to circumvent some of these gotchas, all the definitions in this report will tend as much as possible toward native and out-of-the-box. If I want to normalize some data, I don’t want to write a custom Python script from scratch. I want to have a normalization node, block, action, where I say that all dates must be in a yyyyMMdd format. Otherwise, there’s no point in using the tool over using straight Python.

As expected, established solutions from before the times of the LLM popularity score much higher on the integrability axis. This means that these tools have more mature capabilities for integrating with your existing tech stack. Integrability is also a set of capabilities which takes very long to develop.

Writing integrations, engaging with technology partners, building a community, and offering out-of-the-box content are very long exercises. It is not an apples-for-apples comparison to look at a 2024-established Stack-AI against a 2013-established Workato in terms of the integrability features. Tools that score well in integrability and lower in codability are also great contenders for less complex agentic systems. If you don’t need to have orchestrated multi-agents working in parallel, but still need to integrate with your wider system, these tools may very well be fully satisfactory.

On the other hand, the AI-native tools, such as Vellum, Dify, Langflow, Flowise, with their exclusive focus on building AI agents, do very well on the codability axis. This allows users to have a great deal of control over agents’ behavior, but make it harder to integrate with their IT stack. This makes them better suited for use cases where Agents use web resources, SaaS apps, and documents rather than coordinating a bunch of on-premises enterprise applications.

As you’re reading this on the n8n website, I reckon you would question the tool’s high scores across both metrics, so I invite you to read the evaluation matrix that contains references to all the docs used to assess the vendors.

Taking 50% as a threshold for a tool to score high or low, here’s a breakdown of these vendors’ applicability:

< 50%

score on integrability

Suitable for organizations with a less complex IT stack, which are most often start-ups and small-to-medium sized businesses. High scores on this metric do not provide more benefits for simpler IT stacks.

> 50%

score on integrability

Suitable for organizations with more complex IT stacks that need to call upon various tools and integrate with legacy systems. These are often large organizations, which also benefit from out-of-the-box content that can reduce development and onboarding times.

Suitable where AI agent use cases are simple, such as creating a support chatbot or summarizing unstructured documents. They are mainly used in internal use cases to automate processes which cannot be done with standard low-code workflows.

Suitable where AI agent use cases are implemented for high risk systems, real-time use cases, and customer-facing decisions. These offer a high degree of customization and require knowledge for both coding and AI development.

To address the top-right quadrant syndrome of market evaluations, I will present the upsides for vendors featured in the lower left and downsides for vendors in the upper-right:

  • Advantages for tools with <50% on both integrability and codability include ease of use, early adoption, low amount of training. I find these tools to be a great lightweight opportunity for startups and small businesses to implement AI agents while still having a good level of control over its configurability. As these tools are fairly new, they are a great opportunity for organizations to grow alongside them and implement more mature features as their own use cases evolve.
  • Disadvantages for tools with >50% on both integrability and codability include training and on-boarding times. These tools are not a pick-up-and-go for writing agents, but require users to have a priori knowledge of how an agent can operate and how to use the tool. With the right investment and resources, these tools can prove immensely useful for all types of organizations.

This section defines the evaluation criteria used to compare a selection of tools in the AI Agent development market. It provides a comprehensive list of features that can support developers in creating production-ready AI Agent applications and integrating them in their existing business and technology stack.

For each feature, vendors will be scored as follows:

0

Feature is absent or unstated

1

Feature is partially available or achieved via third-party integrations

2

Feature is available natively in the tool

Each feature will be aggregated under one header. For example, scripting and native version control (along four other features) are under Code-based development environment. The Code-based development heading will be assigned a weight depending on how many features it consists of and their overall importance. You can view Annex 1 for a detailed description of all the features evaluated.

To do the assessment for each vendor, I have done the following:

First, go through all the documentation manually and populate the spreadsheet with all that I can find.

Second, go through the criteria where I did not find the information and check the website and other resources, use the search function in the docs, or write a site:tools.com query with the right keywords of the capabilities I’m looking for.

Lastly, for those with AI-powered documentation search, I asked the AI whether the tool was offering the features.

Vendors that cannot be assessed based on publicly available documentation will be excluded.

Here are three examples of how the assessment is conducted, applied to the Native implementation of IDEs capability.

  1. Windmill: Score of 2 - capability explicitly stated in great detail

    • Example: “The code editor is Windmill's integrated development environment. It allows you to write code in TypeScript, Python, Go, Bash, SQL, or even running any docker container through Windmill's Bash support. The code editor is the main component for building scripts in Windmill's script editor, but it is also used in Windmill's Flow editor to build steps or in the App editor to add inline scripts.”,

    • Reference: https://www.windmill.dev/docs/code_editor

  2. Camunda: score of 1 - capability not natively supported, but can be used in the platform via third-party integrations

  3. Vellum: Score of 1 - capability stated as supported, but documentation does not offer explicit statements and examples

Limitations

Considering the research is based on vendors’ technical documentation, the scoring is directly tied to the quality of the technical documentation. This means that undocumented features may still be present in the tool, in which case, they won’t be reflected in the scoring.

The assessment is not conducted through user testing, so user experience is not in scope. This is comparable to evaluating cars without driving them. We can have an intuitive understanding of how a hatchback, performance SUV, or electric people carrier differ in terms of both usage and experience. With enough low level detail, we can compare cars in the same category, such as differences between an M5 F90 and G90.

The assessment is not based on benchmarking, which means that the evaluation does not include the tools’ behavior under stress.

The evaluation criteria is intended to be as comprehensive as possible, which means that some - or many - of the features we evaluate may not be relevant or applicable to your use cases, which is why we recommend looking at the complete scores rather than the final average.

We have not engaged with any of the vendors featured in the report prior to publishing it. If any vendors have corrections they want to make, I invite them to send me any comments that I will evaluate to update the report.

  • Using AI to build workflows
  • Workflow building experience
  • Scalability
  • MCP

Report prepared by: Andrew Green

Read Entire Article