Kentik bolsters network observability platform with autonomous investigation

1 hour ago 1

Kentik launches agentic AI for autonomous network troubleshooting.

The realm of network observability is changing quickly in the modern generative AI era as vendors embrace and integrate increasingly sophisticated intelligence.

Network observability provider Kentik, which got its start back in 2014 as CloudHelix before being rebranded in 2015, has always been focused on helping networking professionals do their jobs, which can often include lots of repetitive tasks. A year ago, the company announced its initial foray into genAI, providing a natural language interface known as Kentik AI journeys for network troubleshooting and analysis.

Kentik is now taking the next step on its own AI journey, with AI Advisor, which launched November 18. The agentic AI system formulates investigation plans, executes queries across telemetry sources, and presents findings with supporting evidence. The goal is to go beyond just basic chatbot capabilities of asking a question with natural language.

“This is more like, help me do my job,” Avi Freedman, CEO and co-founder of Kentik, told Network World. “And that’s exciting.”

Moving beyond query translation

Kentik’s previous genAI tool, called Journeys, operated in a request-response mode. Engineers asked questions in natural language, and the system translated them into queries against Kentik’s telemetry databases. Each query was independent. Multi-step investigations required the engineer to direct each step manually.

AI Advisor changes this dynamic by handling multi-step workflows autonomously. The difference becomes apparent in troubleshooting scenarios. With Journeys, investigating a customer outage meant asking about traffic patterns, then separately querying for recent changes, and then looking for correlations between the two. Each step required human direction.

“Advisor knows how to actually do networking things and can be more like a teammate,” Freedman explained. “It will go, reason, make a plan, use the different products, go look across the domains of telemetry and awareness, and say, ‘here’s what I think is going on, and here’s what you should do about it.’” In practice, an engineer can now ask, “What might be causing this customer to be down?” and the system will autonomously check traffic volumes, review recent firewall changes, examine the timing of events, and identify whether a specific rule change correlates with the traffic drop. It presents findings with the underlying data and suggests specific remediation steps.

Data engine extensions for contextual analysis

The autonomous investigation capability required Kentik to extend its data platform beyond flow records and device metrics. The Kentik Data Engine processes approximately one trillion telemetry points daily from NetFlow, sFlow, device APIs, cloud provider APIs, and synthetic monitoring. But correlation analysis requires additional context that wasn’t previously captured.

“We needed configs, which we didn’t have,” Freedman said. “We needed graph and topology, which we had, but in places.”

The company added configuration tracking, topology modeling, and relationship mapping to the platform. This allows the system to answer questions like whether a firewall rule change affected specific customer IP addresses or whether an IGP metric adjustment could have influenced routing decisions. The context layer connects time series data with network state information.

The underlying database architecture uses a columnar store for historical data and a streaming database for real-time analysis. Both use the same query language, which allows the system to correlate events across time windows without moving data between systems.

Foundation models and workflow training

Kentik uses commercial large language models (LLMs) rather than training its own from scratch. 

The company’s development effort focuses on training the system to understand network operations workflows rather than tuning model weights. This happens at two levels. First, the system learns what capabilities Kentik provides and how to access them through APIs. Second, it learns how network engineers actually use those capabilities.

“We have 500 customers and over 100,000 users that use it,” Freedman noted. “So, we can train on the kinds of things that they’re doing, the workflows that they’re doing, the things that we know as network experts that they do.”

This approach avoids training on customer telemetry data itself. The system learns patterns of investigation, which queries typically follow others, what correlations matter in different scenarios, and how experienced operators approach specific problem types. Customer network data remains private.

Validation through hybrid guardrails

A concern that many initially had with any type of genAI was the risk of hallucination. While that risk is still present with genAI in general, Kentik implements constraints at multiple levels to prevent unreliable recommendations.

Pre-execution rules define boundaries for what the system should attempt. Some changes are explicitly forbidden. 

“Never try to adjust someone’s IGP metrics for their global network,” Freedman cited as an example. “You’re just not going to know enough from what we see to be able to do that.”

Post-generation validation checks determine whether recommendations make logical sense given the network state. This combines AI-based analysis with traditional algorithmic validation. 

“We have some non-AI guardrails that we use as well,” Freedman explained.

Because the system outputs structured data in JSON format rather than natural language, Kentik can programmatically verify recommendations. It checks whether suggested causes align with topology, whether timing supports causation, and whether correlations exceed noise thresholds. The validation layer uses graph analysis, feature extraction, and statistical correlation alongside generative AI.

Autonomous vs. automated operations

Kentik maintains that AI Advisor enables autonomous operations rather than fully automated ones. The distinction centers on control.

The system investigates problems autonomously but requires human approval before making changes. “Absolutely, 100% we call it the big red button,” Freedman said when asked whether humans remain in the loop for critical actions.

The company already operates in a mixed mode for DDoS mitigation. Some customers grant the system authority to automatically implement specific countermeasures, while others require approval for each action. This same model will extend to general network operations.

The stated goal is efficiency rather than replacement with the target being to automate repetitive investigation work while keeping strategic decisions and critical changes under human control.

“I think what we can achieve is taking 50% of the time out of networking so people can, be doing the thinking stuff,” Freedman said.

SUBSCRIBE TO OUR NEWSLETTER

From our editors straight to your inbox

Get started by entering your email address below.

Read Entire Article