Show HN: Bug which took months to debug

3 months ago 2

Zoom image will be displayed

So you want to hear about debugging hell? Buckle up, because I’m about to tell you the story of a bug that took few months. The solution was embarrassingly simple once we found it.

Let me paint you a picture of our delightfully over-engineered system. We had this web service running on Tomcat that would take UI filters and convert them into Spark-compatible SQL. Think of it as a translation layer between human-readable queries and Spark’s particular brand of SQL masochism.

-- User says: "Show me transactions after 2015"
-- We generate:
SELECT transaction_region, transaction_date
FROM fx_income
WHERE year(transaction_date) > 2015

Instead of running the SQL directly in the same Java program (which would have been easier), we had to send it to a separate Jetty service running on a Unix system where the actual Spark session was running. This was done to keep things separate and because the Spark session and the web server used different runtime libraries (JAR files). Also, running Spark in different modes like cluster or client added complexity. To keep it simple, the Jetty service was set up to handle requests on top of the Spark session. The reasons for this setup included avoiding JAR conflicts, handling caching properly, and maybe someone trying to use microservices before they became popular.

The flow looked like this:

Tomcat service receives filter from UI
Converts filter to Spark SQL
Sends SQL + view configuration to Jetty service
Jetty loads required views from user databases (Greenplum, SQL Server, Oracle, REST APIs, you name it)
Creates Spark views and executes the SQL
Returns results back to Tomcat
Tomcat sends results to UI

// Simplified version of our request flow
public class FilterToSparkService {

public QueryResult executeFilter(FilterRequest filter) {
// Convert UI filter to Spark SQL
String sparkSQL = convertFilterToSQL(filter);
List<String> requiredViews = extractRequiredViews(filter);

// Send to Jetty Spark service
SparkRequest request = new SparkRequest(sparkSQL, requiredViews);
return jettyClient.execute(request); // This is where hell broke loose
}
}

For other databases (DB2, Greenplum, SQL Server), we kept it simple and ran queries directly. But Spark? Oh no, Spark had to be special.

One fine day, our monitoring started screaming. One of our Tomcat instances was throwing ConnectionExceptions when trying to reach the Jetty Spark service. The requests just… weren’t getting through. Meanwhile, the other three instances were humming along perfectly fine.

“Must be a network glitch,” we thought. Restart the corrupted instance — and boom, it worked again. Problem solved, right?

Wrong.

The pattern repeated every couple of days. Sometimes it was instance A, sometimes B or C. The timing was completely random — could be a day, could be a week, could be all four instances going down simultaneously. It was like playing Russian roulette with our production environment.

// This is what we were dealing with
try {
SparkResult result = jettyClient.executeSparkQuery(sql);
return result;
} catch (ConnectionException e) {
log.error("Failed to connect to Jetty service", e);
throw new RuntimeException("Spark service unavailable");
}

Our brilliant solution? Train the production support team to restart the problematic instances. “Just restart it and it’ll work” became our unofficial motto. We had bigger fish to fry, after all.

First suspects: the network and OS teams. “Must be a connectivity issue,” we declared with the confidence of developers who’ve never touched a network cable. The infrastructure folks ran their diagnostics, checked the servers, and came back with the most dreaded response in IT:

“Everything looks fine on our end. It’s probably your application.”

We rolled up our sleeves and decided to stress test the hell out of our lower environments. Fired every conceivable type of Spark request, tried to reproduce the connection failures, and… nothing. The system worked flawlessly.

“Must be a production-only issue,” we rationalized, adding more detailed logging like it was going to solve our problems.

// Our logging got increasingly desperate
log.debug("About to send request to Jetty service...");
log.debug("Request payload: {}", request.toString());
log.debug("Jetty service URL: {}", jettyServiceUrl);
log.debug("Connection timeout: {}", connectionTimeout);
// This log statement was added in month 3 of debugging
log.debug("Current thread: {}, JVM uptime: {}",
Thread.currentThread().getName(),
ManagementFactory.getRuntimeMXBean().getUptime());

The logs were about as helpful as a chocolate teapot.

After months of living with this “minor inconvenience,” I decided to check our lower environments again. And unlikely that day there it was (which was not happening for past few months )— one of the hosts couldn’t reach Jetty while the other could. Finally, a reproducible case!

Time for some remote debugging. Thank God for lower environments where security policies actually let you debug without filling out 47 forms.

I stepped through the code, traced the execution path, and followed it all the way down to the native library calls. That’s when I noticed something peculiar:

Host A (working): Requests going direct to Jetty service
Host B (broken): Requests going through a proxy

Wait, what? A proxy? Since when do we have proxy settings enabled?

Digging deeper into the JVM system properties, I discovered that http.proxyHost and http.proxyPort were mysteriously set to true on the failing instance. But why?

Then it hit me like a ton of caffeinated bricks. I went back through our audit logs and found the pattern:

Every failing instance had recently executed a Snowflake query.

Our Snowflake connection code looked innocent enough:

// The seemingly innocent code
public class SnowflakeQueryService {

public ResultSet executeQuery(String sql) throws SQLException {
// Set proxy for Snowflake connection
System.setProperty("http.useProxy", "true");
System.setProperty("http.proxyHost", proxyHost);
System.setProperty("http.proxyPort", proxyPort);

Connection conn = DriverManager.getConnection(snowflakeUrl, props);
return conn.createStatement().executeQuery(sql);
}
}

The poison pill revealed itself. Setting JVM-wide system properties for Snowflake was corrupting the proxy settings for the entire application. Once a Tomcat instance executed a Snowflake query, ALL subsequent HTTP requests (including our Spark service calls) started routing through the proxy.

And since our internal Jetty service wasn’t accessible through the corporate proxy, those requests were failing spectacularly.

// The pattern was crystal clear:
// 1. Fresh JVM starts up - proxy settings disabled
// 2. User runs Snowflake query - System.setProperty() enables proxy globally
// 3. Subsequent Spark requests try to use proxy
// 4. Proxy can't reach internal Jetty service
// 5. Connection fails until JVM restart

This is a textbook example of the poison pill pattern — where one corrupt request (the Snowflake query) poisons the entire system state, causing cascading failures until the system is restarted.

The solution was beautifully simple. Instead of setting JVM-wide system properties, we needed to pass proxy settings directly to the Snowflake JDBC driver using connection properties:

// Before: The poison pill approach
System.setProperty("http.useProxy", "true");
System.setProperty("http.proxyHost", proxyHost);
System.setProperty("http.proxyPort", proxyPort);

// After: The civilized approach
Properties props = new Properties();
props.put("user", username);
props.put("password", password);
props.put("useProxy", "true"); // Driver-specific property
props.put("proxyHost", proxyHost); // Not a system property!
props.put("proxyPort", proxyPort); // Contained within connection
Connection conn = DriverManager.getConnection(snowflakeUrl, props)

The Snowflake JDBC driver documentation clearly supports driver-specific proxy parameters, which don’t pollute the global JVM .

Audit logging from day one: If we had been tracking which requests hit which instances earlier, we would have spotted the Snowflake correlation much sooner.
Better isolation of concerns: Setting global system properties for database-specific configuration is a code smell that should have been caught in code review.
Connection pool state monitoring: We should have been monitoring JVM system properties as part of our health checks.

Once we deployed the fix, the connection failures vanished completely. No more random restarts, no more phantom network issues, no more frustrated support teams.

This bug taught me that the most challenging problems often have embarrassingly simple solutions. The real skill isn’t in finding complex fixes — it’s in having the patience and methodology to systematically eliminate possibilities until you find the root cause.

The poison pill pattern is more common than you might think, especially in systems where global state can be modified by individual requests. It’s a reminder that every piece of code is potentially affecting every other piece of code when you’re dealing with shared resources like JVM system properties.

Happy debugging, fellow code warriors. May your bugs be reproducible and your coffee be strong.

Read Entire Article