Encouraging the use of LLMs made interviews easier (for us as interviewers)

4 months ago 5

The advent of cluely and the ever growing arms race between the interviewer and the interviewee is a fascinating phenomenon to me. Betting that people would prefer cheating or that companies needed screen for cheaters because the use of LLMs is heavily discourage seems to go against the common experience that LLMs do accelerate or improve the productivity of developers overall (with caveats of course).

When we were hiring for engineering interns, we faced the same issue: how do we screen for candidates when the usage of LLMs is so rampant and "cheating" software is readily available? Several methods were proposed including the option to have face to face interviews instead. However, we are a remote-first company and it did not feel right that candidates have to go through a face to face interview when the main mode of communication in the day-to-day BAU is remote. Our previous method was a simple take-home assignment that takes less than an hour to complete with a week to submit, but obviously this no longer is a suitable method for evaluation.

The key insight provided was that LLMs excel at pattern matching and not first-degree analysis. LLMs are susceptible to context poisoning. Humans have no issue dealing with this, as I've mentioned in my previous post here. The recent Apple paper on LLM reasoning further explains this as well. Humans are actually pretty good at first-degree analysis and reasoning.

So, why not just encourage the use of LLMs and/or search tools during the interview, and instead note down how the candidate approaches the usage of these tools. Think open-book interviews.

This was the protocol we arrived for intern interviews:

  1. We tell them to be prepared to share their screen, and to feel free to use any language, IDE, AI tool and/or search tools they prefer, although we would prefer if the language used is Python or Javascript because that is what we work with in the company.

  2. Depending on the resume provided and the curriculum we expect the candidate to have gone through, we pick a class of problems that would prove challenging or extremely hard to solve on the spot. Something like a calculator solver with varying levels of difficulties (starting from integer arithmetic to custom functions and beyond, for example) usually suffices.

  3. Explicitly mention that the problems we provide are expected to be challenging because we want to monitor how one would perform under stress. The candidate is encouraged to explain their thought process, and we also mention that we don't expect candidates to finish the solution, because the test itself has multiple levels of difficulties.

  4. Candidates usually like to code immediately. I would guide them to first design and analyze the problem, because internally we would also usually engage in RFCs before we implement the solution, and candidates usually perform better once they have the space to think.

  5. When the candidate has successfully solved one level, they are then prompted to start the next level. Usually the next level would challenge assumptions made in the prior levels unless they prepared that in the design phase.

After that the candidate is scored based on:

  1. Levels completed.

  2. How the candidate designed the solution to scale when the requirements are expanded.

  3. How the candidate interacted with AI/search tools.

    a. The candidate would be asked on how they designed the prompt or search keywords.

    b. The candidate would be asked to explain why they think the AI answer or first result of search is correct.

    c. The candidate would be asked how they would extract value out of the AI answer or search result.

Good candidates will provide a clear first-degree analysis on how they designed the information extraction process. They can articulate what they expected or did not expect to get out of the AI or search.

Bad candidates, on the other hand, have many failure modes. They may provide circular reasoning, they might skip understanding how the AI arrived at the answer, they may just assume the AI is correct etc.

While we usually notify the candidate that it may take an hour for the whole process, it is interesting to note that the decision threshold can usually be made around 15 minutes when the candidate approaches AI/search for the first time. Previously with take home assignments or real time interviews (with AI banned) we would spend the full 30 minutes to an hour to try and understand if the candidate actually did the assignment themselves and understood the concepts they used. Here with the encouragement of AI usage, we reach the threshold much faster, resulting in more accurate and faster assessments of the candidate.

Read Entire Article