I Used a Market Basket Analysis During a Job Interview

4 months ago 7

I did some data analytics for a job interview. A bit of predictive analytics. It was for a Senior Data Analyst job interview. Specifically, I did a market basket analysis (or association rule mining or association rule learning – they mean the same thing). It’s machine learning. And I then had to present my findings.

Yeah. It was a lot.

The goal was to identify why some customers had cancelled their subscription (or churned); who they were; and provide recommendations on how to prevent them from churning.

After presenting, I decided to write this up. Not just because I like the sound of my voice, but because:

Hey, this was cool. A bit of predictive analytics, and a bit of machine learning? Of course, this was going on my portfolio.
I’m not a Data Scientist. I’m just a guy who works with data (I was a Data Journalist for five years). If I can learn this, and you work with data on a day-to-day basis, you can probably learn how to do this too.
I ran into a few encoding speed bumps. Writing about it helps me learn from this. Maybe someone else can learn from it too.
Data can come across as mysterious or magical. By breaking this project down and explaining it, I hope I can make it less scary to some. More accessible.

In this visual data story, you’ll learn:

What market basket analysis is, and why it’s one of the go-to predictive analytics solutions for business data problems.
How to encode your data for market basket analysis.
How to carry out market basket analysis in Python (using the mlxtend library)

Before I go on, I mentioned I did this for a job interview. You may wonder… did he get the job? What were the results of this gargantuan task?

Not only did the analysis identify how to recover $1 million from $5.6 million in lost revenue, but it also led to a job offer (despite my bad performance in the second stage of the interview). I ended up turning down the offer in favour of another job role in an industry more in line with my career direction. However, I believe this presentation task (and the analysis carried out) was what helped lead to an offer.

The Customer Churn Case Study Brief

I wrote the brief in detail, in case you’re interested in that sorta thing. If you want to skip straight to the problem solving, you can go here.

ERS provide access to Scientific, Technical and Medical research and insights through their platform. They’ve experienced rapid growth since their formation in 2012, with approximately $40 million in subscription revenue recorded in 2016.

ERS was recently acquired (2017) by TRS, a larger provider of research insights.

ERS growth between 2015 and 2016 was quite high, growing 28% year on year. During the acquisition process, the existing ERS management team were asked to conduct a business review to identify opportunities or risks to continued revenue growth.

During the review, the Head of Sales highlighted an alarming issue. He produced the graph below:

The management team were alarmed at the revenue they lost in 2016 through lapsed/churned accounts. With new business compensating so well for this revenue, they hadn’t realised the loss on existing accounts was so large, and are concerned this could be a major risk to future revenue growth.

Every customer receives the same treatment when they renew each year. The management team want to understand how to combat this attrition risk.

As the leading analyst within the business, you’ve been approached by the Head of Sales to help answer this question for the management team:

How can we identify accounts at risk of cancelling their subscription?

Customer Churn Data Provided

Through your contacts in the data team, you’ve been able to source five datasets to help answer the question. It covers a random sample of 10,000 ERS accounts between 2015 and 2016.

ACCOUNTS.csv – high-level attributes for each account, including the year the customer subscribed
NPS.csv – Net Promoter Score (NPS) survey results for account holders
REVENUE.csv – what subscriptions accounts subscribe to, and for how much (there are three options)
USAGE.csv – number of unique usage sessions for each subscription type
USERS.csv – total number of unique users associated with each account per subscription year

Also:

There are no duplicates
The data is clean, there are no data errors or inaccuracies, but there are NA values.

How I Approached the Problem – Churn Analysis

I chose to approach this as a churn analysis problem.

In a nutshell:

Here are the characteristics of all our customers/employees.
Here are those who churned.
Find out what those who churned share in common.
Provide recommendations on how we can stop them from churning.

And that’s lunch.

This requires predictive analytics – the use of data to predict future trends. In this case, looking at a subset of a population, and providing scenarios on how that set would behave, with probabilities attached.

Because the output required was categorical, rather than continuous data, I decided a market basket analysis approach would be best.

(Don’t know the difference between continuous and categorical or discrete data? Don’t worry. I’ll explain later.)

What is Market Basket Analysis or Association Rules Mining?

If you want to learn more about market basket analysis or association rules mining, you can check out the resources here:

The “association rules” or the series of factors are those things which the churned group will hold in common.

For our dataset, this could be the country they’re based in, their annual revenue, and the number of employees.

For instance, if we found out the following set of association rules has a lift of 2.1x:

Country: Germany
Annual Revenue Category: 500m-1bn
Number of employees: 1001-5000

Then we know that customers from Germany, who make $500mn to $1bn, and have between 1001 and 5000 employees, are 2.1 times more likely to churn.

A rule is simply “is this factor true or false?”. So is this subscriber from Germany? True or false.

I’ll interchange calling a series of rules that could or could not lead to churn as “rules association”, “series of factors”, or “series of variables”.

If you see any of these words, just think: stuff the group share in common.

Market Basket Analysis Challenges to Overcome

Of course, it wasn’t smooth sailing.

In fact, there was one moment I thought I’d be done in an hour, only to realise I had to start from scratch. It took many more hours – in fact, a couple of days – to be done.

(A few swear words were also said).

Here are challenges I faced:

Doing it outside of Microsoft Excel so I could get a list of association rules, instead of just the rule with maximum lift.
Using the mlxtend library in Python, after failing to make another library work.
One-hot encoding my data, after realising it wasn’t in the correct format for a market-basket analysis.
Discovering I had to turn the continuous data into categorical data before I one-hot encode it (unless I wanted the computation time to take forever).

I overcame them in this order:

One-hot encoded my now discrete data using pandas.

Found out how to do a market basket analysis in Python using a few libraries.

Abandoned another library (pycaret), which promised to be easier, but just plain didn’t work. Figured out how to use mlxtend.

Turned my continuous data into discrete data using pandas.

Carrying out a market basket analysis outside of Excel

You can carry out market basket analysis in Microsoft Excel using the Solver function. David Langer has an excellent video tutorial here.

However, using Microsoft Excel with Solver only provides you with one series of factors (or association rules). The one that has maximum lift. I wanted more than one rule – a list of association rules – ordered by lift.

This would help me better understand what is driving churn. If one factor appears multiple times, that implies it has a large impact. It could also help identify several cohorts of customers that are at risk.

I would need to leave the safe shores of Microsoft Excel…

Market Basket Analysis (or Association Rules Mining) with the Mlxtend Library

I kept coming across tutorials for creating a list of association rules in R; however, I can’t program in R.

But…I can program in Python.

Reading this tutorial on DataCamp, I found a method to carry out a market basket analysis with Python. The tutorial uses pycaret, a user-friendly library intended for those who aren’t data scientists. It does a lot of the heavy lifting, e.g. auto-encoding the data.

However, the latest version of pycaret has deprecated association rules mining. I tried to use a previous version of pycaret, but ran into previous dependencies not being compatible with my machine. In short, dependency hell. Move on.

A few searches led me to believe that while mlxtend isn’t as user-friendly as pycaret, it’s more popular, and thus has more documentation and support. It will also provide me with what I need – a series of association rules with different levels of lift.

I knew the library I was going to use. I’d joined the data in Microsoft’s Power Query. Time to get going, right?

Wrong.

See – there’s one little problem with carrying out association rules mining. One thing which pycaret would’ve done for me in the background.

Remember when I wrote this earlier:

A rule is simply “is this factor true or false?”. So is this subscriber from Germany? True or false.

Yeaaaaaaah. So, about that. My data didn’t exactly come in the form:

…	Country-Germany	Country-USA	…
	True	False
	False	True
	False	False

It came in the more common-sense form:

ID	…	Country	…
23553		Germany
23554		USA
23555		France

But to carry out an association rules mining analysis, the columns must be in true or false form only.

I was going to have my data prep work cut out for me.

Turning Continuous Data into Categorical Data using qcut in Pandas

To cast my data into a true or false form is called one-hot encoding. I could have just fast-forwarded this step and one-hot encoded my data, but then I’d have run into the following problem:

I wanted to bin my continuous data into some highly sophisticated levels:

Low
Medium
High

Totally sophisticated.

After some reading, I used pandas.qcut to create bins in my continuous data.

Qcutbins a data series into a chosen number of quantiles. This worked perfectly with my super sophisticated levels of Low, Medium, and High.

I also included one more level – ‘Zero or Missing’. If data is missing or is zero, that’s important info. It, for instance, can tell us that the user was a customer in 2015, but not in 2016, and thus a lapsed/churned customer.

Want to see the code? Well, I’m keeping this article code light, because:

Stephen Hawking’s editor told him to remove the equations from A Brief History of Time, because “every equation will halve the market” – people switch off when they see code.
I already have a jupyter notebook write-up that you can read on my GitHub, complete with a table of pages. You can jump to the part where I use qcut there.
I’d really like to move on to the next fun step.

One Hot Encoding – or Turning Categorical Data into True or False Binary Data

We’ve turned our lovely continuous data into four categories:

Zero or missing
Low
Medium
High

Now it’s time to do the final step before we can carry out our associated rules mining. To one-hot encode these categories into a series of columns with True or False as the result.

We can simply do this with pandas get_dummies() command. Get_dummies only works with strings, so I made sure all my numbers were encoded as strings before running.

Executing the Market Basket Analysis or Associated Rules Learning

Now our data is encoded properly, we can finally – finally – do our analysis. The actual data work that gives us our insight. Everything before was getting our ingredients ready before we cook the main meal.

But, hold your horses. We don’t just run mlxtend on all our data. Associated rules learning is computationally expensive. In fact, it’s exponentially expensive.

The more columns we have, the more it costs to compute.

Which makes sense, the algorithm is computing each individual permutation and finding out its lift.

The difference between running the right or wrong set of data can be 15 minutes versus a couple of hours on a MacBook Pro.

The adage “garbage in, garbage out” also holds here – if you don’t filter your columns right, you’ll just get a bunch of association rules which make you go “no sh*t Sherlock”.

The first time I ran this analysis, I forgot to filter out the relevant columns. Not only did it take 45 minutes, but I got 50 or so rules which associated the “zero or missing” columns with the lapsed customers. This told me nothing. Why? Those columns were what defined a lapsed customer.

I dropped all those columns and began my analysis.

Using Parameters and Thresholds to Guide the Machine Learning

Our objective is to find the variables, or the series of factors, that when associated with each other, will likely lead to a churned customer.

Not to get too mathsy, but three parameters identify the strength of an association rule:

Support
Confidence
Lift

Support is the probability of the event occurring in our dataset.

Confidence is a tricky one to explain. It is a measure of how likely these series of factors will occur together.

Lift is what we’re after.

With mlxtend we can define the thresholds that these parameters will run. The higher we set these thresholds, the fewer rules we’ll get, but the higher quality they’ll be.

They will also shorten our computation time.

For statistical significance, we can think about support like a p-value. So, 5% will be our absolute minimum. I initially set this to 10% and received almost no rules. I dropped it to 5% afterwards.

For the confidence, I initially set that to 60%, and got no rules. I dropped it to 50% and got three rules (all with a lift above 2). Enough to give me the insights I needed for this case study.

For fun, I ran it again with a confidence level of 30%. The other rules that came up provided some context, but not enough to provide business recommendations.

Finally, I exported the results as a CSV.

The Results of My Market Basket Analysis (and telling the data story)

All that work…all that effort…but what were the results? What were the insights?

Older customers, with less than 4 customers, and with a missing NPS score were 3.12x more likely to churn.

What’s great about this result is that we can see how important the age of the account is. In all three cohorts, customers who joined in 2012 are likely to churn.

Job done right? Just send off this chart, and we can tick off this interview task?

Wrong.

Because, even if you have your insight, the best data people know it’s about the story you tell. That’s how stakeholders come away with actionable next steps.

My task was to create a presentation. Before I revealed my findings, I had to tell a story.

Building the Data Story

For our insights to have maximum impact, we need to build a story leading to the conclusion.

Now that I knew where to look: age of accounts, NPS scores, and number of users — I could quickly do some analysis to help build a credible story.

On the morning of the interview, I built out my presentation in PowerPoint and quickly ran some analysis using pivot tables in Excel. It took me less than half an hour. (Deadlines are powerful things.)

If I’d done this in the beginning, without knowing where to look, it would have taken me hours (possibly more than two days) to find these trends.

Predictive analytics isn’t always the answer, and it requires a certain skillset to carry out, but it can help sharpen the questions you ask of your data.

I created the following charts, comparing 2015 and 2016 customers, and lapsed and still subscribed customers:

First, I established context: how much were these churned customers worth? Answer: $297,000, or 18% of revenue.
Second, I looked at churn by number of users: the lower the user count, the more the churn.
Third, I explored revenue by account age: older accounts presented less revenue.
Finally, I hinted at the scale of the issue: 14% of accounts were in the oldest cohort.

This data story helped build towards our main conclusion: old accounts were the most likely to churn.

With that data story, we could create the presentation.

Presenting My Market Basket Analysis

The task called for no more than eight slides of content, including my methodology.

The entire purpose of this task, this case study, was to provide business recommendations.

No one cares about how amazing your data stack is or the tools you used. All they care about is

What is the problem?
How big is the problem?
What is your recommendation/what are the takeaways?
Is it grounded in data?

With that, I outlined my presentation:

Introduction
The problem and my approach
Context (the data story in the previous section)
Findings
Recommendations
Methodology (I)
Methodology (II)

You can view the presentation below:

My recommendations boiled down to:

Provide more support and attention to older accounts.
Make sure the needs of accounts with 1-4 users are being met. E.g. ensure they aren’t overpaying.
Encourage customers to do NPS surveys, as there’s a correlation between missing NPS surveys and churned accounts.

Throughout the presentation, I made sure to bring all my findings back to revenue. How much were these customers worth? If they acted on the recommendations, how much revenue could they save?

If half of the accounts identified at risk are prevented from churning, TRS would save $850,000 to $1 million.

And finally, in the methodology section, I discussed my approach in a way the nerds would appreciate.

The Business Result of My Market Basket Analysis

What was the result of my presentation?

Despite my poor second interview performance, this case study and presentation were a hit. I received a job offer 24 hours later.

Like I mentioned, I chose to go with another job offer, as it was better aligned with my career direction.

Honestly? This offer was a surprise. I thought I’d blown my chances after my second interview. I think this presentation was strong enough to recover my application.

Lessons Learned from My Association Rules Mining Adventure

This was one of the first pieces of predictive analytics I’d done in Python. I’d carried them out in training and in exercises from textbooks, but I had yet to carry one out in a business setting.

I spent too much time trying to make pycaret work with outdated dependencies on my machine. In hindsight, I should have cut my losses earlier and started figuring out how to work with mlxtend.

I was reminded of the importance of filtering and pruning your data before carrying out machine learning. My first execution resulted in over one hundred rules, which were tautological and garbage.

Finally, I learned the importance of transforming and encoding your data correctly for your project. This is my first time one-hot encoding data.

If I were to Do Another Associated Rules Learning or Predictive Analytics Project

My biggest mistake (other than wasting time trying to make pycaret work) was running my associated rule mining script on “tautological” columns. As mentioned earlier, these created garbage output and lengthened my computation time.

If I were to carry out this exercise again, I would spend a bit of time thinking forward: which columns will be output in my one-hot encoded data? Which of these, by definition, means the user has churned? Mindful of these, I would exclude them at an appropriate stage in the data prep process.

Outro: The Data Iceberg

There’s something I’m going to call “the data iceberg”.

All that stakeholders and your audience see are the visualisations and the insights.

Maybe they’re aware of some of the underlying processes to get there. But, in my experience, most of them are not.

Some of the things they’re not aware of:

The data gathering process. The arm wrestling you did with stakeholders to get that data.
The hardware & software limitations you faced during the process, forcing you to make trade-offs.
An encoding problem that you didn’t foresee, which can be so complicated to explain to a non-technical stakeholder that it feels like you’re saying, “I can’t do the magic thing because the magic other thing isn’t in the magic way I need it to be.”
That email you had to chase on, to get an answer to the question of “when you say 2021 data, is that from January 1st, or is that from financial year start?”, which no one replied to, because they thought they’d put that in the metadata file, but they didn’t.

Now and then, I like to do a write-up of a data project, not just to illustrate this data iceberg to non-data people. But also, to myself.

In the middle of a project, when it’s taking twice as long as expected, the imposter syndrome kicks in: “I must be stupid”, “if I was better this task would be done by now”, “I need to hide this mistake, otherwise they’ll think I’m just a fraud”.

Write-ups like these give me not just a chance to reflect and see where I could improve. They also provide a sense of perspective.

They show the unexpected challenges that I overcame. They provide a scale of the work done to get what seems like a few insights.

They show just how valuable those insights are. This was the work required to get them. Hindsight bias kicks in, and you think “of course that’s the answer!”, but you couldn’t have known that without the work.

They prove, just for a short period, that maybe I’m not a fraud. Maybe I’m not an impostor. Maybe I do know what I’m doing.

And maybe, just maybe, I can take on a bigger data challenge.

If you have a project or a role you think I’d be a good fit for, reach out to me on LinkedIn. Alternatively, you can contact me by clicking the “Contact” link in the top menu, but be aware I get a lot of spam, so your email could get lost in the noise.

(Credit to Tima Miroshnichenko & Anna Nekrashevich for two photo assets used in the featured image.)

Read Entire Article