Feature Engineering Case Study - Subscribe & Save Churn Prediction
How to think about designing features - static vs behavioural, and prioritization
This post will help you figure out how to approach a Data Science Case Study Question that focuses on feature engineering. The post will follow a multi-turn conversation that you will typically see in the case study format, but I will add some asides so that you have an understanding of the structure we are following as well.

Interviewer: You work at a subscription-based e-commerce company (e.g. Amazon Subscribe & Save model). The business wants to predict which users will churn so they can send retention offers.
In a subscription e-commerce setting, churn could mean different things. It could be an explicit cancellation, or it could be defined behaviorally like no recurring orders for a certain period of time or they are adjusting their subscribe and save order further and further out. Could you clarify how we are defining churn in this case?
Interviewer: Let’s look at them explicitly cancelling their order as churn
Got it, so churn is explicitly when a customer cancels their Subscribe and Save subscription.
Next, I want to clarify the prediction window. Are we trying to predict whether a customer will cancel in the next 30 days, or are we looking at a different time horizon?
Interviewer: The business is interested in looking at it weekly
Sounds good, and using the prediction what are we hoping to acheive? Are we just ranking customers by risk, or are we going to intervene on the top X% with some retention action that has a cost?
Interviewer: We will be looking to take action on the top 5%
That makes sense. So we will run this weekly and predict whether a customer will cancel in the upcoming week.
Just to be precise, should I assume that every week we score all currently active subscribers using data available up to the end of that week, and then label churn as whether they cancel in the following week?
Interviewer: Yes, that sounds good
With that setup in place, I’ll start thinking about data.
Before I list anything specific, I like to structure this around a few buckets. First, baseline static attributes. Second, behavioral aggregates over time. Third, changes in behavior, since churn is often driven by shifts rather than absolute levels.
Let me start simple. I’d include some basic structural features like how long the customer has been subscribed. Tenure often matters because very new subscribers might churn quickly if expectations are not met, while long tenure users might be more stable. I’d also look at how many active subscription items they currently have. Someone with multiple recurring products might be more embedded than someone with just one.
Then I’d move to behavioral features. Since we are predicting weekly churn, I’d think in terms of recent engagement. For example, number of subscription orders fulfilled in the last 4 weeks, number of skipped deliveries recently, frequency of rescheduling, or any recent pauses. Skips and pauses in particular could be strong early signals of dissatisfaction.
I would also look at changes. For example, has their order frequency decreased compared to their historical average? Has average basket size dropped relative to their baseline? Churn is often preceded by declining engagement, so comparing short term activity to a longer term baseline could be important.
Before I go deeper, I want to check something. Do we have access to customer service interaction data? Things like recent complaints, refunds, or delivery issues could be strong predictors if available.
Interviewer: Yes, you can assume we have that data
Perfect, that gives us more signal to work with.
In that case, I’d definitely include features capturing support interactions. For example, the number of support tickets opened in the last four weeks, whether they had any unresolved issues, the types of issues raised, or even repeated complaints about delivery or product quality. We could also track changes over time, like whether support interactions have spiked compared to their usual baseline. That kind of sudden increase could be an early warning sign for churn.
We could also look at refunds or order cancellations in the recent past. Even if they haven’t fully churned yet, frequent cancellations or refund requests can indicate dissatisfaction that might lead to an explicit cancellation soon.
At this point, I’m thinking of combining static attributes, recent engagement, change over time, and support signals as our primary feature buckets.
I’ll move into listing features and then prioritizing them. Would that sound good?
Interviewer: Yes, that sounds good
Perfect, let’s do it that way.
To start, I’d focus on the baseline static attributes. The obvious ones are tenure and number of active subscriptions, because they set context as we mentioned before.
Delivery region could be important. Some regions may have more frequent delays, higher shipping costs, or seasonal fluctuations that can subtly influence churn. Knowing the customer’s region helps contextualize their behavior.
Preferred delivery day is another one. Customers who consistently choose a certain day may behave differently if deliveries are occasionally missed or delayed on that day. It can help capture frustration that might lead to churn.
We could also look at device or platform used to manage subscriptions. Are they primarily using the mobile app, website, or even third-party integrations? This can sometimes correlate with engagement levels and ease of managing subscriptions.
Another one is payment method type. Customers using credit cards versus debit cards or bank transfers might have different cancellation patterns — for instance, automatic card declines might trigger churn.
Finally, account age relative to first subscription can be informative. Even if tenure on the current subscription is short, longer-standing accounts might be more resilient because the customer has a history with the platform.
If you want, I can start moving into behavioral features next and show how they interact with these static attributes.
Looking to land your next Data Science role?
The two hurdles you need to tackle are callbacks and interviews.
To get a callback, make sure your resume is well-written with our resume reviews, where we provide personalized feedback.
To crack interviews, use our mock interview services. Whether it’s DSA, ML/Stat Theory, or Case Studies, we have you covered.
Still not sure? Check out recent reviews below!
Interviewer: The Delivery Region seems to be a feature with high cardinality. Can you explain how you would handle it?
Sure. Let’s dig into it more technically. Delivery region is high cardinality, so there are a few ways we could handle it.
1. One-hot encoding – the simplest approach, where each unique region becomes a binary column. The main advantage is that it is interpretable and works well with tree-based models. The downside is that if there are hundreds of regions, this can drastically increase feature dimensionality, slow down training, and create sparsity. For low-sample regions, it also risks overfitting.
2. Target encoding (mean encoding) – here, we replace each region with the historical churn rate for that region. This compresses the information into a single numeric feature and preserves the predictive signal. The advantage is low dimensionality and strong signal capture. The cons are that if not done carefully, it can introduce target leakage. For example, computing the mean churn over the entire dataset including the row being predicted inflates performance. To avoid this, we’d compute the encoding in a cross-validation fold during training or use smoothing techniques to shrink estimates toward the global mean for regions with few samples.
3. Embedding-based representation – if we had a lot of other categorical features or user interaction data, we could learn a low-dimensional embedding for regions using something like entity embeddings in a neural network. This can capture complex interactions, but it’s more complex to implement, less interpretable, and probably overkill at this stage.
Considering we are early in rolling this Subcribe and Save feature out and want something simple, interpretable, and low-risk, I would start with target encoding with smoothing. It captures predictive signal efficiently, is compact, and can be easily computed in training versus serving without exploding the feature space. We could always experiment later with embeddings or finer-grained one-hot encodings if we see lift opportunities.
Should I walk through the behavioral features next?
Interviewer: Yup, let’s move into behavioral features.
Great. Since we’re predicting weekly churn, behavioral features are going to be really important because churn is often driven by changes in engagement rather than just static attributes. I like to think about them in terms of recency, frequency, and intensity.
A simple starting point is subscription fulfillment metrics. For example, the number of deliveries successfully fulfilled in the last week, last two weeks, and last month. Alongside that, we can track skipped deliveries, paused subscriptions, or rescheduled orders over the same windows. Skipped or paused deliveries are often early signals that a customer is disengaging.
We can also look at spending behavior. Average basket size or total spend over the last few weeks compared to the long-term average can indicate declining engagement or satisfaction. If someone suddenly orders fewer items or cheaper items than usual, that could precede churn.
Another behavioral angle is engagement with the platform. For example, email opens, clicks on promotions, or app logins in the last week versus longer-term averages. Drops in engagement often correlate with impending churn.
Finally, I’d include trend features. It’s not just the absolute numbers but the change relative to baseline. For example, percentage change in deliveries fulfilled compared to the previous month, or change in average spend. Capturing these deltas helps the model detect shifts that may indicate churn.
Interviewer: These sound promising, but lets cover why you chose the windows you did?
Absolutely, the choice of time windows is really about balancing short-term signal with longer-term context.
For skipped or paused deliveries, I’d use a very short-term window, like the last week, because churn often follows recent disengagement. If a customer skips a delivery this week, it’s much more predictive than something they skipped two months ago. I’d also include a slightly longer window, like the last four weeks, to capture repeated patterns — for example, someone skipping one delivery occasionally might not be risky, but repeated skips over a month could be a stronger signal.
For fulfilled deliveries and spend, I’d use both short-term and longer-term windows. The last week shows immediate behavior, but comparing that to the last month or three months gives context. For instance, if a customer normally orders three items per week but only ordered one this week, that relative drop is a red flag. So we often use short-term windows for recency and longer-term windows for baseline behavior.
For platform engagement metrics like email clicks or app logins, short-term windows matter most, because sudden drops often precede churn. Weekly windows are fine, but for users with sparse activity, a rolling two- or four-week window helps smooth out noise.
The general principle is: short windows capture immediate signals, long windows provide context and help compute deltas or trends. Each feature’s window is chosen based on how quickly we expect the signal to reflect potential churn. Skips and complaints are fast-moving, while spend or order patterns may be slower to change, so they benefit from comparing short-term vs long-term aggregates.
If you like, I can now walk through customer support and service features and the windows I’d use for those.
LLMs are in, but do you know their foundations?
While LLMs have been powerful, the majority of business insights and value from text data is still being generated using non-LLM, foundational NLP techniques and embedding models. So don’t skip out on prepping on the basics of NLP and the embedding model.
This book covers 100+ questions, yes, a resource you can actually finish before your next interview that covers the most commonly tested NLP topics.
Interviewer: Yes, sounds good. Let’s cover those as well.
Great. Customer service and support interactions can be really predictive of churn, but again the time window depends on the type of signal.
For number of support tickets opened, I’d focus on short-term windows like the last week or last two weeks. A sudden spike in tickets can indicate frustration or problems that may immediately trigger a cancellation. I’d also include a slightly longer window, like the last month, to capture recurring issues or chronic dissatisfaction.
For ticket resolution status, I’d look at unresolved tickets as a separate feature. An unresolved issue from last week or two could be more predictive than one that was resolved quickly. We could also compute trends, like whether the number of unresolved issues has increased compared to the previous month.
For ticket types, some categories may be more indicative of churn. For example, repeated delivery complaints or product quality issues might matter more than general questions about billing. We could one-hot encode the main categories, or use target encoding for high-cardinality categories if needed.
Finally, refunds or cancellations in the recent past are important. I’d track refunds in the last week and last month, and also compute changes compared to historical averages. A sudden increase in refunds could indicate dissatisfaction that often precedes explicit churn.
The overall idea is similar to behavioral features: short-term windows capture immediate risk, longer-term windows provide context and allow us to compute trends or deltas. Combining these signals with the baseline and behavioral features should give the model a strong foundation to predict weekly churn.
Interviewer: This could be a very sparse set of features. How would you handle this?
Customer service and support features are naturally sparse — not every customer opens tickets, requests refunds, or complains every week. So we need to handle sparsity carefully to make sure the model can still use the information without being biased or dominated by zeros.
One approach is to treat missing or zero interactions explicitly. For example, we could include a binary indicator for whether the customer had any support interaction in the window, alongside counts or normalized metrics. That way, the model can differentiate between “no interaction because everything is fine” and “low counts that actually carry signal.”
Another approach is smoothing or aggregation. For features like number of complaints or refunds, we could compute a rolling sum over multiple weeks or combine short-term windows with longer-term averages. That reduces noise and gives the model a signal even for customers with sparse activity.
We could also use categorical encoding for types of tickets only for customers who actually have interactions, and leave others as a default “none” category. That prevents creating hundreds of sparse columns with mostly zeros.
Finally, for very new users with little history, we can borrow strength from the population. For example, if a customer has no support history, we can use the average number of complaints or refund rate for their cohort as a proxy. This avoids leaving the feature blank while not introducing leakage.
Overall, the principle is to encode both the presence of interactions and their magnitude, smooth sparse signals, and fallback to cohort-level or population-level priors for users with little history.
In terms of rolling it out - first, I’d include basic count features with a binary indicator. For example, number of tickets opened in the last week, plus a 0/1 flag for whether they had any support interactions at all. This gives the model an immediate, low-complexity signal. Based on ablation I would try out other methods.
Interviewer: Ok, sounds reasonable. Let’s now figure out how to prioritize features.
Yeah — so if I think about shipping a v1 churn model, I’d try to be very disciplined. It’s tempting to throw in everything, but in production you want signal, stability, and simplicity first.
First, I’d start with strong static features. These are almost always worth including and they’re cheap to compute.
Tenure is a no-brainer. In most subscription businesses, churn risk is very different in week 2 versus week 80. Even if tenure alone isn’t highly predictive, it anchors the model.
Then I’d include number of active subscription items. Someone with multiple items is usually more embedded. That’s a structural commitment signal.
Subscription tier or plan type also matters — higher tiers often correlate with either stronger commitment or different churn behavior entirely.
Those three together form a really stable backbone. They won’t move week to week, so they won’t introduce noise.
Then I’d move to behavioral features — this is where the real signal usually lives. Since we’re predicting weekly churn, I’d center most features around a 4-week window. It’s long enough to smooth randomness, but short enough to capture recent dissatisfaction. So I’d include:
Orders fulfilled in last 4 weeks
Skips in last 4 weeks
Any recent pause
Reschedules
Maybe skip rate instead of just skip count
Skips and pauses are especially important. In subscription businesses, churn is often preceded by “soft disengagement.” People don’t cancel immediately — they start skipping. I’d also include a simple delta: activity in the last 4 weeks compared to the previous 4 weeks. Even just one engagement change feature can be powerful, because churn is often about decline, not absolute level. That would probably be the core predictive engine.
Then I’d layer in customer support features — but carefully. Support data can be strong, but it’s sparse and sometimes messy. So for v1, I’d keep it simple:
Binary flag: any support interaction in last 4 weeks
Ticket count in last 4 weeks
Refund count in last 4 weeks
I wouldn’t start with ticket categories, NLP sentiment, resolution times, etc. That’s second iteration work. First I’d just test whether “friction” signals move the needle.
So if I had to summarize how I’d structure v1: Start with static backbone → Add recent behavioral intensity → Add one or two decline signals → Layer simple support friction signals
Ill keep the total feature count modest — maybe 20–30 clean features. Then I’d evaluate incremental lift by bucket. If support adds almost nothing, maybe churn is more about habit decay than service friction. If deltas dominate, that tells us churn is trajectory-driven. That’s how I’d phase it.
Interviewer: How would you calculate these features, and how often?
I’d start by anchoring the entire system around a clear scoring cadence. Since we’re predicting weekly churn, I’d score once per week — say every Sunday night. That becomes our snapshot date, T. Every feature we compute must only use data available up to T, and the label would be whether the customer churns in the following week. If we don’t lock that down clearly, leakage creeps in very easily.
For static features, computation is straightforward. Tenure is just the difference between the snapshot date and the subscription start date, expressed in weeks. Active item count is the number of items active as of that snapshot timestamp. Subscription tier is whatever plan they’re on at that moment. These would be recomputed at scoring time each week, but they don’t require heavy aggregation pipelines. They’re just state-based lookups.
Behavioral features require more care. Since we’re scoring weekly, I’d align everything to full-week windows instead of arbitrary rolling days. For example, “orders in the last four weeks” would mean the four complete weeks prior to the snapshot date. That avoids partial-week noise and keeps definitions consistent over time. I’d use the same logic for skips, pauses, reschedules, refunds, and support tickets — always the same window boundaries so the system is coherent.
For delta features, I’d compute adjacent windows. So I’d calculate orders in the most recent four weeks, then orders in the four weeks before that. The difference or percentage change between those two becomes the trend signal. The key is that both windows must end before the snapshot date. Nothing should overlap into the prediction window.
Ratios, like skip rate, would be calculated within the same window. So skips in the last four weeks divided by scheduled deliveries in the last four weeks. Ratios are helpful because they normalize across customers with different activity levels.
In terms of frequency, I would recompute everything weekly and generate a snapshot table each week with one row per customer per snapshot date. That makes training, backtesting, and debugging much cleaner. I wouldn’t recompute daily unless the business actually wants daily churn predictions — otherwise it just adds operational complexity without improving alignment.
Now, from here on, the interview could dive into choosing model metrics, experimenting with models, and productionalizing it. A sample of this part of the interview loop is listed below. In the interest of keeping this post short ill stop it here. But if you want to see a Part 2 - Comment down below!




Very informative!
part 2