
Cracking The AI Challenge: Evidence From The Academy
Author(s): Dor Meir
Originally published on Towards AI.
MODELING MADE SIMPLE

When I joined my recent company, almost five years ago, my boss told me that an AI project is like a puzzle:
“Each project presents a challenge, a puzzle, and the job is to rigorously assemble the pieces of the puzzle — until you crack it and find the gold-producing model.”
About five years later, I know he was right. In this post, I’ll present four puzzles I solved for a college: enroll more students, adhere to diversity aims, optimize course logistics, and prevent student dropout.
These will be our business challenges.
Underneath these business challenges lie the professional challenges. Those are the few crucial parts that solving them does much of the work — the corners of the puzzle, the frame — but they are also the most interesting ones! A puzzle might have lots of pieces, but you’ll only learn about the most critical ones, the 80% of the 80/20 rule.

Lastly, while solving puzzles is nice, in reality, we need actual results that’ll answer our business needs. These will be the model’s results.
So, without further ado… Enjoy!
Table of contents
- Leads prioritization — Leakage‑free future features
- Meeting diversity goals — Cross‑section → grouped multivariate time series
- Optimizing course logistics — Lag & rolling‑window features for panel TS
- Dropout prevention — Group‑aware time split + post‑stratified risk banding
- Conclusion
1. Leads prioritization
Let’s start with the beginning, and in the case of a college, the beginning is enrollment in studies. TL;DR: Adding future-features safely improved the conversion rate of leads by 54%.
The college’s admissions center actively contacts leads who have expressed interest in a degree and enrolls them in studies. The problem is, there’s an overwhelming number of leads each year (hundreds of thousands), and many are of low quality. How low? Before any model was used, the conversion rate was only 27%, meaning only 27% of those hundreds of thousands of leads had registered for studies.

When I joined this project, there was already a model in place, which has raised the conversion rate to 37%. Here’s an illustration of how the data looked:
There was a general satisfaction with the existing model. And while the admission center callers occasionally sent feedback about features that might improve the model, incorporating most of those suggested features in the model didn’t move the needle by much.

The business challenge
At one time, though, after another periodic presentation of the validation of the model results in production, I’ve got interesting feedback from the admission center:
“That’s all good, but you must know there’s a case where if we get a lead with certain properties, we completely neglect the model predictions, right?”
It turned out that whenever the call center received a lead who was advised to take a meeting with an academic consultant or had made a few of those meetings, the callers gave the highest priority to this lead, while disregarding what the model predicted for that lead.
This instantly rang all the bells for me. The model is obviously at risk of being neglected altogether, as both users have a habit of ignoring it, and the rare cases might be more prevalent as the data distribution shifts. I knew right away this crack had to be mended before the flood came in.

With this important insight, I added the two new consultation features to the model, retrained it, and… miraculously had an astonishing improvement in the conversion rate! Hurray! 🚀
But wait — “too good to be true” results are not a good sign in an AI model, and should be treated with the utmost suspicion… Did I make a fundamental mistake here? Is there a leakage of some sort?
The professional challenge: Add Leakage‑free future features
Eventually, I got it— yes, those two new features were very indicative of registration, more than any other feature we had. The only problem was that at no point before the lead enrolls for studies can we know if she’s not going to be advised to take consultation. The majority of leads will be missing that data at the time of prediction, and the model will perform poorly for them. In other words, these were future features, with the potential to cause data leakage.

How exactly did I crack that challenge?
First, I didn’t break the original model. Instead, I trained a separate model, only for those “older” leads that already had a consultation record. The new pipeline begins now with an initial filtering: does the lead contain any consultation data? If not, the lead is sent along the original model path. If it does, it is sent to the second model, containing the consultation features.
The model’s results
In the end, the callers of the admission center only see one probability of enrollment, which is a result of this ensemble of models, and that is much more accurate than the initial model we had. But by how much? Even two years later, we got a 57% conversion rate (compared to the old 37% rate). Meaning, for every lead our model predicts will enroll, there’s a 57% it’ll enroll! This allowed the call center to both enroll more students and also cut labor hour costs.
More importantly, we addressed the concerns raised by the model’s users. They no longer had a reason to neglect the current model results: the model incorporates the new future features without leakage, providing better predictions than the prior ones, while not breaking the existing model, with which they were generally happy.
2. Meeting diversity goals
Now that we have increased the total number of people enrolling in the studies, we will crack the challenge of helping our college achieve diversity. TL;DR: Our data transformations have led to a daily accuracy of 92% in predicting the number of students who will register from each population segment.
The college’s CEO sets enrollment goals for various minority groups. More precisely, she’s interested in meeting the diversity goals of those minorities in four different categories:
- Population type: general population, Indian-speaking, Arabic-speaking.
- Student type: regular, international, high school, workplace-related studies, soldiers, prisoners, etc.
- Degree track: 1st degree, 2nd degree, 3rd degree, teaching certificate, non-academic diploma studies.
- Student tenure: freshmen, seniors.
For example, in one semester, the CEO might be interested in predicting the number of Arabic-speaking students, regular type, 2nd degree, and freshmen. At the same time, she might also need a prediction for general population students who are still in high school, 1st degree, and freshmen. Without these predictions, the CEO will have trouble setting goals for the number of students who’ll register for each of these specific groups.
The business challenge
There are 180 different possible combinations of those groups in various categories, and the challenge is that we don’t know, before each semester, which specific combinations of students the CEO will require prediction for. We also had to deliver predictions daily, taking into account the number of registrations per group from the previous day.
This is how one time-series, of one population segment in one semester, looks:
Hence, we are required to make predictions for specific groups in our data, not individuals, and we need to make those predictions daily, not just for a point in time (as with the leads model). However, our data is individual and of separate points in time, currently telling which specific student enrolled or canceled enrollment on particular days. What should we do then?
The professional challenge: Transform individual cross-sectional microdata into aggregated multivariate time series
The main solution here was to reconstruct the data altogether, from individual cross-sectional microdata into aggregated multivariate time series data (or from panel data to cross-sectional data):
What do we have here? A lot has changed (see the technical details in the table description):
population_segment is now our main identifier of each time series in a cohort. semester_id and registration_day are our unique identifiers for each observation, as all populations repeat only once per semester per registration day. Now, the model can learn what happened for each population segment on each registration day (remember, we need to perform daily predictions!). sum_enrolled_per_day — a daily summary of the number of students enrolled from each population on each registration day. cumulative_sum — how the registration progressed so far, as the end of the registration period approaches. final_enrolled_semester — for each population and on each registration day, what the final number of students registered in that semester is. Of course, this will be obfuscated in the test data and will only be used to validate the final selected model.
Now our model has completely addressed the CEO’s needs— a daily prediction of how many will register from all types of groups.
Interestingly, this transformation can also handle much greater population segments. A prediction for an aggregated group is simply the sum of the model’s predictions for all smaller groups comprising the larger one. For instance, if on registration day 73, the CEO needs a prediction for the sum of general population prisoners that’ll enroll in all tenures and 1st and 2nd degree types, we can sum the prediction of all categories that are both mutually exclusive but also collectively exhaustive of the bigger category in question:

Inevitably, we can also easily predict the entire population of students who’ll enrol by summing up the predictions for all groups.
The model’s results
Along with other engineered features, we got a one-year forecast accuracy for all days in the registration period of 92%, defined as the Normalized absolute difference: MAE divided by the absolute mean of the target. More importantly, though, the CEO now has an automatic daily prediction for all those different population types, with maximum flexibility over the various possible intersections of population categories. It’s much easier to set goals for participation of minority groups — and adhering to them.
3. Optimizing course logistics
We’ve solved both general and specific population admission challenges, Hurray! 🏅 It’s now time to crack the challenge of predicting the enrollment of existing students in the variety of courses the college offers. TL;DR: Our feature engineering in this challenge resulted in an 88% accuracy in predicting the number of students for all 700 different courses.
The business challenge
The college’s Dean distributes textbooks and opens study groups based on course enrollments. However, former inaccurate predictions (using some rules of thumb) for the number of enrollees per course have led to serious logistical issues. It has reached the point where the college had to cancel the participation of enrollees in courses, resulting in both financial losses and student dissatisfaction.
As you can see, the college already has some interesting features to consider for each course that’ll help predict how many students will eventually enroll. But can we make more useful features out of those?
The professional challenge: Engineer lag and rolling features for multivariate time series
Each row contains categorical data and data about enrollees of the course from the previous semester. More importantly, we also have historical data up to when the course was established. For some courses, this means years of historical data.
As with the prior students' diversity model, we also have aggregated multivariate time series. We can treat each feature as a time series. Thus, we can calculate moving averages, sums, differences, percentages of changes, variances, and other statistics on the lag operators. More simply put, we can capture both historical and ongoing patterns by calculating various statistics on both our numerical and categorical features (turned numeric using target encoding):
A summary of what I did here (see the technicalities in the table description):
A. Aggregated categorical statistics — For every categorical field (e.g., population group, student category, department), we replaced the original label with numerical aggregates, and rolling and lagged variants, that expose the historical demand for the course. Identical sets were created for the three previous semesters and for every other high-cardinality category we track.
B. Aggregated numerical statistics — we added daily, cumulative, and final-semester aggregates and their lagged / rolling counterparts for each numerical feature.
These transformations roughly tripled the feature count, turning sparse categorical labels into dense, information-rich numeric predictors and allowing tree-based models to learn both short-term shocks and long-term trends. 💪 Since we had plenty of data, we were less worried that this abundance of features would make our model overfit to the train randomness.
The model’s results
After engineering these new features, the model has achieved an annual 88% accuracy (normalized MAE) for all 700 courses. It’s worth mentioning that these courses have a wide range of 10 to 3700 students. These results were a major help to the college in optimizing course logistics and making sure all students can participate in their selected courses.

Besides these amazing results, the stakeholders were rather surprised by the model’s feature importance. Intuitively, they understood that previous semester registration data might be important for predicting registration for a semester. However, they couldn’t imagine how important the history of that figure was to the prediction, even 5 semesters before the current one!😲
And in that, we cracked the course logistics optimization challenge, and left the stakeholders in much excitement towards our next, final challenge.
4. Dropout prevention
We’ve now cracked the challenges for all new students (general admission and diversified admission) and existing students (enrollment in courses). Let's crack the final challenge of graduating students, i.e., lowering the dropout rate, and ensuring everyone who can will graduate successfully from the college. TL;DR: Our careful data splitting and risk banding eventually doubled the accuracy in predicting graduates from students at an early age.
The business challenge
The college’s dropout rates are particularly high. There’s a persistence unit that supports students early on, actively approaching them and assisting in different ways, ultimately preventing dropout. The issue is that referrals to students are made using rules of thumb (a little deja vu?), and filtering the relevant students to approach, out of the tens of thousands of active students, is a major challenge. Additionally, the persistence unit aims to engage with students at a very early age, before the majority of them drop out. The problem is that, at this point, there’s very little data about the students, making it difficult to differentiate students based on their graduation chances.

This is how the raw data of graduating students looks:
And after engineering lag and rolling features, with NAs not yet been handled:
The professional challenge 1: Group‑aware Time Series Split
One crucial issue in training such a model is the nuanced split of the train and test sets. We have two things to consider to avoid data leakage:
- Students' exclusivity to one dataset: Each student is, in fact, a time series of multiple observations, each observed in a different semester. If we split any of these time series, leaving part of the same student’s semesters in the train set and part in the test set, we’ll have an obvious target leakage. The model will learn about Bob, who studied from 2023 to 2024, and we know he has graduated at some point in the future, as stated in final_status. Then, the model will be asked to predict Bob’s graduation status in his last semester, which is in the test set, 2025, when it was already informed of Bob’s graduation in the training set. Thus, the train and test sets should isolate all records of each student completely.
- “Last year of studies” exclusivity to one dataset: It is not enough to keep all the data on each student exclusive to one set, as we still need to put the right students in each set. If we add Carol, who graduated in 2025 (our last year of data), to the train set, our test results on Dave, who also dropped out in 2025, will be over-optimistic, as they relied on some knowledge of students graduating in 2025. We will essentially train the model on data that will not be available in production. For our model to generalize well to the future, it can only train on students who finished their studies up until 2024 and test on students who finished their studies in 2025.

These two important considerations leave us with, first, a train set containing students with student IDs completely disjoint from those in the test set. Second, even though both sets might contain observations from the same old semesters, only the test has the latest semester in data, so the maximum last semester in the train set will always be older than the minimum last semester of the test set. Which by definition means that the maximum of all semesters in the train set is older than the minimum of last semester in the test set.

The easiest way to achieve these conditions is to split the students based on their final year, which is the year in max_semester, e.g., year 2025 for the semesters 2025A, 2025B, and 2025C. Students who graduated/dropped out before 2025 will be in the train, and students who graduated in 2025 will be in the test. Of course, each student is taking all their records with them to the relevant set.
The professional challenge 2: Post‑stratified risk banding
Fast forward to the modeling phase of CRISP-ML: we’ve split the data correctly, trained on various ML algorithms, selected the best model, and optimized our decision threshold. We’ve now got the final model, producing both probabilities and a final decision if a student will graduate, for each student at each point in time:

Now what? If we take these results as they are, with a 0 to 1 probability of graduating for students of all levels, the model adds very little useful information to the persistence unit. Why is that? If you recall from our first plot in this model, the number of credits acquired so far by the student is a very indicative feature of graduation, perhaps the most indicative one:

So our model doesn’t do much if it predicts a higher probability of graduation for students with more credits. What we can do, instead, is to cluster together students with the same level of credits, let’s say in ranges of 10 credits, train a model per range, and that way we neutralize the strong effect of credits and provide more useful predictions to our clients, based on the other features of the model. You can see an example of multiple models on different segments of the data, in here:
Time Series: How to Beat SageMaker DeepAR with Random Forest
Improve your KPI by 15% with 3X faster, free & interpretable model
medium.com
The problem with the multiple models method is that, given a regular degree is usually completed after 120 credits, this means maintaining 12 different models (e.g., 0–10 credits, 11–20 credits, 21–30 credits, …, 111–120 credits), and also having confusing results for different observations of the same student. For instance, if Alice, with 6 credits so far, has excellent grades compared to her peers in the same credits range, she’ll be predicted to have an 80% chance of graduating at that point. If, however, her success index declined at 18 credits compared to her peers, she might now have only 10% of graduating, even though we know for sure she has better chances of graduating after 18 credits than after 6 credits only. Our model will become too local and hard to interpret without weighing the credit range in the calculation.
However, there is a more straightforward approach, easier to maintain and less prone to confusion over the results. We can take the initial model, stratify the probabilities for the different credit ranges, and cluster them into three color bands, representing the probability (or risk) of graduating (dropping out) compared to students of the same credit level. In this, we achieve two things: both the probabilities for Alice rise when she acquires more credits, and we can still compare Alice to her peers at any credit range.

How do we do it, then?
- Gather together observations of students with the same credits range (e.g., 0–10, 11–20).
- For each credit range, find the tertiles (thresholds of three equal parts when ordered) of the probabilities to graduate.
- Mark each probability prediction in each credit range, in a “color”:
A. Red: The lower tertile, students who are most likely to drop out, compared to other students in this credit range. Those are students the college should probably not approach first, as they have very little chance to graduate either way.
B. Yellow: The medium tertile, students have a medium probability of dropping out / graduating, compared to other students in this credit range. Those are the students the college is most interested in approaching and assisting, as they are dangling between graduation and dropping out.
C. Green: The highest tertile, students are most likely to graduate, compared to other students in this credit range. As with the red band, those students should probably not be approached first, as they already have high chances of graduating without any help.
The model’s results
This method also does a decent job in validating our model’s results. For a model separating dropouts from graduates well, the mean lift of each of those color bands in each credit range (defined as the division of the average students graduating per color by the average students graduating for all students in the credit range), should behave very differently:
A. The red band lift should be low, meaning many fewer students graduating than the mean for this credit range.
B. The yellow band lift should be close to 100%, meaning close to the average number of students graduating for this credit range.
C. The green band lift should be high, meaning many more students graduating than the mean for this credit range.
And indeed, our results show that:

→ while the number of students per color band is roughly the same,

→ The mean percent of graduates in each color differs greatly. But how is it compared to the grand mean of the credits range, or in other words, how’s the lift?

→ Hurray, the results we expected! 🎉 The red band has a very low lift (2%), the yellow band has close to 100% lift, and the green band has a very high lift (190%). The college can safely approach first the yellow band students, who, as we can see, are dangling between graduation and dropout.
Given these results, our model separates dropouts from graduates at a very early stage of their studies.
But how is it doing in higher credit ranges?

The red lift is still fairly low, yellow’s a little higher than the yellow in the 0–10 range, but still close to 100, and green’s almost 150.

Still good results, yellow is even more similar to the grand mean of the credits range than the previous range.

Those are still decent results, even though our model hasn’t trained on observations of credits above 30 (since the request was for a focus on early-age students). So we’ve learned our model not only generalizes well to the future but also for credits it was not trained on.
And our model still has decent performance even for the much higher credit ranges, where the average of graduates is very high, and it’s harder to find any dropouts:

Even at the last range, our model still manages to do some useful work, even though students in this credit range have a 91% of graduating! They are almost at their graduation ceremony, and our model still has a better chance of finding dropouts than a random guess. 🥳
Eventually, our novel approach succeeded in detecting dropouts early on and helping the college prioritize which students to approach first, both at each credit level and globally for all students. The model’s lift of the green band over the very early-age students is 190%, meaning it’s almost two times better than a random guess in predicting which student will graduate. 🏆
Conclusion
All the pieces of the puzzle are now complete:
We’ve helped the college to: Prioritize leads by adding future features without data leakage; Meet diversity goals by transforming our cross‑section data to grouped multivariate time series; Optimize course logistics by engineering lag & rolling‑window features for time series data; Prevent dropout by group‑aware time splitting and post‑stratified risk banding.
By solving our puzzle, we’ve acheived some great sucesses in various aspects of the college’s business:

Consequently, the college’s cheif of data has openly expressed his appreciation of our work together:
“We succeeded in shortening processes, improving the accuracy of the models, and optimizing the use of AI for better decision-making.”
Of course, our work isn’t done. These models live in production — constantly validated, retrained, re-engineered, and upgraded. Pieces of the puzzle shift, new ones appear, and fresh challenges emerge.
But that’s another story, yet to be told… 🔮
Feel free to share your feedback and contact me on LinkedIn.
If you like what I do, see my other posts:
– Time series modeling for demand forecasting:
Time Series: How to Beat SageMaker DeepAR with Random Forest
Improve your KPI by 15% with 3X faster, free & interpretable model
medium.com
– Developing a Machine learning Streamlit app:
10 Features Your Streamlit ML App Can’t Do Without — Implemented
Much has been written about Streamlit killer data apps, and it is no surprise to see Streamlit is the fastest growing…
medium.com
– A review of Jetbrains IDE for data scientits:
The Good, the Bad and the DataSpell
An honest review of JetBrain’s Data Science IDE after a year of using it
medium.com
And some lighter stuff:
Can ChatGPT Think?
An answer from Leibowitz, Yovell, and ChatGPT
pub.towardsai.net
Will AI kill us all? GPT4 answers to Yudkowsky
I asked Bing about Yudkowsky’s argument. When it got too optimistic, I confronted it with some harder claims. It ended…
dormeir.medium.com
Who’s The Best Free Chatbot?
ChatGPT vs. Bing Chat vs. Bard
pub.towardsai.net
MJ or LBJ — who’s the GOAT? An answer by Bing Chat
Thank you for reading, and good luck! 🍀
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.