Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Cracking The AI Challenge: Evidence From The Academy
Data Science   Latest   Machine Learning

Cracking The AI Challenge: Evidence From The Academy

Author(s): Dor Meir

Originally published on Towards AI.

MODELING MADE SIMPLE

Cracking The AI Challenge: Evidence From The Academy
Image by the author.

When I joined my recent company, almost five years ago, my boss told me that an AI project is like a puzzle:

“Each project presents a challenge, a puzzle, and the job is to rigorously assemble the pieces of the puzzle — until you crack it and find the gold-producing model.”

About five years later, I know he was right. In this post, I’ll present four puzzles I solved for a college: enroll more students, adhere to diversity aims, optimize course logistics, and prevent student dropout.

These will be our business challenges.

Our four business challenges, presented in this post. Image by the author.

Underneath these business challenges lie the professional challenges. Those are the few crucial parts that solving them does much of the work — the corners of the puzzle, the frame — but they are also the most interesting ones! A puzzle might have lots of pieces, but you’ll only learn about the most critical ones, the 80% of the 80/20 rule.

From business challenges to model results: our thinking map. Image by the author.

Lastly, while solving puzzles is nice, in reality, we need actual results that’ll answer our business needs. These will be the model’s results.

So, without further ado… Enjoy!

Table of contents

  1. Leads prioritizationLeakage‑free future features
  2. Meeting diversity goalsCross‑section → grouped multivariate time series
  3. Optimizing course logisticsLag & rolling‑window features for panel TS
  4. Dropout preventionGroup‑aware time split + post‑stratified risk banding
  5. Conclusion

1. Leads prioritization

Let’s start with the beginning, and in the case of a college, the beginning is enrollment in studies. TL;DR: Adding future-features safely improved the conversion rate of leads by 54%.

The first challenge — enrollment in studies. Image by the author.

The college’s admissions center actively contacts leads who have expressed interest in a degree and enrolls them in studies. The problem is, there’s an overwhelming number of leads each year (hundreds of thousands), and many are of low quality. How low? Before any model was used, the conversion rate was only 27%, meaning only 27% of those hundreds of thousands of leads had registered for studies.

An illustration of how college collects leads that have expressed interest in enrolling in studies. Image by the author.

When I joined this project, there was already a model in place, which has raised the conversion rate to 37%. Here’s an illustration of how the data looked:

Each lead, represented by the lead_id feature, was introduced to the system in a specific semester represented in the semester feature, where our target is the enrolled field — did the lead eventually register for studies (1) or not (0). The rest of the fields are different features we have for each lead, some were very predictive of the lead’s tendency to register for studies. Airtable (data by the author).

There was a general satisfaction with the existing model. And while the admission center callers occasionally sent feedback about features that might improve the model, incorporating most of those suggested features in the model didn’t move the needle by much.

An illustration of some branches of one decision tree of the original model, predicting which lead will register to studies using three features: the source website of the lead, the studies track which the lead expressed interest in, and the declared goal of studies. Image by the author.

The business challenge

At one time, though, after another periodic presentation of the validation of the model results in production, I’ve got interesting feedback from the admission center:

“That’s all good, but you must know there’s a case where if we get a lead with certain properties, we completely neglect the model predictions, right?”

It turned out that whenever the call center received a lead who was advised to take a meeting with an academic consultant or had made a few of those meetings, the callers gave the highest priority to this lead, while disregarding what the model predicted for that lead.

This instantly rang all the bells for me. The model is obviously at risk of being neglected altogether, as both users have a habit of ignoring it, and the rare cases might be more prevalent as the data distribution shifts. I knew right away this crack had to be mended before the flood came in.

The initial model in place. Beginning in the top left, we have a lead on a student who is leaving trails in the system, including his source website, stated interests in studies, the program he is showing interest in, his high school stated grades, and others. At this point, the data is stored in the data warehouse (arrow down) and fed into the AI Model (the brain), which makes a decision (and provides probabilities) on whether the student might enroll. Using this AI response, the Admission Center callers decide whether to call the lead and try to register him for studies. The problem occurs in the alternative path where the lead decides to take a consultation meeting (the “take consultation?” rhombus), and the admission center guys prioritize this data over the AI Model’s response. Lucidchart (image by the author).

With this important insight, I added the two new consultation features to the model, retrained it, and… miraculously had an astonishing improvement in the conversion rate! Hurray! 🚀

But wait — “too good to be true” results are not a good sign in an AI model, and should be treated with the utmost suspicion… Did I make a fundamental mistake here? Is there a leakage of some sort?

The initial data + the two new features: 1. advice_consulation — was the lead already advised to take professional consultation. 2. n_consultations — The number of consultation meetings the lead has already had. Airtable (data by the author).

The professional challenge: Add Leakage‑free future features

Eventually, I got it— yes, those two new features were very indicative of registration, more than any other feature we had. The only problem was that at no point before the lead enrolls for studies can we know if she’s not going to be advised to take consultation. The majority of leads will be missing that data at the time of prediction, and the model will perform poorly for them. In other words, these were future features, with the potential to cause data leakage.

The new architecture of the system. The alternative decision path, where a lead added consultation data and the admission center used this data without an AI Model, is no longer available. Instead, we have another model, AI Model 2, also considering this new indication of consultation. When a lead is in the first stage, before any consultation indication is acquired, he’s diverted to AI Model 1. And the moment the lead acquires new consultation data, he’s instantly transferred to AI Model 2, which incorporates the consultation features. In the eyes of the Admission Center, the process is transparent: they receive a response from the ensemble of models, whether the lead has consultation data or not. Lucidchart (image by the author).

How exactly did I crack that challenge?

First, I didn’t break the original model. Instead, I trained a separate model, only for those “older” leads that already had a consultation record. The new pipeline begins now with an initial filtering: does the lead contain any consultation data? If not, the lead is sent along the original model path. If it does, it is sent to the second model, containing the consultation features.

The model’s results

In the end, the callers of the admission center only see one probability of enrollment, which is a result of this ensemble of models, and that is much more accurate than the initial model we had. But by how much? Even two years later, we got a 57% conversion rate (compared to the old 37% rate). Meaning, for every lead our model predicts will enroll, there’s a 57% it’ll enroll! This allowed the call center to both enroll more students and also cut labor hour costs.

More importantly, we addressed the concerns raised by the model’s users. They no longer had a reason to neglect the current model results: the model incorporates the new future features without leakage, providing better predictions than the prior ones, while not breaking the existing model, with which they were generally happy.

2. Meeting diversity goals

Now that we have increased the total number of people enrolling in the studies, we will crack the challenge of helping our college achieve diversity. TL;DR: Our data transformations have led to a daily accuracy of 92% in predicting the number of students who will register from each population segment.

Our second challenge: meeting the college’s goal of diversity of people registering for studies. Image by the author.

The college’s CEO sets enrollment goals for various minority groups. More precisely, she’s interested in meeting the diversity goals of those minorities in four different categories:

  1. Population type: general population, Indian-speaking, Arabic-speaking.
  2. Student type: regular, international, high school, workplace-related studies, soldiers, prisoners, etc.
  3. Degree track: 1st degree, 2nd degree, 3rd degree, teaching certificate, non-academic diploma studies.
  4. Student tenure: freshmen, seniors.
An illustration of the original dataset, as that data comes in from the registration system. Each student (student_id), either enrolls or cancels enrollment (registration_activity) to a specific semester (semester_id) in a specific day before the start of the semester (registration_day), and has one value for each of our categories: population_type, student_type, degree_type, student_tenure, degree_track. Airtable (data by the author).

For example, in one semester, the CEO might be interested in predicting the number of Arabic-speaking students, regular type, 2nd degree, and freshmen. At the same time, she might also need a prediction for general population students who are still in high school, 1st degree, and freshmen. Without these predictions, the CEO will have trouble setting goals for the number of students who’ll register for each of these specific groups.

The business challenge

There are 180 different possible combinations of those groups in various categories, and the challenge is that we don’t know, before each semester, which specific combinations of students the CEO will require prediction for. We also had to deliver predictions daily, taking into account the number of registrations per group from the previous day.

This is how one time-series, of one population segment in one semester, looks:

An illustration of a population segment registration pattern in a semester. The X axis is the day in the registration period, and the Y axis is the number of registered students from this specific population segment. The green line represents the accumulated number of students registered so far, the yellow line represents the final number of registered students on the final day of registration (start of the semester, t=0), and the purple jittered line is the prediction we produced for the final number of registered students at the final day of registration. The distance between the purple and yellow lines on each day represents the error made in the prediction on every single day. Plot by the author.

Hence, we are required to make predictions for specific groups in our data, not individuals, and we need to make those predictions daily, not just for a point in time (as with the leads model). However, our data is individual and of separate points in time, currently telling which specific student enrolled or canceled enrollment on particular days. What should we do then?

The professional challenge: Transform individual cross-sectional microdata into aggregated multivariate time series

The main solution here was to reconstruct the data altogether, from individual cross-sectional microdata into aggregated multivariate time series data (or from panel data to cross-sectional data):

An illustration of the dataset after transforming the individual cross-sectional microdata into aggregated multivariate time series. The 180 possible combinations of various student categories were concatenated into a single column, called population_segment. Each population_segment series was inflated into all possible semester_id + registration_day combinations (only a sample of them presented in the table). sum_enrolled_per_day — registration_activity from our input table was grouped and summed over each population_segment and each semester_id + registration_day. cumulative_sum — The accumulation of sum_enrolled_per_day, from the beginning of the registration period (sum_enrolled_per_day=250) until the current registration_day, which is the value in the same row, in the sum_enrolled_per_day cell. final_enrolled_semester — Our target, the number in cumulative_sum at registration_day = 0. Airtable (data by the author).

What do we have here? A lot has changed (see the technical details in the table description):

population_segment is now our main identifier of each time series in a cohort. semester_id and registration_day are our unique identifiers for each observation, as all populations repeat only once per semester per registration day. Now, the model can learn what happened for each population segment on each registration day (remember, we need to perform daily predictions!). sum_enrolled_per_day — a daily summary of the number of students enrolled from each population on each registration day. cumulative_sum — how the registration progressed so far, as the end of the registration period approaches. final_enrolled_semester for each population and on each registration day, what the final number of students registered in that semester is. Of course, this will be obfuscated in the test data and will only be used to validate the final selected model.

Now our model has completely addressed the CEO’s needs— a daily prediction of how many will register from all types of groups.

Interestingly, this transformation can also handle much greater population segments. A prediction for an aggregated group is simply the sum of the model’s predictions for all smaller groups comprising the larger one. For instance, if on registration day 73, the CEO needs a prediction for the sum of general population prisoners that’ll enroll in all tenures and 1st and 2nd degree types, we can sum the prediction of all categories that are both mutually exclusive but also collectively exhaustive of the bigger category in question:

Y^ is the prediction produced by the model, 73 is the day in the registration period where the model makes the prediction, and the subscript, delimited by commas, is the population segment comprised of the four population categories. The current model granularity level of 180 combinations of population segments allows for easily predicting a greater segment by a sum of the predictions for all the population segments comprising the bigger segment. latex2image (image by the author).

Inevitably, we can also easily predict the entire population of students who’ll enrol by summing up the predictions for all groups.

The model’s results

Along with other engineered features, we got a one-year forecast accuracy for all days in the registration period of 92%, defined as the Normalized absolute difference: MAE divided by the absolute mean of the target. More importantly, though, the CEO now has an automatic daily prediction for all those different population types, with maximum flexibility over the various possible intersections of population categories. It’s much easier to set goals for participation of minority groups — and adhering to them.

3. Optimizing course logistics

We’ve solved both general and specific population admission challenges, Hurray! 🏅 It’s now time to crack the challenge of predicting the enrollment of existing students in the variety of courses the college offers. TL;DR: Our feature engineering in this challenge resulted in an 88% accuracy in predicting the number of students for all 700 different courses.

Our third challenge — predicting the enrollment of existing students in the variety of courses the college offers. Image by the author.

The business challenge

The college’s Dean distributes textbooks and opens study groups based on course enrollments. However, former inaccurate predictions (using some rules of thumb) for the number of enrollees per course have led to serious logistical issues. It has reached the point where the college had to cancel the participation of enrollees in courses, resulting in both financial losses and student dissatisfaction.

An illustration of the original dataset of the course registrations. Each course (course_id) in each semester (semester) has a specific number of students enrolled (n_registered) — our target feature. Some of our features describe the metadata of the course: the department, an indicator for the course having a final exam or not, the difficulty level, and the course's academic credits. We also have some features indicating the previous semester's performance of students in the course: percentage of students completing the final assignment, percentage of students taking the exam, percentage of students passing the exam, and the percentage of enrollees who were freshmen. Airtable (data by the author).

As you can see, the college already has some interesting features to consider for each course that’ll help predict how many students will eventually enroll. But can we make more useful features out of those?

The professional challenge: Engineer lag and rolling features for multivariate time series

Each row contains categorical data and data about enrollees of the course from the previous semester. More importantly, we also have historical data up to when the course was established. For some courses, this means years of historical data.

As with the prior students' diversity model, we also have aggregated multivariate time series. We can treat each feature as a time series. Thus, we can calculate moving averages, sums, differences, percentages of changes, variances, and other statistics on the lag operators. More simply put, we can capture both historical and ongoing patterns by calculating various statistics on both our numerical and categorical features (turned numeric using target encoding):

An illustration of the dataset after feature engineering. The number of features grew almost 3 times. Department feature was replaced with a set of Department features encoded by the number of students registered in all courses of the department (Department_n_registered). It was then expanded with this number in the previous semester (Department_n_registered_1), the change in registration of the prior semester (Department_n_registered_pct_change_1), the difference in registrations from the two semesters before (Department_n_registered_diff_2) and the change in percentage (Department_n_registered_pct_change_2), but also the rolling mean of registered to the department in those past semesters (Department_n_registered_rolling_mean_2). Similar statistics features were also created for the three previous semesters, for our target feature (n_registered) and the performance statistics (pct_took_exam, pct_completed_assignment, etc). Na’s created in the first rows of each course by the rolling features were either dropped or filled. Airtable (data by the author).

A summary of what I did here (see the technicalities in the table description):

A. Aggregated categorical statistics — For every categorical field (e.g., population group, student category, department), we replaced the original label with numerical aggregates, and rolling and lagged variants, that expose the historical demand for the course. Identical sets were created for the three previous semesters and for every other high-cardinality category we track.

B. Aggregated numerical statistics — we added daily, cumulative, and final-semester aggregates and their lagged / rolling counterparts for each numerical feature.

These transformations roughly tripled the feature count, turning sparse categorical labels into dense, information-rich numeric predictors and allowing tree-based models to learn both short-term shocks and long-term trends. 💪 Since we had plenty of data, we were less worried that this abundance of features would make our model overfit to the train randomness.

The model’s results

After engineering these new features, the model has achieved an annual 88% accuracy (normalized MAE) for all 700 courses. It’s worth mentioning that these courses have a wide range of 10 to 3700 students. These results were a major help to the college in optimizing course logistics and making sure all students can participate in their selected courses.

The average error per target bin. The X-axis marks the course size, with 20 relatively similar-sized bins of courses. The Y-axis is the number of students registered/predicted to register. The distance between the blue and red disks for each bin is the average error per bin. Plot by the author.

Besides these amazing results, the stakeholders were rather surprised by the model’s feature importance. Intuitively, they understood that previous semester registration data might be important for predicting registration for a semester. However, they couldn’t imagine how important the history of that figure was to the prediction, even 5 semesters before the current one!😲

And in that, we cracked the course logistics optimization challenge, and left the stakeholders in much excitement towards our next, final challenge.

4. Dropout prevention

We’ve now cracked the challenges for all new students (general admission and diversified admission) and existing students (enrollment in courses). Let's crack the final challenge of graduating students, i.e., lowering the dropout rate, and ensuring everyone who can will graduate successfully from the college. TL;DR: Our careful data splitting and risk banding eventually doubled the accuracy in predicting graduates from students at an early age.

Our final challenge: lowering the dropout rate, and ensuring everyone who can will graduate successfully from the college. Image by the author.

The business challenge

The college’s dropout rates are particularly high. There’s a persistence unit that supports students early on, actively approaching them and assisting in different ways, ultimately preventing dropout. The issue is that referrals to students are made using rules of thumb (a little deja vu?), and filtering the relevant students to approach, out of the tens of thousands of active students, is a major challenge. Additionally, the persistence unit aims to engage with students at a very early age, before the majority of them drop out. The problem is that, at this point, there’s very little data about the students, making it difficult to differentiate students based on their graduation chances.

An average of less than 30% of students with 0–6 credits will eventually graduate, while almost 60% of students who reach 24–31 credits will graduate. Plot by the author.

This is how the raw data of graduating students looks:

The raw data of graduating students. Each student (student_id) has an observation for each semester she took courses in, while semester_max is the final semester of that student, where she either graduated or dropped out, as stated in our target feature, final_status. We also have some personal information about the students, like age, gender, and mother tongue, which the model won’t use to prevent bias. Other features describe the student’s performance up until that semester: her success_index (a number aggregating the student’s grades, attendance, and other success metrics), current and first English levels (there’s a basic level needed for graduation), the number of courses the student took in this semester, the student's program, and her declared intent in studies. Airtable (data by the author).

And after engineering lag and rolling features, with NAs not yet been handled:

The graduating students dataset, after some time series feature engineering for the performance-based features (NAs have not yet been handled). Personal information features were excluded from the model, along with the last semester of the student (semester_max). Airtable (data by the author).

The professional challenge 1: Group‑aware Time Series Split

One crucial issue in training such a model is the nuanced split of the train and test sets. We have two things to consider to avoid data leakage:

  1. Students' exclusivity to one dataset: Each student is, in fact, a time series of multiple observations, each observed in a different semester. If we split any of these time series, leaving part of the same student’s semesters in the train set and part in the test set, we’ll have an obvious target leakage. The model will learn about Bob, who studied from 2023 to 2024, and we know he has graduated at some point in the future, as stated in final_status. Then, the model will be asked to predict Bob’s graduation status in his last semester, which is in the test set, 2025, when it was already informed of Bob’s graduation in the training set. Thus, the train and test sets should isolate all records of each student completely.
  2. “Last year of studies” exclusivity to one dataset: It is not enough to keep all the data on each student exclusive to one set, as we still need to put the right students in each set. If we add Carol, who graduated in 2025 (our last year of data), to the train set, our test results on Dave, who also dropped out in 2025, will be over-optimistic, as they relied on some knowledge of students graduating in 2025. We will essentially train the model on data that will not be available in production. For our model to generalize well to the future, it can only train on students who finished their studies up until 2024 and test on students who finished their studies in 2025.
The train-test split. Students are exclusive to one split, with all their observations belonging to only one set. The final semester of a student is also exclusive to the one split. Even though some historic semesters are shared between the sets (e.g., 2023 and 2024), all the test set students have the same final semester, which is our last year of data, 2025, whilst all students of the train set have a final semester of 2024 or earlier. Lucidchart (image by the author).

These two important considerations leave us with, first, a train set containing students with student IDs completely disjoint from those in the test set. Second, even though both sets might contain observations from the same old semesters, only the test has the latest semester in data, so the maximum last semester in the train set will always be older than the minimum last semester of the test set. Which by definition means that the maximum of all semesters in the train set is older than the minimum of last semester in the test set.

Our train-test split conditions: (1) Disjoint student IDs. (2) Some semesters of train and test sets might be shared. (3) Disjoint Last Semester of all students, when the train last semester is always older (i.e., mathematically smaller) than the test last semester.

The easiest way to achieve these conditions is to split the students based on their final year, which is the year in max_semester, e.g., year 2025 for the semesters 2025A, 2025B, and 2025C. Students who graduated/dropped out before 2025 will be in the train, and students who graduated in 2025 will be in the test. Of course, each student is taking all their records with them to the relevant set.

The professional challenge 2: Post‑stratified risk banding

Fast forward to the modeling phase of CRISP-ML: we’ve split the data correctly, trained on various ML algorithms, selected the best model, and optimized our decision threshold. We’ve now got the final model, producing both probabilities and a final decision if a student will graduate, for each student at each point in time:

Metrics of the final model on unseen data, for all levels of students' credits. The distribution of graduating students is shown in green and, unsurprisingly, increases as the number of credits increases. The model’s precision metric is in blue, and it’s mostly above the green distribution of graduates; otherwise, the model won’t be better than a random guess. If we calculate the lift by dividing precision by the distribution, we’ll see it is highest for the lower credits. That’s exactly what we needed, as our main focus is students at the beginning of their degrees. Recall is orange, it is high on lower credits, and remains so for all credits. Plot by the author.

Now what? If we take these results as they are, with a 0 to 1 probability of graduating for students of all levels, the model adds very little useful information to the persistence unit. Why is that? If you recall from our first plot in this model, the number of credits acquired so far by the student is a very indicative feature of graduation, perhaps the most indicative one:

An average of less than 30% of students with 0–6 credits will eventually graduate, while almost 60% of students who reach 24–31 credits will graduate. Plot by the author.

So our model doesn’t do much if it predicts a higher probability of graduation for students with more credits. What we can do, instead, is to cluster together students with the same level of credits, let’s say in ranges of 10 credits, train a model per range, and that way we neutralize the strong effect of credits and provide more useful predictions to our clients, based on the other features of the model. You can see an example of multiple models on different segments of the data, in here:

Time Series: How to Beat SageMaker DeepAR with Random Forest

Improve your KPI by 15% with 3X faster, free & interpretable model

medium.com

The problem with the multiple models method is that, given a regular degree is usually completed after 120 credits, this means maintaining 12 different models (e.g., 0–10 credits, 11–20 credits, 21–30 credits, …, 111–120 credits), and also having confusing results for different observations of the same student. For instance, if Alice, with 6 credits so far, has excellent grades compared to her peers in the same credits range, she’ll be predicted to have an 80% chance of graduating at that point. If, however, her success index declined at 18 credits compared to her peers, she might now have only 10% of graduating, even though we know for sure she has better chances of graduating after 18 credits than after 6 credits only. Our model will become too local and hard to interpret without weighing the credit range in the calculation.

However, there is a more straightforward approach, easier to maintain and less prone to confusion over the results. We can take the initial model, stratify the probabilities for the different credit ranges, and cluster them into three color bands, representing the probability (or risk) of graduating (dropping out) compared to students of the same credit level. In this, we achieve two things: both the probabilities for Alice rise when she acquires more credits, and we can still compare Alice to her peers at any credit range.

Our post-processing solution to the dominance of the credits feature in the prediction: stratifying the probabilities over the credits ranges, and for each credit range, clustering the probabilities into three color bands, representing the probability (or risk) of graduating compared to students of the same credit level. This post-processing method has made the college’s stakeholders nickname the model the “traffic light model”. Image by the author.

How do we do it, then?

  1. Gather together observations of students with the same credits range (e.g., 0–10, 11–20).
  2. For each credit range, find the tertiles (thresholds of three equal parts when ordered) of the probabilities to graduate.
  3. Mark each probability prediction in each credit range, in a “color”:

A. Red: The lower tertile, students who are most likely to drop out, compared to other students in this credit range. Those are students the college should probably not approach first, as they have very little chance to graduate either way.

B. Yellow: The medium tertile, students have a medium probability of dropping out / graduating, compared to other students in this credit range. Those are the students the college is most interested in approaching and assisting, as they are dangling between graduation and dropping out.

C. Green: The highest tertile, students are most likely to graduate, compared to other students in this credit range. As with the red band, those students should probably not be approached first, as they already have high chances of graduating without any help.

The model’s results

This method also does a decent job in validating our model’s results. For a model separating dropouts from graduates well, the mean lift of each of those color bands in each credit range (defined as the division of the average students graduating per color by the average students graduating for all students in the credit range), should behave very differently:

A. The red band lift should be low, meaning many fewer students graduating than the mean for this credit range.

B. The yellow band lift should be close to 100%, meaning close to the average number of students graduating for this credit range.

C. The green band lift should be high, meaning many more students graduating than the mean for this credit range.

And indeed, our results show that:

The number of unique students per color band of the 0–10 credit range. The number of unique students is only roughly equal between the colors since the tertiles are computed for observations of students in semesters, and there’s a non-uniform distribution of observations per student in credit range. That’s because a student might complete 10 credits in one semester, but also in three semesters, so the number of student appearances in one credits bin might differ. In any case, we have thousands of observations in each tertile, so we can expect our results to be statistically significant. Plot by the author.

→ while the number of students per color band is roughly the same,

Percentage of graduates per color in the 0–10 credits range. Plot by the author.

→ The mean percent of graduates in each color differs greatly. But how is it compared to the grand mean of the credits range, or in other words, how’s the lift?

The lift of graduates per band of color in the 0–10 credits range. Plot by the author.

→ Hurray, the results we expected! 🎉 The red band has a very low lift (2%), the yellow band has close to 100% lift, and the green band has a very high lift (190%). The college can safely approach first the yellow band students, who, as we can see, are dangling between graduation and dropout.

Given these results, our model separates dropouts from graduates at a very early stage of their studies.

But how is it doing in higher credit ranges?

The lift of graduates per band of color in the 11–20 credits range. Plot by the author.

The red lift is still fairly low, yellow’s a little higher than the yellow in the 0–10 range, but still close to 100, and green’s almost 150.

The lift of graduates per band of color in the 21–30credits range. Plot by the author.

Still good results, yellow is even more similar to the grand mean of the credits range than the previous range.

The lift of graduates per band of color in the 31–40 credits range. Plot by the author.

Those are still decent results, even though our model hasn’t trained on observations of credits above 30 (since the request was for a focus on early-age students). So we’ve learned our model not only generalizes well to the future but also for credits it was not trained on.

And our model still has decent performance even for the much higher credit ranges, where the average of graduates is very high, and it’s harder to find any dropouts:

The lifts of graduates per band of colors in the 51–60, 91–100, and 11–120 credits range. Plot by the author.

Even at the last range, our model still manages to do some useful work, even though students in this credit range have a 91% of graduating! They are almost at their graduation ceremony, and our model still has a better chance of finding dropouts than a random guess. 🥳

Eventually, our novel approach succeeded in detecting dropouts early on and helping the college prioritize which students to approach first, both at each credit level and globally for all students. The model’s lift of the green band over the very early-age students is 190%, meaning it’s almost two times better than a random guess in predicting which student will graduate. 🏆

Conclusion

All the pieces of the puzzle are now complete:

Our puzzle is now complete! ✔️Image by the author.

We’ve helped the college to: Prioritize leads by adding future features without data leakage; Meet diversity goals by transforming our cross‑section data to grouped multivariate time series; Optimize course logistics by engineering lag & rolling‑window features for time series data; Prevent dropout by group‑aware time splitting and post‑stratified risk banding.

By solving our puzzle, we’ve acheived some great sucesses in various aspects of the college’s business:

Our sucesses in the various aspects of the college’s business. Image by author.

Consequently, the college’s cheif of data has openly expressed his appreciation of our work together:

“We succeeded in shortening processes, improving the accuracy of the models, and optimizing the use of AI for better decision-making.

Of course, our work isn’t done. These models live in production — constantly validated, retrained, re-engineered, and upgraded. Pieces of the puzzle shift, new ones appear, and fresh challenges emerge.

But that’s another story, yet to be told… 🔮

Feel free to share your feedback and contact me on LinkedIn.

If you like what I do, see my other posts:

– Time series modeling for demand forecasting:

Time Series: How to Beat SageMaker DeepAR with Random Forest

Improve your KPI by 15% with 3X faster, free & interpretable model

medium.com

– Developing a Machine learning Streamlit app:

10 Features Your Streamlit ML App Can’t Do Without — Implemented

Much has been written about Streamlit killer data apps, and it is no surprise to see Streamlit is the fastest growing…

medium.com

– A review of Jetbrains IDE for data scientits:

The Good, the Bad and the DataSpell

An honest review of JetBrain’s Data Science IDE after a year of using it

medium.com

And some lighter stuff:

Can ChatGPT Think?

An answer from Leibowitz, Yovell, and ChatGPT

pub.towardsai.net

Will AI kill us all? GPT4 answers to Yudkowsky

I asked Bing about Yudkowsky’s argument. When it got too optimistic, I confronted it with some harder claims. It ended…

dormeir.medium.com

Who’s The Best Free Chatbot?

ChatGPT vs. Bing Chat vs. Bard

pub.towardsai.net

MJ or LBJ — who’s the GOAT? An answer by Bing Chat

Thank you for reading, and good luck! 🍀

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.