
Claude, Excel, and a 1991 Masterpiece
Last Updated on April 29, 2025 by Editorial Team
Author(s): Han Qi
Originally published on Towards AI.

The gist
- Transform a scanned PDF into a user-friendly Excel sheet.
- Validate extracted data by overlaying it onto the PDF.
- Debug issues using conditional breakpoints.
- Gain insights into your learning habits and those of others.
What’s the MSLQ
The Motivated Strategies for Learning Questionnaire (MSLQ) is a 1991 gem — a five-year effort distilling 81 questions into 15 scales measuring motivation and cognitive habits.
As an edtech enthusiast, I wanted to use it to better understand myself and support my colleagues. But all I found was a grainy scanned PDF from ERIC (Education Resources Information Center: https://eric.ed.gov/?id=ED338122), stamped “BEST COPY AVAILABLE.”
Follow Along
This process takes about 15 minutes.
Pick some memorable courses, then start answering the 81 questions on page 36 of the PDF.
Avoid reading earlier pages to keep your responses unbiased.
I began jotting down answers like “14331 34456” on paper, but quickly realized this wasn’t scalable.
Ditching old habits, I turned to Claude, the best free AI model I know, to test its OCR prowess and streamline the process.
Extraction prompt
There are 81 questions in MLSQ. Create an excel sheet where a user can fill in his responses for the 81 questions by typing numbers, and the 9 scales will be automatically calculated. The 9 scales each are links to a set of non-overlapping, non-consecutive questions. I don’t have many respondents, so want 1 row per question, and columns represent respondents. I also want a readable description of the question for each row as the lefmost column The results of the 9 scales for each respondent should appear in a new worksheet (tab) in excel
Extraction code
Here’s the final Python code after tweaks: MSLQ_excel_template to create excel after OCR on MSLQ.pdf
Very impressive to generate 600 lines of near-perfect code at one go!
Though the MSLQ has 15 scales, I focused on the 9 with sample feedback in the PDF. Recovering the other 6 is as simple as editing selected_scales
.
The 15 scales group 81 questions in a mutually exclusive manner.



Give Claude a round of applause
- Extracted all 81 questions in order, with correct numbering (1-based, not Python’s 0-based). Saved time finding the questions and hardcoding strings
- Found the 8 reverse coded questions. Reverse coding means lower answers yield higher scores. Reverse coded questions have their answers deducted from 8 to reverse 1–7 to 7–1.
- Grouped questions into 9 scales, saving hours of manual scrolling through the 75-page PDF. We must scroll because by design, questionnaires shuffle the questions so their associated scales do not appear consecutively
- Good UX — Instructions provided at the top left to make the Excel self-explanatory, and colored headers
- Good UX — Included a CLI for output path and respondent count, which I later hardcoded for speed
- Complex Excel Formulas for AVERAGE, COUNTIF, IF, and references across sheets are all correct
- Set up the data structures, implicitly doing data modelling
mslq_questions = [
# Format: [question_number, question_text, scale, is_reversed]
[
1,
"In a class like this, I prefer course material that really challenges me so I can learn new things.",
"Intrinsic Goal Orientation",
False,
],
[
2,
"If I study in appropriate ways, then I will be able to learn the material in this course.",
"Control of Learning Beliefs",
False,
],
##################### TRUNCATED ########################
Try It
Run the code to generate the Excel or copy this template:: https://docs.google.com/spreadsheets/d/1B9suxNdatBROsPIz8OrdKYoIla-VDR46
Minor Quirks
- A missing “=” in formulas was an easy fix (changed
f"IF…
tof"=IF…
). Easy to spot since the cells contain the formula as a string. - The code used
fitz
, the old version ofpymupdf
. Swapping all instances offitz
topymupdf
worked seamlessly - Initially, scales and reverse-coding info appeared on the Questions tab, potentially priming respondents. A follow-up prompt moved these to a new Metadata tab.
Scale and Reversed? should not be in the Questions tab because it will prime the respondent. show them in another tab
Now what?
With the Excel ready, I needed to ensure Claude’s extraction was accurate. Could I trust the data pulled from a fuzzy PDF?
Validation with Visualization
I asked Claude to highlight where it found each question, reverse-coding status, and scale associations in the PDF, just like validating a translation by translating back to the source language.
Validation Prompt
The source MSLQ.pdf is attached.
I want to see where in the pdf is the information in code extracted from. Write python to highlight the pdf, and create a new pdf.
Information i want to check: 1. question text 2. whether each question is reverse coded 3. Associations between questions and scale
Validation Code
See the code here: Read MSLQ.pdf and add highlights and summary page to create MSLQ_highlighted.pdf.
It produced MSLQ_highlighted.pdf
with a new page appended, and a validation report MSLQ_highlighted_validation.xlsx
Processing PDF: MSLQ.pdf
Scanning document for relevant information...
Questions not found:
Q1: I prefer class work that is challenging so I can learn new things.
Q50: When studying for this course, I often set aside time to discuss course material with a group of students from the class.
Q51: I treat the course material as a starting point and try to develop my own ideas about it.
Q61: I try to think through a topic and decide what I am supposed to learn from it rather than just reading it over when studying for this course.
Validation data saved to: MSLQ_highlighted_validation.xlsx
Highlighted PDF saved to: MSLQ_highlighted.pdf
Analysis Statistics:
- Questions found: 77
- Questions not found: 4
- Reversed items found: 8
- Scales found: 0
Process completed successfully!
I had to add REVERSED
into search_terms
to help the program highlight reversed items.
We can still see that 4 questions and scales could not be found. Let’s investigate.
Debugging Missing Highlights
Question 1


We use conditional breakpoints to stop at page 12 (page_num == 11
) and question 1 (q_num == 1
), then compare the search term (q_text
) against the pdf (page.get_text()
).
You can see Claude has rephrased the question, thus breaking the match.
Why can Question 16 be found?
Sharp readers would have realized the text is split across 2 lines in the pdf, but the search term is on 1 line.
This still works becausepage.search_for
is designed to search across pdf lines by default.
Why Question 24 is not highlighted but not reported as not found?
- The pdf repeats each question twice, first under their associated scale, then as an ordered list of questions in the questionnaire. If at least 1 of the 2 appearances match, it is considered found. Even if both match, they are deduplicated in a set before reporting statistics.
Question 50

Following the same strategy, q_text
was missing the
before course material
.
q_text
:When studying for this course, I often set aside time to discuss course material with a group of students from the class
page.get_text
:When studying for this course, I often set aside time to discuss the course material with a group of students from the class
Question 51

print(page.get_text())
saw try
as ery
due to bad scan qualitytreat the course material as a starting point and ery to develop my own ideas about it
.
Question 61

q_text
had 3 extra words appended (I try to think through a topic and decide what I am supposed to learn from it rather than just reading it over when studying for this course.
)
Fixing the Bugs
In the LLM era, perfection isn’t the goal — empowerment is. Potential fixes:
- Rephrasing (Q1): Prompt Claude to avoid grammatical changes.
- Missing/extra words (Q50, Q61): Emphasize exact word count in prompts.
- Scan quality (Q51): Apply spellcheck tools.
Claude’s Wins
- Consistent variable names (
mslq_questions
) across extraction script and highlighting scripts - Question phrasing consistency within
mslq_questions
across the scripts (Extraction errors in Q2, Q4 appear in both scripts) - Clear data structures indexing question numbers (
reversed_items
andscale_associations
) - Bonus: A validation Excel (
MSLQ_highlighted_validation.xlsx
) for non-technical teammates and a new summary page 76 in the highlighted PDF (though pagination overflow hid some text).


The new page 76 contains a list of question number, the page it was first found, and description.
Possible Improvements
- Optimize the highlighting code’s 4 nested loops (e.g., convert O(n) list searches to O(1) set lookups or use parallel processing).
- Set Excel column widths to the longest question text to avoid truncation.
- Address user impatience with lengthy questionnaires, the real bottleneck
Interpreting the scores
You may want to use this feedback to do something about changing your study skills or motivation. All of the motivational and study skills mentioned on your feedback sheet are learnable.
If your scores are above 3, then you are doing well. If you are below 3 on more than six of the nine scales, you may want to seek help from your instructor or the counseling services at your institution.
What did I learn
I completed the questionnaire by reflecting on how I behaved during university, averaging across a broad range of courses.

- Task Value 5
- Self-Efficacy for Learning & Performance 4.375
- Test Anxiety 2.6
- Rehearsal 2.25
- Elaboration 3.5
- Organization 4.5
- Metacognitive Self-Regulation 3.9
- Time & Study Environment 5.125
- Effort Regulation 6.25
Effort Regulation
My effort regulation score was surprisingly high, indicating that when I set goals, I tend to complete them. The real challenge arises when I either don’t set goals or set the wrong ones.
Metacognitive Self-Regulation
An average score here makes sense. In my first two years of university, I spent a lot of time obsessively planning, monitoring, and regulating my study sessions — sometimes down to 5-minute granularities over 3-hour blocks.
While it made me efficient, it also drained any sense of inspiration or spontaneity. I went through a phase of consuming self-help content, chasing constant productivity and brilliance, only to become jaded.
Later, after taking a leadership position in AIESEC, I shifted 20 hours per week into leadership development. My grades suffered, but the life skills I gained were worth it.
Elaboration
My average elaboration score reflects that I didn’t take particularly good notes. Studying two degrees in four years forced me into an unsustainable pace — I had little time to slow down, deeply understand concepts, or consolidate ideas.
I often rushed through recorded webcasts at 2–3× speed and had just 1.5 days of revision time per module to cover 13 weeks of material.
Still, the sheer intensity of that experience raised my learning ceiling to a level I haven’t quite reached since.
Rehearsal
My low rehearsal score isn’t surprising. I never understood the point of rote memorization.
I believed time was better spent engaging in higher-order thinking and connecting ideas rather than copying notes verbatim.
I did explore memory techniques for fun — once memorizing 30 random items within 90 seconds and recalling them after a 10-minute gap, but ultimately decided against investing the immense amount of brain rewiring needed to become truly proficient.
Moving Forward
Looking ahead, I want to tap into that high-performance mode from university once or twice a year, in a sustainable rhythm.
Now that I’m beyond university, I can choose the work I want to pour effort into, rather than being forced through arbitrary modules. As a result, strict effort regulation feels less critical.
Instead, my focus will be on sharpening Metacognitive Self-Regulation — particularly planning, pivoting effectively, and becoming more comfortable with uncertainty and change.
The unexamined life is not worth living — Socrates
Reigniting dormant wisdom
The MSLQ, like many research treasures, has its insights buried away in dusty PDFs. Decades-old frameworks still hold answers to today’s challenges — attention, focus, and growth.
Imagine the forgotten gems waiting to be rediscovered.
Pick one, dust it off, and let AI help you bring it to life to empower communities, schools, and workplaces.
We’re not just using LLMs to generate the future — we can use them to reclaim the best of the past
Resources
- Pintrich, P. R., Smith, D. A. F., Garcia, T., & McKeachie, W. J. (1991). A manual for the use of the Motivated Strategies for Learning Questionnaire (MSLQ) (Technical Report №91-B-004). University of Michigan, Ann Arbor, MI. Available from https://eric.ed.gov/?id=ED338122
- Philip Guo’s Computational Pedagogy Research: https://pg.ucsd.edu/
- Proquest Education Resources: https://about.proquest.com/en/products-services/pq_ed_journals
- Aggregator of Education databases: https://www.bu.edu/library/pickering-educational/research/collections/databases
- Aggregator of Subject-based databases: https://browse.welch.jhmi.edu/searching/databases-by-subject
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.