Can LLMs Truly Think Outside the Box?

Author(s): Max Shap

Originally published on Towards AI.

I was surprised to hear from a few friends that some tech companies still include puzzles in their interview process — or even dedicate a full 45-minute session — asking candidates to solve problems like this:

Two people had to go to the top of the hill. There’s only one motorcycle, yet only one person can ride it. Fortunately, they both get to the top of the hill. How?

In the 2010s, such interview questions were popular among big tech companies, particularly Microsoft and Google. These puzzles, often called brainteasers, were designed to test a candidate’s lateral thinking — or “thinking outside the box.” This approach involves creative, divergent thinking, which requires looking at a problem from new perspectives and challenging assumptions. It’s often associated with the right hemisphere of the brain.

However, Google later admitted that brainteasers were ineffective [2] for hiring, as they found little correlation between performance on these puzzles and job success. As a result, Google — and many other companies — dropped brainteasers from their hiring processes.

Let’s set aside why some companies might revisit riddles for candidate assessments today (you can read my thoughts on this in the Conclusions section). Instead, let’s focus on the fun part: How do cutting-edge AI models perform on brainteaser questions that demand not only strong reasoning skills but also a touch of creative thinking?

Methodology

I tested two frontier models from OpenAI for this experiment: gpt-4o-mini and gpt-4o.

Side note: running API with gpt-o1 was way too expensive 😂

The model was tested under the same conditions as a human candidate at the interview:

It was asked to solve the riddle.
It was allowed to output intermediate thoughts to demonstrate the thinking process.

To sum up, the following system prompt was used:

system_prompt = """
You are an expert in solving riddles.
You will be provided with a riddle wrapped in the tags: <riddle>riddle text</riddle>.

Your task is to provide an answer to the riddle.

If you find it helpful, you may output your intermediate thoughts to aid in finding the answer. These should be wrapped in the tags <thinking>your thinking process</thinking>. However, this is optional.
You must conclude your response with the final answer wrapped in the tags <answer>your answer</answer>.
If you are unsure of the answer, respond with <answer>I have no answer</answer>.
Let’s begin solving riddles.
"""

Example riddle:

Riddle: Jessica is telling her friends this story and asks them to guess if it's the truth or a lie: 
"There was a man sitting in a house at night that had no lights on at all. 
There was no lamp, no candle, and no other source of light. 
Yet, he sat in the house and read his book happily." 
Her friends say she's lying, but Jessica corrects them and says 
she's telling the truth. 
Jessica's story is true. 
But how?

Prediction by gpt-4o model:

<thinking>We need to consider all possible ways a person can read without traditional sources of light. 
The key point of the riddle is "no lights on at all" which rules out any man-made lighting sources. 
This leads us to think about natural ways a person could read in the absence of artificial light. 
The most plausible natural source of light that would allow someone to read is sunlight or moonlight.
</thinking>

<thinking>Since the scenario happens "at night," 
sunlight is not possible, leaving moonlight. 
But even without moonlight, there is one other potential explanation: 
the man could be blind and reading Braille. 
This explanation fits as it doesn't depend on any light source at all.
</thinking>

<answer>The man is blind and reading a book in Braille.</answer>

After the model generated a response, I needed a way to evaluate the correctness of the results. Directly comparing the predicted answers with the reference answers wasn’t feasible, as the model might generate semantically correct answers phrased differently. To address this, I used another LLM, specifically the gpt-4o model, as a judge to validate the predictions.

The validator was provided with relaxed criteria for evaluating correctness. If the predicted answer made sense, even if it didn’t match the reference answer exactly, it still marked the sample as correct.

The following system prompt was used for this purpose:

You are an expert in validating answers to riddles.

You will be provided with the following:

A riddle wrapped in the tags: <riddle>riddle text</riddle>.
A reference answer wrapped in the tags: <reference_answer>text</reference_answer>.
A predicted answer wrapped in the tags: <predicted_answer>text</predicted_answer>.
Your task is to determine whether the predicted answer matches the reference answer.

Focus on whether the meaning of the predicted answer aligns with the reference answer, ignoring any typos.
The reference answer may also include an explanation, usually in a separate sentence. If the predicted answer contains reasoning that differs from the reference reasoning but the predicted answer itself is correct, you should still consider the riddle as solved correctly.
If you strongly believe the predicted answer is valid and can be treated as correct (even if it is completely different from the reference answer), you may decide that the riddle is solved correctly.
You may output intermediate thoughts to help you reach a decision. These should be wrapped in the tags <thoughts></thoughts>.

Finally, return your verdict wrapped in the tags <verdict>your verdict</verdict>.
Your verdict should be either True (for matching answers) or False (if the answers do not match).

Example puzzle along with predictions from two models, and with verdict from a validator:

Predicted answers are different from reference answers. However, they perfectly match this question. The validator catches this and marks predictions as correct

Finally, I calculated the accuracy.

In total, I spent around 15$ on running predictions and validations.

To summarize, the approach was as follows:

Present each riddle individually to gpt-4o-mini and gpt-4o, prompting them to think step by step and solve it.
Use gpt-4o as a judge (since it is more powerful than the mini variant). Provide the text of the riddle, along with the correct answer and the generated response, and ask it to evaluate whether the generated answer is semantically close to the reference answer.
Calculate the accuracy.

Data

For my experiments, I used the carefully curated dataset created by the paper’s authors [3]. Each puzzle in the dataset is designed to evaluate a broad range of human intelligence skills, including strategy development, planning, visual-spatial thinking, creativity, and memory.

To build this dataset, the authors first collected thousands of puzzles from public resources. They then applied filtering, deduplication, and grammar correction, followed by human verification to ensure the puzzles retained their original meaning. Each puzzle includes an answer, and some samples also provide reasoning to explain the solution. Finally, the authors augmented the puzzles with two key transformations:

Semantic Reconstruction: Rephrasing the original question while keeping the answer unchanged.
Context Reconstruction: Maintaining the misleading commonsense premise but changing both the question and the answer to fit a new situational context.

These augmentations were crucial for evaluating the model’s lateral thinking abilities rather than its memorization skills. (As we know, LLMs are trained on vast amounts of internet data, so some of these puzzles might have appeared during their pretraining.)

Performance on semantically reconstructed puzzles reflects how well the model understands the puzzle’s meaning. In contrast, performance on contextually reconstructed puzzles reveals the model’s reasoning capabilities.

Examples of Original puzzle, Semantic reconstruction, and Context reconstruction:

The final dataset contains 1,100 high-quality brain teasers. Check out the paper for more details.

Results

The overall accuracy of the models is shown in the table below:

Overall, the mini variant of the model performs 20% worse than the main version indicating that probably it’s not the best choice for reasoning tasks.

Performance of the model based on the Original puzzle, Semantic reconstruction, and Context reconstruction:

The model’s accuracy for gpt-4o is quite high, achieving 84% on original puzzles. However, it drops significantly — by 10% — on Semantic Reconstruction riddles, where the question is rephrased using different words but retains the same meaning. This may suggest two potential issues within the model: (1) sensitivity to word order, even in advanced models, and (2) a degree of randomness in its reasoning process.

More notably, the performance gap is even larger, nearly 20%, on Context Reconstruction puzzles. These puzzles present entirely new scenarios, requiring the model to rely on reasoning abilities rather than memorization.

While the results are still reasonable and could undoubtedly be improved in various ways (see the Next Steps section), the model occasionally fails even on simple riddles like this:

Next steps

The results I achieved can certainly be improved in several ways. I list below the options that are worth trying to improve the reasoning skills.

Ensembling for Improved Accuracy. One straightforward improvement is to solve the same riddle multiple times in parallel (e.g., N times) and vote for the most frequent answer. This ensembling technique is a standard approach and can typically boost performance metrics by 5–7%.

Addressing Performance on Contextual Reconstruction. As observed, performance drops significantly on contextually reconstructed puzzles compared to the original ones. This likely occurs because the model has memorized many original puzzles during pretraining on internet data, relying less on genuine reasoning. To improve, we need a model specifically optimized for reasoning tasks — such as the o1 family or even o3 models. However, I didn’t explore this route due to the high costs and long response times associated with these models. If anyone is interested in reproducing these experiments with more powerful models, the reproducible code is available at [4].

Simulating a Real-World Interview Scenario. It would also be intriguing to test the model in a simulated interview-like setting. In this setup:

The model acts as the candidate, generating an initial response to a riddle.
Another model serves as the interviewer, validating the candidate’s response.
If the answer is incorrect, the candidate model can revise its response using feedback from the interviewer.

This approach mimics a real brainteaser interview, where candidates think step by step, validate their ideas with the interviewer and adjust their reasoning as needed. It would be fascinating to analyze how much assistance from the validator (who knows the correct answer) is required to guide the generator model to the right solution. Additionally, we could explore whether this iterative process converges within a reasonable number of steps.

Conclusions

In this short article, I examined the reasoning capabilities of the gpt-4o model family on brainteasers using a high-quality puzzle dataset from [3]. While the model demonstrated solid performance overall, achieving 84% accuracy, its performance dropped significantly — to 65% — on puzzles that are unlikely to be publicly available on the internet and, therefore, were probably not part of the model’s training data.

This performance gap highlights significant room for improvement in the reasoning skills of this model family.

I speculate that this gap might explain why some tech companies still include brainteaser-style questions in their interview processes. Since models struggle with these types of problems, they may help reduce the influence of generative AI on interviews. That said, this is purely my personal opinion and isn’t based on any factual data.

I didn’t discuss the performance of the o1 model in this article, which would be an excellent area for further exploration. Are there any volunteers willing to spend $100+ on such experiments and endure the long wait for results, given the model’s internal reasoning process? I’d be curious to see how O1 performs. The code is available at [4], so feel free to try it out — perhaps it will excel.

I’m also intrigued by the idea of testing a simulation scenario where two models interact: one generates answers or reasoning, while the other validates and suggests improvements. This setup could mimic real-life interviewer-candidate communication. Could this iterative process boost results? It’s an exciting question to explore.

That’s all for this article — thank you for reading! Let’s connect in future discussions. Be sure to subscribe and connect with me on LinkedIn: https://www.linkedin.com/in/maxshapp/

See you in the next episode!

References

[1] Edward De Bono. (1970). Lateral thinking. New York.

[2] Hacker News thread

[3] Jiang, Y., et al. (2023). BRAINTEASER: Lateral Thinking Puzzles for Large Language Models. arXiv. https://arxiv.org/pdf/2310.05057

[4] Link to GitHub repo with notebook.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Can LLMs Truly Think Outside the Box?

Author(s): Max Shap

Methodology

Data

Results

Next steps

Conclusions

References

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Can LLMs Truly Think Outside the Box?

Author(s): Max Shap

Methodology

Data

Results

Next steps

Conclusions

References

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥