How to Identify AI & ML Research Problems Worth Solving
Author(s): Ayo Akinkugbe
Originally published on Towards AI.

In Search of a North Star
Formulating a research question is one of the early and most important steps in the research process. It is the one determinant if your research effort would be a directionless chaos or a worthwhile venture with credible outcomes. Yet, it is also the step students struggle with the most. As you venture into the creative abyss that is research, you must be know where you want to go so you don’t get caught in. The challenge here is often not a lack of ideas. The AI and machine learning space evolves so quickly that new problems emerge every month. However learning how to transform a broad interest into a focused, testable, and meaningful question is often the roadblock.
A well-crafted research question acts as the north star of your project: it guides your hypotheses, informs your method, shapes your experiments, and ultimately frames the contribution you will make. This article breaks down the fundamentals of what makes a research question testable, how to narrow a broad topic into a precise problem, and how to distinguish the key components that new researchers often confuse: the research problem, purpose, objectives, and contribution.
What Makes a Research Question Testable?
A testable research question is one that can be answered with evidence, be it — empirical, theoretical, or computational. In AI and ML, evidence typically comes from experiments, simulations, model evaluations, or analytical proofs. A question becomes testable when it specifies what is being studied, how it can be measured, and under what conditions it can be validated or refuted. For example, the question “How can we improve AI fairness?” is too vague to test. It contains no measurable variables, no explicit context, and no definition of “improvement.” On the other hand, “Does training a classifier with counterfactual data augmentation reduce gender bias in toxicity detection models compared to standard fine-tuning?” is highly testable. It defines the intervention (counterfactual augmentation), the model (toxicity classifier), the metric (gender bias), and the comparison baseline (standard fine-tuning).
A question becomes testable when it specifies what is being studied, how it can be measured, and under what conditions it can be validated or refuted
Testable questions also need a feasible experimental path. If a question requires data you cannot obtain, computational resources beyond your reach, or evaluation methods that don’t exist, the question may be interesting but not practical for your research work. Finally, a testable question must allow for the possibility that the answer is “no.” If you phrase a question so that the only acceptable outcome is the one you expect, you are not doing research — you are writing a marketing statement. Good research questions allow for multiple possible answers, each of which can be supported or rejected through rigorous inquiry.
A testable question must allow for the possibility that the answer is “no.” If you phrase a question so that the only acceptable outcome is the one you expect, you are not doing research — you are writing a marketing statement
From Broad Topic to Narrow Problem to Research Gap
Most doctoral or research candidates begin with a broad interest area — something like model interpretability, generative AI safety, data efficiency, or bias detection. These broad topics are far too large to serve as research questions, but they are excellent starting points. The next step is to narrow your interest down to a specific problem. For example, “interpretability” might narrow to “feature attribution stability,” “concept discovery,” or “mechanistic interpretability of transformers.” The key is to identify a sub-topic where existing methods struggle, show inconsistency, or fail entirely.
Once you have a specific problem, your goal is to examine the literature closely to pinpoint a gap. A research gap is not simply “something no one has done before.” Rather, it is a mismatch between what we need and what current methods provide. Gaps appear when models fail in certain scenarios, when metrics are misleading, when assumptions break under real-world conditions, or when tradeoffs (e.g. accuracy vs. efficiency) create unexplored regions. For example: “Existing sparse auto-encoder methods interpret LLM representations, but they are computationally expensive and do not scale to 70B+ models.” That is a research gap. It describes a specific, recognized limitation whose resolution would move the field forward.
A research gap is not simply “something no one has done before.” Rather, it is a mismatch between what we need and what current methods provide.
Moving from topic to problem to gap requires iteration. You explore the literature, test small prototypes, talk with your advisor or an experienced researcher in your field, and gradually hone in on the piece of the puzzle where you can make a meaningful contribution. When your gap is clear, your research question almost writes itself: “Can we design a scalable sparse autoencoder architecture that extracts interpretable features from large language models while reducing memory usage by at least 30%?” That is narrow, actionable, and rooted in a real gap.
Practically Narrowing a Broad AI/ML Topic Into a Specific Research Problem
We discussed the theory of starting broad, finding a gap and defining a problem in the last section. But how exactly do we execute this in practice Below is a time tested, relaiable, practical method
Step 1: Start With a Broad Topic and Map the Sub-Areas
Take your broad interest — e.g “interpretability”, “fairness”, “reinforcement learning”, “model robustness”, etc. Then create a high-level map of its subcomponents.
Example (Interpretability):
- Feature attribution
- Saliency maps
- Counterfactuals
- Concept discovery
- Mechanistic interpretability
- Attribution stability
- Explainers for large language models
- Evaluation metrics for explanations
- Human-AI alignment of explanations
To find subcomponents of a broad research interest like the one above, you can:
- Search “[topic] survey paper” — Survey papers almost always break the field into major subareas.
- Check conference tutorials (NeurIPS, ICML, ICLR) — Tutorials are designed to introduce a topic and usually contain a slide listing its core subfields.
- Look at how benchmark papers structure the field — Benchmarks (e.g. OpenXAI, RobustBench, GLUE, Wilds) categorize subproblems clearly.
- Scan table of contents from authoritative textbooks. — For instance Interpretable Machine Learning or Fairness in ML books outline subcomponents by chapter.
- Use taxonomy sections in recent literature reviews — Most reviews explicitly include “Taxonomy of Methods” or “Categories of Approaches”.
- Look at Papers With Code task categories — Search your broad topic → see task group and subtask structure.
Practical Task: Create a 1-page mind map or outline that breaks the main topic into 5–10 subtopics. This prevents wandering and gives you a menu of directions.
Step 2: Read 5–10 High-Influence Survey-Level Papers
You are not doing a full literature review yet. This is a scouting phase to understand the landscape. Look for:
- Meta-surveys ( example — “A Survey on Explainable AI”)
- State-of-the-art reports (arXiv, NeurIPS, ICML)
- Benchmarks or evaluation papers
- “Open problems” sections in those papers
These type of papers often explicitly tell you where methods fail, what challenges remain and where the field is moving.
Practical Task: For each paper, write 3 bullets:
- What the paper claims is solved
- What the paper admits is not solved
- What it suggests for future research
This isolates research gaps faster than reading blindly.
Step 3: Look for Patterns in the Pain Points
As you gather the bullets from the papers, you’ll start seeing recurring pain points. For example, in interpretability literature:
- Attribution methods are unstable across perturbations.
- Concept activation vectors are sensitive to how concepts are defined.
- LLM explanations do not match ground-truth reasoning steps.
- Mechanistic circuits are hard to generalize across architectures.
If several papers note the same limitation, that is a researchable problem.
Practical Task: Create a table:

This type of failure pattern table reveals problem candidates.
Step 4: Evaluate Feasibility (The Filter Test)
Now you must narrow down to problems that are MUFT
- Meaningful— not trivial
- Underserved— not already solved
- Feasible — can be done in 3–6 months (or within your research period)
- Testable— can be turned into an experiment
The easiest way is to ask these 5 questions:
- Can I test this with publicly available datasets?
- Can I reproduce the current state of the art in < 2 weeks?
- Can I measure improvement with clear metrics?
- Is the failure mode already acknowledged in papers?
- Will a new method or metric be an obvious contribution?
If the answer is yes to at least 3–4 then it’s a strong candidate.
Step 5: Convert a Gap Into a Research Problem Statement
Most gaps sound like this:
- “Attribution methods are unstable across runs.”
- “Concept vectors drift depending on dataset composition.”
- “LLM explanations don’t reflect internal reasoning.”
A research problem is more specific:
“Current feature attribution methods produce explanations that vary significantly under small input perturbations, making them unreliable for safety-critical applications.”
This is the level where a hypothesis can be formed.
Step 6: Validate the Problem With a Quick Reproduction Experiment
This is crucial and often skipped. Before committing months to the topic, do a 2–3 day “mini experiment”:
- Reproduce 1–2 relevant papers
- Test the failure mode they claim
- Observe if it truly happens
- See what variables influence the failure
- Check if evaluation metrics are easy to compute
If the failure is real and measurable, you now have a validated, empirically grounded problem.
Step 7: Turn the Problem Into a Specific Research Direction
Once you’ve confirmed the problem is real and tractable, define:
- A specific dataset
- A specific evaluation metric
- A specific comparison baseline
- A specific hypothesis (H1/H0)
Example:
- Topic: Interpretability
- Subtopic: Attribution stability
- Gap: Attribution methods inconsistent for similar inputs
- Problem: No unified metric exists to quantify cross-method stability
- Concrete direction: Propose a stability metric and test it across IG, SmoothGrad, GradCAM
- Hypothesis: H1: A metric based on local curvature better predicts stability across perturbations than existing sensitivity metrics.
Now you have a defendable, testable research question.

Difference Between Research Problem, Purpose, Objectives, and Contribution
Many early-stage researchers use these terms interchangeably, but they serve distinct roles in a well-formed research design.
The research problem describes the underlying issue in the world or in the literature that motivates your study. The problem must be observable, documented, or experienced. For example: “Current LLM-based text detectors fail when paraphrasing or synonym substitution is applied, reducing detection accuracy by up to 40%.” This is a problem as it identifies a shortcoming that causes real-world consequences.
The purpose of the research clarifies why you are studying the problem and what you intend to achieve. The purpose is directional, not testable. In our example: “The purpose of this research is to develop a more robust AI-generated text detection method that is resilient to paraphrasing attacks.” It explains the intended outcome without specifying how you will measure success.
The objectives break the purpose into measurable actions. These are the steps you will take. Good objectives often begin with verbs like evaluate, compare, design, measure, or analyze. For example:
- Objective 1: Evaluate the failure modes of current detectors under paraphrasing transformations.
- Objective 2: Design a curvature-based detection method inspired by DetectGPT.
- Objective 3: Compare robustness across paraphrasing strategies using MAUVE and KL-based metrics.
These are concrete, verifiable, and can guide your methodology.
The contribution is what you add to the field — something that did not exist before. Contributions can be technical (a new algorithm), empirical (new evidence or discovery), theoretical (a new formulation), methodological (a new benchmark), or practical (a toolkit or dataset). A strong contribution is specific: “This research introduces a paraphrase-resistant text detection method that reduces false negatives by 25% compared to state-of-the-art models.” Understanding these distinctions would help produce writing that is organized, credible, and aligned with standard research-level expectations.
Final Thoughts
Formulating a research question is an iterative process that requires both creativity and discipline. You start with an area that excites you, narrow it into a specific problem, uncover a gap through literature and experimentation, and then articulate a testable question that guides everything that follows. When your research question is clearly defined, your purpose becomes easier to articulate, your objectives become straightforward, and your contribution emerges naturally. Mastering this process will not only strengthen your paper but also shape your identity and modus operandi as a researcher.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.