
A Look at FinReflectKG: AI-Driven Knowledge Graph in Finance
Last Updated on September 29, 2025 by Editorial Team
Author(s): Marcelo Labre
Originally published on Towards AI.

Last week at the Quant x AI event here in New York, I had the pleasure of seeing
Fabrizio Dimino present a compelling paper he co-authored: “FinReflectKG: Agentic Construction and Evaluation of Financial Knowledge Graphs.”
The paper tackles a critical challenge holding back financial AI. While we have powerful language models, they lack the structured, reliable symbolic reasoning systems needed to truly understand the complex world of finance. More specifically, building knowledge graphs (KGs) on regulatory documents like SEC filings is a massive hurdle.
The FinReflectKG paper offers a powerful solution with two key contributions: a new, open-source financial KG dataset and a novel framework for building it.
Here, I’ll summarize their brilliant approach and then propose a way to improving their monitoring for global semantic diversity in the model.
The Core Innovation: A “Self-Reflecting” AI Agent
The centerpiece of the paper’s methodology is a sophisticated, three-mode pipeline for extracting knowledge “triples” (like (Nvidia, Produces, GPUs)
) from financial documents.
While they test simpler Single-Pass and Multi-Pass methods, their most innovative approach is the Reflection-driven agentic workflow.
This agentic process works like a team:
- An Extraction LLM first takes a chunk of a document and extracts an initial set of knowledge triples.
- A Critic LLM then reviews these triples, providing structured feedback on any issues it finds, such as ambiguous entities (e.g., using “We” instead of the company ticker) or non-standard relationship types.
- A Correction LLM takes this feedback and refines the triples.
This feedback loop repeats until no more issues are found or a maximum number of steps is reached, systematically improving the quality of the extracted knowledge.
Proving Its Worth Without a Ground Truth
A major challenge in KG construction is evaluation, as there’s often no perfect “answer key” to compare against. The authors developed a holistic evaluation framework to address this, using several complementary methods:
- CheckRules: A set of custom, rule-based checks to enforce quality and consistency. For example, rules automatically flag ambiguous subjects like “we” or “the company” and ensure all extracted entities and relationships comply with a predefined schema.
- Coverage Ratios: Metrics to measure how comprehensively the KG captures the diversity of entities and relationships present in the source documents.
- Semantic Diversity: An analysis using information theory (Shannon and Rényi entropy) to measure the balance and variety of the extracted knowledge, ensuring the graph isn’t overly skewed towards a few common concepts.
- LLM-as-a-Judge: A comparative evaluation where a powerful LLM assesses the outputs of the three different extraction modes (single-pass, multi-pass, and reflection) across four key dimensions: Precision, Faithfulness, Comprehensiveness, and Relevance.
Key Findings and Verdict
The results clearly demonstrate the superiority of the reflection-agent mode. It consistently achieves the best balance of reliability and coverage.
- It achieved the highest compliance score, passing 64.8% of all four strict CheckRules.
- It extracted the most triples per document chunk (15.8) and had significantly higher entity and relationship coverage ratios than the other methods.
- In the LLM-as-a-Judge evaluation, it was the clear winner in Precision, Comprehensiveness, and Relevance.
The primary trade-off is speed. The iterative feedback loop requires more computation, making it less suitable for real-time applications where a single-pass approach might be preferred.
Future Directions
The authors conclude by outlining their plans to significantly expand the project, including:
- Enlarging the dataset to cover all S&P 500 companies over the last 10 years.
- Developing a schema-free pipeline that can create ontologies from scratch, inspired by the “Extract-Define-Canonicalize” (EDC) framework.
- Building Temporal Knowledge Graphs to capture the evolution of financial relationships over time, enabling causal reasoning for applications like thematic investing.
Suggestion: Monitoring Semantic Refinement via Cross-Entropy
The paper notes that the Reflection method, while improving compliance and coverage, reduces the diversity of the extracted elements as measured by absolute entropy. This is an expected outcome of a rule-constrained process. However, the authors’ proposal to monitor “when diversity falls below a predefined threshold” using absolute entropy presents a practical challenge: absolute entropy values are difficult to interpret and make thresholding arbitrary.
A more principled approach is to measure the relative change in the information landscape between the baseline and the refined output. I suggest a direct application of the Principle of Minimum Cross-Entropy (MinXEnt), an extension of Jaynes’ Maximum Entropy Principle (MaxEnt). In this case, we treat the distribution from the Single-Pass method as the prior and measure how the Reflection method’s distribution diverges from it at each step.
The ideal tool for this is cross-entropy, specifically the Kullback–Leibler (KL) Divergence, which quantifies the information gain or loss when one probability distribution is used to approximate another.
Proposed Methodology
The goal is to monitor the KL Divergence at each iteration t of the Reflection agent's refinement loop.
1. Establish the Prior Distribution (p): First, run the Single Pass method across the corpus. From the full set of extracted elements (entities, types, and relationships), derive a normalized frequency distribution:

This represents our baseline or prior understanding of the data’s structure.
2. Calculate Iterative Distributions (q(t)): For the Reflection method, at each step t of the iterative feedback loop for a given chunk of text c, derive the corresponding normalized frequency distribution:

where

and m is the stopping iteration for chunk c as per the stopping criteria in section 4.3.3.
3. Unify and Smooth: As the set of extracted elements will differ between the two methods for every t, a direct comparison is not possible. To solve this:
- Create a unified vocabulary that is the union of all unique elements found in both p and all q(t).
- Represent all distributions over this unified vocabulary. Any element not present in a given distribution will have an initial frequency of zero.
- Apply Laplace (add-one) smoothing to all frequencies. This is critical to avoid zero probabilities in the denominator of the KL Divergence formula, ensuring the metric is always well-defined.
For instance, in a 5 element example we may end up with the following frequencies:


In this case we have the frequency probabilities:


by applying Laplace smoothing:

where:




4. Compute KL Divergence: For each iteration t, compute the KL Divergence of the Reflection distribution q(t) from the Single-Pass prior p:

Interpretation and Monitoring
By plotting KL(q(t)|p) against each iteration step t, we can directly observe the refinement process. We would expect to see the KL divergence at some point to reach a minimum. This minimum represents the optimal point of refinement, the iteration at which the agent has extracted the most new information without beginning to overfit or degrade the quality of the graph.
This provides a principled, data-driven stopping criterion and a far more interpretable measure of the agent’s progress than absolute entropy. More reliable monitoring thresholds can also be derived.
The Next Frontier
The work in FinReflectKG is more than just academic exercises. It represents the blueprint for the next generation of AI. It shows us how to guide these emergent powerful models from being fluent statistical parrots toward becoming disciplined, verifiable reasoners.
This is the work of building a true foundation for intelligence , grounded in a symbolic source of truth. It is the next frontier toward the next generation of intelligent systems.
References
[1] Original paper from arXiv and from Hugging Face.
[2] Open source KG dataset from Hugging Face.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.