ChatGPT Combined with Graph Database to Predict a FIFA 2022 Winner but Went Wrong
Last Updated on July 24, 2023 by Editorial Team
Author(s): NebulaGraph Database
Originally published on Towards AI.
The Hype of ChatGPT
In the hype for FIFA 2022 World Cup, I saw a blog post from Cambridge Intelligence, in which they leveraged limited information and correlations among players, teams, and clubs to predict the final winning team. As a graph technology enthusiast, I always would like to try similar things with NebulaGraph to share the ideas of graph algorithms to extract hidden information from the overall connections in a graph in the community.
The initial attempt was to complete it in about 2 hours, but I noticed the dataset needed to be parsed carefully from Wikipedia, and I happened to be not good at doing this job, so I put the idea on hold for a couple of days.
In the meantime, another craze, the OpenAI ChatGPT, was announced. As I had been a user of DALL-E 2 already(to generate feature images of my blog posts), I also quickly gave it a try. And I witnessed how other guys(via Twitter, blogs, HackerNews) tried to convince ChatGPT to do so many things that are hard to believe they could do:
- Help to implement a piece of code at any time
- Simulate any prompt interface: shell, python, virtual machine, or even a language you create
- Act out almost any given persona, and chat with you
- Write poetry, rap, prose
- Find a bug in a piece of code
- Explain the meaning of a complex regular expression/Open Cypher Query
ChatGPTβs ability to contextualize and understand has never been greater before, so much so that everyone is talking about a new way of working: how to master asking/convincing/triggering machines to help us do our jobs better and faster.
So, after trying to get ChatGPT to help me write complex graph database query statements, explain the meaning of complex graph query statements, and a large chunk of Bison code, which he/she had done them WELL, I realized: why not let ChatGPT write the code that extracts the data for me?
Grabbing data
I really tried it and the result is⦠good enough.
The whole process was basically like a coding interviewer, or a product manager, presenting my requirements, then ChatGPT giving me the code implementation. I then try to run the code, find the things that donβt make sense in the code, point them out, and give suggestions, and ChatGPT quickly understands the points I point out and makes the appropriate corrections, like:
I wonβt list this whole process, but I will share the generated code and the whole discussion here.
The final generated data is a CSV file.
- Raw version world_cup_squads.csv
- Manually modified, separated columns for birthday and age world_cup_squads_v0.csv
- It contains information/columns of a team, group, number, position, player name, birthday, age, number of international matches played, number of goals scored, and club served.
Team,Group,No.,Pos.,Player,DOB,Age,Caps,Goals,Club
Ecuador,A,1,1GK,HernΓ‘n GalΓndez,(1987-03-30)30 March 1987,35,12,0,Aucas
Ecuador,A,2,2DF,FΓ©lix Torres,(1997-01-11)11 January 1997,25,17,2,Santos Laguna
Ecuador,A,3,2DF,Piero HincapiΓ©,(2002-01-09)9 January 2002,20,21,1,Bayer Leverkusen
Ecuador,A,4,2DF,Robert Arboleda,(1991-10-22)22 October 1991,31,33,2,SΓ£o Paulo
Ecuador,A,5,3MF,JosΓ© Cifuentes,(1999-03-12)12 March 1999,23,11,0,Los Angeles FC
- Final version with header removed world_cup_squads_no_headers.csv
Graph algorithm to predict the FIFA World Cup 2022
With the help of ChatGPT, I could finally try to predict the winner of the game with Graph Magic, before that, I need to map the data into the graph view.
If you donβt care about the process, go to the predicted result directly.
Graph modeling
Prerequisites: This article uses NebulaGraph(Open-Source) and NebulaGraph Explorer(Proprietary), which you can request a free trial of on AWS.
Graph Modeling is the abstraction and representation of real-world information in the form of a βvertex-> edgeβ graph. In our case, we will project the information parsed from Wikipedia as:
Vertices:
- player
- team
- group
- club
Edges:
- groupedin (the team belongs to which group)
- belongto (players belong to the national team)
- serve (players serve in the club)
The age of the players, the number of international caps, and the number of goals scored are naturally fit as properties for the player tag(type of vertex).
The following is a screenshot of this schema in NebulaGraph Explorer (I will just call it Explorer later).
Then, we can click the save icon in the upper right corner and the button: Apply to Space
to create a graph space with the defined schema:
Ingesting into NebulaGraph
With the graph modeling, we can upload the CSV file (the no-header version) into Explorer, by pointing and selecting the vid and properties that map the different columns to the vertices and edges.
Click Import, we then import the whole graph to NebulaGraph, and after it succeeded, we could also get the entire CSV β Nebula Importer configuration file: nebula_importer_config_fifa.yml, so that you reuse it in the future whenever to re-import the same data or share it with others.
Note: Refer to the Import Data Document
After importing, we can view the statistics on the schema view page, showing us that 831 players participated in the 2022 Qatar World Cup, serving in 295 different clubs.
Explore the graph
Letβs see what insights we could get from the information/ knowledge in the form of a graph.
Querying the data
Letβs start by showing all the data and see what we will get.
First, with the help of NebulaGraph Explorer, I simply did drag and drop to draw any vertex type (TAG) and any type of edge between vertex types (TAG), here we know that all the vertices are connected with others, so no isolated vertices will be missed by this query pattern:
Let it generate the query statement for me. Here, it defaults to LIMIT 100
, so let's change it to something larger (LIMIT 10000) and let it execute in the Console.
Initial observation
The result renders out like this, and you can see that it naturally forms a pattern of clusters.
These peripheral clusters are mostly made up of players from clubs that are not traditionally strong ones (now we learned that they could win, though, who knows!). Many of those clubs have only one or two players who serve in one national team or region, so they are somewhat isolated from other clusters.
Graph algorithm-based analysis
After I clicked on the two buttons(Sized by Degrees, Colored by Louvain Algorithm) in Explorer (refer to the document for details), in the browser, we can see that the entire graph has become something like this:
Two graph algorithms are utilized to analyze the insights here.
- change the display size of vertices to highlight importance using their degrees
- using Louvainβs algorithm to distinguish the community of the vertices
You can see that the big red circle is the famous Barcelona, and its players are marked in red, too.
Winner Prediction Algorithm
In order to be able to make full use of the graph magic(with the implied conditions and information on the graph), my idea(stolen/inspired from this post) is to choose a graph algorithm that considers edges for node importance analysis, to find out the vertices that have higher importance, iterate and rank them globally, and thus get the top team rankings.
These methods actually reflect the fact that excellent players have greater community and connectivity at the same time. Meanwhile, to increase the differentiation between traditionally strong teams, I am going to take into account the information on appearances and goals scored.
Ultimately, my algorithm is:
- Take all the
(player)-serve->(club)
relationships and filter them for players with too few goals and too few goals per game (to balance out the disproportionate impact of older players from some weaker teams) - Explore outwards from all filtered players to get national teams
- Run the Betweenness Centrality algorithm on the above subgraph to calculate the node importance scores
Process of the Prediction
First, we take out the subgraph in the pattern of (player)-serve->(club)
for those who have scored more than ten goals and have an average of more than 0.2 goals per game.
MATCH ()-[e]->()
WITH e LIMIT 10000
WITH e AS e WHERE e.goals > 10 AND toFloat(e.goals)/e.caps > 0.2
RETURN e
Note: For convenience, I have included the number of goals and caps as properties in the serve edge, too.
Then, we select all the vertices on the graph in the left toolbar, select the belongto
edge of the outgoing direction, expand the graph outwards (traverse), and select the icon that marks the newly expanded vertices as flags.
Now that we have the final subgraph, we use the graph algorithm function within the browser to execute BNC (Betweenness Centrality):
The graph canvas then looks like this:
Result
Ultimately, we sorted according to the value of Betweenness Centrality to get the final winning team: Brazil! U+1F1E7U+1F1F7, followed by Belgium, Germany, England, France, and Argentina.
It turned out that the result was way too wrongβ¦but worth a try, isnβt it? I hope you enjoy this small experiment!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI