Symbolic Regression: When Regression Took it Seriously!!
Last Updated on June 4, 2024 by Editorial Team
Author(s): Abhijith S Babu
Originally published on Towards AI.
In a great mission to unravel the mysteries of the universe, humans always sought the patterns in the vast abundance of data. Going to the previous century, John Kepler, the brilliant mind behind Keplerβs laws of planetary motion, was not armed with a computer that could run a genetic algorithm to find the patterns in his data. Kepler made use of his sheer mathematical brilliance and fixed his eyes and mind on a huge dataset and unveiled the relationship between a planetβs time period and radius. His method was simple, immerse himself in the data, trust his intuitions, and let the patterns emerge. This method was used by various scientists throughout the history of science. But do not mistake simplicity for ease.
Scientists who used the above method in the past had to put a lot of effort and dedication into their discoveries. In the modern world, where data grows in size and dimensions, the task of uncovering complex patterns becomes more daunting. In the agony, symbolic regression emerges, a modern computational marvel that kills pain in the pursuit of patterns.
Symbolic regression stands apart from classic regression by offering greater flexibility in identifying patterns within data. A traditional regression model looks for a specific pattern in data, i.e. a model designed for finding linear relationships cannot work accurately on data that has an exponential relationship between variables. Symbolic regression thrives on this diversity. It is closely related to the genetic algorithm, in fact, it is a modification of the evolutionary algorithm.
A normal evolutionary algorithm goes as follows:
- We have a population of individuals (mathematical functions), and a fitness function (accuracy of the function on our data)
- We randomly select an n-sized subset of the population
- We calculate the fitness function of all the individuals in the subset
- We select one individual from the subset. Selection is made in such a way that the fittest individual is the most likely to be selected, but still, there is a chance that another individual is selected. (The more the fitness, the more the probability of being selected)
- We create a copy of this selected individual and do a random mutation on it (randomly change an operation in the mathematical function)
- We replace the weakest member of the subset with this mutated copy.
These steps are repeated to get a subset of high-average fitness
Crossover between individuals is not done in the above algorithm. You can introduce crossover by selecting more than one individual from the subset and doing a crossover among them before mutation.
Symbolic regression makes modifications to this evolutionary algorithm to improve its output. One important modification is the introduction of age-regularization. Instead of replacing the weakest member of the subset, the earliest created individual in the subset is replaced. This could prevent the early convergence of the population and avoid getting stuck at a local maximum fitness function.
When it comes to symbolic regression, a fascinating twist is introduced to improve the algorithm β age-regularization. Age-regularization ushers a chronological method of regeneration, i.e. instead of adding the newly formed individual by replacing the weakest individual, we replace the oldest element in the subset, thus making way for a new generation. This change can have a profound effect on the evolutionary journey of the population. Age-regularization ensures that genetic diversity is maintained within the population. It prevents the population from stagnating at a local maximum, thus encouraging a robust search in the solution landscape.
Symbolic regression also uses a new variable called temperature, which controls the mutations in the algorithm. Based on the value of temperature, the algorithm can reject the mutations when the fitness of the mutated individual is lesser than the original chosen individual. This process is called simulated annealing. This temperature value can also be coupled with the probability of choosing an individual from the population. Using this variable we can have good control of converging or diverging our expression. Simulated annealing has been experimentally proven to speed up the search process.
In symbolic regression, the genetic algorithm goes through an evolve-simplify-optimize loop, i.e., after every evolution, the mathematical expressions are simplified and optimized. The simplification and optimization can reduce the complexity of the algorithm, but they are introduced only after some iterations of evolution alone. This is done to avoid losing some important individuals.
Suppose the equation we are searching for is (x*y) β (x/y). In one step, we reach (x*y) β (x*y). We know that we are one correct mutation away from finding our solution. But, if we simplify the equation, it will go to 0 and we wonβt be able to achieve the output. So, simplifying only occasionally can keep the redundant but useful expressions while reducing the complexity.
Symbolic regression is like the maverick scientist of the regression world, shaking things up with its cool genetic algorithm vibes. Itβs the rebel that doesnβt settle for a βclose enoughβ mathematical model, but instead crafts an actual, tangible model that scientific researchers can hold onto.
While traditional regression models might come close to fitting the data, they can sometimes play havoc with related theories and derivations. Symbolic regression, on the other hand, is the renegade that refuses to compromise, delivering precise and impactful results for the scientific community.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI