Ljubljana Reveals Its Secrets
Last Updated on January 7, 2023 by Editorial Team
Last Updated on April 19, 2022 by Editorial Team
Author(s): Tim Cvetko
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Solving Ljubljanaβs rental prices usingΒ AI
Like most students, we faced the issue of finding the right apartment during our studies. Not just the right one, the optimal one. The question that aroseΒ was:
βWhat is the best possible apartment we could find given ourΒ budget?β
For all foreigners, you are seeing the sights of the city center of the Slovenian capital of Ljubljanaβββthe cornerstone of culture, politics, and education.
Web scraping with lxml inΒ Python
Web scraping a web page involves fetching it and extracting data from it. Once fetched, the extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet, and so on. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. Data was scraped and with little processing, we managed to get a fully compatible CSVΒ file.
Full code and data can be found onΒ GitHub.
The data we got had merely 4000 samples. New offers are being uploaded to the website each day, so we came up with a way to keep scraping the data from the site at all times. A bash script was instantiated and our powerful little guy starts the script every night at 3 AM that automatically scrapes the data and appends it to the CSV file. Python logs the number of new rows to a txt file. Raspberry works independently and is accessible viaΒ SSH.
Data Preprocessing
Data will be handled by pandas. Thereβs quite a lot to be done, so letβs cut to the chase. Note that we are using a different delimiter for the read_csv function rather than the default comma. All data types are still defined as objects. At this time, we are holding 4400 data rows in memory. Hopefully Raspberry picks up enough data so that someday weβll be able to manage a much larger dataset. Letβs display the head of the data frame. Keep in mind that some words and phrases are in Slovene. Donβt let that bother youΒ π
Room factor
First up is the room factor. The data comes in as a string summed up from the flat description and will require some handling. There are 13 different unique inputs. We have to change the values by hand. Lastly, letβs convert the column to a numeric one by calling pandasβ to_numeric on the room factor column. These are now integers and will be of crucial significance to theΒ model.
Location
We knew the obvious parameter when trying to predict the prices in the city was the location. What was not so obvious, was how to feed it to the model, so that the output would be correlated with it. After quite some time of solving python errors, this is what weβve come up with.
GeoPy converts the given strings to locations from which we would be able to get the latitude and longitude and apply the K-Means clustering algorithm to define 10 clusters and add the K-Means labels accordingly so that each label corresponds with the clustered index.
Now, we can drop both the latitude and the longitude, as well as the location description, as we wonβt be needing them anymore.
The size and price column were simple enough. Using pandasβ str notation we got rid of the units and converted the column to numerical once again.
The last two parameters indicate the year of construction and the year of renovation. They seem to have a lot of null values. Using the fillna with the βpadβ method on the sorted columns fixedΒ that.
As it turns out, another useful thing to do was to subtract the year of construction and renovation from the current year [2020]. This now suggests how many years ago either the construction or the renovation of the building was done. This along with normalizing will substantially help with the lossΒ metrics.
Splitting theΒ data
We are specifying 10% of the data to serve as the test set. The dataset is narrowed down from the original 7 columns to 5. As youβll see later, 10% of the training data will be spent on validation.
Hyperparameter tuning
Tuning machine learning hyperparameters is a tedious yet crucial task, as the performance of an algorithm can be highly dependent on the choice of hyperparameters. Manual tuning takes time away from important steps of the machine learning pipeline like feature engineering and interpreting results. Grid and random search are hands-off, but require long run times because they waste time evaluating unpromising areas of the search space. Increasingly, hyperparameter tuning is done by automated methods that aim to find optimal hyperparameters in less time using an informed search with no manual effort necessary beyond the initialΒ set-up.
This tutorial will focus on the following steps:
- experiment setup and HParamsΒ summary
- Adapt TensorFlow runs to log hyperparameters andΒ metrics
- Start runs and log them all under one parent directory
- Visualize the results in TensorBoardβs HParams dashboard
We will tune 4 hyperparameters (the dropout rate, the optimizer along with the learning rate, and the number of units of a specific Dense layer). TensorFlow makes it easy with the Tensorboard plugins. Thereβs a lot of code, but to keep it simple, essentially what weβre doing is iterating over the intervals of our 4 hyperparameters and optimal hyperparameters will be stored when the model reaches the highest accuracy. Weβre also keeping a log of everything and the Tensorboard callback makes sure all the information will be visible in Tensorboard. Run function stores the metrics and keeps a log of every evaluation of the hyperparameters.
And this is the section where it all comes together. A massive for loop iterates over all the values and keeps a log based on the modelβs evaluation of specific hyperparameters.
The greatest combination of hyperparameters is stored in a dictionary called hparams and they are an absolutely valid form to ingest to a model. If you think about it, the hyperparameter tuning could by itself be a concrete model. I decided to build the model again, including the best hparams, but with a substantially larger amount of epochs. There was no need to overkill hyperparameter tuning with the same number of epochs, as all we needed were parameters that give the bestΒ result.
According to hyperparameter tuning, the optimal parameters are 64 units, 0.3 for the dropout rate, stochastic gradient descent as the optimizer, and a learning rate ofΒ 0.001.
Regularization
We utilized two methods of regularization, and letβs take a brief look atΒ them.
- Dropout is a regularization technique that prevents complex co-adaptations on training data by randomly dropping neuron weights. With a dropout rate of 0.2, 20% of the neuronsβ weights are set to 0. Therefore not doing anything.
- L1 or Lasso Regression adds βabsolute value of magnitudeβ of coefficient as a penalty term to the loss. It shrinks the less important features to even lower.
The Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. This function is quadratic for small values of a and linear for large values.
Source: WikiAs you can see, for target = 0, the loss increases when the error increases. However, the speed with which it increases depends on this πΏ value. In fact, Grover (2019) writes about this as follows: Huber loss approaches MAE when πΏ ~ 0 and MSE when πΏ ~ β (large numbers.)
Model performance
We can conclude with 80 % accuracy that our data predictions are no more than 30 % percent off. This might seem a lot, but note that this is our first preliminary model with merely 4400 data points. We expect, that in three monthsβ time, when data entries should exceed 5000, our results will have been muchΒ better.
Conclusion
AI continues to become the to-go tool for coming up with neat solutions to everyday problems. In our case, we came up with an efficient machine learning model that could to some extent predict the rental prices of apartments making the student experience all the more enjoyable.
What started out as a hobby project, turned out to be quite a successful TensorFlow model for predicting Sloveniaβs capitalβs rentalΒ prices.
Connect and readΒ more
Feel free to contact me via any social media. Iβd be happy to get your feedback, acknowledgment, or criticism. Let me know what you createΒ !!!
LinkedIn, Medium, GitHub,Β Gmail
Pay2Go Structure in The Creator Economy
Ljubljana Reveals Its Secrets was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI