Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Geocode Millions of Locations Without Being Sued
Latest

Geocode Millions of Locations Without Being Sued

Last Updated on January 6, 2023 by Editorial Team

Author(s): Paul Kinsvater

Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

With GeoBatchPy and geospatial analytics inΒ mind

Exponential Time Smoothing (ETS) is a technique prevalent in the time series forecasting community that is about down-weighting signals from increasingly distant times. And similar methods are standard in predictive models outside time series forecasting as long as time is part of the problem. Do you want to predict customer churn? Try with a weighted average of past purchaseΒ amounts.

Signals from increasingly distant times decay in importance. This is reflected in a churn prediction model using a weighted average of past purchase amounts as a predictor. Image created by the author using Excalidraw.

The geospatial equivalent of temporal weights is a spatial weight matrix. The same idea applies: compute a weighted average, with weights depending on spatial distance on a two-dimensional coordinate system. But, unlike with time, distance in space requires significant preliminary work. Likely, you start with (structured) address texts as your location records. So, how to bring them on a coordinate system?

Below we will use the longitude-latitude coordinate system. And the transition from texts to longitude-latitude is called geocoding. There are dozens, if not hundreds, of geocoding service providers. But the choice turns out to be risky for costs and legalΒ reasons.

Read the license if you don’t want to be banned orΒ sued

It would be best to be careful when picking a geo service provider. Google Maps and Mapbox are two prominent yet poor choices for our analytics use cases due to their restrictive licenseβ€Šβ€”β€Šyou are not allowed to store and redistribute results. E.g., Google has created mechanisms to detect abuse of their APIs, which, in the best case, results in a ban on your account. So, storing millions of geocodes for later analytics is notΒ allowed.

Fortunately, a startup called Geoapify fills this niche with bravura. It is not just about legal reasons. It is also about costs, comfort to use (batch processing), and good integration into open data and open source standards.

Why Geoapify for geospatial analytics

In short, this is why we chose Geoapify:

  • We need to geocode millions of location records without going bust. Geoapify offers batch geocoding at a 50% discount.
  • Geoapify uses OpenStreetMap and other open data sources with a friendly license so that we can store results for analytics. As a side effect, we can link our internal location records with a rich ecosystem of open data sources like Wikidata.
  • Geoapify is not just about geocoding. It also offers place details, isolines, travel distances, and moreβ€Šβ€”β€Šwith most services covered by their Batch API at a 50% discount.

Start at zero cost today using Geoapify’s services, including commercial use. Their free tier allows batch geocoding of 6k addresses per day. Sign up at geoapify.com and generate your API key in noΒ time.

Tutorialβ€Šβ€”β€ŠGeoBatchPy for batch geocoding andΒ more

We love Python and the command line. When we started, no Geoapify API client fulfilled our needs. So, we created GeoBatchPy.

GeoBatchPy is a Python client for the Geoapify API. And it comes with a command line interface for their Batch API, which shines when you need to process large numbers of locations. You can install the latest release from PyPI with pip install geobatchpy. But we recommend creating a new Conda environment covering GeoPandas and PySAL if you want to follow along with the tutorial below using yourΒ data.

This tutorial shows how to integrate GeoBatchPy into a geospatial analytics workflow, starting from simple address records, followed by batch geocoding, to computing spatial weight matrices. We conclude with a simple analytics useΒ case.

Part 1β€Šβ€”β€Šdata preprocessing

Our dataset for this tutorial consists of 1081 sports stadiums, mainly across Germany, Belgium, Netherlands. We generated the data using Geoapify’s PlacesΒ API.

Image created by theΒ author.

Geoapify’s geocoding service accepts free text search and structured input, the latter being helpful only if we have a lot of faith in our data quality. I have seen too many data quality issues in real-world structured address records. And my conclusion is to go for the free text search. Here, we parse the structured data into one string perΒ row.

Image created by theΒ author.

Part 2β€Šβ€”β€Šgeocoding

It is time to geocode our addresses. You can do this using our Python API, but we prefer the command line. First, we prepare the input file usingΒ Python:

Now we switch to the CLI. To make the following two commands work, you need to either set your GEOAPIFY_KEYenvironment variable or add option –api-key <your-key> to the end of every geobatch command. First, we submit jobs to Geoapify serversΒ with

The output of the first step, tutorial-geocode-urls.json is the input for theΒ next:

Processing our requests takes time, depending on the request size, the subscription plan, and how busy Geoapify serversΒ are.

We convert the results into a simplified list of GeoJSON-like Python dictionaries.

Image created by theΒ author.

GeoPandas helps us transform the data into a tabular format. The method parses the geometry into a Shapely geometric object, puts all properties into separate columns, and ignores the rest. We also set the coordinate reference system (CRS) to 'EPSG:4326', meaning that the geometries' tuples are interpreted as longitude and latitude.

Image created by theΒ author.

Part 3β€Šβ€”β€Šspatial weightΒ matrices

We are going to use PySAL and its distance-based approach to compute a spatial weight matrixβ€Šβ€”β€ŠPySAL comes with several methods, each with its own requirements. The distance-based method accepts our data frame of geocodes as input and computes weights which, by default, decay linearly with increasing distance.

For any given location L0 we identify a set of its closest neighbors and compute weights as a function of distance. A spatial weights matrix consists of all weights for all locations L0, organized row-wise with the target location L0 weight on the diagonal. We set the diagonal weights to zero for our analytics purposeβ€Šβ€”β€Šimage created by the author with Excalidraw.

We apply three changes to the default behavior:

  • Parameters fixed=False and k=10 result in variable strength of decay per target location. This way the number of non-zero weights is k=10 in every location’s neighborhood.
  • We set weights on the diagonal to 0. That excludes every target location from its set of neighbors with non-zero weights. This will be relevant to our useΒ case.
  • Setting attribute transform='R' normalizes weights for any given target location so that their sum equalsΒ 1.
A summary of neighbors and corresponding weights for location 0, where the weight is positive. By construction, every set of weights consists of k=10 neighbors, and the sum of every set of weights equals 1β€”image created by theΒ author.

Part 4β€Šβ€”β€Ša simple analytics useΒ case

Say we want to predict a location’s property price per square meter from available prices in its neighborhood. We compute a weighted average reusing our spatial weight matrix from the previousΒ section:

Image created by theΒ author.

The next plot indicates that the weighted average alone is an unbiased predictor of theΒ price.

A scatterplot of known location prices and weighted averages from their neighborhoods. The red line is the diagonal Price = WeightedAverageβ€Šβ€”β€Šimage created by theΒ author.

Now it becomes obvious why we set the diagonal weights to 0, excluding the target from its neighborhood weights. This way, we can predict the price of any new location with the weighted average of its neighbors with knownΒ prices.

In a more realistic scenario, you would want to consider more than neighboring prices in your prediction model to reflect significant variations in closeby locations. E.g., two neighboring locations can be priced very differently if one is directly exposed to a lot of noise from traffic. The weighted average can then be used as one of many predictors in a regression model fitted to ourΒ data.

Conclusion andΒ outlook

Most (or all?) businesses process address data for day-by-day operational purposes. For invoicing, delivery of goods, customer visits, etc. Internal location data amount to thousands, if not hundreds of thousands, location records very quickly. Utilizing that same data for geospatial analytics usually relies on significant preparation work, like geocoding. We show how to avoid unnecessary expenses and legal risks using Geoapify and our Python packageβ€Šβ€”β€Šwith just a few lines of code. Practically every analytics project has the potential to benefit from a geospatial dimension.

We motivate spatial weights by starting with the temporal equivalent, which finds wide adaption in the analytics space. It is not just about one or the other. Models can combine time and space to account for bothβ€”first, down-weight individual signals from a temporal distance. Second, spatially combine those temporarily down-weighted signals. E.g., when we study customer churn, we can compute average loyalty in every customer’s neighborhood. We identify the date and location of every churner from the pastβ€” that’s the (binary) signals. And we compute weighted averages taking into account the number of active customers for every current customer’s neighborhood, weighted by distance in time and space. E.g., this indicates if a customer lives in an area of recently increased local competitionβ€Šβ€”β€Šsomething we would want to act onΒ quickly.


Geocode Millions of Locations Without Being Sued was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓