Working with Spatial Data in Table Storage

Beat Schwegler

November 2nd, 20150 0

Several of our customers needed to correlate and aggregate large volumes of location-tagged data. While many higher level databases provide rich capabilities for spatial data (such as SQL Server or MongoDB), most large scale cloud based table storages (e.g. Microsoft Azure Table Storage) don’t provide native support for spatial queries. In this case study, we discuss how we implemented spatial capabilities by mapping entities to unique grid numbers. While our implementation uses Table Storage, this approach is platform independent and can be reused with other storage technologies.

Overview of the Solution

In one of our projects, we needed to overlay additional geo-tagged data on top of an interactive map. For instance, we visualized building constructions and overlaid that with demographic information and heat maps to show the movement of people during a specific period. When using a storage system that supports spatial data, this would be a simple query – just filtering results that are visible within the current zooming window. However if the storage technology doesn’t support spatial data, the query is not so simple. Considering that our data is stored in Azure Table Storage or ingested through a real-time analytics pipeline (using EventHub and Stream Analytics), we had to implement our own approach to query spatial data.

One of the key challenges of spatial data is that each geo-tagged entity is described by at least two properties (latitude and longitude). Therefore to retrieve all entities which are contained within a rectangular boundary requires at least the filtering of two properties (latitude and longitude), followed by an inner join:

[(Entity.Lon ≥ Query.SouthEast.Lon) AND (Entity.Lon ≤ Query.NorthWest.Lon)]
INNER JOIN
[(Entity.Lat ≥ Query.SouthEast.Lat) AND (Entity.Lat ≤ Query.NorthWest.Lat)]

This is straightforward if the storage technology indexes all properties (columns); however it is more challenging if the columns used for storing spatial data are not indexed. For instance, using Azure Table Storage, only the partition key and the row key are indexed. (That’s why every query should contain at least one of these keys – if not, it will result into a full table scan). Furthermore, row keys must be unique across a partition and therefore only one discrete spatial entity can be stored for a given location. Also, both partition and row keys are of datatype string, therefore the use of numeric filters for latitude and longitude aren’t directly supported.

To address these constraints, we decided not to work with latitude and longitude independently, but to create a combined property that refers to a geographical boundary; in our case, a grid. Each set of latitude and longitude values belongs to a discrete grid, and each grid is identified by a unique grid number. We simply use this grid number as the partition key. This allows us to efficiently look up all entities that belong to a certain rectangular boundary by simply filtering the partition key for the required grid identification numbers. Because Table Storage filters are restricted to 15 discrete comparisons, we decided to create a new table for each required grid size. For instance, we store the building construction entities in two tables; one with a grid size of 50 meters, and one with a grid size of 1000 meters.

Implementation

To map latitude and longitude to a discrete grid, we used Jürgen Pfeifer’s Geodesy library which is available on GitHub ( https://github.com/juergenpf/Geodesy) as well as a NuGet package.

Geodesy handles the Universal Transverse Mercator (UTM) projection, dividing the earth into smaller grids which are then each mapped to a flat map. Based on this, Jürgen implemented an algorithm to put a finer-grain mesh over the mapped area of the earth to be able to classify a geo-location by a discrete, globally unique mesh number, which we use as our grid number.

Using the library is straightforward: We just instantiate the mesh for a specific grid size and retrieve the grid number for each set of latitude and longitude values. This grid number becomes the partition key for the entity to be stored. Below is a walkthrough of the key steps to implement this approach:

Add the Geodesy NuGet package
Define the using statement for the library:

using Geodesy;

Defining the spatial entity which stores the gridId as its PartitionKey:

    class SpatialEntity : TableEntity
    {
        public SpatialEntity() {}
        public SpatialEntity(long gridId, double lat, double lon)
        {
           this.PartitionKey = gridId.ToString();
           this.RowKey = Guid.NewGuid().ToString();
           this.lat = lat;
           this.lon = lon;
        }
        public double lon { get; set; }
        public double lat { get; set; }
        public string additionalProperty { get; set; }
    }

Create the mesh(es). In our example, we create one for 50 meters and another one for 1000 meters:

GlobalMesh mesh50m = new GlobalMesh(50);
GlobalMesh mesh1000m = new GlobalMesh(1000);

Get the table references for the two tables which are used to store the spatial data:

CloudTableClient tableClient = storageAccount.CreateCloudTableClient();
CloudTable spatialTable50m = tableClient.GetTableReference(“spatial50m”);
spatialTable50m.CreateIfNotExists();
CloudTable spatialTable1000m = tableClient.GetTableReference(“spatial1000m”);
spatialTable1000m.CreateIfNotExists();

Retrieve the grid ids for the lat/lon pair, one for each grid size:

long gridId50m = mesh50m.MeshNumber(lat, lon);
long gridId1000m = mesh1000m.MeshNumber(lat, lon);

Create two spatial entities which will be stored in Table Storage:

var entity50m = new SpatialEntity(gridId50m, lat, lon);
var entity1000m = new SpatialEntity(gridId1000m, lat, lon);

Insert the two spatial entities into their respective tables:

spatialTable50m.Execute(TableOperation.Insert(entity50m));
spatialTable1000m.Execute(TableOperation.Insert(entity1000m));

Create the table query by filtering the gridId that maps to the requested latitude and longitude:

var spatialQuery50m = new TableQuery<SpatialEntity>().Where(TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, mesh50m.MeshNumber(lat, lon)));
var spatialQuery1000m = new TableQuery<SpatialEntity>().Where(TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.Equal, mesh1000m.MeshNumber(lat, lon)));

Query the entities from their respective tables:

var entities50m = spatialTable50m.ExecuteQuery(spatialQuery50m);
var entities1000m = spatialTable1000m.ExecuteQuery(spatialQuery1000m);

If you’re not developing in .NET, you can use the http://github.com/timfpark/geo-tile library, which uses a slightly less accurate approach. The source code is available in C#, JavaScript, and Python.

Challenges

This approach works very well for rectangle-based spatial queries. If there is the need for querying polygons or other geometric structures, a two-step approach is required:

Calculate the bounding rectangle for the polygon and query all entities that belong to grids that are contained within, or intersect with, the bounding rectangle
Run a second local geometric function to select the entities that actually belong to the polygon

To simplify the geometric functions, it might be useful to project longitude and latitude onto a Mercator map and transform them into X and Y coordinates in meters. X and Y will be stored as part of the entity, in addition to latitude and longitude. This can be easily done by using one of the provided Geodesy projections:

SphericalMercatorProjection geoTransform = new SphericalMercatorProjection();
double x = GeoTransform.LongitudeToX(lon);
double y = GeoTransform.LatitudeToY(lat);

Opportunities for Reuse

This approach can be used across a wide range of storage technologies, even if they don’t provide native spatial capabilities. Even without the need to store data, the Geodesy library is of great help in computing spatial data.