Spatialite: Speedup your query with spatial indexing

Spatialite, like any good spatial data management system, can build a spatial index for your layers. Using this index in your spatial queries will dramatically shorten the runtime for that query. The latest version of Spatialite offers a nice compact format for using a spatial index. To demonstrate, I created a point layer of 20,000 theoretical store locations, with sales data for each store, and a polygon layer of over 280 “local councils”. My mission it to sum up the total sales in each local council. So I need to find which stores are located in each local council and aggregate sales for those stores.

We begin with two tables, councils (multipolygon) and stores (point) that look like this:

spatialite> SELECT * FROM councils LIMIT 10;
pk_uid Name Geometry
---------- ---------- ----------
1 Ronneby
2 Sölvesbor
3 Ă„lvdalen
....

spatialite> SELECT * FROM stores LIMIT 10;
pk_uid sales Geometry
---------- ---------------- ----------
1 192617.423921806
2 75489.1390022072
3 163013.639152632
....

Let’s build the query. First, we are using both tables, so the FROM part of our SQL looks like:
FROM councils AS c, stores AS s
and since we want to aggregate total sales for each council, we require a GROUP BY clause:
GROUP BY c.pk_uid
What results do we want? The council name and total sales, so the SELECT part of our query will be:
SELECT c.Name AS Name, SUM(s.sales) AS "Total Sales"
Finally, we formulate the criteria, stores contained in each council, like so:
WHERE ST_Contains(c.Geometry, s.Geometry)

Putting it all together, our “naive” query (no spatial index yet) goes:
SELECT c.Name AS Name, SUM(s.sales) AS "Total Sales"
FROM councils AS c, stores AS s
WHERE ST_Contains(c.Geometry, s.Geometry)
GROUP BY c.pk_uid;

Now this query will run successfully on our two tables as described, but you better go for lunch when you start the query. It might finish on a fast computer by the time you get back… The query loops thru all 20,000 store locations for each of the 280 councils. So that’s over 5,000,000 table scans, and checks for each point in each polygon.

So how do we utilize spatial indexing to speed things up? The R*Tree indexing and its implementation in Spatialite are covered in the Spatialite Cookbook. It’s important to note that Spatialite does NOT make use of a spatial index unless you explicitly incorporate it into the query statement. (BTW, this is unlike PostGIS which automatically uses a spatial index, if it’s available.)
The options to implement a spatial index in a query structure were clearly explained in this thread from the spatialite maillist.

First we create a spatial index for both tables:
spatialite> SELECT CreateSpatialIndex('councils','Geometry');
1
spatialite> SELECT CreateSpatialIndex('stores','Geometry');
1

Now the additional WHERE clause, in it’s new incarnation (requires spatialite >=3.0) looks like this for our query:
s.ROWID IN
(SELECT ROWID FROM SpatialIndex WHERE f_table_name='stores' AND search_frame=c.Geometry)

“SpatialIndex” is a new virtual table, and using the search_frame component of that table we filter out the store locations in the bounding box (AKA Mbr=Minimum Bounding Rectangle). Searching for points in a rectangle is much faster than the full ST_Contains (or equivalent ST_Within) function. After this filtering, the ST_Contains() function runs on only a small subset of the 20,000 store locations.

So the spatial index enabled query will now look like:
SELECT c.Name AS Name, SUM(s.sales) AS "Total Sales"
FROM councils AS c, stores AS s
WHERE ST_Contains(c.Geometry, s.Geometry) AND
s.ROWID IN
(SELECT ROWID FROM SpatialIndex WHERE f_table_name='stores' AND search_frame=c.Geometry)
GROUP BY c.pk_uid;

On my machine, this query, aided by the spatial index, completed in less than 1.5 minutes. The first “naive” query took over 35 minutes. Pretty cool, no?

7 thoughts on “Spatialite: Speedup your query with spatial indexing

  1. Micha,

    Can you make the test data set available?

    I’m quite concerned that it takes 90 seconds for this query. We should be able to do quite a lot faster than that, but perhaps there is something special about the data…

  2. I would really appreciate if you could somehow elaborate on how are we supposed to display geometries that are retrieved as WKT through queries. And also i need to add the display to the android mapview from Esri’s API. Any kind of help would be a miracle. Thank You

  3. No miracles here…
    From your question I understand that you have a table of data that includes a column with a WKT representation of the geometry. And you want to display the data in a GIS program. Is that the problem?
    If so, then in spatialite, you would import that table including the WKT column into a spatialite database. Then you have two steps to perform:
    1- Use the function AddGeometryColumn() with all its parameters to make the table spatial.
    2- Now use the function GeomFromText() to update the spatialite “geometry” column from the WKT column.

    Suppose you have a table of “Points Of Interest” named “poi”, with a WKT formatted column called “wkt”. It might go something like this:
    SELECT AddGeometryColumn('poi','geometry',4326,'POINT',2);
    UPDATE poi SET geometry=GeomFromText(wkt);

    Now the spatialite database can be loaded into QGIS, for example, and the poi layer will be available to display.

    As for android mapview, and ESRI’s API, I have no idea.

  4. Hi Micha,

    I’m trying to use the spatial index table as you explained but when I select row from SpatialIndex table, nothing is returned. Do you have any idea why the virtual table is empty ? Do I need to do something to update it ?
    I’m using spatialite 4.0.0 (and I tried with 4.1.1 with the same result).

    Thank a lot for your help.

  5. Hi Flavien
    Can you give us some idea of the structure of your tables, and what SELECT statement you used when nothing was returned?

  6. Hi Micha,

    I have a layer of Points (containing 7200 Nodes representing intersections in a routing network) and a layer of Polygons (1570 areas representing earthquake damages in each area, so there is a column “damage”). I want to associate the damage value to each intersection.
    My first query was :
    SELECT node.id, damages.damage
    FROM nodes , damages
    where CONTAINS(damages.the_geom, node.the_geom)

    Using no spatial index, it was very very very… very slow. So I search for optimization and find your article explaining how to use spatial indexes. The problem is that the virtual table SpatialIndex is empty. I tried to recreate the table, to add the geometry columns and spatial indexes but in vain. I’m wondering right now if I missed something ?

    I found an alternative solution that works well :
    SELECT n.id, d.D5 as damage
    FROM nodesTable n, damagesTable d
    WHERE Within(n.the_geom, d.the_geom)
    AND n.ROWID IN
    ( SELECT pkid FROM idx_nodesTable_the_geom
    WHERE xmin > MbrMinX(d.the_geom)
    AND xmax MbrMinY(d.the_geom)
    AND ymax < MbrMaxY(d.the_geom)
    )

    If you need more information about the structure of the tables, let me know.

    Thank you for your help.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>