samedi 11 mars 2017

Dealing with huge vector GeoPackage databases in GDAL/OGR

Recently, I've fixed a bug in the OGR OpenFileGDB driver, the driver made from the reverse engineering the ESRI FileGeoDatabase format, so as to be able to read tables whose section that enumerates and describes fields is located beyond the first 4 GB of the file. This table from the 2016 TIGER database is indeed featuring all linear edges of the USA and is 15 GB large (feature and spatial indexes included), with 85 million features.

Some time before, I had to deal with a smaller database - 1.7 GB as GeoPackage - with 5.4 million polygons (bounding box) from the cadastre of an Italian province. One issue I noticed is that when you want to get the summary of the layer, with ogrinfo -al -so the.gpkg, it was very slow. The reason is that this summary includes the feature count, and there's no way to get this metadata quickly, apart from running the "SELECT COUNT(*) FROM the_table" request, which causes a full scan of the table. For small databases, this runs fast, but when going into the gigabyte realm, this can take several dozains of seconds. But getting the spatial extent of the layer, which is one of the other information displayed by the summary mode of ogrinfo, is fast since the gpkg_contents "system" table of a GeoPackage database includes the bounding box of the table. So my idea was to extend the definition of the gpkg_contents table to add a new column, ogr_feature_count, to store the feature count. I went to implement that, and it worked fine. The synchronization of the value of ogr_feature_count after edits can be done with 2 SQLite triggers, on row insertion and deletion, and that  works with implementations that are not aware of the existence of this new column. Like older OGR versions. Unfortunately it appears that at least one other implementation completely rejected such databases. There is some inconsistency in the GeoPackage specification if additional columns are accepted or not in system tables. From the /base/core/contents/data/table_def test case, "Column order, check constraint and trigger definitions, and other column definitions in the returned sql are irrelevant.", it would seem that additional columns should still be considered as a valid GeoPackage. Anyway, that's only the theory and we don't want to break interoperability for just a nice-to-have feature... So I went to change the design a bit and created a new table, gpkg_ogr_contents, with a table_name and feature_count columns. I'm aware that I should not borrow the gpkg_ prefix, but I felt it was safer to do so since other implementations will probably ignore any unknown gpkg_ prefixed table. And the addition of the ogr_ prefix makes collisions with future extension of the GeoPackage specification unlikely. The content of this table is also maintained in synchronization with the data table thanks to two triggers, and this makes the other software that rejected my first attempt happy. Problem solved.

Let's come back to our 13 GB FileGeoDatabase. My first attempt to convert is to GeoPackage with ogr2ogr resulted in converting the features in about half an hour, but once this 100% stage reached, the finalization, which includes building the spatial index took ages. So long, that after a whole night it wasn't yet finished and seriously making the computer non responsive due to massive I/O activity. In the GeoPackage driver, the spatial index is indeed created after feature insertion, so that the feature table and spatial index tables are well separated in the file, and from previous experiment with the Spatialite driver, it proved to be the right strategy. Populating the SQLite R-Tree is done with a simple statement: INSERT INTO my_rtree SELECT fid, ST_MinX(geom), ST_MaxX(geom), ST_MinY(geom), ST_MaxY(geom) FROM the_table. Analyzing what happens in the SQLite code is not easy when you are not familiar with that code base, but my intuition is that there was constant back and forth between the geometry data area and the RTree area in the file, making the SQLite page cache inefficient. So I decided to experiment with a more progressive approach. Let's iterate over the feature table and collect the fid, minx, max, miny, maxy by chunks of 100 000 rows, and the insert those 100 000 bounding boxes into the R-Tree, and loop again unil we have completely read the feature table. With such a strategy, the spatial index can now be built in 4h30. The resulting GeoPackage file weights 31.6 GB, so twice at large than the FileGeoDatabase. One of the reasons for the difference must be due to geometries in FileGeoDatabase being compressed (quantization for coordinate precision, delta encoding and use of variable integer) whereas GeoPackage uses a uncompressed SQLite BLOB based on OGC WKB.
My first attempt at opening it in QGIS resulted in the UI to be frozen, probably for hours. The reason is that QGIS always issues a spatial filter, even when requesting on a area of interest that is at least as large as the extent of the layer, where there is no performance gain to expect from using it. So the first optimization was in the OGR GeoPackage to detect that situation and to not translate the OGR spatial filter as SQLite R-Tree filter. QGIS could now open the database and progressively displays the features. Unfortunately when zooming in, the UI became frozen again. When applying a spatial filter, the GeoPackage driver created a SQL request like the following one:

SELECT * FROM the_table WHERE fid IN 
       (SELECT id FROM the_rtree WHERE 
        xmin <= bbox_xmax AND xmax >= bbox_xmin AND
        ymin <= bboy_ymay AND ymay >= bboy_ymin)

It turns out that the sub-select (the one that fetches the feature ID from the spatial index) is apparently entirely run before the outer select (the one that returns geometry and attributes) starts being evaluated. This way of expressing the spatial filter came from the Spatialite driver (since GeoPackage and Spatialite use the exact same mechanisms for spatial indexing), itself based on examples from an old Spatialite tutorial. For not too big databases, this runs well. After some experiment, it turns out that doing a JOIN between the feature table and the RTree virtual table makes it possible to have a non blocking request:

SELECT * FROM the_table t JOIN the_rtree r ON t.fid = r.id
WHERE r.xmin <= bbox_xmax AND r.xmax >= bbox_xmin AND
      r.ymin <= bboy_ymax AND r.ymax >= bboy_ymin

Now QGIS is completely responsive, although I find that even on high zoom levels, the performance is somehow disappointing, ie features appear rather slowly. There seems to be some threshold effect on the size of the database, since the performance is rather good on the Italian province cadastral use case.

Another experiment showed that increasing the SQLite page size from 1024 bytes (the default in SQLite 3.11 or earlier) to 4096 bytes (the default since SQLite 3.12) decreases the database size to 28.8 GB. This new page size of 4096 bytes is now used by default by the OGR SQLite and GPKG drivers (unless OGR_SQLITE_PRAGMA=page_size=xxxx is specified as a configuration option).

I also discovered that increasing the SQLite page cache from its 2 MB default to 2 GB (with --config OGR_SQLITE_CACHE 2000) significantly improved the time to build the spatial index, decreasing the total conversion time from 4h30 to 2h10. 2GB is just a value selected at random. It might be too large or perhaps a larger value would help further.

All improvements mentionned above (faster spatial index creation, better use of spatial index and change of default page size) are now in GDAL trunk, and will be available in the upcoming 2.2.0 release.