# Spatial indexes coming to sf

Spatial indexes give you fast results on spatial queries,
such as finding whether pairs of geometries intersect
or touch, or finding their intersection. They reduce
the time to get results from quadratic in the number of
geometries to linear in the number of geometries. A recent
commit
brings spatial indexes to sf for the binary logical predicates
(`intersects`

, `touches`

, `crosses`

, `within`

, `contains`

,
`contains_properly`

, `overlaps`

, `covers`

, `covered_by`

,
`relate_pattern`

, `equals`

, `disjoint`

), as well as `intersection`

,
which returns intersecting geometries.

The spatial join function `st_join`

uses a logical predicate to
join features, and is also also affected by this speedup.

## Antecedents

There have been attempts to use spatial planar indices, including
enhancement issue sfr:76.
In rgeos, GEOS STRtrees were used in
rgeos/src/rgeos_poly2nb.c, which is mirrored in a modern Rcpp setting
sf/src/geos.cpp, around lines 276 and 551.
The STRtree is constructed by building envelopes (bounding boxes) of input entities,
which are then queried for intersection with envelopes of another set of entities
(in rgeos, R functions `gUnarySTRtreeQuery`

and `gBinarySTRtreeQuery`

). The use case
was to find neighbours of all the about 90,000 US Census entities in Los Angeles, via
spdep::poly2nb(), which received an argument to enter the candidate neighbours found
by Unary querying the STRtree of entities by the same entities.

## Benchmark

A simple benchmark shows the obvious: `st_intersects`

without spatial
index behaves quadratic in the number of geometries (black line),
and is much faster for the case where a spatial index is created,
stronger so for larger number of polygons:

The polygon datasets used are simple checker boards with square polygons (showing a nice Moiré pattern):

The black small square polygons are essentially matched to the red ones; the number of polygons along the x axis is the number of a single geometry set (black).

To show that the behaviour of `intersects`

and `intersection`

is indeed linear in the number of polygons, we show runtimes for
both, as a function of the number of polygons (where `intersection`

was divided by 10 for scaling purposes):

## Implementation

Spatial indexes are available in the
GEOS library used by `sf`

, through the
functions
starting with `STRtree`

. The algorithm
implements a Sort-Tile-Recursive R-tree, according to the JTS
documentation
described in *P. Rigaux, Michel Scholl and Agnes Voisard. Spatial
Databases With Application To GIS. Morgan Kaufmann, San Francisco,
2002*.

The sf implementation
(some commits to follow this one) excludes some binary operations.
`st_distance`

, and `st_relate`

without pattern, are excluded as
these need to go through all combinations, returning a dense
matrix. `st_equals_exact`

is excluded as approximate matches are
sought, without constraining the degree of approximation.

# On which argument is an index built?

The R-tree is built on the first argument (`x`

), and used to
match all geometries over the second argument (`y`

) of binary
functions. This could give runtime differences, but for instance
for the dataset that triggered this development in
sfr:394, we see hardly
any difference:

```
library(sf)
# Linking to GEOS 3.5.1, GDAL 2.1.3, proj.4 4.9.2, lwgeom 2.3.2 r15302
load("test_intersection.Rdata")
nrow(test)
# [1] 16398
nrow(hsg2)
# [1] 6869
system.time(int1 <- st_intersection(test, hsg2))
# user system elapsed
# 105.712 0.040 105.758
system.time(int2 <- st_intersection(hsg2, test))
# user system elapsed
# 107.756 0.060 107.822
# Warning messages:
# 1: attribute variables are assumed to be spatially constant throughout all geometries
# 2: attribute variables are assumed to be spatially constant throughout all geometries
```

The resulting feature sets `int1`

and `int2`

are identical, only
the order of the features (records) and of the attribute columns
(variables) differs. Runtime without index is 35 minutes, 20 times
as long.

# Is the spatial index always built?

In the current implemenation (version >= 0.5-1) the index is always
built, except for `st_distance`

, and `st_relate`

without pattern.
This means it is also built e.g. for rook or queen neighborhood
selections (sfr:234),
which use `st_relate`

with a specified pattern.

# What about `prepared`

geometries?

Prepared geometries
in GEOS are essentially indexes over single geometries and not
over sets of geometries; they speed things up in particular when
single geometries are very complex, and only for a single geometry
to single geometry comparison. The spatial indexes are indexes over
*collections* of geometries; they make a cheap preselection based on
bounding boxes before the expensive pairwise comparison takes place.

## Script used

The followinig script was used to create the benchmark plots. It no longer works; in the version where it worked, `prepared = FALSE`

would take a branch where no trees were built, this is no longer the case.

```
library(sf)
sizes = c(10, 20, 50, 100, 160, 200)
res = matrix(NA, length(sizes), 4)
for (i in seq_along(sizes)) {
g1 = st_make_grid(st_polygon(list(rbind(c(0,0),c(0,1),c(1,1),c(0,1),c(0,0)))), n = sizes[i]) * sizes[i]
g2 = g1 + c(.5,.5)
res[i, 1] = system.time(i1 <- st_intersects(g1, g2))[1]
res[i, 2] = system.time(i2 <- st_intersects(g1, g2, prepare = FALSE))[1]
res[i, 3] = system.time(i1 <- st_intersection(g1, g2))[1]
res[i, 4] = identical(i1, i2)
}
plot(sizes^2, res[,2], type = 'b', ylab = 'time [s]', xlab = '# of polygons')
lines(sizes^2, res[,1], type = 'b', col = 'red')
legend("topleft", lty = c(1,1), col = c(1,2), legend = c("st_intersects without index", "st_intersects with spatial index"))
plot(sizes^2, res[,3]/10, type = 'b', ylab = 'time [s]', xlab = '# of polygons')
lines(sizes^2, res[,1], type = 'b', col = 'red')
legend("topleft", lty = c(1,1), col = c(1,2), legend = c("st_intersection * 0.1", "st_intersects"))
```