# Tidying feature geometries with sf

- Introduction
- Corrup or invalid geometries?
- Making invalid polygons valid
- Empty geometries
- Tidying feature geometries
- References [references]

### Introduction

Spatial line and polygon data are often messy; although *simple
features* formally follow a standard, there is no guarantee that data is
clean when imported in R. This blog shows how we can identify,
(de)select, or repair broken and invalid geometries. We also show how
empty geometries arise, and can be dealt with. Literature on invalid
polygons and correcting them is found in Ramsey
(2010), Ledoux, Ohori, and Meijers
(2014), Ledoux, Ohori, and Meijers
(2012), and Van Oosterom, Quak, and
Tijssen (2005); all these come with excelent figures
illustrating the problem cases.

We see that from version 0.4-0, `sf`

may be linked to `lwgeom`

,

```
library(sf)
## Linking to GEOS 3.5.1, GDAL 2.1.3, proj.4 4.9.2, lwgeom 2.3.1 r15264
```

where `lwgeom`

stands for the *light-weight geometry* library that
powers postgis. This library is not present on CRAN, so binary packages
installed from CRAN will not come with it. It is only linked to `sf`

when it is detected during a build from source. When `lwgeom`

is
present, we will have a working version of `st_make_valid`

, which is
essentially identical to PostGIS’ `ST_makeValid`

.

### Corrup or invalid geometries?

There are two types of things that can go wrong when dealing with
geometries in `sf`

. First, a geometry can be corrupt, which is for
instance the case for a `LINESTRING`

with one point, or a `POLYGON`

with
more than zero and less than 4 points:

```
l0 = st_linestring(matrix(1:2,1,2))
p0 = st_polygon(list(rbind(c(0,0),c(1,1),c(0,0))))
```

These cases *could* of course be easily caught by the respective
constructor functions, but they are not because we want to see what
happens. Also, if we would catch them, it would not prevent us from
running into them, because the majority of spatial data enters R through
GDAL, and `sf`

’s binary interface (reading well-known
binary).
Also, for many purposes corrupt may not be a problem, e.g. if we only
want to plot them. In case we want to use them however in geometrical
operations, we’ll typically see a message like:

```
IllegalArgumentException: Invalid number of points in LinearRing found 3 - must be 0 or >= 4
```

which points to GEOS not accepting a geometry as a possible geometry.
Such an error message however does not point us to *which* geometry
caused this. We could of course write a loop over all geometries to find
this out, but can also use `st_is_valid`

which returns by default `NA`

on corrupt geometries:

```
l0 = st_linestring(matrix(1:2,1,2))
p0 = st_polygon(list(rbind(c(0,0),c(1,1),c(0,0))))
p = st_point(c(0,1)) # not corrupt
st_is_valid(st_sfc(l0, p0, p))
## [1] NA NA TRUE
```

Simple feature *validity* refers to a number of properties that polygons
should have, such as non-self intersecting, holes being inside polygons.
A number of different examples for invalid geometries are found in
Ledoux, Ohori, and Meijers (2014), and were taken from
their prepair github repo:

```
# A 'bowtie' polygon:
p1 = st_as_sfc("POLYGON((0 0, 0 10, 10 0, 10 10, 0 0))")
# Square with wrong orientation:
p2 = st_as_sfc("POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))")
# Inner ring with one edge sharing part of an edge of the outer ring:
p3 = st_as_sfc("POLYGON((0 0, 10 0, 10 10, 0 10, 0 0),(5 2,5 7,10 7, 10 2, 5 2))")
# Dangling edge:
p4 = st_as_sfc("POLYGON((0 0, 10 0, 15 5, 10 0, 10 10, 0 10, 0 0))")
# Outer ring not closed:
p5 = st_as_sfc("POLYGON((0 0, 10 0, 10 10, 0 10))")
# Two adjacent inner rings:
p6 = st_as_sfc("POLYGON((0 0, 10 0, 10 10, 0 10, 0 0), (1 1, 1 8, 3 8, 3 1, 1 1), (3 1, 3 8, 5 8, 5 1, 3 1))")
# Polygon with an inner ring inside another inner ring:
p7 = st_as_sfc("POLYGON((0 0, 10 0, 10 10, 0 10, 0 0), (2 8, 5 8, 5 2, 2 2, 2 8), (3 3, 4 3, 3 4, 3 3))")
p = c(p1, p2, p3, p4, p5, p6, p7)
(valid = st_is_valid(p))
## [1] FALSE TRUE FALSE FALSE NA FALSE FALSE
```

Interestingly, GEOS considers `p5`

as corrupt (`NA`

) and `p2`

as valid.

To query GEOS for the reason of invalidity, we can use the
`reason = TRUE`

argument to `st_is_valid`

:

```
st_is_valid(p, reason = TRUE)
## [1] "Self-intersection[5 5]" "Valid Geometry"
## [3] "Self-intersection[10 2]" "Self-intersection[10 0]"
## [5] NA "Self-intersection[3 1]"
## [7] "Holes are nested[3 3]"
```

### Making invalid polygons valid

As mentioned above, in case `sf`

was linked to `lwgeom`

, which is
confirmed by

```
sf_extSoftVersion()["lwgeom"]
## lwgeom
## "2.3.1 r15264"
```

not printing a `NA`

, we can use `st_make_valid`

to make geometries
valid:

```
st_make_valid(p)
## Geometry set for 7 features
## geometry type: GEOMETRY
## dimension: XY
## bbox: xmin: 0 ymin: 0 xmax: 15 ymax: 10
## epsg (SRID): NA
## proj4string: NA
## First 5 geometries:
## MULTIPOLYGON(((0 0, 0 10, 5 5, 0 0)), ((5 5, 10...
## POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))
## GEOMETRYCOLLECTION(POLYGON((10 7, 10 2, 10 0, 0...
## GEOMETRYCOLLECTION(POLYGON((10 0, 0 0, 0 10, 10...
## POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))
```

A well-known “trick”, which may be your only alternative if is to buffer the geometries with zero distance:

```
st_buffer(p[!is.na(valid)], 0.0)
## Geometry set for 6 features
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 0 ymin: 0 xmax: 10 ymax: 10
## epsg (SRID): NA
## proj4string: NA
## First 5 geometries:
## POLYGON((0 0, 0 10, 5 5, 0 0))
## POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))
## POLYGON((0 0, 0 10, 10 10, 10 7, 5 7, 5 2, 10 2...
## POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))
## POLYGON((0 0, 0 10, 10 10, 10 0, 0 0), (1 1, 3 ...
```

but we see that, apart from the fact that this only works for non-corrupt geometries, we end up with different results.

A larger example from the prepair site is this:

```
x = read_sf("/home/edzer/git/prepair/data/CLC2006_2018418.geojson")
st_is_valid(x)
## [1] FALSE
st_is_valid(st_make_valid(x))
## [1] TRUE
plot(x, col = 'grey', axes = TRUE, graticule = TRUE)
```

The corresponding paper, Ledoux, Ohori, and Meijers
(2012) zooms in on problematic points.
The authors argue to use constrained triangulation instead of the (less
documented) approach taken by `lwgeom`

; Mike Sumner also explores this
here. It builds upon
RTriangle, which cannot
be integrated in `sf`

as it is distributed under license with a
non-commercial clause. Ledoux uses CGAL, which would
be great to have an interface to from R!

### Empty geometries

Empty geometries exist, and can be thought of as zero-length vectors,
`data.frame`

s without rows, or `NULL`

values in lists: in essence,
there’s place for information, but there is no information. An empty
geometry arises for instance if we ask for the intersection of two
non-intersecting geometries:

```
st_intersection(st_point(0:1), st_point(1:2))
## GEOMETRYCOLLECTION()
```

In principle, we could have designed `sf`

such that empty geometries
were represented a `NULL`

value, but the standard prescrives that every
geometry type has an empty instance:

```
st_linestring()
## LINESTRING()
st_polygon()
## POLYGON()
st_point()
## POINT(NA NA)
```

and thus the empty geometry is typed. This guarantees clean roundtrips from a database to R back into a database: no information (on type) gets lost in case of presence of empty geometries.

How can we detect, and filter on empty geometries? We can do that with
`st_dimension`

:

```
lin = st_linestring(rbind(c(0,0),c(1,1)))
pol = st_polygon(list(rbind(c(0,0),c(1,1),c(0,1),c(0,0))))
poi = st_point(c(1,1))
p0 = st_point()
pol0 = st_polygon()
st_dimension(st_sfc(lin, pol, poi, p0, pol0))
## [1] 1 2 0 NA NA
```

and see that empty geometries return `NA`

.

The standard however prescribes that an empty polygon still has
dimension two, and we can override the `NA`

convenience to get
standard-compliant dimensions by

```
st_dimension(st_sfc(lin, pol, poi, p0, pol0), NA_if_empty = FALSE)
## [1] 1 2 0 0 2
```

### Tidying feature geometries

When you analyse your spatial data with `sf`

and you don’t get any
warnings or error messages, all may be fine. In case you do, or your are
curious, you can check for

- empty geometries, using
`any(is.na(st_dimension(x)))`

- corrupt geometries, using
`any(is.na(st_is_valid(x)))`

- invalid geometries, using
`any(na.omit(st_is_valid(x)) == FALSE)`

; in case of corrupt and/or invalid geometries, - in case of invalid geometries, query the reason for invalidity by
`st_is_valid(x, reason = TRUE)`

- you may be succesful in making geometries valid using
`st_make_valid(x)`

or, if`st_make_valid`

is not supported by `st_buffer(x, 0.0)`

on non-corrupt geometries (but beware of the bowtie example above, where`st_buffer`

removes one half).- After succesful a
`st_make_valid`

, you may want to select a particular type subset using`st_is`

, or cast`GEOMETRYCOLLECTIONS`

to`MULTIPOLYGON`

by

```
st_make_valid(p) %>% st_cast("MULTIPOLYGON")
## Warning in st_cast.GEOMETRYCOLLECTION(X[[i]], ...): only first part of
## geometrycollection is retained
## Warning in st_cast.GEOMETRYCOLLECTION(X[[i]], ...): only first part of
## geometrycollection is retained
## Warning in st_cast.GEOMETRYCOLLECTION(X[[i]], ...): only first part of
## geometrycollection is retained
## Geometry set for 7 features
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 0 ymin: 0 xmax: 10 ymax: 10
## epsg (SRID): NA
## proj4string: NA
## First 5 geometries:
## MULTIPOLYGON(((0 0, 0 10, 5 5, 0 0)), ((5 5, 10...
## MULTIPOLYGON(((0 0, 0 10, 10 10, 10 0, 0 0)))
## MULTIPOLYGON(((10 7, 10 2, 10 0, 0 0, 0 10, 10 ...
## MULTIPOLYGON(((10 0, 0 0, 0 10, 10 10, 10 0)))
## MULTIPOLYGON(((0 0, 10 0, 10 10, 0 10, 0 0)))
```

For longer explanations about what makes a polygons invalid, do read one of the references below, all are richly illustrated

### References [references]

Ledoux, Hugo, Ken Arroyo Ohori, and Martijn Meijers. 2012. “Automatically Repairing Invalid Polygons with a Constrained Triangulation.” In. Agile. https://3d.bk.tudelft.nl/ken/files/12_agile.pdf.

———. 2014. “A Triangulation-Based Approach to Automatically Repair GIS
Polygons.” *Computers & Geosciences* 66: 121–31.
https://pdfs.semanticscholar.org/d9ec/b32a7844b436fcd4757958e5eeca9563fcd2.pdf.

Ramsey, Paul. 2010. “PostGIS-Tips for Power Users.” *Presentation on:
FOSS4G*. http://2010.foss4g.org/presentations/3369.pdf.

Van Oosterom, Peter, Wilko Quak, and Theo Tijssen. 2005. “About Invalid,
Valid and Clean Polygons.” In *Developments in Spatial Data Handling*,
1–16. Springer. http://excerpts.numilog.com/books/3540267727.pdf.