Disclaimer: This tutorial was created by Daniel Lakeland

It is part of his series on Julia Data Tutorials

Visualizing Spatial Data with GMT

GMT is the "Generic Mapping Tools" a powerful toolset for geospatial visualizations. With great power comes great learning curves. The goal here is to make it possible for someone who has a data analysis background but not much specific knowledge of geospatial software to make maps that show their data in 2D maps, and simultaneously to learn something about the framework and what it is capable of, and how to use it.

Problem 1: a Choropleth map of US Census Data

One of the most common tasks in basic geospatial data analysis is to create a map in which regions of the world are colored based on the value of some statistic pertaining to the region. We will start with a very simple map of population for each county in the US. The data for this can be found from the Census bureau at https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/totals/co-est2020-alldata.csv

using CSV,DataFrames,GMT,Printf,DataFramesMeta

countypopurl = "https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/totals/co-est2020-alldata.csv"

download(countypopurl,"data/countypops.csv")

cpopall = CSV.read("data/countypops.csv",DataFrame)

cpop = cpopall[:,[:STATE,:COUNTY,:STNAME,:CTYNAME,:POPESTIMATE2020]]

3,194 rows × 5 columns

	STATE	COUNTY	STNAME	CTYNAME	POPESTIMATE2020
	Int64	Int64	String31	String	Int64
1	1	0	Alabama	Alabama	4921532
2	1	1	Alabama	Autauga County	56145
3	1	3	Alabama	Baldwin County	229287
4	1	5	Alabama	Barbour County	24589
5	1	7	Alabama	Bibb County	22136
6	1	9	Alabama	Blount County	57879
7	1	11	Alabama	Bullock County	9976
8	1	13	Alabama	Butler County	19504
9	1	15	Alabama	Calhoun County	113469
10	1	17	Alabama	Chambers County	32865
11	1	19	Alabama	Cherokee County	26294
12	1	21	Alabama	Chilton County	44397
13	1	23	Alabama	Choctaw County	12418
14	1	25	Alabama	Clarke County	23291
15	1	27	Alabama	Clay County	13112
16	1	29	Alabama	Cleburne County	14967
17	1	31	Alabama	Coffee County	53230
18	1	33	Alabama	Colbert County	55411
19	1	35	Alabama	Conecuh County	11851
20	1	37	Alabama	Coosa County	10650
21	1	39	Alabama	Covington County	36930
22	1	41	Alabama	Crenshaw County	13681
23	1	43	Alabama	Cullman County	84515
24	1	45	Alabama	Dale County	48959
⋮	⋮	⋮	⋮	⋮	⋮

Now we have a large number of different estimates for each county in the US. For simplicity let's use the POPESTIMATE2020 variable. The COUNTY variable is the FIPS county code for the county.

In order to build a Choropleth map, we will need to have a file that defines the spatial boundaries of each county! GMT is a generic mapping tool It will read various geospatial data formats which can be used to define the regions to be plotted. The Census bureau has a file which defines the boundaries of the counties at 3 different resolutions. These are in "shapefile" format. The files and other related shapefiles are available at the Census website. We will use the medium resolution version of the county shapefiles (5 million meter resolution).

countyshapeurl = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_5m.zip"

download(countyshapeurl,"data/cb_2018_us_county_5m.zip")
cd("data")
run(`unzip cb_2018_us_county_5m.zip`)
cd("..")

Inside the zip file are a number of files, the one ending in ".shp" is in the shape file format. We can read this using GMT's function gmtread

counties = gmtread("data/cb_2018_us_county_5m.shp")

This file has islands (holes in polygons).
	Use `gmtread(..., no_islands=true)` to ignore them.
Vector{GMTdataset} with 3571 segments
PROJ: +proj=longlat +datum=NAD83 +no_defs
WKT: GEOGCS["NAD83",
    DATUM["North_American_Datum_1983",
        SPHEROID["GRS 1980",6378137,298.257222101]],
    PRIMEM["Greenwich",0],
    UNIT["degree",0.0174532925199433]]
Header1:	39,071,01074048,0500000US39071,39071,Highland,06,1432479992,121949
83
Header2:	06,003,01675840,0500000US06003,06003,Alpine,06,1912292630,12557304
Header3:	12,033,00295737,0500000US12033,12033,Escambia,06,1701544502,563927
612
Header4:	17,101,00424252,0500000US17101,17101,Lawrence,06,963936864,5077783
Header5:	28,153,00695797,0500000US28153,28153,Wayne,06,2099745573,7255476
Header6:	28,141,00695791,0500000US28141,28141,Tishomingo,06,1098938845,5236
0190
Header7:	30,089,01719578,0500000US30089,30089,Sanders,06,7149405165,7639414
4
Header8:	36,001,00974099,0500000US36001,36001,Albany,06,1354120790,27124553
Header9:	42,105,01213684,0500000US42105,42105,Potter,06,2800612694,559778
Header10:	29,077,00758493,0500000US29077,29077,Greene,06,1749081115,6620035
...
Header3563:	22,007,00558414,0500000US22007,22007,Assumption,15,877029986,67
099711
Header3564:	39,161,01074092,0500000US39161,39161,Van Wert,06,1059702141,327
8633
Header3565:	54,019,01560095,0500000US54019,54019,Fayette,06,1713569368,1744
3841
Header3566:	08,009,00198120,0500000US08009,08009,Baca,06,6617400546,6142193
Header3567:	42,055,01213670,0500000US42055,42055,Franklin,06,2000048804,154
7614
Header3568:	12,001,00308548,0500000US12001,12001,Alachua,06,2266324954,2428
77007
Header3569:	48,247,01383909,0500000US48247,48247,Jim Hogg,06,2942674729,925
65
Header3570:	29,099,00758504,0500000US29099,29099,Jefferson,06,1700345322,20
143587
Header3571:	13,307,00352287,0500000US13307,13307,Webster,06,542230980,23481
62
First segment DATA
Attributes: Dict("LSAD" => "06", "COUNTYFP" => "071", "GEOID" => "39071", "
COUNTYNS" => "01074048", "AWATER" => "12194983", "ALAND" => "1432479992", "
AFFGEOID" => "0500000US39071", "STATEFP" => "39", "NAME" => "Highland")
Global BoundingBox:    [-179.14733999999999, 179.77847, -14.552548999999999
, 71.352561, 0.0, 0.0]
First seg BoundingBox: [-83.872214, -83.343479, 39.01889, 39.37873599999999
6]
25×2 Matrix{Float64}:
 -83.8698  39.0555
 -83.8657  39.2473
 -83.8344  39.2457
 -83.8359  39.2233
 -83.8139  39.223
 -83.8013  39.2316
 -83.7848  39.2629
 -83.5909  39.3787
 -83.3727  39.3774
 -83.3751  39.3478
   ⋮       
 -83.3435  39.2332
 -83.3535  39.1976
 -83.3856  39.0552
 -83.6116  39.0189
 -83.673   39.0204
 -83.7053  39.0214
 -83.8169  39.0202
 -83.8722  39.0213
 -83.8698  39.0555

Now, what kind of thing is "counties?"

typeof(counties)

Vector{GMTdataset} (alias for Array{GMTdataset, 1})

It's a Vector{GMTdataset}. Basically GMT treats the shape file as a collection of shapes, each shape gets its own GMTdataset, which is a structure

search: GMTdataset

  No documentation found.

  Summary
  ≡≡≡≡≡≡≡≡≡

  mutable struct GMTdataset{T<:Real, N}

  Fields
  ≡≡≡≡≡≡≡≡

  data     :: Array{T<:Real, N}
  ds_bbox  :: Vector{Float64}
  bbox     :: Vector{Float64}
  attrib   :: Dict{String, String}
  colnames :: Vector{String}
  text     :: Vector{String}cpop.CFIPS = format("%03d",cpop.COUNTY)

  geom     :: Int64

  Supertype Hierarchy
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  GMTdataset{T<:Real, N} <: AbstractArray{T<:Real, N} <: Any

In addition to "data" which is lattitude and longitude values in this case, there are a variety of metadata fields. one of which is "header".

Let's look at the header of the first county:

counties[1].header
counties[1].attrib

Dict{String, String} with 9 entries:
  "LSAD"     => "06"
  "COUNTYFP" => "071"
  "GEOID"    => "39071"
  "COUNTYNS" => "01074048"
  "AWATER"   => "12194983"
  "ALAND"    => "1432479992"
  "AFFGEOID" => "0500000US39071"
  "STATEFP"  => "39"
  "NAME"     => "Highland"

This shows that the header is a comma separated values string. We want to "join" our data to these counties such that the county FIPS identifier is the same. The FIPS identifier is a unique numerical code assigned to each county (the names are not necessarily unique).

In this case, we have the value 39 being the FIPS code for the state of Ohio, and 071 being the county FIPS code for Highland County.

So we'd like to join field 2 of the header to the COUNTY column of our population data. However our population data represents the COUNTY field as a number, not a 0 padded string. Let's fix that then do the join

cpop.COUNTY = Printf.format.(Ref(Printf.Format("%03d")),cpop.COUNTY)
cpop.STATE = Printf.format.(Ref(Printf.Format("%02d")),cpop.STATE)

cpop = cpop[cpop.COUNTY .!= "000",:] ## eliminate county "000" which is "the whole state" for each state

dfc = DataFrame(STATE = map(x->x.attrib["STATEFP"],counties),COUNTY=map(x -> x.attrib["COUNTYFP"],counties),ORDER=1:length(counties))


joineddata = @chain leftjoin(dfc,cpop,on= [:STATE,:COUNTY],makeunique=true) begin
@orderby(:ORDER)
end

cptvallog = makecpt(range=(log(1000),log(11e6)),C=:plasma)

#imshow(counties,level=joineddata.POPESTIMATE2020,cmap=cptval,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50),proj=:guess,colorbar=true)

GMT.GMTcpt([0.050980392156862744 0.03137254901960784 0.5294117647058824; 0.
06274509803921569 0.027450980392156862 0.5333333333333333; … ; 0.9411764705
882353 0.9686274509803922 0.1411764705882353; 0.9411764705882353 0.94117647
05882353 0.9411764705882353], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [6.907755278982
137 6.944233429145116; 6.944233429145116 6.980711579308096; … ; 16.14044953
0436687 16.17692768059967; 16.17692768059967 16.213405830762646], [6.907755
278982137, 16.213405830762646], [0.0 0.0 0.0; 1.0 1.0 1.0; 0.50196078431372
55 0.5019607843137255 0.5019607843137255], 24, NaN, [0.050980392156862744 0
.03137254901960784 … 0.027450980392156862 0.5333333333333333; 0.06274509803
921569 0.027450980392156862 … 0.027450980392156862 0.5372549019607843; … ; 
0.9450980392156862 0.9607843137254902 … 0.9686274509803922 0.14117647058823
53; 0.9411764705882353 0.9686274509803922 … 0.9764705882352941 0.1294117647
0588237], ["", "", "", "", "", "", "", "", "", ""  …  "", "", "", "", "", "
", "", "", "", ""], ["", "", "", "", "", "", "", "", "", ""  …  "", "", "",
 "", "", "", "", "", "", ""], "rgb", String[])

There are missing values of POPESTIMATE2020 so let's replace them with NaN so we at least still get the polygon drawn. GMT will draw NaN as a special color.

joineddata.POPESTIMATE2020 = replace(joineddata.POPESTIMATE2020,missing => NaN)



GMT.plot(counties,level=log.(1.0 .+ joineddata.POPESTIMATE2020),cmap=cptvallog,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50),
    proj=:guess,colorbar=true,figname="choroplethlog.png",title="Continental US County log(Population)")

Choropleth of log(population) at county level

Let's see what it looks like if we don't take the logarithm...

cptval = makecpt(range=(0,11_000_000),C=:plasma)

GMT.plot(counties,level=joineddata.POPESTIMATE2020,cmap=cptval,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50),
    proj=:guess,colorbar=true,figname="choroplethnolog.png",title="Continental US County Population")

Continental Population

It may be interesting to work further with the polygons that represent the counties in some additional ways. For example we could calculate the population Density by dividing the population by the area of the polygon. We can calculate the area using the function gmtspatial.

measures = gmtspatial(counties, area="k")[1] # to get area in km^2
print(typeof(measures))
length(measures)
print(measures)

GMTdataset{Float64, 2}BoundingBox:  [-179.10882508557128, 179.6234549210534
2, -14.542124124603314, 69.40079782409146, 0.0001071433475167924, 384291.37
54448318]
3571×3 Matrix{Float64}:
  -83.6009  39.1847  1446.7
 -119.821   38.5971  1925.5
  -87.362   30.6678  1939.81
  -87.7269  38.7203   968.306
  -88.6959  31.6407  2107.22
  -88.2392  34.74    1151.02
 -115.13    47.6745  7235.18
  -73.9735  42.6001  1385.33
  -77.8956  41.7446  2803.22
  -93.3419  37.258   1756.51
    ⋮                
  -91.0626  29.9009   944.669
  -84.5861  40.8556  1063.1
  -81.0811  38.0287  1732.05
 -102.56    37.3194  6624.41
  -77.7217  39.9272  2000.25
  -82.3582  29.6747  2509.54
  -98.6973  27.0433  2937.68
  -90.538   38.2606  1722.83
  -84.5497  32.0458   544.366

But some counties will consiste of several polygons, like some along the coast may include some islands, etc, to get the total area, we need to group by county identifier, and sum all the areas.

countyareas = @chain dfc begin
  @transform(:area = measures[:,3])
  groupby([:STATE,:COUNTY])
  @combine(:totarea = sum(:area))
  @select(:STATE,:COUNTY,:totarea)
end


cpop = @chain leftjoin(cpop,countyareas,on = [:STATE, :COUNTY]) begin
  @transform(:density = :POPESTIMATE2020 ./ :totarea)
  @orderby(:STATE,:COUNTY)
end

print(cpop[1:10,:])

joineddata = @chain leftjoin(dfc,cpop,on = [:STATE,:COUNTY]) begin 
  @orderby(:ORDER)
end

10×7 DataFrame
 Row │ STATE   COUNTY  STNAME    CTYNAME          POPESTIMATE2020  totarea 
  density
     │ String  String  String31  String           Int64            Float64?
  Float64?
─────┼─────────────────────────────────────────────────────────────────────
───────────
   1 │ 01      001     Alabama   Autauga County             56145   1560.78
  35.9725
   2 │ 01      003     Alabama   Baldwin County            229287   4352.7 
  52.677
   3 │ 01      005     Alabama   Barbour County             24589   2346.53
  10.4789
   4 │ 01      007     Alabama   Bibb County                22136   1622.71
  13.6414
   5 │ 01      009     Alabama   Blount County              57879   1686.26
  34.3238
   6 │ 01      011     Alabama   Bullock County              9976   1617.2 
   6.16867
   7 │ 01      013     Alabama   Butler County              19504   2014.3 
   9.68279
   8 │ 01      015     Alabama   Calhoun County            113469   1582.32
  71.7105
   9 │ 01      017     Alabama   Chambers County            32865   1563.07
  21.0259
  10 │ 01      019     Alabama   Cherokee County            26294   1554.12
  16.9189

3,571 rows × 8 columns (omitted printing of 1 columns)

	STATE	COUNTY	ORDER	STNAME	CTYNAME	POPESTIMATE2020	totarea
	String	String	Int64	String31?	String?	Int64?	Float64?
1	39	071	1	Ohio	Highland County	43304	1446.7
2	06	003	2	California	Alpine County	1119	1925.5
3	12	033	3	Florida	Escambia County	322364	1939.81
4	17	101	4	Illinois	Lawrence County	15467	968.306
5	28	153	5	Mississippi	Wayne County	20317	2107.22
6	28	141	6	Mississippi	Tishomingo County	19275	1151.02
7	30	089	7	Montana	Sanders County	12157	7235.18
8	36	001	8	New York	Albany County	303654	1385.33
9	42	105	9	Pennsylvania	Potter County	16453	2803.22
10	29	077	10	Missouri	Greene County	294997	1756.51
11	29	205	11	Missouri	Shelby County	5919	1298.11
12	02	100	12	Alaska	Haines Borough	2614	6272.01
13	02	100	13	Alaska	Haines Borough	2614	6272.01
14	02	100	14	Alaska	Haines Borough	2614	6272.01
15	04	015	15	Arizona	Mohave County	217206	34881.9
16	13	143	16	Georgia	Haralson County	30383	734.927
17	21	145	17	Kentucky	McCracken County	65644	693.749
18	27	133	18	Minnesota	Rock County	9301	1253.7
19	41	047	19	Oregon	Marion County	349204	3089.11
20	47	125	20	Tennessee	Montgomery County	214251	1406.65
21	48	243	21	Texas	Jeff Davis County	2220	5858.32
22	55	061	22	Wisconsin	Kewaunee County	20386	891.309
23	05	051	23	Arkansas	Garland County	99789	1902.12
24	06	109	24	California	Tuolumne County	54515	5891.94
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

denscpt = makecpt(range=(0,11e6/(100*100)),C=:plasma)


GMT.plot(counties,level=Vector{Float64}(replace(joineddata.density,missing=>NaN)),cmap=denscpt,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50),
    proj=:guess,colorbar=true,figname="choroplethdens.png",title="Continental US County Pop Density (1/km^2)")

Continental US Density

Smallest Area Containing ~50% of the population

Our approach to this question will be to sort the counties by density in decreasing order, cumsum the populations... and then create an indicator variable for whether the cumsum is less than 50% of the total, then plot this indicator variable.

totpop = sum(cpop[cpop.COUNTY .!= "000",:POPESTIMATE2020]) ## ignore county 0, that's the "whole state"
print(totpop)

areads = @chain cpop begin
@subset(:COUNTY .!= "000")
@orderby(-:density)
@transform( :csum = cumsum(:POPESTIMATE2020))
@transform( :inset = :csum .< totpop/2.0,:in80set = :csum .< 0.8 * totpop)
end

joineddata = @chain leftjoin(dfc,areads,on=[:STATE,:COUNTY]) begin
  @orderby(:ORDER)
end

insetcmap = makecpt(range(0,1),C=:plasma)

GMT.plot(counties,level=Float64.(replace(joineddata.inset,missing => NaN)),cmap=insetcmap,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50),
    proj=:guess,colorbar=true,figname="choroplethhpdset.png",title="Continental US smallest 50% set ")

329484123

high probability density set 50%

80%tile smallest area

GMT.plot(counties,level=Float64.(replace(joineddata.in80set,missing => NaN)),cmap=insetcmap,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50),
    proj=:guess,colorbar=true,figname="choroplethhpd80set.png",title="Continental US smallest 80% set ")

high probability density set 50%