It is part of his series on Julia Data Tutorials
.
GMT is the "Generic Mapping Tools" a powerful toolset for geospatial visualizations. With great power comes great learning curves. The goal here is to make it possible for someone who has a data analysis background but not much specific knowledge of geospatial software to make maps that show their data in 2D maps, and simultaneously to learn something about the framework and what it is capable of, and how to use it.
One of the most common tasks in basic geospatial data analysis is to create a map in which regions of the world are colored based on the value of some statistic pertaining to the region. We will start with a very simple map of population for each county in the US. The data for this can be found from the Census bureau at https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/totals/co-est2020-alldata.csv
using CSV,DataFrames,GMT,Printf,DataFramesMeta countypopurl = "https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/totals/co-est2020-alldata.csv" download(countypopurl,"data/countypops.csv") cpopall = CSV.read("data/countypops.csv",DataFrame) cpop = cpopall[:,[:STATE,:COUNTY,:STNAME,:CTYNAME,:POPESTIMATE2020]]
3,194 rows × 5 columns
STATE | COUNTY | STNAME | CTYNAME | POPESTIMATE2020 | |
---|---|---|---|---|---|
Int64 | Int64 | String31 | String | Int64 | |
1 | 1 | 0 | Alabama | Alabama | 4921532 |
2 | 1 | 1 | Alabama | Autauga County | 56145 |
3 | 1 | 3 | Alabama | Baldwin County | 229287 |
4 | 1 | 5 | Alabama | Barbour County | 24589 |
5 | 1 | 7 | Alabama | Bibb County | 22136 |
6 | 1 | 9 | Alabama | Blount County | 57879 |
7 | 1 | 11 | Alabama | Bullock County | 9976 |
8 | 1 | 13 | Alabama | Butler County | 19504 |
9 | 1 | 15 | Alabama | Calhoun County | 113469 |
10 | 1 | 17 | Alabama | Chambers County | 32865 |
11 | 1 | 19 | Alabama | Cherokee County | 26294 |
12 | 1 | 21 | Alabama | Chilton County | 44397 |
13 | 1 | 23 | Alabama | Choctaw County | 12418 |
14 | 1 | 25 | Alabama | Clarke County | 23291 |
15 | 1 | 27 | Alabama | Clay County | 13112 |
16 | 1 | 29 | Alabama | Cleburne County | 14967 |
17 | 1 | 31 | Alabama | Coffee County | 53230 |
18 | 1 | 33 | Alabama | Colbert County | 55411 |
19 | 1 | 35 | Alabama | Conecuh County | 11851 |
20 | 1 | 37 | Alabama | Coosa County | 10650 |
21 | 1 | 39 | Alabama | Covington County | 36930 |
22 | 1 | 41 | Alabama | Crenshaw County | 13681 |
23 | 1 | 43 | Alabama | Cullman County | 84515 |
24 | 1 | 45 | Alabama | Dale County | 48959 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Now we have a large number of different estimates for each county in the US. For simplicity let's use the POPESTIMATE2020 variable. The COUNTY variable is the FIPS county code for the county.
In order to build a Choropleth map, we will need to have a file that defines the spatial boundaries of each county! GMT is a generic mapping tool It will read various geospatial data formats which can be used to define the regions to be plotted. The Census bureau has a file which defines the boundaries of the counties at 3 different resolutions. These are in "shapefile" format. The files and other related shapefiles are available at the Census website. We will use the medium resolution version of the county shapefiles (5 million meter resolution).
countyshapeurl = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_5m.zip" download(countyshapeurl,"data/cb_2018_us_county_5m.zip") cd("data") run(`unzip cb_2018_us_county_5m.zip`) cd("..")
Inside the zip file are a number of files, the one ending in ".shp" is in the shape file format. We can read this using GMT's function gmtread
counties = gmtread("data/cb_2018_us_county_5m.shp")
This file has islands (holes in polygons). Use `gmtread(..., no_islands=true)` to ignore them. Vector{GMTdataset} with 3571 segments PROJ: +proj=longlat +datum=NAD83 +no_defs WKT: GEOGCS["NAD83", DATUM["North_American_Datum_1983", SPHEROID["GRS 1980",6378137,298.257222101]], PRIMEM["Greenwich",0], UNIT["degree",0.0174532925199433]] Header1: 39,071,01074048,0500000US39071,39071,Highland,06,1432479992,121949 83 Header2: 06,003,01675840,0500000US06003,06003,Alpine,06,1912292630,12557304 Header3: 12,033,00295737,0500000US12033,12033,Escambia,06,1701544502,563927 612 Header4: 17,101,00424252,0500000US17101,17101,Lawrence,06,963936864,5077783 Header5: 28,153,00695797,0500000US28153,28153,Wayne,06,2099745573,7255476 Header6: 28,141,00695791,0500000US28141,28141,Tishomingo,06,1098938845,5236 0190 Header7: 30,089,01719578,0500000US30089,30089,Sanders,06,7149405165,7639414 4 Header8: 36,001,00974099,0500000US36001,36001,Albany,06,1354120790,27124553 Header9: 42,105,01213684,0500000US42105,42105,Potter,06,2800612694,559778 Header10: 29,077,00758493,0500000US29077,29077,Greene,06,1749081115,6620035 ... Header3563: 22,007,00558414,0500000US22007,22007,Assumption,15,877029986,67 099711 Header3564: 39,161,01074092,0500000US39161,39161,Van Wert,06,1059702141,327 8633 Header3565: 54,019,01560095,0500000US54019,54019,Fayette,06,1713569368,1744 3841 Header3566: 08,009,00198120,0500000US08009,08009,Baca,06,6617400546,6142193 Header3567: 42,055,01213670,0500000US42055,42055,Franklin,06,2000048804,154 7614 Header3568: 12,001,00308548,0500000US12001,12001,Alachua,06,2266324954,2428 77007 Header3569: 48,247,01383909,0500000US48247,48247,Jim Hogg,06,2942674729,925 65 Header3570: 29,099,00758504,0500000US29099,29099,Jefferson,06,1700345322,20 143587 Header3571: 13,307,00352287,0500000US13307,13307,Webster,06,542230980,23481 62 First segment DATA Attributes: Dict("LSAD" => "06", "COUNTYFP" => "071", "GEOID" => "39071", " COUNTYNS" => "01074048", "AWATER" => "12194983", "ALAND" => "1432479992", " AFFGEOID" => "0500000US39071", "STATEFP" => "39", "NAME" => "Highland") Global BoundingBox: [-179.14733999999999, 179.77847, -14.552548999999999 , 71.352561, 0.0, 0.0] First seg BoundingBox: [-83.872214, -83.343479, 39.01889, 39.37873599999999 6] 25×2 Matrix{Float64}: -83.8698 39.0555 -83.8657 39.2473 -83.8344 39.2457 -83.8359 39.2233 -83.8139 39.223 -83.8013 39.2316 -83.7848 39.2629 -83.5909 39.3787 -83.3727 39.3774 -83.3751 39.3478 ⋮ -83.3435 39.2332 -83.3535 39.1976 -83.3856 39.0552 -83.6116 39.0189 -83.673 39.0204 -83.7053 39.0214 -83.8169 39.0202 -83.8722 39.0213 -83.8698 39.0555
Now, what kind of thing is "counties?"
typeof(counties)
Vector{GMTdataset} (alias for Array{GMTdataset, 1})
It's a Vector{GMTdataset}. Basically GMT treats the shape file as a collection of shapes, each shape gets its own GMTdataset, which is a structure
search: GMTdataset
No documentation found.
Summary
≡≡≡≡≡≡≡≡≡
mutable struct GMTdataset{T<:Real, N}
Fields
≡≡≡≡≡≡≡≡
data :: Array{T<:Real, N}
ds_bbox :: Vector{Float64}
bbox :: Vector{Float64}
attrib :: Dict{String, String}
colnames :: Vector{String}
text :: Vector{String}cpop.CFIPS = format("%03d",cpop.COUNTY)
geom :: Int64
Supertype Hierarchy
≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡
GMTdataset{T<:Real, N} <: AbstractArray{T<:Real, N} <: Any
In addition to "data" which is lattitude and longitude values in this case, there are a variety of metadata fields. one of which is "header".
Let's look at the header of the first county:
counties[1].header counties[1].attrib
Dict{String, String} with 9 entries: "LSAD" => "06" "COUNTYFP" => "071" "GEOID" => "39071" "COUNTYNS" => "01074048" "AWATER" => "12194983" "ALAND" => "1432479992" "AFFGEOID" => "0500000US39071" "STATEFP" => "39" "NAME" => "Highland"
This shows that the header is a comma separated values string. We want to "join" our data to these counties such that the county FIPS identifier is the same. The FIPS identifier is a unique numerical code assigned to each county (the names are not necessarily unique).
In this case, we have the value 39 being the FIPS code for the state of Ohio, and 071 being the county FIPS code for Highland County.
So we'd like to join field 2 of the header to the COUNTY column of our population data. However our population data represents the COUNTY field as a number, not a 0 padded string. Let's fix that then do the join
cpop.COUNTY = Printf.format.(Ref(Printf.Format("%03d")),cpop.COUNTY) cpop.STATE = Printf.format.(Ref(Printf.Format("%02d")),cpop.STATE) cpop = cpop[cpop.COUNTY .!= "000",:] ## eliminate county "000" which is "the whole state" for each state dfc = DataFrame(STATE = map(x->x.attrib["STATEFP"],counties),COUNTY=map(x -> x.attrib["COUNTYFP"],counties),ORDER=1:length(counties)) joineddata = @chain leftjoin(dfc,cpop,on= [:STATE,:COUNTY],makeunique=true) begin @orderby(:ORDER) end cptvallog = makecpt(range=(log(1000),log(11e6)),C=:plasma) #imshow(counties,level=joineddata.POPESTIMATE2020,cmap=cptval,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50),proj=:guess,colorbar=true)
GMT.GMTcpt([0.050980392156862744 0.03137254901960784 0.5294117647058824; 0. 06274509803921569 0.027450980392156862 0.5333333333333333; … ; 0.9411764705 882353 0.9686274509803922 0.1411764705882353; 0.9411764705882353 0.94117647 05882353 0.9411764705882353], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [6.907755278982 137 6.944233429145116; 6.944233429145116 6.980711579308096; … ; 16.14044953 0436687 16.17692768059967; 16.17692768059967 16.213405830762646], [6.907755 278982137, 16.213405830762646], [0.0 0.0 0.0; 1.0 1.0 1.0; 0.50196078431372 55 0.5019607843137255 0.5019607843137255], 24, NaN, [0.050980392156862744 0 .03137254901960784 … 0.027450980392156862 0.5333333333333333; 0.06274509803 921569 0.027450980392156862 … 0.027450980392156862 0.5372549019607843; … ; 0.9450980392156862 0.9607843137254902 … 0.9686274509803922 0.14117647058823 53; 0.9411764705882353 0.9686274509803922 … 0.9764705882352941 0.1294117647 0588237], ["", "", "", "", "", "", "", "", "", "" … "", "", "", "", "", " ", "", "", "", ""], ["", "", "", "", "", "", "", "", "", "" … "", "", "", "", "", "", "", "", "", ""], "rgb", String[])
There are missing values of POPESTIMATE2020 so let's replace them with NaN so we at least still get the polygon drawn. GMT will draw NaN as a special color.
joineddata.POPESTIMATE2020 = replace(joineddata.POPESTIMATE2020,missing => NaN) GMT.plot(counties,level=log.(1.0 .+ joineddata.POPESTIMATE2020),cmap=cptvallog,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50), proj=:guess,colorbar=true,figname="choroplethlog.png",title="Continental US County log(Population)")
Let's see what it looks like if we don't take the logarithm...
cptval = makecpt(range=(0,11_000_000),C=:plasma) GMT.plot(counties,level=joineddata.POPESTIMATE2020,cmap=cptval,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50), proj=:guess,colorbar=true,figname="choroplethnolog.png",title="Continental US County Population")
It may be interesting to work further with the polygons that represent the counties in some additional ways. For example we could calculate the population Density by dividing the population by the area of the polygon. We can calculate the area using the function gmtspatial.
measures = gmtspatial(counties, area="k")[1] # to get area in km^2 print(typeof(measures)) length(measures) print(measures)
GMTdataset{Float64, 2}BoundingBox: [-179.10882508557128, 179.6234549210534 2, -14.542124124603314, 69.40079782409146, 0.0001071433475167924, 384291.37 54448318] 3571×3 Matrix{Float64}: -83.6009 39.1847 1446.7 -119.821 38.5971 1925.5 -87.362 30.6678 1939.81 -87.7269 38.7203 968.306 -88.6959 31.6407 2107.22 -88.2392 34.74 1151.02 -115.13 47.6745 7235.18 -73.9735 42.6001 1385.33 -77.8956 41.7446 2803.22 -93.3419 37.258 1756.51 ⋮ -91.0626 29.9009 944.669 -84.5861 40.8556 1063.1 -81.0811 38.0287 1732.05 -102.56 37.3194 6624.41 -77.7217 39.9272 2000.25 -82.3582 29.6747 2509.54 -98.6973 27.0433 2937.68 -90.538 38.2606 1722.83 -84.5497 32.0458 544.366
But some counties will consiste of several polygons, like some along the coast may include some islands, etc, to get the total area, we need to group by county identifier, and sum all the areas.
countyareas = @chain dfc begin @transform(:area = measures[:,3]) groupby([:STATE,:COUNTY]) @combine(:totarea = sum(:area)) @select(:STATE,:COUNTY,:totarea) end cpop = @chain leftjoin(cpop,countyareas,on = [:STATE, :COUNTY]) begin @transform(:density = :POPESTIMATE2020 ./ :totarea) @orderby(:STATE,:COUNTY) end print(cpop[1:10,:]) joineddata = @chain leftjoin(dfc,cpop,on = [:STATE,:COUNTY]) begin @orderby(:ORDER) end
10×7 DataFrame Row │ STATE COUNTY STNAME CTYNAME POPESTIMATE2020 totarea density │ String String String31 String Int64 Float64? Float64? ─────┼───────────────────────────────────────────────────────────────────── ─────────── 1 │ 01 001 Alabama Autauga County 56145 1560.78 35.9725 2 │ 01 003 Alabama Baldwin County 229287 4352.7 52.677 3 │ 01 005 Alabama Barbour County 24589 2346.53 10.4789 4 │ 01 007 Alabama Bibb County 22136 1622.71 13.6414 5 │ 01 009 Alabama Blount County 57879 1686.26 34.3238 6 │ 01 011 Alabama Bullock County 9976 1617.2 6.16867 7 │ 01 013 Alabama Butler County 19504 2014.3 9.68279 8 │ 01 015 Alabama Calhoun County 113469 1582.32 71.7105 9 │ 01 017 Alabama Chambers County 32865 1563.07 21.0259 10 │ 01 019 Alabama Cherokee County 26294 1554.12 16.9189
3,571 rows × 8 columns (omitted printing of 1 columns)
STATE | COUNTY | ORDER | STNAME | CTYNAME | POPESTIMATE2020 | totarea | |
---|---|---|---|---|---|---|---|
String | String | Int64 | String31? | String? | Int64? | Float64? | |
1 | 39 | 071 | 1 | Ohio | Highland County | 43304 | 1446.7 |
2 | 06 | 003 | 2 | California | Alpine County | 1119 | 1925.5 |
3 | 12 | 033 | 3 | Florida | Escambia County | 322364 | 1939.81 |
4 | 17 | 101 | 4 | Illinois | Lawrence County | 15467 | 968.306 |
5 | 28 | 153 | 5 | Mississippi | Wayne County | 20317 | 2107.22 |
6 | 28 | 141 | 6 | Mississippi | Tishomingo County | 19275 | 1151.02 |
7 | 30 | 089 | 7 | Montana | Sanders County | 12157 | 7235.18 |
8 | 36 | 001 | 8 | New York | Albany County | 303654 | 1385.33 |
9 | 42 | 105 | 9 | Pennsylvania | Potter County | 16453 | 2803.22 |
10 | 29 | 077 | 10 | Missouri | Greene County | 294997 | 1756.51 |
11 | 29 | 205 | 11 | Missouri | Shelby County | 5919 | 1298.11 |
12 | 02 | 100 | 12 | Alaska | Haines Borough | 2614 | 6272.01 |
13 | 02 | 100 | 13 | Alaska | Haines Borough | 2614 | 6272.01 |
14 | 02 | 100 | 14 | Alaska | Haines Borough | 2614 | 6272.01 |
15 | 04 | 015 | 15 | Arizona | Mohave County | 217206 | 34881.9 |
16 | 13 | 143 | 16 | Georgia | Haralson County | 30383 | 734.927 |
17 | 21 | 145 | 17 | Kentucky | McCracken County | 65644 | 693.749 |
18 | 27 | 133 | 18 | Minnesota | Rock County | 9301 | 1253.7 |
19 | 41 | 047 | 19 | Oregon | Marion County | 349204 | 3089.11 |
20 | 47 | 125 | 20 | Tennessee | Montgomery County | 214251 | 1406.65 |
21 | 48 | 243 | 21 | Texas | Jeff Davis County | 2220 | 5858.32 |
22 | 55 | 061 | 22 | Wisconsin | Kewaunee County | 20386 | 891.309 |
23 | 05 | 051 | 23 | Arkansas | Garland County | 99789 | 1902.12 |
24 | 06 | 109 | 24 | California | Tuolumne County | 54515 | 5891.94 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
denscpt = makecpt(range=(0,11e6/(100*100)),C=:plasma) GMT.plot(counties,level=Vector{Float64}(replace(joineddata.density,missing=>NaN)),cmap=denscpt,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50), proj=:guess,colorbar=true,figname="choroplethdens.png",title="Continental US County Pop Density (1/km^2)")
Our approach to this question will be to sort the counties by density in decreasing order, cumsum the populations... and then create an indicator variable for whether the cumsum is less than 50% of the total, then plot this indicator variable.
totpop = sum(cpop[cpop.COUNTY .!= "000",:POPESTIMATE2020]) ## ignore county 0, that's the "whole state" print(totpop) areads = @chain cpop begin @subset(:COUNTY .!= "000") @orderby(-:density) @transform( :csum = cumsum(:POPESTIMATE2020)) @transform( :inset = :csum .< totpop/2.0,:in80set = :csum .< 0.8 * totpop) end joineddata = @chain leftjoin(dfc,areads,on=[:STATE,:COUNTY]) begin @orderby(:ORDER) end insetcmap = makecpt(range(0,1),C=:plasma) GMT.plot(counties,level=Float64.(replace(joineddata.inset,missing => NaN)),cmap=insetcmap,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50), proj=:guess,colorbar=true,figname="choroplethhpdset.png",title="Continental US smallest 50% set ")
329484123
GMT.plot(counties,level=Float64.(replace(joineddata.in80set,missing => NaN)),cmap=insetcmap,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50), proj=:guess,colorbar=true,figname="choroplethhpd80set.png",title="Continental US smallest 80% set ")