Downloads and reads dataset identified by dataset_id
.
Unzips if necessary, but only loads CSV files, otherwise returns the path to the downloaded file.
Converts types of columns where known, e.g. value columns to numeric.
czso_get_table(
dataset_id,
dest_dir = NULL,
force_redownload = FALSE,
resource_num = 1
)
a character. Found in the czso_id column of data frame returned by get_catalogue()
.
character. Directory in which downloaded files will be stored.
If left unset, will use the czso.dest_dir
option if the option is set, and tempdir()
otherwise. Will be created if it does not exist.
integer. Whether to redownload data source file even if already cached. Defaults to FALSE.
integer. Order of resource in resource list for the given dataset. Defaults to 1, the normal value for CZSO datasets.
a tibble, or vector of file paths if file is not CSV or if there are multiple files in the dataset. See Details on the columns contained in the tibble
CZSO provides its open data as tidy data, so each row only contains one value
in the hodnota
column and the remaining columns give details on how
that value is defined. See "Included columns" below on how these work.
The schema of the dataset is not yet used, so some columns may be mistyped and are by default returned as character vectors.
The range of columns present in the output varies from one dataset to another, so the package does not attempt to provide English-language names for the known subset, as that would result in a jumble of Czenglish.
Instead, here is a guide to some of the common column names you will encounter:
idhod
: a unique ID of the value in the CZSO database. This does not allow
you to link to any other (meta)data as far as I know, but it does provide unique
identification should you need it.
hodnota
: the value.
stapro_kod
: code of the statistic/indicator/variable as listed.
in the SMS UKAZ register (https://www.czso.cz/csu/czso/statistical-variables-indicators);
this one has Czech-English documentation - access this by clicking the UK flag top right.
You can also get a data table with the definitions, if you search for "statistické proměnné"
in
the title
field of the catalogue. Last I checked, the ID of this table was "990124-17"
.
rok
denotes year as YYYY.
ctvrtleti
denotes quarter if available.
Other metadata will come in the form {variable}_[txt|cis|kod]
. The _txt
column holds the Czech text name for the category. The _cis
column holds the
ID of the codelist (register) you need to decode the code in _kod
.
The English codelists are at http://apl.czso.cz/iSMS/en/cislist.jsp,
Czech ones at http://apl.czso.cz/iSMS/cs/cislist.jsp.
You can find the Czech-language codelists in the catalogue retrieved with
czso_get_catalogue()
, where their IDs begin with "cis"
followed by the number; the English ones can also be retrieved from
the link above using a permalink URL.
More conveniently, you can use the czso_get_codelist()
function to retrieve the codelist.
Units are denoted in a separate column.
A helper on common breakdowns with their associated columns:
uzemi
: territory
vek
: age
pohlavi
: gender
NA
s in "breakdown" columns (e.g. gender or age) denote the total.
Do not use this for harvesting datasets from CZSO en masse.
Other Core workflow:
czso_filter_catalogue()
,
czso_get_catalogue()
,
czso_get_codelist()
# \donttest{
czso_get_table("110080")
#> # A tibble: 1,170 × 14
#> idhod hodnota stapro_kod SPKVANTIL_cis SPKVANTIL_kod POHLAVI_cis POHLAVI_kod
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
#> 1 73662… 21782 5958 7636 Q5 NA NA
#> 2 73662… 25625 5958 NA NA NA NA
#> 3 73662… 28431 5958 NA NA 102 1
#> 4 73662… 22133 5958 NA NA 102 2
#> 5 73662… 23533 5958 7636 Q5 102 1
#> 6 73662… 19731 5958 7636 Q5 102 2
#> 7 74595… 26033 5958 NA NA NA NA
#> 8 74595… 28873 5958 NA NA 102 1
#> 9 74595… 22496 5958 NA NA 102 2
#> 10 74595… 21997 5958 7636 Q5 NA NA
#> # ℹ 1,160 more rows
#> # ℹ 7 more variables: rok <int>, uzemi_cis <chr>, uzemi_kod <chr>,
#> # STAPRO_TXT <chr>, uzemi_txt <chr>, SPKVANTIL_txt <chr>, POHLAVI_txt <chr>
# }