The high-level functions that form the core of {statnipokladna} lend themselves most easily to a workflow optimised for creating results based on the latest state of the data provided by the data published.
Depending on your objectives, you might prefer a workflow geared for speed (minimising downloads) or for reproducibility (tracking where exactly data came from and when, and perhaps keeping copies of raw input data inside your project).
This matters for two reasons. First, you might have different reproducibility needs. Second, you might want to hedge against undocumented changes made to the data structures
In this basic workflow, you use
sp_get_table()
sp_get_codelist()
/sp_add_codelist
without worrying about where the input data is downloaded to
(i.e. the dest_dir
parameter and the corresponding
statnipokladna.dest_dir
global option, together with the
redownload
parameter). This means:
tempdir()
and hence
redownloaded in every new sessionThat said, {statnipokladna} does not yet allow you to download the latest edition of a table (e.g. “get me the latest balance sheet for organisation X that is available”), so you will normally get the latest version of a given period’s data, as opposed to getting the latest period’s data. The latter option would be quite easy to script though. (Note that there is an unofficial API at https://monitor.statnipokladna.cz/api/transakcni-data and an official (but less convenient one) at https://monitor.statnipokladna.cz/data/dataset.json and also a linked-data endpoint at https://opendata.mfcr.cz/ - these would allow you to programmatically identify the latest available time period for a given table/dataset, and this logic should over time make its way into {statnipokladna} as well.)
Set dest_dir
or the statnipokladna.dest_dir
global option in your project-level or user-level .Rprofile
to:
…and leave the redownload
param at its default of
FALSE
.
Both limit redownloading the same data.
In the first scenario, you improve reproducibility at the expense of disk space and downloading the same data for each project.
In the second scenario, you save disk space but are vulnerable to
using outdated data downloaded earlier by one project in another
project, or, if you occasionally change the redownload
parameter, updating your “central” datastore and thus breaking your
other project(s).
A good compromise might be storing codelists on a per-project basis and keeping the “transactional” data in a central per-machine location. This is because breaking changes upstream have most often been made to codelists, and these are also harder to recover from.
Especially when storing data locally inside a project, you will want to keep track of package versions with {renv}, as {statnipokladna} sometimes must introduce breaking changes to adapt to upstream changes made by the data provider.
The workflow most suited for rigorous reproducibility will include the following principles:
{statnipokladna} has a set of lower-lever functions that allow you to proceed step by step, keeping track of intermediate data files, URLs and paths.
For codelists:
sp_cl_url <- sp_get_codelist_url("paragraf")
sp_cl_path <- sp_get_codelist_file(url = sp_cl_url)
sp_cl <- sp_load_codelist(sp_cl_path)
For data:
rozv_url <- sp_get_dataset_url("rozv", 2019, 12,
check_if_exists = T)
rozv_path <- sp_get_dataset("rozv", 2019, 12)
rozv_stat_path <- sp_get_table_file("balance-sheet",
rozv_path)
rozv_stat <- sp_load_table(rozv_stat_path)
This should be complemented by setting the
statnipokladna.dest_dir
option in a project-level
.Rprofile
file to a directory inside the project,
e.g. options(statnipokladna.dest_dir = "sp-data")
.
Note that we keep redownload
set to its default of
FALSE
to avoid being surprised by changes to the upstream
data on the provider’s server. (It is up to you whether to commit the
downloaded data to version control - in the case of some core codelists,
this would make sense.)
This step-by-step workflow is suitable for being turned into an explicit pipeline e.g. using the {targets} framework.
Below is the content of a hypothetical _targets.R
file
that will allow you to keep track of the pipeline for using codelists,
from initial function calls, via URLs to the resulting objects.
library(targets)
library(tarchetypes)
tar_option_set(packages = c("dplyr", "statnipokladna"))
# keep downloaded data in project directory
options(statnipokladna.dest_dir = "sp_data")
codelist_names <- c("druhuj", "poddruhuj", "nuts",
"paragraf", "paragraf_long",
"polozka", "polvyk")
targets_codelists <- list(
tar_target(codelists, codelist_names),
# track changes at URL via {targets}/{tarchetypes}
tar_url(sp_cl_urls, sp_get_codelist_url(codelists),
pattern = map(codelists)),
# track changes to file: if deleted/changed, redownload it
tar_file(sp_cl_paths,
sp_get_codelist_file(url = sp_cl_urls),
pattern = map(sp_cl_urls)),
# keep all codelists in one list tracked by {targets}:
tar_fst_tbl(sp_cl, sp_load_codelist(sp_cl_paths),
pattern = map(sp_cl_paths), iteration = "list")
# here, you might want to save the codelists to individual
# Rds or parquet files and track those via {targets}
)
list(targets_codelists)
And for data, it might look like this:
(Note: still need a branching example here)
library(targets)
library(tarchetypes)
# keep downloaded data in project directory
options(statnipokladna.dest_dir = "sp_data")
tar_option_set(packages = c("dplyr", "statnipokladna"))
# Define targets
targets_spdata <- list(
tar_target(d_year, 2019),
tar_target(d_month, 12),
tar_target(d_id, "rozv"),
tar_url(d_url, sp_get_dataset_url(d_id, d_year, d_month)),
tar_file(d_file, {is.character(d_url)
sp_get_dataset(d_id, d_year, d_month)}), # to make sure target runs when data at URL changes
tar_target(table_file, sp_get_table_file("balance-sheet", d_file), format = "file"),
tar_target(table_praha, sp_load_table(table_file, "00064581"))
# here, you might want to save the data to individual
# Rds or parquet files (or an Arrow dataset, see below) and track those via {targets}.
)
list(targets_spdata)
Note that we still keep redownload
set to its default of
FALSE
. This means that when the upstream data changes
(i.e. a request to the URL indicates that the file has changed on the
server), the sp_cl_url
target will run (thus notifying you
of an upstream change) but the XML file in your dest_dir
will not be overwritten. You can then choose to get an updated version
of that codelist from the server by changing the redownload
parameter.
The targets workflow will also improve performance of codelist
reading, since codelists will be stored in {targets}’ fast datastore. In
contrast, sp_get_codelist()
must reread the XML file every
time it is called, so even if the XML file had already been downloaded
by previous function call, there is some overhead coming from the XML
parsing.
You could also use {targets} to track the more high-level workflow in part 2 above, but you would have less visiblity inside the targets workflow of intermediate URLs and files.
For data files, the pipeline might look like this:
TODO
In a more advanced approach, the branching logic in {targets} could be used to handle the time periods as well.