Note for Czech speakers: a previous note covering similar information, sometimes in more depth, is available in the Czech guide to the data.
This vignette provides background on the data that
statnipokladna
provides access to. Make sure you read the
relevant parts before analysing the data: it can be quite complicated
and it is easy to make mistakes and end up with wrong numbers.
The data published on https://monitor.statnipokladna.cz/datovy-katalog/ are exports from a much larger database used by Státní pokladna (SP), a comprehensive budgeting and financial/accounting reporting system for most Czech public organisations.
In some ways the data is a secondary product not always dogfooded by the data maintainers, so sometimes (though not often) there are small inconsistencies on formats and content.
Several datasets are not published in open data but are available in the analytical tool (only in Czech) or the web-based overviews. This relates to data on budget preparation and budgetary responsibility, respectively.
How data for different time periods is exported differs by dataset. This has significant implications for how you get to usable full-year numbers or time series in different tables.
For budget datasets finm
and finu
(tables
budget-central-old*
and budget-local*
) and all
accounting reports, the data dump for month N will contain data
describing what was reported at month M. The period_vykaz
column will contain only one unique value. (For budget data, the
budget_adopted
column does not change throughout the year,
but the others do. The budget_spending
column contains
spending up to month M.) To see time series of spending throughout a
year, you need to download multiple datasets (with
sp_get_table()
). To get a full-year’s spending, run
sp_get_table("table_id", year = "YYYY")
to get the 12th
month.
In the newer central budget reports, budget-central
(dataset misris
) and budget-state-funds
(dataset finsf
), data for month M will contain separate
rows for spending in (not up to) each month of the
year up to month M. To get cumulative spending up to month M in year Y,
you need to only download the dataset for month M and then sum the rows
in that table. To get a full-year’s spending, run
sp_get_table("table_id", year = "YYYY")
to get the 12th
month’s release, containing data for all months, then sum the
values across the months contained in the resulting data frame.
If you need consolidation (i.e. summing up data from multiple
organisations), you need to do this before summing anything
up.
Zooming in on the data that is available from the open data dumps, there are two kinds:
Codelists are generally long lists of metadata items. These allow you to “translate” codes contained in return data. Codes identify individual lines, telling you which categories the value in that line belongs to, often across multiple categorisations (see Working with budget data.)
One thing to bear in mind is that codelist data contains all items in that codelist, regardless of the time periods for which the item is applicable. That means that for a given code ID in a codelist, there may be duplicates distinguishable only by their validity range.
The sp_add_codelist()
function tries to take care of
this for you, but may not success if there are oddities in the data,
e.g. one code with two items valid for the same date. In that case you
get an informative error message; you should then resolve the duplicates
manually using data returned by sp_get_codelist()
and
supply the edited data to sp_add_codelist()
as an
object.
Note also that there are two kinds of codelists: some can be joined
to core data, others are joined to another codelist, in effect decoding
some items in that codelist. Currently the package cannot handle the
latter type automatically, so you need to save the output of
sp_get_codelist()
into an object and joit it yourself using
dplyr
.
You can see the list of available codelists at https://monitor.statnipokladna.cz/datovy-katalog/ciselniky
and in sp_codelists
. When I refer to codelists below, I use
the ID which you can supply as the "codelist_id"
parameter
to sp_add_codelist()
.
The codelist that you will generally want for comparing organisations is called “ucjed”; this contains metadata on all organisations covered by the data. It is huge so you will want to store it somewhere after you have retrieved (and processed and filtered if relevant.)
Note that some codelists can only be joined onto other codelists,
e.g. the ucjed
codelist, which contains most metadata on
reporting organisations, has several columns named
somethings_id
. Those columns can be “decoded” using using
the respective something
codelist.
Return data come broadly in two kinds: budgetary and
accounting - see below for how to work with each of them, as the logic
differs markedly. Return data come in ZIP files, which can contain on or
more files, each with a return (I call them tables for more generality,
hence the sp_get_table()
function.)
You can see the list of returns which statnipokladna
can
process in the sp_tables
data frame.
While the sp_get_table()
returns a data frame with
column names understandable in English, codelists only exist in
Czech.
You should not have to worry about this if you use the
sp_get_table()
function, but it explains what you see when
you use sp_get_dataset()
directly.
Generally, each CSV file will contain data for all organisations which provided the return; sometimes this will be split into two files, as in some accounting returns.
The column named ico
in the output of
sp_get_table()
is the unique identifier of an
organisation.
Often, there will be multiple returns in one dataset (ZIP file), each in a differently named CSV file. This is particularly the case for budget data.
All CSV files contain some common columns, notably those identifying the organisation, its geography, and its place in the public sector. Note that the geography is derived from the seat of the organisation, not the location of the spending.
You can see the list of datasets (ZIP files) which
statnipokladna
can process in the sp_datasets
data frame.
The more recent central budgets stored in the
budget-central
table (misris
dataset) require
a different codelist for sectoral (“paragraf”) breakdown. Use the
paragraf_long
codelist, not paragraf
. If you
use paragraf
, sp_add_codelist()
will
technically work but the joined columns will be empty.
Put simply, the data covers organisations which report into the SP system, which broadly means all public organisations (I am sure there is a legal definition and I am sure there are exceptions, but I do not know the detail.) There are codelists “druhuj” (organisation type) and “poddruhuj” (subtype) which let you see what type an organisation belongs to, and “forma” will let you know about its legal form.
This means the data covers both state (I call them “central” in
sp_tables
) and local organisations incl. all 6000+
municipalities; and both public organisations themselves and their
related organisations (both “podřízené” and “příspěvkové”,
i.e. subsidiary and sponsored, the latter being more independent).
Commercial entities owned by public organisations are included at least
as codelists items and presumably at least their accounting returns will
be included in the data, but I have not researched whether there are
cut-offs as to what stake counts as ownership etc.
State enterprises (a special legal form) are included. And Budvar, the Czech state brewery which also has a special legal form (national company) also makes an appearance in the codelist… NB: I am not sure which return data, if any, they are included in.
State funds - a special kind of non-commercial public organisation which disburses money for specific purposes, such as transport infrastructure, are included.
One special distinction to note is Chapters (kapitola) and OSS (organizační složky státu). Chapters are top-level budget lines, typically managed by a ministry but including many organisations. OSS are quasi-organisations, some are ministries, some not, and some of which manage chapters.
Report/return data is published in differing periodicities, depending on the kind of return and the kind of organisation.
Generally:
See the list of currently available data releases at https://monitor.statnipokladna.cz/datovy-katalog/transakcni-data.
A ZIP file is published for each time period of each data dump.
sp_get_table()
handles this for you and returns one long
data frame containing data for all time periods in the year
or month
parameters. In the resulting data frame, the
per*
columns mark the time period of the return.
The data becomes available approx. 3 months after the end of each period; budget data seems to be available more quickly than accounting data.
Budget data has a special form: the available data files provide all sorts of breakdowns in a single file, in long format, crossed between each other. This means that any individual value will make little sense - it will be e.g. money from a particular sector, from a particular source, either capital or current. Each of these categorisations - of which there are a few more - have multiple levels of detail.
What this means is that to get a meaningful number, you need to do a lot of summarising. It also means that if you are only interested in, say, the capital spend of an organisation, you will need to:
ico
sp_add_codelist("polozka")
for the “druhové členění”,
roughly meaning the capital x current breakdownpolozka
categorisationUnless you are interested in further detail, you do not need to add any other codelists.
The sectoral breakdown (“paragrafy”) is contained in codelist “paragraf” and the functional breakdown is in “polozka” (not to be confused with “polvyk”).
The budget datasets contain columns with monetary values at each
phase of the budgetary cycle: in the output of
sp_get_table()
, these are budget_adopted
for
the original budget, budget_amended
for plan after
amendments, budget_final
(where available) and
budget_spending
for the final reported spend. The first of
these does not change throughout the year.
If you are summing data across multiple organisations, you will need to take care of consolidation. Typically this concerns relations between levels of government, e.g. a region gives grants to municipalities which then spend them, and consolidation ensures you only count the money once as it flows outside the public sector.
The metadata allowing consolidation is contained in the “polozka”
codelist in columns kon_*
. These are TRUE/FALSE and you
consolidate data by excluding certain levels (i.e. filtering out items
for which that kon_*
column is TRUE).
This is necessary even for seemingly smaller entities, like some municipalities, if they have any organisations which they establish, like schools - because those report their own money. The correct way to get their budgetary figures is to filter for their geography, then summarise across all organisations included in that geography, and consolidate.
Note that for the more recent central budget data (dataset
misris
, tablebudget=central
) you need to join thepolozka
codelist to the raw data (i.e. containing rows for individual months), not to a table that has been summed by year.
Once done, double check your sum with the figure published on the Monitor if available.
Accounting data is a bit more straightforward as it follows generally known accounting practices, i.e. you will find balance sheets, profit-and-loss accounts, etc.
The only specialised codelist that you will need specifically for accounting data is called to make sense of this data ‘polvyk’; it contains something akin to a chart of accounts.
As noted above, not all open data dumps from SP can currently be
processed by statnipokladna
. See sp_datasets
for a list of those which can be.
As note above, some data available on the web presentation or analysis tool at https://monitor.statnipokladna.cz/ are not accessible in a documented way as open data:
In addition, there are some kinds of information which are held in other datasets:
As far as I am aware, there is no consistent dataset on the geographical breakdown of public/state spending by place of actual spend.