Iteratively collate output of function into an Arrow dataset out of memory

map + arrow: iterate over a function and collate the results into an Arrow dataset. This happens without the whole dataset being in memory, so is suitable for large data objects. The function must return a data.frame or tibble. The returned value is a path to the directory containing the Arrow dataset.

marrow_dir(.x, .f, ..., .path, .partitioning = c(), .format = "parquet")

marrow_ds(.x, .f, ..., .path, .partitioning = c(), .format = "parquet")

marrow_files(.x, .f, ..., .path, .partitioning = c(), .format = "parquet")

Arguments

.x	vector or list of values for .f to iterate over
.f	function; must return a data.frame/tibble
...	other arguments to .f
.path	path to directory where collated Arrow dataset will be stored. will be created if it does not exist
.partitioning	character vector of columns to use for partitioning. Columns must exist in output of .f.
.format	"parquet" (the default) or "arrow".

Value

path to new dataset directory; character string of length one.

an Arrow Dataset

character vector containing paths to all files in dataset dir

Functions

marrow_dir: Return path to directory containing dataset
marrow_ds: Return Arrow Dataset
marrow_files: Return paths to all files in dataset dir

Examples

months <- unique(airquality$Month)
td <- tempdir()
part_of_aq <- function(month) {
  airquality[airquality$Month==month,]
}

aq_arrow <- purrrow:::marrow_dir(months, part_of_aq,
                                  .path = td)