Experimental lifecycle map + arrow: iterate over a function and collate the results into an Arrow dataset. This happens without the whole dataset being in memory, so is suitable for large data objects. The function must return a data.frame or tibble. The returned value is a path to the directory containing the Arrow dataset.

marrow_dir(.x, .f, ..., .path, .partitioning = c(), .format = "parquet")

marrow_ds(.x, .f, ..., .path, .partitioning = c(), .format = "parquet")

marrow_files(.x, .f, ..., .path, .partitioning = c(), .format = "parquet")

Arguments

.x

vector or list of values for .f to iterate over

.f

function; must return a data.frame/tibble

...

other arguments to .f

.path

path to directory where collated Arrow dataset will be stored. will be created if it does not exist

.partitioning

character vector of columns to use for partitioning. Columns must exist in output of .f.

.format

"parquet" (the default) or "arrow".

Value

path to new dataset directory; character string of length one.

an Arrow Dataset

character vector containing paths to all files in dataset dir

Functions

  • marrow_dir: Return path to directory containing dataset

  • marrow_ds: Return Arrow Dataset

  • marrow_files: Return paths to all files in dataset dir

Examples

months <- unique(airquality$Month) td <- tempdir() part_of_aq <- function(month) { airquality[airquality$Month==month,] } aq_arrow <- purrrow:::marrow_dir(months, part_of_aq, .path = td)