Skip to contents

Introduction

The pointblankops package provides specialized data validation operations using lightweight operatives for focused intelligence gathering. Operatives are streamlined alternatives to pointblank agents, designed for efficient row-level failure detection without the overhead of full reporting capabilities.

The use case that this solves is the following:

  • data is large so can be out of memory
  • we run tests on data to understand which rows fail which test, because downstream we exclude different rows in different situations depending on the purpose of the analysis
  • so we need per-row validation results to use in post-processing

Extracting this from an interrogated agent it tedious and memory-intensive.

To preserve memory and allow working on large datasets, operatives focus on extracting validation failures directly, without the full reporting overhead of pointblank agents.

  • Per-row validation results are returned in a tidy format, making it easy to integrate with other data processing workflows.
  • They can be stored directly in a database or saved to a file format like Parquet for further analysis, all done efficiently with minimal memory footprint.
  • Validation failure information can also be returned as a tibble for immediate use in R.

Creating Operatives

Operatives are created using the create_operative() function, which is a lightweight wrapper around pointblank’s create_agent():

# Create test data
test_data <- data.frame(
  batch = c("A", "A", "B", "B", "C"),
  id = c(1, 2, 3, 4, 5),
  value = c(10, NA, 15, 8, 12),
  category = c("X", "Y", "X", "Z", "Y")
)

# Create an operative
operative <- create_operative(test_data, tbl_name = "test_data", label = "Test Operative")
operative

Adding Validation Steps

Just like pointblank agents, operatives can have validation steps added to them:

operative <- operative |>
  col_vals_not_null(columns = vars(value)) |>
  col_vals_between(columns = vars(value), left = 5, right = 20) |>
  col_vals_in_set(columns = vars(category), set = c("X", "Y", "Z"))

Debriefing Operatives

The core functionality is the debrief() function, which extracts only the validation failures:

# Get failures as a tibble
failures <- debrief(operative, row_id_col = c("batch", "id"))
failures
#> # A tibble: 1 × 6
#>   batch id    test_name test_type         column_name failure_details           
#>   <chr> <chr> <chr>     <chr>             <chr>       <chr>                     
#> 1 A     2     step_1    col_vals_not_null value       Failed col_vals_not_null …

Output Options

The debrief() function supports multiple output formats:

1. Return as Tibble (default)

failures <- debrief(operative, row_id_col = c("batch", "id"))
failures
#> # A tibble: 1 × 6
#>   batch id    test_name test_type         column_name failure_details           
#>   <chr> <chr> <chr>     <chr>             <chr>       <chr>                     
#> 1 A     2     step_1    col_vals_not_null value       Failed col_vals_not_null …

2. Save to Parquet File

debrief(operative, 
        row_id_col = c("batch", "id"), 
        parquet_path = "validation_failures.parquet")
read_parquet("validation_failures.parquet")
#> # A tibble: 1 × 6
#>   batch id    test_name test_type         column_name failure_details           
#>   <chr> <chr> <chr>     <chr>             <chr>       <chr>                     
#> 1 A     2     step_1    col_vals_not_null value       Failed col_vals_not_null …

3. Save to Database

con <- DBI::dbConnect(duckdb::duckdb(), ":memory:")

# Copy test data to database
DBI::dbWriteTable(con, "test_data", test_data)

# Create operative from database table
db_operative <- create_operative(test_data) |>
  col_vals_not_null(columns = vars(value)) |>
  col_vals_between(columns = vars(value), left = 5, right = 20)

# Save failures to database table
debrief(db_operative, 
        row_id_col = c("batch", "id"), 
        con = con, 
        output_tbl = "validation_failures")
tbl(con, "validation_failures") |>
  collect()
#> # A tibble: 1 × 6
#>   batch id    test_name test_type         column_name failure_details           
#>   <chr> <chr> <chr>     <chr>             <chr>       <chr>                     
#> 1 A     2     step_1    col_vals_not_null value       Failed col_vals_not_null …

Memory Efficiency

For large datasets, debrief() processes data in chunks to maintain memory efficiency:

# Process in smaller chunks for memory efficiency
failures <- debrief(operative, 
                   row_id_col = c("batch", "id"),
                   chunk_size = 500)  # Process 500 rows at a time
failures
#> # A tibble: 1 × 6
#>   batch id    test_name test_type         column_name failure_details           
#>   <chr> <chr> <chr>     <chr>             <chr>       <chr>                     
#> 1 A     2     step_1    col_vals_not_null value       Failed col_vals_not_null …

Database Compatibility

Operatives work seamlessly with database tables via dbplyr:

con <- DBI::dbConnect(duckdb::duckdb(), ":memory:")
DBI::dbWriteTable(con, "large_table", test_data)

# Create operative from database table
db_operative <- create_operative(dplyr::tbl(con, "large_table")) |>
  col_vals_not_null(columns = vars(value)) |> 
  col_vals_gt(value, 8)

# Debrief processes the query efficiently in the database
failures <- debrief(db_operative, row_id_col = c("batch", "id"))
failures
#> # A tibble: 2 × 6
#>   batch id    test_name test_type         column_name failure_details           
#>   <chr> <chr> <chr>     <chr>             <chr>       <chr>                     
#> 1 A     2     step_1    col_vals_not_null value       Failed col_vals_not_null …
#> 2 B     4     step_2    col_vals_gt       value       Failed col_vals_gt on col…

Supported Validation Types

The following pointblank validation functions are supported:

  • col_vals_not_null() / col_vals_null()
  • col_vals_between() / col_vals_not_between()
  • col_vals_in_set() / col_vals_not_in_set()
  • col_vals_gt() / col_vals_gte() / col_vals_lt() / col_vals_lte()
  • col_vals_equal() / col_vals_not_equal()

Unsupported validation types are automatically skipped with a message.