# Create test data
test_data <- data.frame(
batch = c("A", "A", "B", "B", "C"),
id = c(1, 2, 3, 4, 5),
value = c(10, NA, 15, 8, 12),
category = c("X", "Y", "X", "Z", "Y")
)
# Create an operative
operative <- create_operative(test_data, tbl_name = "test_data", label = "Test Operative")
Introduction
The pointblankops
package provides specialized data validation operations using lightweight operatives for focused intelligence gathering. Operatives are streamlined alternatives to pointblank agents, designed for efficient row-level failure detection without the overhead of full reporting capabilities.
The use case that this solves is the following:
- data is large so can be out of memory
- we run tests on data to understand which rows fail which test, because downstream we exclude different rows in different situations depending on the purpose of the analysis
- so we need per-row validation results to use in post-processing
Extracting this from an interrogated agent it tedious and memory-intensive.
To preserve memory and allow working on large datasets, operatives focus on extracting validation failures directly, without the full reporting overhead of pointblank agents.
- Per-row validation results are returned in a tidy format, making it easy to integrate with other data processing workflows.
- They can be stored directly in a database or saved to a file format like Parquet for further analysis, all done efficiently with minimal memory footprint.
- Validation failure information can also be returned as a tibble for immediate use in R.
Creating Operatives
Operatives are created using the create_operative()
function, which is a lightweight wrapper around pointblank’s create_agent()
:
operative
Adding Validation Steps
Just like pointblank agents, operatives can have validation steps added to them:
Debriefing Operatives
The core functionality is the debrief()
function, which extracts only the validation failures:
failures
#> # A tibble: 1 × 6
#> batch id test_name test_type column_name failure_details
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A 2 step_1 col_vals_not_null value Failed col_vals_not_null …
Output Options
The debrief()
function supports multiple output formats:
1. Return as Tibble (default)
failures
#> # A tibble: 1 × 6
#> batch id test_name test_type column_name failure_details
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A 2 step_1 col_vals_not_null value Failed col_vals_not_null …
2. Save to Parquet File
read_parquet("validation_failures.parquet")
#> # A tibble: 1 × 6
#> batch id test_name test_type column_name failure_details
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A 2 step_1 col_vals_not_null value Failed col_vals_not_null …
3. Save to Database
con <- DBI::dbConnect(duckdb::duckdb(), ":memory:")
# Copy test data to database
DBI::dbWriteTable(con, "test_data", test_data)
# Create operative from database table
db_operative <- create_operative(test_data) |>
col_vals_not_null(columns = vars(value)) |>
col_vals_between(columns = vars(value), left = 5, right = 20)
# Save failures to database table
debrief(db_operative,
row_id_col = c("batch", "id"),
con = con,
output_tbl = "validation_failures")
Memory Efficiency
For large datasets, debrief()
processes data in chunks to maintain memory efficiency:
failures
#> # A tibble: 1 × 6
#> batch id test_name test_type column_name failure_details
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A 2 step_1 col_vals_not_null value Failed col_vals_not_null …
Database Compatibility
Operatives work seamlessly with database tables via dbplyr:
con <- DBI::dbConnect(duckdb::duckdb(), ":memory:")
DBI::dbWriteTable(con, "large_table", test_data)
# Create operative from database table
db_operative <- create_operative(dplyr::tbl(con, "large_table")) |>
col_vals_not_null(columns = vars(value)) |>
col_vals_gt(value, 8)
# Debrief processes the query efficiently in the database
failures <- debrief(db_operative, row_id_col = c("batch", "id"))
failures
#> # A tibble: 2 × 6
#> batch id test_name test_type column_name failure_details
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A 2 step_1 col_vals_not_null value Failed col_vals_not_null …
#> 2 B 4 step_2 col_vals_gt value Failed col_vals_gt on col…
Supported Validation Types
The following pointblank validation functions are supported:
-
col_vals_not_null()
/col_vals_null()
-
col_vals_between()
/col_vals_not_between()
-
col_vals_in_set()
/col_vals_not_in_set()
-
col_vals_gt()
/col_vals_gte()
/col_vals_lt()
/col_vals_lte()
-
col_vals_equal()
/col_vals_not_equal()
Unsupported validation types are automatically skipped with a message.