Validate model predictions against gold labels and peer agreement
Source:R/annotations.R
validate.Rd`validate()` is a convenience wrapper that runs both [compute_confusion_matrices()] and [intercoder_reliability()] so that a single call yields per-model confusion matrices (vs gold labels and pairwise) as well as Cohen's Kappa / Krippendorff's Alpha scores.
Usage
validate(
annotations,
gold = NULL,
pairwise = TRUE,
label_levels = NULL,
sample_col = "sample_id",
model_col = "model_id",
label_col = "label",
truth_col = "truth",
method = c("auto", "cohen", "krippendorff"),
include_confusion = TRUE,
include_reliability = TRUE
)Arguments
- annotations
Output from [explore()] or a compatible data frame with at least `sample_id`, `model_id`, and `label` columns.
- gold
Optional vector of gold labels. Overrides the `truth` column when supplied.
- pairwise
When `TRUE`, cross-model confusion tables are returned even if no gold labels exist.
- label_levels
Optional factor levels to enforce a consistent ordering in the resulting tables.
- sample_col, model_col, label_col, truth_col
Column names to use when `annotations` is a custom data frame.
- method
One of `"auto"`, `"cohen"`, or `"krippendorff"`. The `"auto"` setting computes both pairwise Cohen's Kappa and Krippendorff's Alpha (when applicable).
- include_confusion
When `TRUE` (default) the confusion matrices section is included in the output.
- include_reliability
When `TRUE` (default) the intercoder reliability section is included in the output.
Value
A list containing up to two elements: `confusion` (the full result of [compute_confusion_matrices()]) and `reliability` (the result of [intercoder_reliability()]). Elements are omitted when the corresponding `include_*` argument is `FALSE`.
Examples
annotations <- data.frame(
sample_id = rep(1:3, times = 2),
model_id = rep(c("llama", "qwen"), each = 3),
label = c("pos", "neg", "pos", "pos", "neg", "neg"),
truth = c("pos", "neg", "pos", "pos", "pos", "neg"),
stringsAsFactors = FALSE
)
result <- validate(annotations)
names(result)
#> [1] "confusion" "reliability"