Validate model predictions against gold labels and peer agreement

`validate()` is a convenience wrapper that runs both [compute_confusion_matrices()] and [intercoder_reliability()] so that a single call yields per-model confusion matrices (vs gold labels and pairwise) as well as Cohen's Kappa / Krippendorff's Alpha scores.

Usage

validate(
  annotations,
  gold = NULL,
  pairwise = TRUE,
  label_levels = NULL,
  sample_col = "sample_id",
  model_col = "model_id",
  label_col = "label",
  truth_col = "truth",
  method = c("auto", "cohen", "krippendorff"),
  include_confusion = TRUE,
  include_reliability = TRUE
)

Arguments

annotations: Output from [explore()] or a compatible data frame with at least `sample_id`, `model_id`, and `label` columns.
gold: Optional vector of gold labels. Overrides the `truth` column when supplied.
pairwise: When `TRUE`, cross-model confusion tables are returned even if no gold labels exist.
label_levels: Optional factor levels to enforce a consistent ordering in the resulting tables.
sample_col, model_col, label_col, truth_col: Column names to use when `annotations` is a custom data frame.
method: One of `"auto"`, `"cohen"`, or `"krippendorff"`. The `"auto"` setting computes both pairwise Cohen's Kappa and Krippendorff's Alpha (when applicable).
include_confusion: When `TRUE` (default) the confusion matrices section is included in the output.
include_reliability: When `TRUE` (default) the intercoder reliability section is included in the output.

Value

A list containing up to two elements: `confusion` (the full result of [compute_confusion_matrices()]) and `reliability` (the result of [intercoder_reliability()]). Elements are omitted when the corresponding `include_*` argument is `FALSE`.

Examples

annotations <- data.frame(
  sample_id = rep(1:3, times = 2),
  model_id = rep(c("llama", "qwen"), each = 3),
  label = c("pos", "neg", "pos", "pos", "neg", "neg"),
  truth = c("pos", "neg", "pos", "pos", "pos", "neg"),
  stringsAsFactors = FALSE
)

result <- validate(annotations)
names(result)
#> [1] "confusion"   "reliability"