Compare multiple LLMs over a shared set of prompts

`explore()` orchestrates running several models over the same prompts, captures their predictions, and returns both long and wide annotation tables that can be fed into confusion-matrix and reliability helpers.

Usage

explore(
  models,
  instruction = NULL,
  prompts = NULL,
  engine = c("auto", "parallel", "single"),
  batch_size = 8L,
  reuse_models = FALSE,
  sink = NULL,
  progress = interactive(),
  clean = TRUE,
  keep_prompts = FALSE,
  hash = TRUE,
  chat_template = TRUE,
  system_prompt = NULL
)

Arguments

models

Model definitions. Accepts one of the following formats:

A single model path string (consistent with [model_load()] syntax)
A named character vector where names become `model_id`s
A list of model specification lists

Each model specification list supports the following keys:

id: (Required unless auto-generated) Unique identifier for this model
model_path: (Required unless using `predictor`) Path to local GGUF file, URL, or cached model name. Supports the same formats as [model_load()]
n_gpu_layers: Number of layers to offload to GPU. Use `"auto"` (default) for automatic detection, `0` for CPU-only, or `-1` for all layers on GPU
n_ctx: Context window size (default: 2048)
n_threads: Number of CPU threads (default: auto-detected)
cache_dir: Custom cache directory for model downloads
use_mmap: Enable memory mapping (default: TRUE)
use_mlock: Lock model in memory (default: FALSE)
check_memory: Check memory availability before loading (default: TRUE)
force_redownload: Force re-download even if cached (default: FALSE)
verify_integrity: Verify file integrity (default: TRUE)
hf_token: Hugging Face access token for gated models. Can also be set globally via [set_hf_token()]
verbosity: Backend logging level (default: 1)
chat_template: Override the global `chat_template` setting for this model
system_prompt: Override the global `system_prompt` for this model
instruction: Task instruction to use for this model
generation: List of generation parameters (max_tokens, temperature, etc.)
prompts: Custom prompts for this model
predictor: Function for mock/testing scenarios (bypasses model loading)

instruction

Default task instruction inserted into `spec` whenever a model entry does not override it.

prompts

One of: (1) a function (for example `function(spec)`) that returns prompts (character vector or a data frame with a `prompt` column); (2) a character vector of ready-made prompts; or (3) a template list where each named element becomes a section in the rendered prompt. Field names are used as-is for headers. Vector fields matching `sample_id` length are treated as per-item values. Use `sample_id` to specify item IDs (meta, not rendered). When `NULL`, each model must provide its own `prompts` entry.

engine

One of `"auto"`, `"parallel"`, or `"single"`. Controls whether `generate_parallel()` or `generate()` is used under the hood.

batch_size

Number of prompts to send per backend call when the parallel engine is active. Must be >= 1.

reuse_models

If `TRUE`, model/context handles stay alive for the duration of the function (useful when exploring lots of prompts). When `FALSE` (default) handles are released after each model to minimise peak memory usage.

sink

Optional function that accepts `(chunk, model_id)` and is invoked after each model finishes. This makes it easy to stream intermediate results to disk via helpers such as [annotation_sink_csv()].

progress

Whether to print progress messages for each model/batch.

clean

Forwarded to `generate()`/`generate_parallel()` to remove control tokens from the outputs.

keep_prompts

If `TRUE`, the generated prompts are preserved in the long-format output (useful for audits). Defaults to `FALSE`.

hash

When `TRUE` (default), computes SHA-256 hashes for each model's prompts and resulting labels so replication collaborators can verify inputs and outputs. Hashes are attached to the returned list via the `"hashes"` attribute.

chat_template

When `TRUE`, wraps prompts using the model's built-in chat template before generation. This uses [apply_chat_template()] to format prompts with appropriate special tokens for instruction-tuned models. Individual models can override this via their spec. Default: `TRUE`.

system_prompt

Optional system message to include when `chat_template = TRUE`. This is prepended as a system role message before the user prompt. Individual models can override this via their spec. Default: `NULL`.

Value

A list with elements `annotations` (long table) and `matrix` (wide annotation matrix). When `sink` is supplied the `annotations` and `matrix` entries are set to `NULL` to avoid duplicating the streamed output.