Skip to contents

A high-level convenience function that provides one-line LLM inference. Automatically handles model downloading, loading, and text generation with optional chat template formatting and system prompts for instruction-tuned models.

Usage

quick_llama(
  prompt,
  model = .get_default_model(),
  n_threads = NULL,
  n_gpu_layers = "auto",
  n_ctx = 2048L,
  verbosity = 1L,
  max_tokens = 100L,
  top_k = 40L,
  top_p = 1,
  temperature = 0,
  repeat_last_n = 0L,
  penalty_repeat = 1,
  min_p = 0.05,
  system_prompt = "You are a helpful assistant.",
  auto_format = TRUE,
  chat_template = NULL,
  stream = FALSE,
  seed = 1234L,
  progress = interactive(),
  clean = TRUE,
  hash = TRUE,
  ...
)

Arguments

prompt

Character string or vector of prompts to process

model

Model URL or path (default: Llama 3.2 3B Instruct Q5_K_M)

n_threads

Number of threads (default: auto-detect)

n_gpu_layers

Number of GPU layers (default: auto-detect)

n_ctx

Context size (default: 2048)

verbosity

Backend logging verbosity (default: 1L). Higher values show more detail: 0 prints only errors, 1 adds warnings, 2 includes informational messages, and 3 enables the most verbose debug output.

max_tokens

Maximum tokens to generate (default: 100)

top_k

Top-k sampling (default: 40). Limits vocabulary to k most likely tokens

top_p

Top-p sampling (default: 1.0). Set to 0.9 for nucleus sampling

temperature

Sampling temperature (default: 0.0). Higher values increase creativity

repeat_last_n

Number of recent tokens to consider for repetition penalty (default: 0). Set to 0 to disable

penalty_repeat

Repetition penalty strength (default: 1.0). Set to 1.0 to disable

min_p

Minimum probability threshold (default: 0.05)

system_prompt

System prompt to add to conversation (default: "You are a helpful assistant.")

auto_format

Whether to automatically apply chat template formatting (default: TRUE)

chat_template

Custom chat template to use (default: NULL uses model's built-in template)

stream

Whether to stream output (default: auto-detect based on interactive())

seed

Random seed for reproducibility (default: 1234)

progress

Show a console progress bar when running parallel generation. Default: interactive(). Has no effect for single-prompt runs.

clean

Whether to strip chat-template control tokens from the generated output. Defaults to TRUE.

hash

When `TRUE` (default), compute SHA-256 hashes for the prompts fed into the backend and the corresponding outputs. Hashes are attached via the `"hashes"` attribute for later inspection.

...

Additional parameters passed to generate() or generate_parallel()

Value

Character string (single prompt) or named list (multiple prompts)

Examples

if (FALSE) { # \dontrun{
# Simple usage with default settings (deterministic)
response <- quick_llama("Hello, how are you?")

# Raw text generation without chat template
raw_response <- quick_llama("Complete this: The capital of France is",
                           auto_format = FALSE)

# Custom system prompt
code_response <- quick_llama("Write a Python hello world program",
                            system_prompt = "You are a Python programming expert.")

# Creative writing with higher temperature
creative_response <- quick_llama("Tell me a story",
                                 temperature = 0.8,
                                 max_tokens = 200)

# Prevent repetition
no_repeat <- quick_llama("Explain AI",
                        repeat_last_n = 64,
                        penalty_repeat = 1.1)

# Multiple prompts (parallel processing)
responses <- quick_llama(c("Summarize AI", "Explain quantum computing"),
                        max_tokens = 150)
} # }