Quick LLaMA Inference

A high-level convenience function that provides one-line LLM inference. Automatically handles model downloading, loading, and text generation with optional chat template formatting and system prompts for instruction-tuned models.

Usage

quick_llama(
  prompt,
  model = .get_default_model(),
  n_threads = NULL,
  n_gpu_layers = "auto",
  n_ctx = 2048L,
  verbosity = 1L,
  max_tokens = 100L,
  top_k = 40L,
  top_p = 1,
  temperature = 0,
  repeat_last_n = 0L,
  penalty_repeat = 1,
  min_p = 0.05,
  system_prompt = "You are a helpful assistant.",
  auto_format = TRUE,
  chat_template = NULL,
  stream = FALSE,
  seed = 1234L,
  progress = interactive(),
  clean = TRUE,
  hash = TRUE,
  ...
)

Arguments

prompt: Character string or vector of prompts to process
model: Model URL or path (default: Llama 3.2 3B Instruct Q5_K_M)
n_threads: Number of threads (default: auto-detect)
n_gpu_layers: Number of GPU layers (default: auto-detect)
n_ctx: Context size (default: 2048)
verbosity: Backend logging verbosity (default: 1L). Higher values show more detail: 0 prints only errors, 1 adds warnings, 2 includes informational messages, and 3 enables the most verbose debug output.
max_tokens: Maximum tokens to generate (default: 100)
top_k: Top-k sampling (default: 40). Limits vocabulary to k most likely tokens
top_p: Top-p sampling (default: 1.0). Set to 0.9 for nucleus sampling
temperature: Sampling temperature (default: 0.0). Higher values increase creativity
repeat_last_n: Number of recent tokens to consider for repetition penalty (default: 0). Set to 0 to disable
penalty_repeat: Repetition penalty strength (default: 1.0). Set to 1.0 to disable
min_p: Minimum probability threshold (default: 0.05)
system_prompt: System prompt to add to conversation (default: "You are a helpful assistant.")
auto_format: Whether to automatically apply chat template formatting (default: TRUE)
chat_template: Custom chat template to use (default: NULL uses model's built-in template)
stream: Whether to stream output (default: auto-detect based on interactive())
seed: Random seed for reproducibility (default: 1234)
progress: Show a console progress bar when running parallel generation. Default: interactive(). Has no effect for single-prompt runs.
clean: Whether to strip chat-template control tokens from the generated output. Defaults to TRUE.
hash: When `TRUE` (default), compute SHA-256 hashes for the prompts fed into the backend and the corresponding outputs. Hashes are attached via the `"hashes"` attribute for later inspection.
...: Additional parameters passed to generate() or generate_parallel()

Value

Character string (single prompt) or named list (multiple prompts)

Examples

if (FALSE) { # \dontrun{
# Simple usage with default settings (deterministic)
response <- quick_llama("Hello, how are you?")

# Raw text generation without chat template
raw_response <- quick_llama("Complete this: The capital of France is",
                           auto_format = FALSE)

# Custom system prompt
code_response <- quick_llama("Write a Python hello world program",
                            system_prompt = "You are a Python programming expert.")

# Creative writing with higher temperature
creative_response <- quick_llama("Tell me a story",
                                 temperature = 0.8,
                                 max_tokens = 200)

# Prevent repetition
no_repeat <- quick_llama("Explain AI",
                        repeat_last_n = 64,
                        penalty_repeat = 1.1)

# Multiple prompts (parallel processing)
responses <- quick_llama(c("Summarize AI", "Explain quantum computing"),
                        max_tokens = 150)
} # }

Usage

Arguments

Value

See also

Examples