A high-level convenience function that provides one-line LLM inference. Automatically handles model downloading, loading, and text generation with optional chat template formatting and system prompts for instruction-tuned models.
Usage
quick_llama(
prompt,
model = .get_default_model(),
n_threads = NULL,
n_gpu_layers = "auto",
n_ctx = 2048L,
verbosity = 1L,
max_tokens = 100L,
top_k = 40L,
top_p = 1,
temperature = 0,
repeat_last_n = 0L,
penalty_repeat = 1,
min_p = 0.05,
system_prompt = "You are a helpful assistant.",
auto_format = TRUE,
chat_template = NULL,
stream = FALSE,
seed = 1234L,
progress = interactive(),
clean = TRUE,
hash = TRUE,
...
)Arguments
- prompt
Character string or vector of prompts to process
- model
Model URL or path (default: Llama 3.2 3B Instruct Q5_K_M)
- n_threads
Number of threads (default: auto-detect)
- n_gpu_layers
Number of GPU layers (default: auto-detect)
- n_ctx
Context size (default: 2048)
- verbosity
Backend logging verbosity (default: 1L). Higher values show more detail:
0prints only errors,1adds warnings,2includes informational messages, and3enables the most verbose debug output.- max_tokens
Maximum tokens to generate (default: 100)
- top_k
Top-k sampling (default: 40). Limits vocabulary to k most likely tokens
- top_p
Top-p sampling (default: 1.0). Set to 0.9 for nucleus sampling
- temperature
Sampling temperature (default: 0.0). Higher values increase creativity
- repeat_last_n
Number of recent tokens to consider for repetition penalty (default: 0). Set to 0 to disable
- penalty_repeat
Repetition penalty strength (default: 1.0). Set to 1.0 to disable
- min_p
Minimum probability threshold (default: 0.05)
- system_prompt
System prompt to add to conversation (default: "You are a helpful assistant.")
- auto_format
Whether to automatically apply chat template formatting (default: TRUE)
- chat_template
Custom chat template to use (default: NULL uses model's built-in template)
- stream
Whether to stream output (default: auto-detect based on interactive())
- seed
Random seed for reproducibility (default: 1234)
- progress
Show a console progress bar when running parallel generation. Default:
interactive(). Has no effect for single-prompt runs.- clean
Whether to strip chat-template control tokens from the generated output. Defaults to
TRUE.- hash
When `TRUE` (default), compute SHA-256 hashes for the prompts fed into the backend and the corresponding outputs. Hashes are attached via the `"hashes"` attribute for later inspection.
- ...
Additional parameters passed to generate() or generate_parallel()
Examples
if (FALSE) { # \dontrun{
# Simple usage with default settings (deterministic)
response <- quick_llama("Hello, how are you?")
# Raw text generation without chat template
raw_response <- quick_llama("Complete this: The capital of France is",
auto_format = FALSE)
# Custom system prompt
code_response <- quick_llama("Write a Python hello world program",
system_prompt = "You are a Python programming expert.")
# Creative writing with higher temperature
creative_response <- quick_llama("Tell me a story",
temperature = 0.8,
max_tokens = 200)
# Prevent repetition
no_repeat <- quick_llama("Explain AI",
repeat_last_n = 64,
penalty_repeat = 1.1)
# Multiple prompts (parallel processing)
responses <- quick_llama(c("Summarize AI", "Explain quantum computing"),
max_tokens = 150)
} # }