Converts text into a sequence of integer token IDs that the language model can process. This is the first step in text generation, as models work with tokens rather than raw text. Different models may use different tokenization schemes (BPE, SentencePiece, etc.).
Arguments
- model
A model object created with
model_load- text
Character string or vector to tokenize. Can be a single text or multiple texts
- add_special
Whether to add special tokens like BOS (Beginning of Sequence) and EOS (End of Sequence) tokens (default: TRUE). These tokens help models understand text boundaries
Value
Integer vector of token IDs corresponding to the input text. These can be used with
generate for text generation or detokenize to convert back to text
Examples
if (FALSE) { # \dontrun{
# Load model
model <- model_load("path/to/model.gguf")
# Basic tokenization
tokens <- tokenize(model, "Hello, world!")
print(tokens) # e.g., c(15339, 11, 1917, 0)
# Tokenize without special tokens (for model inputs)
raw_tokens <- tokenize(model, "Continue this text", add_special = FALSE)
# Tokenize multiple texts
batch_tokens <- tokenize(model, c("First text", "Second text"))
# Check tokenization of specific phrases
question_tokens <- tokenize(model, "What is AI?")
print(length(question_tokens)) # Number of tokens
} # }