Convert Text to Token IDs

Converts text into a sequence of integer token IDs that the language model can process. This is the first step in text generation, as models work with tokens rather than raw text. Different models may use different tokenization schemes (BPE, SentencePiece, etc.).

Usage

tokenize(model, text, add_special = TRUE)

Arguments

model: A model object created with model_load
text: Character string or vector to tokenize. Can be a single text or multiple texts
add_special: Whether to add special tokens like BOS (Beginning of Sequence) and EOS (End of Sequence) tokens (default: TRUE). These tokens help models understand text boundaries

Value

Integer vector of token IDs corresponding to the input text. These can be used with generate for text generation or detokenize to convert back to text

Examples

if (FALSE) { # \dontrun{
# Load model
model <- model_load("path/to/model.gguf")

# Basic tokenization
tokens <- tokenize(model, "Hello, world!")
print(tokens)  # e.g., c(15339, 11, 1917, 0)

# Tokenize without special tokens (for model inputs)
raw_tokens <- tokenize(model, "Continue this text", add_special = FALSE)

# Tokenize multiple texts
batch_tokens <- tokenize(model, c("First text", "Second text"))

# Check tokenization of specific phrases
question_tokens <- tokenize(model, "What is AI?")
print(length(question_tokens))  # Number of tokens
} # }

Usage

Arguments

Value

See also

Examples