yera.models.interfaces.llms.llama_cpp
Interface to local llms via llama.cpp.
This module provides the LlamaCppLLM class for running large language models locally using the llama.cpp inference engine. Models are specified via a file path in the configuration. The interface supports optional GPU acceleration via the n_gpu_layers parameter, and provides both streaming chat completions and structured output generation using JSON schema constraints.
Symbols
LlamaCppLLM
BaseLLMInterfaceInterface to local llms via llama.cpp inference engine.
Provides a wrapper around the llama.cpp library for running quantised language models locally. The model file path is specified directly in the configuration. Supports optional GPU acceleration via the n_gpu_layers parameter, and provides both streaming chat completions and structured output generation with JSON schema constraints.
The model is lazily initialised on start() and must be explicitly shut down via stop(). All API methods require the model to be started first.
Attributes
Configuration settings for the llm including model path and inference parameters.
Connection configuration for llama.cpp.
Lazy-initialised llama.cpp model instance.
Methods
LlamaCppLLM.start
start() → NoneInitialise the llama.cpp model.
Loads the quantised model from disk and initialises the llama.cpp inference engine. GPU acceleration is enabled by default (n_gpu_layers=-1) unless overridden in configuration. Must be called before any chat or structured output requests.
Raises
If the model is already running.
LlamaCppLLM.stop
stop() → NoneShut down and release the llama.cpp model.
Closes the model and frees associated resources. The model must have been started via start() before this method can be called.
Raises
If the model has not been started.
LlamaCppLLM.chat
chat(
messages: list[Message],
temperature: float = 0.2,
top_p: float = 0.95,
top_k: int = 40,
min_p: float = 0.05,
typical_p: float = 1.0,
stop: str | list[str] | None = None,
seed: int | None = None,
max_tokens: int | None = None,
presence_penalty: float = 0.0,
frequency_penalty: float = 0.0,
repeat_penalty: float = 1.0,
**llama_cpp_kw,
) → Iterator[str]Stream a chat completion response from the local llm.
Sends a conversation to the llama.cpp model and streams the response as text tokens. Supports fine-grained control over sampling behaviour through temperature, top-k, top-p, and other sampling parameters.
Parameters
List of Message objects representing the conversation history.
Sampling temperature controlling randomness (0.0-2.0). Defaults to 0.2.
Nucleus sampling parameter (0.0-1.0). Defaults to 0.95.
Top-k sampling limit. Defaults to 40.
Minimum probability constraint (0.0-1.0). Defaults to 0.05.
Typical probability constraint (0.0-1.0). Defaults to 1.0.
Stop sequence(s) to terminate generation. Defaults to None.
Random seed for reproducibility. Defaults to None.
Maximum tokens to generate. Defaults to None.
Presence penalty for token repetition. Defaults to 0.0.
Frequency penalty for token repetition. Defaults to 0.0.
Repeat penalty multiplier. Defaults to 1.0.
Additional keyword arguments passed to llama.cpp.
Raises
If the model has not been started.
LlamaCppLLM.make_struct
make_struct(
messages: list[Message],
temperature: float = 0.2,
top_p: float = 0.95,
top_k: int = 40,
min_p: float = 0.05,
typical_p: float = 1.0,
seed: int | None = None,
max_tokens: int | None = None,
presence_penalty: float = 0.0,
frequency_penalty: float = 0.0,
repeat_penalty: float = 1.0,
**llama_cpp_kw,
) → Iterator[str]Stream a structured output response conforming to a provided schema.
Generates a response that strictly conforms to the structure defined by the provided Pydantic model class. Uses JSON schema constraints to enforce structural compliance. Supports the same sampling parameters as chat().
Parameters
List of Message objects representing the conversation history.
A Pydantic model class defining the desired output structure.
Sampling temperature controlling randomness (0.0-2.0). Defaults to 0.2.
Nucleus sampling parameter (0.0-1.0). Defaults to 0.95.
Top-k sampling limit. Defaults to 40.
Minimum probability constraint (0.0-1.0). Defaults to 0.05.
Typical probability constraint (0.0-1.0). Defaults to 1.0.
Random seed for reproducibility. Defaults to None.
Maximum tokens to generate. Defaults to None.
Presence penalty for token repetition. Defaults to 0.0.
Frequency penalty for token repetition. Defaults to 0.0.
Repeat penalty multiplier. Defaults to 1.0.
Additional keyword arguments passed to llama.cpp.
Raises
If the model has not been started.