yera.models.interfaces.llms.llama_cpp

Interface to local llms via llama.cpp.

This module provides the LlamaCppLLM class for running large language models locally using the llama.cpp inference engine. Models are specified via a file path in the configuration. The interface supports optional GPU acceleration via the n_gpu_layers parameter, and provides both streaming chat completions and structured output generation using JSON schema constraints.

Symbols

class LlamaCppLLM — Interface to local llms via llama.cpp inference engine.

LlamaCppLLM

Interface to local llms via llama.cpp inference engine.

Provides a wrapper around the llama.cpp library for running quantised language models locally. The model file path is specified directly in the configuration. Supports optional GPU acceleration via the n_gpu_layers parameter, and provides both streaming chat completions and structured output generation with JSON schema constraints.

The model is lazily initialised on start() and must be explicitly shut down via stop(). All API methods require the model to be started first.

Attributes

config
type: LLMConfig

Configuration settings for the llm including model path and inference parameters.

connection
type: LlamaCppConnection

Connection configuration for llama.cpp.

model
type: Llama

Lazy-initialised llama.cpp model instance.

Methods

start — Initialise the llama.cpp model.
stop — Shut down and release the llama.cpp model.
chat — Stream a chat completion response from the local llm.
make_struct — Stream a structured output response conforming to a provided schema.

LlamaCppLLM.start

start() → None

Initialise the llama.cpp model.

Loads the quantised model from disk and initialises the llama.cpp inference engine. GPU acceleration is enabled by default (n_gpu_layers=-1) unless overridden in configuration. Must be called before any chat or structured output requests.

Raises

ValueError

If the model is already running.

LlamaCppLLM.stop

stop() → None

Shut down and release the llama.cpp model.

Closes the model and frees associated resources. The model must have been started via start() before this method can be called.

Raises

ValueError

If the model has not been started.

LlamaCppLLM.chat

chat(
    messages: list[Message],
    temperature: float = 0.2,
    top_p: float = 0.95,
    top_k: int = 40,
    min_p: float = 0.05,
    typical_p: float = 1.0,
    stop: str | list[str] | None = None,
    seed: int | None = None,
    max_tokens: int | None = None,
    presence_penalty: float = 0.0,
    frequency_penalty: float = 0.0,
    repeat_penalty: float = 1.0,
    **llama_cpp_kw,
) → Iterator[str]

Stream a chat completion response from the local llm.

Sends a conversation to the llama.cpp model and streams the response as text tokens. Supports fine-grained control over sampling behaviour through temperature, top-k, top-p, and other sampling parameters.

Parameters

messages
type: list[Message]

List of Message objects representing the conversation history.

temperature
type: float = 0.2

Sampling temperature controlling randomness (0.0-2.0). Defaults to 0.2.

top_p
type: float = 0.95

Nucleus sampling parameter (0.0-1.0). Defaults to 0.95.

top_k
type: int = 40

Top-k sampling limit. Defaults to 40.

min_p
type: float = 0.05

Minimum probability constraint (0.0-1.0). Defaults to 0.05.

typical_p
type: float = 1.0

Typical probability constraint (0.0-1.0). Defaults to 1.0.

stop
type: str | list[str] | None = None

Stop sequence(s) to terminate generation. Defaults to None.

seed
type: int | None = None

Random seed for reproducibility. Defaults to None.

max_tokens
type: int | None = None

Maximum tokens to generate. Defaults to None.

presence_penalty
type: float = 0.0

Presence penalty for token repetition. Defaults to 0.0.

frequency_penalty
type: float = 0.0

Frequency penalty for token repetition. Defaults to 0.0.

repeat_penalty
type: float = 1.0

Repeat penalty multiplier. Defaults to 1.0.

**llama_cpp_kw
type: str | float | int | bool

Additional keyword arguments passed to llama.cpp.

Raises

ValueError

If the model has not been started.

LlamaCppLLM.make_struct

make_struct(
    messages: list[Message],
    temperature: float = 0.2,
    top_p: float = 0.95,
    top_k: int = 40,
    min_p: float = 0.05,
    typical_p: float = 1.0,
    seed: int | None = None,
    max_tokens: int | None = None,
    presence_penalty: float = 0.0,
    frequency_penalty: float = 0.0,
    repeat_penalty: float = 1.0,
    **llama_cpp_kw,
) → Iterator[str]

Stream a structured output response conforming to a provided schema.

Generates a response that strictly conforms to the structure defined by the provided Pydantic model class. Uses JSON schema constraints to enforce structural compliance. Supports the same sampling parameters as chat().

Parameters

messages
type: list[Message]

List of Message objects representing the conversation history.

cls
type: type[TStruct]

A Pydantic model class defining the desired output structure.

temperature
type: float = 0.2

Sampling temperature controlling randomness (0.0-2.0). Defaults to 0.2.

top_p
type: float = 0.95

Nucleus sampling parameter (0.0-1.0). Defaults to 0.95.

top_k
type: int = 40

Top-k sampling limit. Defaults to 40.

min_p
type: float = 0.05

Minimum probability constraint (0.0-1.0). Defaults to 0.05.

typical_p
type: float = 1.0

Typical probability constraint (0.0-1.0). Defaults to 1.0.

seed
type: int | None = None

Random seed for reproducibility. Defaults to None.

max_tokens
type: int | None = None

Maximum tokens to generate. Defaults to None.

presence_penalty
type: float = 0.0

Presence penalty for token repetition. Defaults to 0.0.

frequency_penalty
type: float = 0.0

Frequency penalty for token repetition. Defaults to 0.0.

repeat_penalty
type: float = 1.0

Repeat penalty multiplier. Defaults to 1.0.

**llama_cpp_kw
type: str | float | int | bool

Additional keyword arguments passed to llama.cpp.

Raises

ValueError

If the model has not been started.