yera.models.interfaces.llms.llama_cpp

Interface to local llms via llama.cpp.

This module provides the LlamaCppLLM class for running large language models locally using the llama.cpp inference engine. Models are specified via a file path in the configuration. The interface supports optional GPU acceleration via the n_gpu_layers parameter, and provides both streaming chat completions and structured output generation using JSON schema constraints.

Symbols

class LlamaCppLLM — Interface to local llms via llama.cpp inference engine.

LlamaCppLLM

Inherits: BaseLLMInterface

Interface to local llms via llama.cpp inference engine.

Provides a wrapper around the llama.cpp library for running quantised language models locally. The model file path is specified directly in the configuration. Supports optional GPU acceleration via the n_gpu_layers parameter, and provides both streaming chat completions and structured output generation with JSON schema constraints.

The model is lazily initialised on start() and must be explicitly shut down via stop(). All API methods require the model to be started first.

Attributes

config

type: LLMConfig

Configuration settings for the llm including model path and inference parameters.

connection

type: LlamaCppConnection

Connection configuration for llama.cpp.

model

type: Llama

Lazy-initialised llama.cpp model instance.

Methods

start — Initialise the llama.cpp model.

stop — Shut down and release the llama.cpp model.

chat — Stream a chat completion response from the local llm.

make_struct — Stream a structured output response conforming to a provided schema.

LlamaCppLLM.start

start() → None

Initialise the llama.cpp model.

Loads the quantised model from disk and initialises the llama.cpp inference engine. GPU acceleration is enabled by default (n_gpu_layers=-1) unless overridden in configuration. Must be called before any chat or structured output requests.

Raises

ValueError

If the model is already running.

LlamaCppLLM.stop

stop() → None

Shut down and release the llama.cpp model.

Closes the model and frees associated resources. The model must have been started via start() before this method can be called.

Raises

ValueError

If the model has not been started.

LlamaCppLLM.chat

chat(
    messages: list[Message],
    temperature: float = 0.2,
    top_p: float = 0.95,
    top_k: int = 40,
    min_p: float = 0.05,
    typical_p: float = 1.0,
    stop: str | list[str] | None = None,
    seed: int | None = None,
    max_tokens: int | None = None,
    presence_penalty: float = 0.0,
    frequency_penalty: float = 0.0,
    repeat_penalty: float = 1.0,
    **llama_cpp_kw,
) → Iterator[str]

Stream a chat completion response from the local llm.

Sends a conversation to the llama.cpp model and streams the response as text tokens. Supports fine-grained control over sampling behaviour through temperature, top-k, top-p, and other sampling parameters.

Parameters

messages

type: list[Message]

List of Message objects representing the conversation history.

temperature

type: float = 0.2

Sampling temperature controlling randomness (0.0-2.0). Defaults to 0.2.

top_p

type: float = 0.95

Nucleus sampling parameter (0.0-1.0). Defaults to 0.95.

top_k

type: int = 40

Top-k sampling limit. Defaults to 40.

min_p

type: float = 0.05

Minimum probability constraint (0.0-1.0). Defaults to 0.05.

typical_p

type: float = 1.0

Typical probability constraint (0.0-1.0). Defaults to 1.0.

stop

type: str | list[str] | None = None

Stop sequence(s) to terminate generation. Defaults to None.

seed

type: int | None = None

Random seed for reproducibility. Defaults to None.

max_tokens

type: int | None = None

Maximum tokens to generate. Defaults to None.

presence_penalty

type: float = 0.0

Presence penalty for token repetition. Defaults to 0.0.

frequency_penalty

type: float = 0.0

Frequency penalty for token repetition. Defaults to 0.0.

repeat_penalty

type: float = 1.0

Repeat penalty multiplier. Defaults to 1.0.

**llama_cpp_kw

type: str | float | int | bool

Additional keyword arguments passed to llama.cpp.

Raises

ValueError

If the model has not been started.

LlamaCppLLM.make_struct

make_struct(
    messages: list[Message],
    temperature: float = 0.2,
    top_p: float = 0.95,
    top_k: int = 40,
    min_p: float = 0.05,
    typical_p: float = 1.0,
    seed: int | None = None,
    max_tokens: int | None = None,
    presence_penalty: float = 0.0,
    frequency_penalty: float = 0.0,
    repeat_penalty: float = 1.0,
    **llama_cpp_kw,
) → Iterator[str]

Stream a structured output response conforming to a provided schema.

Generates a response that strictly conforms to the structure defined by the provided Pydantic model class. Uses JSON schema constraints to enforce structural compliance. Supports the same sampling parameters as chat().

Parameters

messages

type: list[Message]

List of Message objects representing the conversation history.

cls

type: type[TStruct]

A Pydantic model class defining the desired output structure.

temperature

type: float = 0.2

Sampling temperature controlling randomness (0.0-2.0). Defaults to 0.2.

top_p

type: float = 0.95

Nucleus sampling parameter (0.0-1.0). Defaults to 0.95.

top_k

type: int = 40

Top-k sampling limit. Defaults to 40.

min_p

type: float = 0.05

Minimum probability constraint (0.0-1.0). Defaults to 0.05.

typical_p

type: float = 1.0

Typical probability constraint (0.0-1.0). Defaults to 1.0.

seed

type: int | None = None

Random seed for reproducibility. Defaults to None.

max_tokens

type: int | None = None

Maximum tokens to generate. Defaults to None.

presence_penalty

type: float = 0.0

Presence penalty for token repetition. Defaults to 0.0.

frequency_penalty

type: float = 0.0

Frequency penalty for token repetition. Defaults to 0.0.

repeat_penalty

type: float = 1.0

Repeat penalty multiplier. Defaults to 1.0.

**llama_cpp_kw

type: str | float | int | bool

Additional keyword arguments passed to llama.cpp.

Raises

ValueError

If the model has not been started.

← back to docs