Security of LLM inference during runtime

#llama#llm#inference#integrity

As local inference for language models becomes more popular, issues that until recently sat at the margins of AI security are becoming increasingly relevant. Most of the debate still focuses on the application layer — prompt injection, data poisoning, jailbreaks, RAG security. Far less attention goes to the integrity of the model artifact itself during inference.

The project I built targets precisely that layer. It demonstrates that under specific environmental conditions, it is possible to persistently manipulate a language model’s responses by modifying quantized weights inside a GGUF file after the inference server has already started — without restarting the process, without code injection, and without ptrace.

That distinction matters. This is not a fundamental weakness of the transformer architecture. It is also not about “breaking the model” in the classic exploitation sense. The core issue is the interaction between how the model is stored, how the file is memory-mapped, and the faulty operational assumption that if the server treats the model as read-only, the model must also remain immutable at runtime. In practice, that assumption can be false — if the model file stays shared and writable by another process running in the same environment.

The threat model is both constrained and realistic. The attacker does not need control of the llama-server process, does not need root privileges, and does not need to inject code or debug process memory. Write access to the GGUF model file used by the running server is enough. That scenario should not exist in a properly designed production environment, but in practice it is entirely plausible — shared Docker volumes, local directories mounted into containers, experimental tooling running alongside the inference server, weak permission separation around model artifacts. All common.

Attack mechanism

The attack follows from the default behavior of llama-server. The server maps the GGUF file into memory using mmap, and the observed behavior matches the path where the process reads file data through shared page-cache pages managed by the kernel. If a second process writes modified data to the same file, the kernel updates the relevant memory pages. As a result, the inference process may see new weight values on subsequent reads — even though it never reloaded the model and formally treats it as read-only. That is exactly why modifying the file on disk can have a runtime effect rather than being a purely offline change.

The modification is not random or global. The target was chosen deliberately: the output.weight tensor, which is the final projection matrix from the hidden state into the token logit space. In simplified form, the model computes final logits as hidden_state @ output.weight. Each row of that matrix corresponds to a specific token in the vocabulary. Amplifying selected rows increases the logits of the corresponding tokens and therefore their dominance after softmax. From an attacker’s perspective, this is a particularly attractive surface because it allows direct influence over the final probability distribution of generation.

In the analyzed example I used TinyLlama 1.1B Chat in GGUF format, specifically Q4_K_M, while output.weight itself is stored as Q6_K. That detail matters operationally because the script does not rebuild the full model representation or perform full dequantization. Instead, it leverages the block structure of Q6_K. Each block represents 256 weight values in 210 bytes. The last two bytes contain the fp16 superblock scale field, denoted as d. Dequantized values are scaled by that parameter — so multiplying d by a chosen factor proportionally increases all weight values in that block. Applied to every block belonging to a specific row of output.weight, the result is practical amplification of the logit assigned to a chosen token.

One particularly important aspect from an engineering standpoint: the attack script does not treat the GGUF file as an opaque byte stream. attack.py implements its own parser for the GGUF structure — reads the header, metadata, tensor count, value types in the key-value section, tensor descriptors, and data-section alignment. It then identifies output.weight and calculates its absolute offset inside the file, which makes it possible to compute exactly where the blocks corresponding to a given token are located and where the d field sits in each of those blocks. This approach has a decisive advantage over binary pattern searching: it is repeatable, rooted in the semantics of the format, and generalizable to other tensors or quantization types.

Correct handling of tokenization is another important piece. Generating a desired string is not as simple as specifying a literal. You need to determine how the model actually tokenizes it. I used llama-tokenize for that. For the target “Pwned”, the result was three tokens: [349, 1233, 287]. That leads to an important experimental observation: boosting all tokens with the same factor is not necessarily optimal. Autoregressive decoding means different positions have different natural probabilities depending on context. The first token must win in a relatively neutral state, the middle token often needs stronger amplification, and the final token may require a weaker boost — once the prefix has been generated, its appearance becomes more likely on its own. Excessively boosting the last subtokens leads to repetition and response degeneration.

That is what makes this more than a flashy demo. I am not just showing that you can change the model’s answer. I am documenting a practical heuristic for choosing modification strength at the subtoken level. The script applies a pattern where the first token gets a base multiplier, middle tokens get a higher one, and the last token gets a lower one. The result is then verified not with a single request but with a series of test queries sent to two endpoints: classic /completion and /v1/chat/completions. This matters because chat mode introduces a conversational template that changes the model’s contextual state and can alter token preferences. In practice, attack effectiveness should be evaluated with the application interface in mind, not just raw decoding.

The operation itself is technically simple but conceptually interesting. The script first sends a baseline request to show model behavior before modification. It then parses the GGUF file, locates output.weight, calculates the rows corresponding to selected tokens, stores original scale values in a JSON backup, modifies the relevant d fields, and flushes and fsyncs to synchronize changes with the filesystem. After that, the test prompt set runs again to measure whether the target string now dominates responses. A restore mode puts the original values back using the saved backup — which makes this not only a security demonstration but also a reproducible experiment.

The significance goes well beyond manipulating a single word. The main conclusion is this: the security of LLM inference cannot be analyzed solely at the process level or application layer. The model as a binary file, the way it is mapped into memory, and the permission model around storage volumes are all part of the attack surface. If an organization deploys models locally while sharing the model directory across components with different trust levels, generation integrity is not guaranteed — even when the inference server itself has no known RCE-class vulnerabilities.

The operational consequences are also worth emphasizing. This type of manipulation is relatively difficult to detect with standard service monitoring. The process does not restart, the container may remain healthy, the endpoint continues to work correctly from an infrastructure perspective — and yet the model may systematically produce distorted responses. That property makes the issue especially important in environments where high response integrity is assumed: security analysis support systems, document classification pipelines, information extraction workflows, local copilots used by engineering teams.

From a research perspective, this opens several directions. First, tensors other than output.weight are worth studying. Manipulating the final projection is effective, but likely not the only option — more subtle and harder-to-detect effects could probably be achieved by modifying layers responsible for information routing earlier in the network. Second, the question of attack subtlety is particularly interesting. The proof of concept presented here intentionally produces a clear semantic effect, but in offensive use cases, lower-amplitude changes that shift the model only in a particular semantic or stylistic direction would be considerably more dangerous.

That naturally leads to detection. If small weight changes can influence response distribution, mechanisms are needed to monitor model integrity not only before launch but during runtime. Periodic hashing of model files, integrity measurement for storage volumes, copying models into private read-only locations, regression testing based on reference prompt sets — none of these is complete on its own. A hash loses value if not measured at runtime. Behavioral tests are statistical and may miss subtle shifts. Runtime integrity for inference models deserves much more systematic treatment.

Experimental environment

The repository contains a Docker container based on Ubuntu 24.04 that builds llama.cpp from source, with both llama-server and llama-tokenize available. The model is downloaded by setup.sh into /models, and docker-compose.yml mounts the local ./models directory into the container. This layout is convenient, but from a security perspective it immediately reveals the key assumption behind the experiment: the server and the modification tool operate on the same filesystem resource. The repository also includes settings useful for debugging — SYS_PTRACE and seccomp:unconfined — though they are not required for the attack described here.

The limitations matter equally. In this form, the attack requires output.weight to exist explicitly as a separate structure inside the model file rather than being tied to input embeddings. It also assumes Q6_K quantization, because block layout and scale field location depend on the format. And it assumes the server is using the default mmap path — running with —no-mmap removes the propagation mechanism by which file changes become visible to the inference process.

The biggest value of this project lies beyond the proof of concept itself. It shows that LLM security needs to be analyzed across multiple layers: from model math, through tokenization and weight representation, down to the operating system, page cache, memory mapping, and volume-mounting policies. It is often at the boundaries between those layers that the most underestimated threat classes emerge. In that sense, the repository is not merely a demonstration of a specific attack against TinyLlama — it is a starting point for broader research into runtime integrity for language models and the security of local inference stacks built around open weights.

Repository: https://github.com/piotrmaciejbednarski/llm-inference-tampering