21. 03. 2026 Andrea Mariani AI

Reflections on Running LLMs Locally: Why It’s Worth Running Them on Your Own Infrastructure

Model selection, infrastructure sizing, vertical fine-tuning and MCP  server integration. All explained without the fluff.

Why Run AI on Your Own Infrastructure?

Let’s be honest: over the past two years, LLMs have evolved from a tool perceived as experimental and reserved for researchers into something companies use every day for concrete, practical tasks. And with that widespread adoption has come a question I hear more and more often: Do I really have to go through the cloud to use an LLM, or can I run this on my own server?

The short answer is: it depends. But in most cases when the question comes up, the answer is yes, you should at least seriously consider it. Running an LLM locally, on your own hardware, on-premise or in a private datacenter, gives you things the cloud simply cannot: your data never leaves your perimeter, latency is predictable, and after a while the numbers work out much better than they might seem.

In this post I’ll walk you through how all of this works in practice: how to choose the right model, how much GPU you actually need, when it makes sense to fine-tune smaller models, and how to connect your LLM to your business systems via MCP. Not theory for its own sake, just practices already in use in real-world environments.

💡 WHO THIS IS FOR

If you already have a rough idea of what an LLM is and you have even just one machine with a decent GPU at home or in the office, you’re already halfway there. You don’t need to be an ML researcher, just have curiosity and a willingness to experiment.

Two Paths, One Goal

Before diving into the details, it’s worth being clear about something: this document speaks to two very different types of readers, who often start from the same question but have very different destinations.

The experimental path: you’re a developer, a small technical team, or simply someone who wants to understand how this works. You have a GPU, you want to try Ollama, you want to see what a local LLM can do without spending a fortune. Your goal is to learn quickly with minimum friction.
The enterprise path: you work in a structured organization with security, compliance, high availability, and integration requirements. Ollama is fine to start with, but in production you need something more robust: vLLM, NVIDIA NIM, Kubernetes with the GPU Operator, RBAC, SSO, audit logging, etc.

Throughout the post you’ll find both perspectives addressed together. When a topic changes significantly between the two contexts, I’ll flag it explicitly. The idea is that you can follow the thread of your own path without having to skip entire sections.

1. Cloud vs. Local: What Are We Actually Talking About?

1.1 How the two approaches work

When you use ChatGPT, Claude, or any other cloud AI service, you’re sending your questions and your data to a remote server you don’t control. The model processes everything there, sends back a response, and technically someone somewhere has seen that request go through.

A local LLM works completely differently: the model, and all its billions of parameters, lives on your servers. The request originates within your perimeter, is processed within your perimeter, and the response returns to the user without ever leaving. This is a fundamental difference, not just in terms of privacy but in terms of architecture.

Feature	Local LLM	Cloud LLM
Data privacy	✅ Total, nothing leaves the network	⚠️ Depends on the provider’s policies
Latency	✅ Predictable, often < 100ms	⚠️ Variable (network, provider load)
Cost per query	✅ Fixed (hardware already paid for)	💲 Pay-per-token, grows with usage
Model updates	⚠️ You decide when to update	✅ Automatic (but you don’t always want that)
Customization	✅ Full fine-tuning, your model	⚠️ Limited, often prompt-only
Rapid scaling	⚠️ Requires buying hardware	✅ Scales in minutes
Compliance (GDPR, HIPAA…)	✅ Much simpler to manage	⚠️ Requires contracts and audits

1.2 When local wins hands-down

I’m not saying the cloud is always the wrong choice: there are contexts where it makes perfect sense. In fact there are situations where the local option has no real competition:

You handle sensitive data: Healthcare, legal, finance, government – in many cases you simply cannot send certain data to an external provider. That’s the end of the discussion. A local LLM is not an option, it’s the only option.
You make a lot of requests: If your team runs hundreds of thousands of queries per day, the cost per cloud token grows quickly to a point that’s hard to justify. Your own hardware is a one-time expense.
You need deep integration: If the model must access your internal databases, company documents, or proprietary systems in real time, doing all of that via the cloud is slow, expensive, and complicated.
Your domain is highly specific: no generic model, however large, will ever match one that has been specialized for your sector. And you can only build that specialist if you control the model.
Latency matters: real-time applications, voice assistants, embedded systems: an extra 200ms of network latency can be the difference between a product that works and one nobody uses.

2. Which Model Should You Choose?

2.1 The current open-weight model landscape

One very positive development of recent years is that open-weight models (the ones whose weights you can download and use as you see fit) have become genuinely competitive. For most practical tasks they no longer leave much to be desired compared to cloud models.

These are the main ecosystems worth knowing:

Model	Author	Sizes (Billion parameters)	Why choose it
Llama 3.x	Meta AI	8B, 70B, 405B	The current benchmark reference: balanced and versatile
Mistral / Mixtral	Mistral AI	7B, 8x7B, 8x22B	Highly efficient MoE architecture, excellent at reasoning
Qwen 2.5	Alibaba	0.5B – 72B	Native multilingual (great in English and beyond), coding and math
Gemma 2	Google DeepMind	2B, 9B, 27B	Compact and fast, designed for local deployment
Phi-4	Microsoft	14B	Small but impressive: trained on exceptionally high-quality data
DeepSeek-R1	DeepSeek	1.5B – 671B	Best-in-class for chain-of-thought reasoning and complex problems
Command R+	Cohere	35B, 104B	Built for RAG and tool use: a natural fit with MCP

2.2 How to choose the right one for you

Benchmarks are a starting point, not an answer

MMLU, HumanEval, GSM8K: You’ll find them on every leaderboard, and they’re useful for getting a rough sense. But a model that dominates academic benchmarks might only be mediocre for your specific use case. What really matters: build a small test set using real questions from your domain and run the candidates against it. It’s not the most sophisticated method, but in practice it works far better than generic benchmarks.

Watch out for licenses

Not all models are free to use however you want. Llama 3, for instance, requires accepting a specific license once you reach certain usage volumes. Qwen 2.5, Mistral, and Gemma each have different terms. Before building anything in production, read the license: it’s tedious but necessary.

GGUF, SafeTensors, Ollama: The format matters

For local use, GGUF (used by llama.cpp and Ollama) is the most convenient format: It supports native quantization and runs on any hardware with a few GB of VRAM (Video RAM on the graphics card) or even just RAM. For more structured environments with vLLM or TGI, the original SafeTensors from HuggingFace are the standard.

⚙️ QUICK TIP

If you’re just getting started and want to try right now: install Ollama, type ‘ollama run llama3.1’ and in five minutes you’ll have a working local LLM. For 80% of experimental use cases, that’s all you need to begin.

2.3 Quantization: fitting the model into your VRAM

A 7-billion parameter model in full precision takes around 28 GB of VRAM. Very few consumer GPUs get anywhere near that. Fortunately, quantization exists: you reduce the numerical precision of each parameter, the model takes up far less memory, while quality drops only marginally in almost all real-world use cases.

Quantization	Memory usage	Quality loss	When to use it
Q8 (8-bit)	~8 GB per 7B params	Nearly none	When you have plenty of VRAM and want the best quality
Q6_K	~6 GB per 7B params	Negligible	Well-balanced option for pro GPUs
Q4_K_M	~4.5 GB per 7B params	Small, acceptable	The most widely used in practice: works well almost everywhere
Q3_K_M	~3.5 GB per 7B params	Noticeable	When VRAM is tight and you have no other option
Q2_K	~2.7 GB per 7B params	Significant	Experimentation only, not for production

A practical rule: use the highest quantization level your VRAM can hold. Q4_K_M is the most balanced option for the majority of local deployments.

3. How Much GPU Do You Actually Need?

3.1 Hardware: what matters and what does not

Here comes the question everyone asks sooner or later: How much does the hardware cost? The answer: It depends on how many users you have and how large a model you need. But first, let’s clarify what actually matters:

VRAM: the real bottleneck: all the model weights must fit in VRAM. If they don’t, part of the model will spill into system RAM (CPU offloading) and performance will drop dramatically. VRAM is the first thing to look at.
Memory bandwidth: often more important than capacity: a GPU with a lot of VRAM but low bandwidth can actually be slower than one with less VRAM but very high bandwidth. The RTX 5090, for instance, delivers 1.8 TB/s bandwidth, which makes it exceptionally fast for inference.
System RAM: plan for at least twice your VRAM as system RAM, especially when working with long contexts.

3.2 Sizing by number of users

Three variables determine how much hardware you need: How many users send requests simultaneously (not how many total users you have), how long the average context is, and how many tokens per second you need (a fluid chatbot requires at least 15-20 tok/s to feel responsive).

Concurrent users	Typical scenario	Recommended model	Indicative hardware	Expected throughput
1-5 users	Dev team / prototype	Llama 3.1 8B Q4	1x RTX 5090 (32 GB) or 1x RTX PRO 6000 (96 GB)	~40-60 tok/s
5-20 users	Business team	Llama 3.3 70B Q4	2x RTX 5090 or 1x RTX PRO 6000 + 1x A100 40GB	~20-30 tok/s
20-100 users	SMB or department	Llama 3.3 70B Q6	2x RTX PRO 6000 (192 GB total) or 4x A100 40GB	~25-35 tok/s
100-500 users	Mid-size enterprise	Mixtral 8x22B or 70B	4x RTX PRO 6000 or 4x A100 80GB	~30-40 tok/s
>500 users	Large enterprise	Multi-node distributed architecture	H100 / A100 80GB cluster	Horizontal scaling

📊 QUICK FORMULA

Want to estimate VRAM on the fly? (billions of parameters) × (quantization bits / 8) × 1.2 = GB of VRAM needed. Example: Llama 70B with Q4 → 70 × 0.5 × 1.2 = 42 GB. You need at least 2x RTX 5090 or 2x A100 40GB.

3.3 GPU comparison: consumer, pro, and data center

Not all GPUs are created equal, and the right choice depends heavily on your context. Here’s how to navigate the main options available today:

GPU	VRAM	Cost range	When to choose it
NVIDIA RTX 5090	32 GB GDDR7	$$	Excellent for small teams, top consumer performance, 1.8 TB/s bandwidth
NVIDIA RTX PRO 6000 Blackwell	96 GB GDDR7	$$$$	The pro GPU par excellence: generous VRAM, ECC, built for continuous workloads
NVIDIA RTX 4090	24 GB GDDR6X	$$	Still very capable, great if you find one at a good price
NVIDIA A100 40GB	40 GB HBM2	$$$$	Mid-range data center GPU, easy NVLink for multi-GPU setups
NVIDIA A100 80GB	80 GB HBM2e	$$$$	The established reference for enterprise deployments
NVIDIA H100 80GB	80 GB HBM3	$$$$$	The absolute top: for those with the budget and a need for maximum throughput
AMD RX 7900 XTX	24 GB GDDR6	$	A valid alternative with ROCm, though the ML ecosystem is still less mature

3.4 Scaling strategies

Vertical scaling

Add GPUs to the same server, or move to more powerful GPUs. This is the simplest approach to manage operationally. The RTX PRO 6000, with its 96 GB, can run a Llama 70B Q4 on a single GPU without any particular complications. The limit is physical: beyond a certain point you simply cannot fit more GPUs into one server.

Horizontal scaling

Two main approaches, with very different levels of complexity:

Model parallelism (tensor or pipeline): the model itself is distributed across multiple machines. Necessary for very large models (>70B with smaller GPUs). Requires fast interconnects, such as NVLink between GPUs in the same server and InfiniBand between nodes. Complex to configure.
Request parallelism: multiple instances of the same model, each serving a subset of users. Much simpler, ideal when the model fits on a single machine and you want to increase total throughput.

The recommended hybrid approach

For most organizations, the winning combination is: one or two machines with pro GPUs (RTX PRO 6000 or A100) for complex tasks that require the large model, plus a couple of machines with RTX 5090s running multiple instances of a smaller, fine-tuned model for routine requests. You optimize both quality and cost without overcomplicating the infrastructure.

3B. Which Serving Tool Should You Choose?

One of the most important decisions in a local LLM architecture is the serving engine: the software that loads the model into VRAM, handles incoming requests, and returns responses. Not all engines are equal, and the right choice depends heavily on where you are in your journey.

Ollama

Ollama is the ideal starting point for anyone who wants to get going quickly. You install a binary, type ‘ollama run llama3.1’, and in five minutes you have a working LLM with an OpenAI-compatible REST API. It automatically handles model downloads, versioning, quantization, and serving on CPU or GPU. Its limitations emerge when you scale: it does not support continuous batching (requests are processed sequentially), it lacks advanced native multi-GPU management, and its production monitoring and control features are limited. For a team of 5-10 people experimenting, it is perfect. For 100 concurrent users in production, it starts to show cracks.

vLLM

vLLM is the reference engine for production deployments. Developed at UC Berkeley, it implements PagedAttention, a technique that manages VRAM much more efficiently than traditional approaches, enabling significantly higher throughput with the same hardware. It supports continuous batching (multiple requests processed in parallel), multi-GPU with tensor parallelism, advanced quantization (AWQ, GPTQ, FP8), fully OpenAI-compatible API, and native Prometheus metrics. Configuration is more complex than Ollama, but the throughput gain, often 2-5x, more than justifies the investment in environments with real load.

TGI (Text Generation Inference)

TGI is the engine developed by HuggingFace, optimized for models on the Hub. It supports continuous batching, quantization, multi-GPU, and has excellent integration with the HuggingFace ecosystem (including native support for access tokens on gated models). In terms of performance it is comparable to vLLM for most models; the choice between the two often comes down to ecosystem preference or specific feature requirements.

Feature	Ollama	vLLM	TGI
Ease of setup	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐
Production throughput	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
Continuous batching	No	Yes	Yes
Native multi-GPU	Partial	Yes (tensor parallelism)	Yes
Advanced quantization	GGUF / Q4-Q8	AWQ, GPTQ, FP8	AWQ, GPTQ, EETQ
OpenAI-compatible API	Yes	Yes	Yes
Prometheus metrics	No	Yes (native)	Yes (native)
HuggingFace integration	Good	Excellent	Native
Best for	Experimentation, dev teams	Enterprise production, high load	Production, HF ecosystem

🔀 HOW TO CHOOSE

Simple rule: start with Ollama. When load grows or you need production features (batching, metrics, serious multi-GPU), migrate to vLLM. If you work heavily with HuggingFace models or already have HF infrastructure, consider TGI as an equivalent alternative. All three tools have OpenAI-compatible APIs, so migrating from one to another requires minimal changes to client code.

4. Fine-Tuning: Smaller Models, Bigger Results

4.1 The myth of “bigger is always better”

There is a widespread assumption that if you have a hard problem, you need the largest model you can afford. In general terms, that is true. But in your specific domain? Probably not.

A 7B parameter model trained on thousands of examples from your specific use case will almost always outperform a generic 70B model. The reason is intuitive: large models are generalists: they know a little about everything, but rarely excel in any specific domain. Fine-tuning takes a generalist and turns it into a specialist. And the specialist, in their own field, wins.

💡 WHY IT WORKS

Models specifically fine-tuned on domain data consistently outperform larger general-purpose models on narrow tasks. Published research in biomedical NLP, legal document processing, and code generation all confirm the same pattern: a well-trained 7-14B specialist beats a generic 70B generalist on its own turf, at a fraction of the inference cost.

4.2 Fine-tuning techniques

Full fine-tuning: powerful but expensive

You update all billions of model weights. Maximum adaptability, but requires a lot of VRAM, days of GPU time, and large amounts of data. For most business use cases it is overkill: there are far more efficient alternatives.

LoRA: the optimal trade-off

Low-Rank Adaptation is the technique everyone uses in practice. Instead of touching all the weights, you add small adaptation matrices (the “adapters”) that capture the necessary changes. Results are comparable to full fine-tuning for most tasks, at a fraction of the resource cost. The key parameter is the rank (r): values between 8 and 64 cover 95% of cases.

QLoRA: LoRA with quantization, for modest hardware: you combine LoRA with an already quantized model. The result: you can fine-tune a Llama 70B on a single 32 GB RTX 5090. Until a few years ago that would have been unthinkable.

Instruction tuning and DPO

Instruction tuning teaches the model to follow structured instructions: essential if you want an assistant that responds in a predictable way. DPO (Direct Preference Optimization) is the modern successor to RLHF: it aligns the model to human preferences without the complexity of classical reinforcement learning.

4.3 How to build a good dataset

The uncomfortable truth: 90% of fine-tuning work is not choosing hyperparameters, it is building a good dataset. Quality beats quantity, always.

Define the task precisely. Not ‘company assistant’ but ‘classification of support tickets into 15 categories’ or ‘generation of structured legal contract summaries’. The more specific you are, the better it works.
Collect real examples. The best examples come from your production history: real queries with responses validated by domain experts. Do not invent synthetic examples if you can use real data.
Use a consistent format. System prompt, user, assistant: and never change it within the dataset. Alpaca, ShareGPT, and ChatML are the most common formats; Ollama and vLLM support all of them natively.
Clean with rigor. Remove duplicates, ambiguous examples, and low-quality responses. 1,000 excellent examples are worth more than 10,000 mediocre ones, and often produce better results.
Always keep a held-out set. Set aside 10-20% of your data before you start. It is the only way to measure whether fine-tuning has actually improved anything or whether you are just overfitting.

4.4 Tools to use

Tool	Type	Why it is useful	When to use it
Unsloth	Optimized QLoRA/LoRA	2-5x faster than standard HuggingFace, less VRAM	Fine-tuning on RTX 5090 or RTX PRO 6000: the default choice
Axolotl	General framework	YAML configuration, flexible, multi-GPU support	Teams with complex requirements or working across multiple models
LLaMA-Factory	UI + CLI	Graphical interface, dozens of supported models	Those who prefer a GUI or want to experiment quickly
TRL (HuggingFace)	Python library	SFT, DPO, PPO, RLHF: all integrated	ML engineers who want full control
MLX (Apple)	Apple Silicon framework	Optimized for Mac M-series (M3 Ultra, M4 Max/Ultra)	Those with a Mac Studio or Mac Pro who want to put it to good use

5. MCP: Getting Your Model to Actually Do Things

5.1 What MCP is and why you should care

You have your local LLM running. It answers questions, generates text, great. But it remains a passive tool: you ask it something, it responds. What if it could actually do things? Search your database, update a Jira ticket, read a file from the filesystem, send an email?

That is exactly what MCP (Model Context Protocol) enables. It is an open standard, originally developed at Anthropic and now adopted by a growing ecosystem, that defines how an LLM can interact with external systems in a structured and secure way. The best analogy: MCP is to LLMs what USB is to computers, a universal connector that lets you plug any model into any tool without rewriting integration code every time.

🔌 IN PRACTICE

With MCP you can tell your LLM: ‘Search the CRM for customers who have not renewed in the last 90 days, draft a personalized email for each one, and save them to a folder on Google Drive.’ The model executes all three steps in sequence, using the tools you made available to it.

5.2 How MCP is structured

Three pieces work together:

MCP Host: your application (the chatbot, the IDE, the internal system). It is what the user sees and interacts with.
MCP Client: the component that connects the LLM to the available MCP servers. It sits inside the host and acts as a mediator.
MCP Server: each server exposes a set of capabilities: tools (functions the model can call), resources (data it can read), and predefined prompts. Each server is an independent microservice you can write in Python, Node.js, Go, or any other language you prefer.

Communication uses JSON-RPC 2.0, a simple and lightweight protocol. Local servers communicate via stdio; remote ones via HTTP+SSE. The beauty of it is that writing a new MCP server takes fewer than 50 lines of Python with FastMCP.

5.3 What the model can concretely do with MCP

Tools: the model takes action

The model calls external functions with structured parameters and receives the result. Practical examples already in use in production:

Database queries: “Give me the top 10 customers by Q3 revenue” → the model generates and executes the SQL, returns the result already analyzed.
Company filesystem: reading, writing, semantic search across internal documents, without manual file uploads.
Business APIs: ERP, CRM, ticketing systems, calendars, email. Everything accessible as if the model had the credentials.
Code execution: Python scripts for data analysis, chart generation, transformations. The model writes the code and runs it in a sandbox.

Resources: data the model reads

Data sources that flow directly into the model’s context: updated documents, vector search results over internal knowledge bases, system logs. You do not need to retrain the model every time the data changes; MCP resources update in real time.

Predefined prompts

Reusable templates for recurring scenarios. Useful for standardizing output in contexts where structure matters: weekly reports, meeting summaries, document analysis in a fixed format.

5.4 Integrating MCP with a Local LLM

With cloud models it is straightforward because the API handles everything. With local models you need to do a bit more work, because not all models handle tool calling in the same way. Here is the typical flow:

The user makes a request through the UI (Open WebUI, a custom app, a Slack bot).
The orchestrator (LangChain, LlamaIndex, or custom code) formats the request, including the JSON definitions of available tools in the system prompt.
The local model responds indicating which tool it wants to use and with which parameters, in structured JSON format.
The orchestrator intercepts this response, calls the correct MCP server, and receives the result.
The result is inserted back into the context, and the model generates the final response for the user.

⚙️ MODELS WITH NATIVE TOOL USE

Not all models handle tool calling the same way. For serious MCP deployments, use models specifically trained for this: Llama 3.1/3.2/3.3, Mistral NeMo, Qwen 2.5, Hermes-3, Command R+. They make an enormous difference in reliability compared to generic models.

5.5 Security: do not overlook it

MCP gives the model the ability to act on real systems. That means you need to treat the model like an untrusted user who has access to your systems, because in a sense that is exactly what it is.

Least privilege for every MCP server: the server that reads the CRM should not be able to write to the filesystem. Each server exposes only what is needed, nothing more.
Always validate inputs: the parameters the model passes to tools must be validated exactly as you would with any untrusted user input. SQL injection, path traversal: the same rules apply.
Audit logging: every tool call must be logged with timestamp, parameters, and result. Not just for security, but for debugging when things go wrong.
Sandboxing for code execution: if you have a tool that runs arbitrary code, it must run in an isolated container. Always.
Rate limiting on the agent loop: a model that enters a tool-calling loop can cause real damage. Set a limit on the number of calls per session.

6. Putting It All Together: A Practical Stack

6.1 The stack for a SMB (PMI) – (10-50 users)

If you are a mid-sized organization looking for a complete local AI assistant, this is the stack that actually works in practice: all open source, all composable:

Layer	Recommended tool	Why this and not something else
Model serving	Ollama	Extremely easy to start, OpenAI-compatible API: works as a drop-in replacement
Production serving	vLLM	Higher performance, optimized batching, multi-GPU distributed serving
Chat UI	Open WebUI	Modern interface, user management, RAG integrated out-of-the-box
Orchestration	LangChain / LlamaIndex	Tool calling, RAG pipeline, context management, agent loop
MCP server	FastMCP (Python)	50 lines for a working server, huge ecosystem of examples
Vector database (RAG)	Qdrant or Chroma	Semantic search over company documents, fully local
Observability	Langfuse	Prompt tracing, latency, costs, quality: essential in production
Fine-tuning	Unsloth + Axolotl	Efficient QLoRA on RTX 5090 / RTX PRO 6000, YAML configuration

6.2 How a typical project unfolds

There are no shortcuts, but the path is fairly standard. Here is how it works in practice:

Proof of Concept: install Ollama, pick Llama 3.1 8B or Qwen 2.5 7B, connect Open WebUI. Have 3-5 real users test it for a week. The feedback you collect at this stage is extremely valuable.
Model selection: build a benchmark with 50-100 real questions from your domain. Test the candidates, measure quality and speed. Choose based on your data, not on what you read in a blog post.
Fine-tuning: collect the dataset, run QLoRA with Unsloth, evaluate on the held-out set, iterate. Integrate the fine-tuned model. This phase alone often doubles the quality perceived by users.
MCP integration: identify the 3-5 tools that deliver the most value (typically: internal database, filesystem, CRM API), write the MCP servers, test the agent loop end-to-end.
Production and continuous improvement: set up Langfuse, define SLAs, create a process for collecting user feedback and improving the fine-tuning dataset over time. A local LLM that does not improve over time is a missed opportunity.

6.3 Let’s talk about costs: keeping it real

The economic advantage of an on-premise deployment becomes clear over time. Here is an indicative estimate for a team of 30 active users: treat these numbers as an order of magnitude, not as a quote:

Cost item	Cost range	Notes
Server with 2x RTX 5090 (or 2x RTX PRO 6000)	$$$	One-time cost, amortized over 3 years
Electricity	$	Recurring monthly cost, 2x high-TDP GPUs
Maintenance and ops	$$	Recurring monthly cost, part-time DevOps/MLOps
Total year 1 (indicative)	$$$$	Includes hardware + all operating costs
Total years 2-3 (opex only)	$$$	Operating costs only, hardware already amortized
Cloud equivalent GPT-4o (estimated)	$$$$$	Estimated annual cost, 30 active users

The typical break-even point is between 12 and 18 months. After that, the savings are real and grow proportionally with usage.

6.4 The Enterprise Context: On-Premise Operating Systems and Platforms

If you are working in a medium-to-large organization, the conversation does not end with choosing a GPU and a model. There is an underlying infrastructure layer that in enterprise environments makes all the difference between a deployment that holds up in production and one that becomes an operational headache within a few months.

Operating systems

In enterprise environments, Linux is the de facto standard for servers running LLMs. The most widely used distributions in this context are:

Red Hat Enterprise Linux (RHEL): the most common choice in large organizations, especially in regulated sectors like finance and healthcare. It offers commercial support, predictable update cycles, and security certifications (FIPS, Common Criteria). NVIDIA explicitly supports RHEL for its drivers and CUDA stack.
Ubuntu Server LTS: widely used in tech-focused companies and mid-sized organizations. The 5-year LTS cycle guarantees stability, and the ML tooling ecosystem on Ubuntu is the most mature available. Ollama, vLLM, and most frameworks treat Ubuntu as their primary reference platform.
Rocky Linux / AlmaLinux: open-source alternatives to RHEL, binary-compatible, designed for those who want Red Hat ecosystem stability without the commercial support cost. Heavily used in universities, public institutions, and SMBs with structured IT teams.
SUSE Linux Enterprise (SLES): present mainly in SAP environments and European manufacturing. Less common in pure ML contexts, but relevant if the deployment needs to integrate with existing SAP stacks.

Containerization and orchestration

In environments with more than one server or multiple teams sharing infrastructure, direct deployment on bare metal gives way to containerization. The typical layers in an enterprise context are:

Docker / Podman: the starting point. Each component of the stack (the model server, MCP servers, vector database) runs in isolated containers. Podman is preferred in RHEL environments for security reasons (daemon-less, rootless by default).
Kubernetes (K8s): when you have multiple GPU nodes and want to manage scheduling, scaling, and availability centrally, Kubernetes is the standard. The NVIDIA GPU Operator automates the installation and configuration of GPU drivers on every cluster node, eliminating manual management.
Red Hat OpenShift: the enterprise version of Kubernetes, with a management console, advanced RBAC, integrated CI/CD pipelines, and commercial support. Heavily present in banks, insurance companies, and public administration. OpenShift AI (formerly Red Hat OpenDataScience Platform) adds ML-specific layers for model deployment.
NVIDIA NIM (NVIDIA Inference Microservices): NVIDIA-optimized containers that package model, runtime, and API into a single deployable unit. They significantly reduce time to production in Kubernetes environments and are certified for both RHEL and Ubuntu.

Kubernetes and the NVIDIA GPU Operator: how it works in practice

The GPU Operator is the component that makes Kubernetes truly GPU-aware. Without it, adding a GPU node to the cluster requires manual installation of drivers, CUDA toolkit, NVIDIA container runtime, and device plugin a lengthy, brittle process that is hard to standardize. The GPU Operator automates all of this as a set of DaemonSets that run on every node.

Once installed, you can request GPUs in your pods with a simple spec: ‘nvidia.com/gpu: 1’ in the resources block of your manifest. Kubernetes handles scheduling the pod onto the right node and guaranteeing exclusive access to the requested GPU. For large models requiring multiple GPUs, you can request ‘nvidia.com/gpu: 4’ and the system handles placement automatically.

Two advanced features particularly useful for LLM deployments are Time-Slicing and MIG (Multi-Instance GPU). Time-Slicing allows multiple pods to share a single GPU in a multiplexed way, useful for lightweight models or low-frequency inference tasks. MIG, available on A100 and H100 GPUs, physically partitions the GPU into isolated instances with dedicated VRAM and compute, guaranteeing full isolation between different workloads essential in multi-tenant environments.

NVIDIA AI Enterprise and NIM: NVIDIA’s enterprise deployment platform

NVIDIA AI Enterprise is NVIDIA’s commercial software platform for production AI deployment. It includes enterprise support, guaranteed SLAs, security certifications, and priority access to NIM the pre-configured containers that dramatically simplify LLM deployment in Kubernetes environments.

A NIM container includes everything needed to run a specific model in an optimized way: the inference engine (based on TensorRT-LLM for maximum performance on NVIDIA GPUs), optimized model weights, an OpenAI-compatible API server, and monitoring metrics. The difference compared to manually configuring vLLM or TGI is significant: a NIM starts with a single docker run command, immediately exposes an OpenAI-compatible endpoint, and guarantees performance optimized for the specific NVIDIA hardware it runs on.

The NIM catalog available on NGC (NVIDIA GPU Cloud) covers the most widely used models: Llama 3.x, Mistral, Gemma, Phi, and many others. Each NIM is available in variants optimized for different hardware configurations from a single RTX PRO 6000 to multi-node H100 clusters. For those operating in enterprise contexts where operational simplicity and commercial support matter as much as performance, NIM is often the most pragmatic choice.

MLOps and lifecycle management

In an enterprise context, a model is not a static artifact: it gets updated, compared against previous versions, and monitored over time. The tools that manage this lifecycle are:

MLflow: experiment tracking, model versioning, centralized registry. It is the de facto standard for keeping track of which version of the fine-tuned model is in production and with what metrics.
Kubeflow: native MLOps platform on Kubernetes. Manages training, serving, and monitoring pipelines in an integrated way. More complex to configure than MLflow, but more powerful in multi-team contexts.
Ray / Ray Serve: distributed framework for training and inference on multi-node clusters. Particularly useful when working with large models that require parallelism across multiple GPUs or multiple machines.

Storage and networking

Two aspects often underestimated but critical in enterprise on-premise deployments:

Storage for model weights: the weights of a 70B model in Q8 take up around 70 GB. With multiple models and multiple versions, dedicated storage grows rapidly. The most commonly used solutions are Ceph (open-source distributed storage), NetApp ONTAP, or IBM Storage Scale (formerly GPFS) for environments with high performance requirements.
High-speed networking: for multi-GPU deployment across different nodes, network bandwidth is critical. InfiniBand (100-400 Gb/s) is the standard in serious HPC and datacenter environments. Alternatively, RoCE (RDMA over Converged Ethernet) delivers similar latencies on existing Ethernet infrastructure at lower cost.

Layer	Key technologies	Why it matters
Production OS	RHEL / Ubuntu LTS / Rocky Linux	Stability, NVIDIA driver support, security certifications
Container runtime	Docker / Podman	Isolation, reproducibility, consistent deployment across environments
Orchestration	Kubernetes + NVIDIA GPU Operator	GPU scheduling, automatic scaling, high availability
Enterprise K8s	Red Hat OpenShift / OpenShift AI	RBAC, CI/CD, commercial support, ideal for regulated sectors
Optimized serving	NVIDIA NIM	Ready-to-use containers, optimized for inference on NVIDIA GPUs
MLOps and versioning	MLflow / Kubeflow	Experiment tracking, model registry, training pipelines
Distributed compute	Ray / Ray Serve	Multi-node parallelism for training and inference on clusters
Infra monitoring	Prometheus + Grafana + DCGM	GPU metrics (utilization, temperature, memory), alerts, dashboards
Model storage	Ceph / NetApp / IBM Storage Scale	High-capacity distributed storage for weights and datasets

💡 WHERE TO START

If you are in an enterprise context evaluating a first structured on-premise deployment, the most pragmatic starting point is: Ubuntu Server LTS or RHEL + Docker for containers + MLflow for tracking. Kubernetes and more elaborate platforms should be introduced when the number of teams or models in production genuinely justifies it, not before.

6.5 Security and Enterprise Governance

When an LLM stops being an experimental tool and becomes a production business system, security and governance stop being optional. The model has access to sensitive data, generates output that influences decisions, and interacts with critical systems through MCP. Ignoring these aspects is not an option in regulated environments.

Authentication and access control (RBAC and SSO)

In a multi-user deployment, not everyone should have the same level of access to the model or to the available MCP tools. An RBAC (Role-Based Access Control) system lets you define who can do what: a standard user can use the chatbot, a power user can access advanced tools, an administrator can manage models and view logs.

SSO (Single Sign-On) integration via standard protocols like OIDC (OpenID Connect) or SAML 2.0 allows you to connect the LLM system to the existing corporate directory, Active Directory, Okta, Azure AD, Keycloak. Users authenticate with the same corporate credentials, access management is centralized, and when an employee leaves the organization their access is automatically revoked. Open WebUI supports OIDC natively; for vLLM and NIM, authentication is typically managed at the API gateway layer (Kong, Nginx, Traefik).

Audit logging: knowing what happened

In regulated environments, finance, healthcare, public administration being able to demonstrate who did what, when, and with which data is often a legal requirement as well as an operational one. An audit logging system for LLMs should record at minimum: user identity, timestamp, prompt sent, response received, MCP tools invoked and with which parameters, and session duration.

The challenge is that prompt logs can contain personal or sensitive data, which creates a conflict with privacy regulations (GDPR in Europe, HIPAA in US healthcare). The typical solution is pseudonymization: logs are stored with anonymous identifiers, with a separate mapping table accessible only to authorized administrators and protected by additional access controls. Langfuse, already mentioned for observability, supports this approach natively and can be configured to automatically mask sensitive patterns (credit card numbers, tax IDs, and similar) before archiving.

Data security and data residency

One of the main reasons for choosing a local deployment is maintaining full control over data. But that control must be explicitly designed, not taken for granted. Some practical considerations:

Data classification: explicitly define which data can be sent to the model and which cannot. A classification system (public, internal, confidential, secret) applied to company documents allows automatically blocking the sending of sensitive data to certain models or endpoints.
Encryption at rest and in transit: model weights, logs, and vector database data must be encrypted at rest. Communications between client, API server, and MCP servers must run over TLS. In Kubernetes environments, use Network Policies to limit traffic between pods to only what is strictly necessary.
Data residency: in some jurisdictions (Europe in particular) data must physically remain within the territory. A local deployment solves the problem by definition, but make sure that backups, logs, and models are also stored in the correct locations.

Model security and abuse protection

The model itself is an attack surface. Prompt injection, jailbreaking, extracting data from the system context these are real threats in environments where the model has access to sensitive information or can execute actions via MCP.

System prompt protection: the system prompt that defines the model’s behavior should not be visible to the end user or modifiable by them. Manage it server-side and version it to track changes over time.
Input and output filtering: implement guardrails on both input (detection of prompt injection or out-of-policy requests) and output (detection of inappropriate content or potential sensitive data leakage). NVIDIA’s NeMo Guardrails is one of the most comprehensive frameworks for this; lighter alternatives include LLM Guard or custom guardrails via a dedicated MCP server.
Per-user rate limiting: limit the number of requests per user over time to prevent abuse, systematic model scraping, or application-layer DoS attacks.
Container vulnerability scanning: NIM or vLLM containers must be included in the organization’s vulnerability scanning process (Trivy, Grype, or solutions integrated in platforms like OpenShift). A secure model in a vulnerable container is not a secure system.

Area	Approach	Tools	When to apply
User authentication	OIDC / SAML 2.0	Keycloak, Okta, Azure AD, Open WebUI OIDC	Required in multi-user setups
Access control	RBAC	Open WebUI roles, API Gateway policies, K8s RBAC	Required in enterprise
Audit logging	Prompt & tool logging	Langfuse (with pseudonymization), Elasticsearch	Required in regulated sectors
Data encryption	TLS + encryption at rest	Cert-manager (K8s), LUKS, encrypted storage	Always recommended
Model guardrails	Input/output filtering	NeMo Guardrails, LLM Guard, custom MCP server	Recommended in production
Vulnerability scanning	Container scanning	Trivy, Grype, OpenShift built-in	Required in certified environments
Rate limiting	Per-user throttling	API Gateway (Kong, Nginx), vLLM rate limits	Recommended in multi-user
Network isolation	Network policies	Kubernetes NetworkPolicy, OpenShift SDN	Required in multi-tenant

⚠️ EXPERIMENTAL vs ENTERPRISE PATH

If you are experimenting with Ollama on a personal machine, you can skip most of this section. If you are building a system that will handle real company data with multiple users, every point on this list is relevant. You do not need to implement everything on day one, but you do need a clear plan for how you will get there.

Conclusions. Is It Worth It? Yes, and Here Is Why

If you have made it this far, you probably already have a concrete scenario in mind where a local LLM would make a real difference. Before closing, it is worth being transparent about how this post came together.
What you have read is the result of extensive research on the subject, built up over time through technical documentation, real-world use cases, comparison of existing architectures, and direct experimentation. This is not an analysis from an AI Engineer’s perspective you will not find detailed benchmarks, copy-paste code snippets, or low-level optimizations here. The approach is deliberately systemic: the goal was to understand how these tools fit into a real organizational context, which decisions actually matter, and where the friction points lie between theory and production deployment.
This means that some of the architectural choices described here prioritize conceptual clarity over technical depth, and that references to specific technologies should always be verified against the current state of the ecosystem, which evolves rapidly.
That said, the core principles hold. Data privacy, cost control, domain specialization through fine-tuning, integration with internal systems via MCP these are not arguments that change with the next framework release. They are structural reasons why local deployment is worth seriously considering, regardless of which model or tool happens to be trending six months from now.
The ecosystem has become surprisingly accessible. Ollama, Open WebUI, Unsloth, FastMCP are mature, well-documented tools with active communities. A competent person with a free weekend can have a working system in production. Not perfect, working and from there you improve.

These Solutions are Engineered by Humans

Did you find this article interesting? Are you an “under the hood” kind of person? We’re really big on automation and we’re always looking for people in a similar vein to fill roles like this one as well as other roles here at Würth IT Italy.

Andrea Mariani

Author

Andrea Mariani

Latest posts by Andrea Mariani

29. 06. 2026 AI, NetEye, Unified Monitoring

How to Build an LLM-Assisted Automation

01. 12. 2025 NetEye, Unified Monitoring

Running the Icinga Agent as SYSTEM? No Thanks.

11. 09. 2025 NetEye, PHP, Unified Monitoring

Using Keycloak to Secure Web Pages and Virtual Directories

20. 06. 2025 NetEye, Unified Monitoring

NEP Telegram Notification

21. 03. 2025 NetEye, Unified Monitoring

How to Create a Serial Modem Emulation Service on NetEye

See All

Reflections on Running LLMs Locally: Why It’s Worth Running Them on Your Own Infrastructure

Why Run AI on Your Own Infrastructure?

Two Paths, One Goal

1. Cloud vs. Local: What Are We Actually Talking About?

2. Which Model Should You Choose?

3. How Much GPU Do You Actually Need?

3B. Which Serving Tool Should You Choose?

4. Fine-Tuning: Smaller Models, Bigger Results

5. MCP: Getting Your Model to Actually Do Things

6. Putting It All Together: A Practical Stack

Operating systems

Containerization and orchestration

Kubernetes and the NVIDIA GPU Operator: how it works in practice

NVIDIA AI Enterprise and NIM: NVIDIA’s enterprise deployment platform

MLOps and lifecycle management

Storage and networking

Authentication and access control (RBAC and SSO)

Audit logging: knowing what happened

Data security and data residency

Model security and abuse protection

Conclusions. Is It Worth It? Yes, and Here Is Why

Further Reading

These Solutions are Engineered by Humans

Andrea Mariani

Author

Andrea Mariani

Latest posts by Andrea Mariani

Leave a Reply Cancel reply

Search by technology

Contact

Categories

Recent posts

Archive

Reflections on Running LLMs Locally: Why It’s Worth Running Them on Your Own Infrastructure

Why Run AI on Your Own Infrastructure?

Two Paths, One Goal

1. Cloud vs. Local: What Are We Actually Talking About?

2. Which Model Should You Choose?

3. How Much GPU Do You Actually Need?

3B. Which Serving Tool Should You Choose?

4. Fine-Tuning: Smaller Models, Bigger Results

5. MCP: Getting Your Model to Actually Do Things

6. Putting It All Together: A Practical Stack

Operating systems

Containerization and orchestration

Kubernetes and the NVIDIA GPU Operator: how it works in practice

NVIDIA AI Enterprise and NIM: NVIDIA’s enterprise deployment platform

MLOps and lifecycle management

Storage and networking

Authentication and access control (RBAC and SSO)

Audit logging: knowing what happened

Data security and data residency

Model security and abuse protection

Conclusions. Is It Worth It? Yes, and Here Is Why

Further Reading

These Solutions are Engineered by Humans

Andrea Mariani

Author

Andrea Mariani

Latest posts by Andrea Mariani

Related Content

Load-balancing Requests to LLMs in Kubernetes: A KV-cache Approach with llm-d!

How to Build an LLM-Assisted Automation

Monitoring Ollama in NetEye with ollama-metrics and check_prometheus

When Agile at Scale Meets Atlassian: Choosing the Right Scaling Model for Global IT Organizations

Atlassian Rovo Today: Architecture, Technologies, and Enterprise Trust

Leave a Reply Cancel reply

Search by technology

Contact

Categories

Recent posts

Archive