21. 03. 2026 Andrea Mariani AI

Reflections on Running LLMs Locally: Why It Is Worth Running Them on Your Own Infrastructure

Model selection, infrastructure sizing, vertical fine-tuning and MCP server integration. All explained without the fluff.

Why Run AI on Your Own Infrastructure?

Let’s be honest: over the past two years, LLMs have evolved from a tool perceived as experimental and reserved for researchers into something companies use every day for concrete, practical tasks. And with that widespread adoption came a question I hear more and more often: do I really have to use the cloud, or can I run this on my own servers?
The short answer is: it depends. But in most cases where the question comes up, the answer is yes, you should at least seriously consider it. Running an LLM locally, on your own hardware, on-premise or in a private datacenter, gives you things the cloud simply cannot: your data never leaves your perimeter, latency is predictable, and after a while the numbers work out much better than they might seem.
In this post I walk you through how all of this works in practice: how to choose the right model, how much GPU you actually need, when it makes sense to fine-tune smaller models, and how to connect your LLM to your business systems via MCP. No theory for its own sake, only practices already in use in real-world environments.

💡 WHO THIS IS FORIf you already have a rough idea of what an LLM is and you have even just one machine with a decent GPU at home or in the office, you are already halfway there. You do not need to be an ML researcher, just curiosity and a willingness to experiment.

Two Paths, One Goal

Before diving into the details, it is worth being clear about something: this document speaks to two very different types of readers, who often start from the same question but have very different destinations.

  • The experimental path: you are a developer, a small technical team, or simply someone who wants to understand how this works. You have a GPU, you want to try Ollama, you want to see what a local LLM can do without spending a fortune. Your goal is to learn quickly with minimum friction.
  • The enterprise path: you work in a structured organization with security, compliance, high availability, and integration requirements. Ollama is fine to start with, but in production you need something more robust: vLLM, NVIDIA NIM, Kubernetes with the GPU Operator, RBAC, SSO, audit logging.

Throughout the post you will find both perspectives addressed together. When a topic changes significantly between the two contexts, I flag it explicitly. The idea is that you can follow the thread of your own path without having to skip entire sections.

1. Cloud vs. Local: What Are We Actually Talking About?

1.1 How the two approaches work

When you use ChatGPT, Claude, or any other cloud AI service, you are sending your questions and your data to a remote server you do not control. The model processes everything there, sends back a response, and technically someone somewhere has seen that request go through.
A local LLM works completely differently: the model, all its billions of parameters, lives on your servers. The request originates within your perimeter, is processed within your perimeter, and the response returns to the user without ever leaving. This is a fundamental difference, not just in terms of privacy but in terms of architecture.

FeatureLocal LLMCloud LLM
Data privacy✅ Total, nothing leaves the network⚠️ Depends on the provider’s policies
Latency✅ Predictable, often < 100ms⚠️ Variable (network, provider load)
Cost per query✅ Fixed (hardware already paid for)💲 Pay-per-token, grows with usage
Model updates⚠️ You decide when to update✅ Automatic (but you don’t always want that)
Customization✅ Full fine-tuning, your model⚠️ Limited, often prompt-only
Rapid scaling⚠️ Requires buying hardware✅ Scales in minutes
Compliance (GDPR, HIPAA…)✅ Much simpler to manage⚠️ Requires contracts and audits

1.2 When local wins hands down

I am not saying the cloud is always the wrong choice: there are contexts where it makes perfect sense. But there are situations where the local option has no real competition:

  • You handle sensitive data: healthcare, legal, finance, government: in many cases you simply cannot send certain data to an external provider. That is the end of the discussion. A local LLM is not an option, it is the only option.
  • You make a lot of requests: if your team runs hundreds of thousands of queries per day, the cost per cloud token grows quickly to a point that is hard to justify. Hardware is a one-time expense.
  • You need deep integration: if the model must access your internal databases, company documents, or proprietary systems in real time, doing all of that via the cloud is slow, expensive, and complicated.
  • Your domain is highly specific: no generic model, however large, will ever match one that has been specialized for your sector. And you can only build that specialist if you control the model.
  • Latency matters: real-time applications, voice assistants, embedded systems: an extra 200ms of network latency can be the difference between a product that works and one nobody uses.

2. Which Model Do You Choose?

2.1 The current open-weight model landscape

One very positive development of recent years is that open-weight models (the ones whose weights you can download and use as you see fit) have become genuinely competitive. For most practical tasks they no longer have much to envy from cloud models. These are the main ecosystems worth knowing:

ModelAuthorSizesWhy choose it
Llama 3.xMeta AI8B, 70B, 405BThe current benchmark reference: balanced and versatile
Mistral / MixtralMistral AI7B, 8x7B, 8x22BHighly efficient MoE architecture, excellent at reasoning
Qwen 2.5Alibaba0.5B – 72BNative multilingual (great in English and beyond), coding and math
Gemma 2Google DeepMind2B, 9B, 27BCompact and fast, designed for local deployment
Phi-4Microsoft14BSmall but impressive: trained on exceptionally high-quality data
DeepSeek-R1DeepSeek1.5B – 671BBest-in-class for chain-of-thought reasoning and complex problems
Command R+Cohere35B, 104BBuilt for RAG and tool use: a natural fit with MCP

2.2 How to choose the right one for you

Benchmarks are a starting point, not an answer

MMLU, HumanEval, GSM8K: you find them on every leaderboard and they are useful for getting a rough sense. But a model that dominates academic benchmarks might be mediocre for your specific use case. The only thing that really matters: build a small test set using real questions from your domain and run the candidates against it. It is not the most sophisticated method, but in practice it works far better than generic benchmarks.

Watch out for licenses

Not all models are free to use however you want. Llama 3, for instance, requires accepting a specific license once you reach certain usage volumes. Qwen 2.5, Mistral, and Gemma each have different terms. Before building anything in production, read the license: it is tedious but necessary.

GGUF, SafeTensors, Ollama: the format matters

For local use, GGUF (used by llama.cpp and Ollama) is the most convenient format: it supports native quantization and runs on any hardware with a few GB of VRAM or even just RAM. For more structured environments with vLLM or TGI, the original SafeTensors from HuggingFace are the standard.

⚙️ QUICK TIPIf you are just getting started and want to try right now: install Ollama, type ‘ollama run llama3.1’ and in five minutes you have a working local LLM. For 80% of experimental use cases, that is all you need to begin.

2.3 Quantization: fitting the model into your VRAM

A 7-billion parameter model in full precision takes around 28 GB of VRAM. Very few consumer GPUs get anywhere near that. Fortunately, quantization exists: you reduce the numerical precision of each parameter, the model takes up far less memory, and quality drops only marginally in almost all real-world use cases.

QuantizationMemory usageQuality lossWhen to use it
Q8 (8-bit)~8 GB per 7B paramsNearly noneWhen you have plenty of VRAM and want the best quality
Q6_K~6 GB per 7B paramsNegligibleWell-balanced option for pro GPUs
Q4_K_M~4.5 GB per 7B paramsSmall, acceptableThe most widely used in practice: works well almost everywhere
Q3_K_M~3.5 GB per 7B paramsNoticeableWhen VRAM is tight and you have no other option
Q2_K~2.7 GB per 7B paramsSignificantExperimentation only, not for production

The practical rule: use the highest quantization level your VRAM can hold. Q4_K_M is the most balanced option for the majority of local deployments.

3. How Much GPU Do You Actually Need?

3.1 Hardware: what matters and what does not

Here comes the question everyone asks sooner or later: how much does the hardware cost? The answer: it depends on how many users you have and how large a model you need. But first, let us clarify what actually matters:

  • VRAM: the real bottleneck: all the model weights must fit in VRAM. If they do not, part of the model spills onto system RAM (CPU offloading) and performance drops dramatically. VRAM is the first thing to look at.
  • Memory bandwidth: often more important than capacity: a GPU with a lot of VRAM but low bandwidth can actually be slower than one with less VRAM but very high bandwidth. The RTX 5090, for instance, delivers 1.8 TB/s bandwidth, which makes it exceptionally fast for inference.
  • System RAM: plan for at least twice your VRAM as system RAM, especially when working with long contexts.

3.2 Sizing by number of users

Three variables determine how much hardware you need: how many users send requests simultaneously (not how many total users you have), how long the average context is, and how many tokens per second you need (a fluid chatbot requires at least 15-20 tok/s to feel responsive).

Concurrent usersTypical scenarioRecommended modelIndicative hardwareExpected throughput
1-5 usersDev team / prototypeLlama 3.1 8B Q41x RTX 5090 (32 GB) or 1x RTX PRO 6000 (96 GB)~40-60 tok/s
5-20 usersBusiness teamLlama 3.3 70B Q42x RTX 5090 or 1x RTX PRO 6000 + 1x A100 40GB~20-30 tok/s
20-100 usersSMB or departmentLlama 3.3 70B Q62x RTX PRO 6000 (192 GB total) or 4x A100 40GB~25-35 tok/s
100-500 usersMid-size enterpriseMixtral 8x22B or 70B4x RTX PRO 6000 or 4x A100 80GB~30-40 tok/s
>500 usersLarge enterpriseMulti-node distributed architectureH100 / A100 80GB clusterHorizontal scaling
📊 QUICK FORMULAWant to estimate VRAM on the fly? (billions of parameters) × (quantization bits / 8) × 1.2 = GB of VRAM needed. Example: Llama 70B with Q4 → 70 × 0.5 × 1.2 = 42 GB. You need at least 2x RTX 5090 or 2x A100 40GB.

3.3 GPU comparison: consumer, pro, and datacenter

Not all GPUs are created equal, and the right choice depends heavily on your context. Here is how to navigate the main options available today:

GPUVRAMCost rangeWhen to choose it
NVIDIA RTX 509032 GB GDDR7$$Excellent for small teams, top consumer performance, 1.8 TB/s bandwidth
NVIDIA RTX PRO 6000 Blackwell96 GB GDDR7$$$$The pro GPU par excellence: generous VRAM, ECC, built for continuous workloads
NVIDIA RTX 409024 GB GDDR6X$$Still very capable, great if you find one at a good price
NVIDIA A100 40GB40 GB HBM2$$$$Mid-range datacenter GPU, easy NVLink for multi-GPU setups
NVIDIA A100 80GB80 GB HBM2e$$$$The established reference for enterprise deployments
NVIDIA H100 80GB80 GB HBM3$$$$$The absolute top: for those with the budget and a need for maximum throughput
AMD RX 7900 XTX24 GB GDDR6$A valid alternative with ROCm, though the ML ecosystem is still less mature

3.4 Scaling strategies

Vertical scaling

Add GPUs to the same server, or move to more powerful GPUs. This is the simplest approach to manage operationally. The RTX PRO 6000, with its 96 GB, can run a Llama 70B Q4 on a single GPU without any particular complications. The limit is physical: beyond a certain point you simply cannot fit more GPUs into one server.

Horizontal scaling

Two main approaches, with very different levels of complexity:

  • Model parallelism (tensor or pipeline): the model itself is distributed across multiple machines. Necessary for very large models (>70B with smaller GPUs). Requires fast interconnects, such as NVLink between GPUs in the same server and InfiniBand between nodes. Complex to configure.
  • Request parallelism: multiple instances of the same model, each serving a subset of users. Much simpler, ideal when the model fits on a single machine and you want to increase total throughput.

The recommended hybrid approach

For most organizations, the winning combination is: one or two machines with pro GPUs (RTX PRO 6000 or A100) for complex tasks that require the large model, plus a couple of machines with RTX 5090s running multiple instances of a smaller, fine-tuned model for routine requests. You optimize both quality and cost without overcomplicating the infrastructure.

3b. Which Serving Tool Should You Choose?

One of the most important decisions in a local LLM architecture is the serving engine: the software that loads the model into VRAM, handles incoming requests, and returns responses. Not all engines are equal, and the right choice depends heavily on where you are in your journey.

Ollama

Ollama is the ideal starting point for anyone who wants to get going quickly. You install a binary, type ‘ollama run llama3.1’, and in five minutes you have a working LLM with an OpenAI-compatible REST API. It automatically handles model downloads, versioning, quantization, and serving on CPU or GPU. Its limitations emerge when you scale: it does not support continuous batching (requests are processed sequentially), it lacks advanced native multi-GPU management, and its production monitoring and control features are limited. For a team of 5-10 people experimenting, it is perfect. For 100 concurrent users in production, it starts to show cracks.

vLLM

vLLM is the reference engine for production deployments. Developed at UC Berkeley, it implements PagedAttention, a technique that manages VRAM much more efficiently than traditional approaches, enabling significantly higher throughput with the same hardware. It supports continuous batching (multiple requests processed in parallel), multi-GPU with tensor parallelism, advanced quantization (AWQ, GPTQ, FP8), fully OpenAI-compatible API, and native Prometheus metrics. Configuration is more complex than Ollama, but the throughput gain, often 2-5x, more than justifies the investment in environments with real load.

TGI (Text Generation Inference)

TGI is the engine developed by HuggingFace, optimized for models on the Hub. It supports continuous batching, quantization, multi-GPU, and has excellent integration with the HuggingFace ecosystem (including native support for access tokens on gated models). In terms of performance it is comparable to vLLM for most models; the choice between the two often comes down to ecosystem preference or specific feature requirements.

FeatureOllamavLLMTGI
Ease of setup⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Production throughput⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Continuous batchingNoYesYes
Native multi-GPUPartialYes (tensor parallelism)Yes
Advanced quantizationGGUF / Q4-Q8AWQ, GPTQ, FP8AWQ, GPTQ, EETQ
OpenAI-compatible APIYesYesYes
Prometheus metricsNoYes (native)Yes (native)
HuggingFace integrationGoodExcellentNative
Best forExperimentation, dev teamsEnterprise production, high loadProduction, HF ecosystem
🔀 HOW TO CHOOSESimple rule: start with Ollama. When load grows or you need production features (batching, metrics, serious multi-GPU), migrate to vLLM. If you work heavily with HuggingFace models or already have HF infrastructure, consider TGI as an equivalent alternative. All three tools have OpenAI-compatible APIs, so migrating from one to another requires minimal changes to client code.

4. Fine-Tuning: Smaller Models, Bigger Results

4.1 The myth of “bigger is always better”

There is a widespread assumption that if you have a hard problem, you need the largest model you can afford. In general terms, that is true. But in your specific domain? Probably not.

A 7B parameter model trained on thousands of examples from your specific use case will almost always outperform a generic 70B model. The reason is intuitive: large models are generalists: they know a little about everything, but rarely excel in any specific domain. Fine-tuning takes a generalist and turns it into a specialist. And the specialist, in their own field, wins.

💡 WHY IT WORKSModels specifically fine-tuned on domain data consistently outperform larger general-purpose models on narrow tasks. Published research in biomedical NLP, legal document processing, and code generation all confirm the same pattern: a well-trained 7-14B specialist beats a generic 70B generalist on its own turf, at a fraction of the inference cost.

4.2 Fine-tuning techniques

Full fine-tuning: powerful but expensive

You update all billions of model weights. Maximum adaptability, but requires a lot of VRAM, days of GPU time, and large amounts of data. For most business use cases it is overkill: there are far more efficient alternatives.

LoRA: the optimal trade-off

Low-Rank Adaptation is the technique everyone uses in practice. Instead of touching all the weights, you add small adaptation matrices (the “adapters”) that capture the necessary changes. Results are comparable to full fine-tuning for most tasks, at a fraction of the resource cost. The key parameter is the rank (r): values between 8 and 64 cover 95% of cases.

  • QLoRA: LoRA with quantization, for modest hardware: you combine LoRA with an already quantized model. The result: you can fine-tune a Llama 70B on a single 32 GB RTX 5090. Until a few years ago that would have been unthinkable.

Instruction tuning and DPO

Instruction tuning teaches the model to follow structured instructions: essential if you want an assistant that responds in a predictable way. DPO (Direct Preference Optimization) is the modern successor to RLHF: it aligns the model to human preferences without the complexity of classical reinforcement learning.

4.3 How to build a good dataset

The uncomfortable truth: 90% of fine-tuning work is not choosing hyperparameters, it is building a good dataset. Quality beats quantity, always.

  • Define the task precisely. Not ‘company assistant’ but ‘classification of support tickets into 15 categories’ or ‘generation of structured legal contract summaries’. The more specific you are, the better it works.
  • Collect real examples. The best examples come from your production history: real queries with responses validated by domain experts. Do not invent synthetic examples if you can use real data.
  • Use a consistent format. System prompt, user, assistant: and never change it within the dataset. Alpaca, ShareGPT, and ChatML are the most common formats; Ollama and vLLM support all of them natively.
  • Clean with rigor. Remove duplicates, ambiguous examples, and low-quality responses. 1,000 excellent examples are worth more than 10,000 mediocre ones, and often produce better results.
  • Always keep a held-out set. Set aside 10-20% of your data before you start. It is the only way to measure whether fine-tuning has actually improved anything or whether you are just overfitting.

4.4 Tools to use

ToolTypeWhy it is usefulWhen to use it
UnslothOptimized QLoRA/LoRA2-5x faster than standard HuggingFace, less VRAMFine-tuning on RTX 5090 or RTX PRO 6000: the default choice
AxolotlGeneral frameworkYAML configuration, flexible, multi-GPU supportTeams with complex requirements or working across multiple models
LLaMA-FactoryUI + CLIGraphical interface, dozens of supported modelsThose who prefer a GUI or want to experiment quickly
TRL (HuggingFace)Python librarySFT, DPO, PPO, RLHF: all integratedML engineers who want full control
MLX (Apple)Apple Silicon frameworkOptimized for Mac M-series (M3 Ultra, M4 Max/Ultra)Those with a Mac Studio or Mac Pro who want to put it to good use

5. MCP: Getting Your Model to Actually Do Things

5.1 What MCP is and why you should care

You have your local LLM running. It answers questions, generates text, great. But it remains a passive tool: you ask it something, it responds. What if it could actually do things? Search your database, update a Jira ticket, read a file from the filesystem, send an email?

That is exactly what MCP (Model Context Protocol) enables. It is an open standard, originally developed at Anthropic and now adopted by a growing ecosystem, that defines how an LLM can interact with external systems in a structured and secure way. The best analogy: MCP is to LLMs what USB is to computers, a universal connector that lets you plug any model into any tool without rewriting integration code every time.

🔌 IN PRACTICEWith MCP you can tell your LLM: ‘Search the CRM for customers who have not renewed in the last 90 days, draft a personalized email for each one, and save them to a folder on Google Drive.’ The model executes all three steps in sequence, using the tools you made available to it.

5.2 How MCP is structured

Three pieces work together:

  • MCP Host: your application (the chatbot, the IDE, the internal system). It is what the user sees and interacts with.
  • MCP Client: the component that connects the LLM to the available MCP servers. It sits inside the host and acts as a mediator.
  • MCP Server: each server exposes a set of capabilities: tools (functions the model can call), resources (data it can read), and predefined prompts. Each server is an independent microservice you can write in Python, Node.js, Go, or any other language you prefer.

Communication uses JSON-RPC 2.0, a simple and lightweight protocol. Local servers communicate via stdio; remote ones via HTTP+SSE. The beauty of it is that writing a new MCP server takes fewer than 50 lines of Python with FastMCP.

5.3 What the model can concretely do with MCP

Tools: the model takes action

The model calls external functions with structured parameters and receives the result. Practical examples already in use in production:

  • Database queries: “Give me the top 10 customers by Q3 revenue” → the model generates and executes the SQL, returns the result already analyzed.
  • Company filesystem: reading, writing, semantic search across internal documents, without manual file uploads.
  • Business APIs: ERP, CRM, ticketing systems, calendars, email. Everything accessible as if the model had the credentials.
  • Code execution: Python scripts for data analysis, chart generation, transformations. The model writes the code and runs it in a sandbox.

Resources: data the model reads

Data sources that flow directly into the model’s context: updated documents, vector search results over internal knowledge bases, system logs. You do not need to retrain the model every time the data changes; MCP resources update in real time.

Predefined prompts

Reusable templates for recurring scenarios. Useful for standardizing output in contexts where structure matters: weekly reports, meeting summaries, document analysis in a fixed format.

5.4 Integrating MCP with a Local LLM

With cloud models it is straightforward because the API handles everything. With local models you need to do a bit more work, because not all models handle tool calling in the same way. Here is the typical flow:

  • The user makes a request through the UI (Open WebUI, a custom app, a Slack bot).
  • The orchestrator (LangChain, LlamaIndex, or custom code) formats the request, including the JSON definitions of available tools in the system prompt.
  • The local model responds indicating which tool it wants to use and with which parameters, in structured JSON format.
  • The orchestrator intercepts this response, calls the correct MCP server, and receives the result.
  • The result is inserted back into the context, and the model generates the final response for the user.
⚙️ MODELS WITH NATIVE TOOL USENot all models handle tool calling the same way. For serious MCP deployments, use models specifically trained for this: Llama 3.1/3.2/3.3, Mistral NeMo, Qwen 2.5, Hermes-3, Command R+. They make an enormous difference in reliability compared to generic models.

5.5 Security: do not overlook it

MCP gives the model the ability to act on real systems. That means you need to treat the model like an untrusted user who has access to your systems, because in a sense that is exactly what it is.

  • Least privilege for every MCP server: the server that reads the CRM should not be able to write to the filesystem. Each server exposes only what is needed, nothing more.
  • Always validate inputs: the parameters the model passes to tools must be validated exactly as you would with any untrusted user input. SQL injection, path traversal: the same rules apply.
  • Audit logging: every tool call must be logged with timestamp, parameters, and result. Not just for security, but for debugging when things go wrong.
  • Sandboxing for code execution: if you have a tool that runs arbitrary code, it must run in an isolated container. Always.
  • Rate limiting on the agent loop: a model that enters a tool-calling loop can cause real damage. Set a limit on the number of calls per session.

6. Putting It All Together: A Practical Stack

6.1 The stack for a SMB (PMI) – (10-50 users)

If you are a mid-sized organization looking for a complete local AI assistant, this is the stack that actually works in practice: all open source, all composable:

LayerRecommended toolWhy this and not something else
Model servingOllamaExtremely easy to start, OpenAI-compatible API: works as a drop-in replacement
Production servingvLLMHigher performance, optimized batching, multi-GPU distributed serving
Chat UIOpen WebUIModern interface, user management, RAG integrated out-of-the-box
OrchestrationLangChain / LlamaIndexTool calling, RAG pipeline, context management, agent loop
MCP serverFastMCP (Python)50 lines for a working server, huge ecosystem of examples
Vector database (RAG)Qdrant or ChromaSemantic search over company documents, fully local
ObservabilityLangfusePrompt tracing, latency, costs, quality: essential in production
Fine-tuningUnsloth + AxolotlEfficient QLoRA on RTX 5090 / RTX PRO 6000, YAML configuration

6.2 How a typical project unfolds

There are no shortcuts, but the path is fairly standard. Here is how it works in practice:

  • Proof of Concept: install Ollama, pick Llama 3.1 8B or Qwen 2.5 7B, connect Open WebUI. Have 3-5 real users test it for a week. The feedback you collect at this stage is extremely valuable.
  • Model selection: build a benchmark with 50-100 real questions from your domain. Test the candidates, measure quality and speed. Choose based on your data, not on what you read in a blog post.
  • Fine-tuning: collect the dataset, run QLoRA with Unsloth, evaluate on the held-out set, iterate. Integrate the fine-tuned model. This phase alone often doubles the quality perceived by users.
  • MCP integration: identify the 3-5 tools that deliver the most value (typically: internal database, filesystem, CRM API), write the MCP servers, test the agent loop end-to-end.
  • Production and continuous improvement: set up Langfuse, define SLAs, create a process for collecting user feedback and improving the fine-tuning dataset over time. A local LLM that does not improve over time is a missed opportunity.

6.3 Let’s talk about costs: keeping it real

The economic advantage of an on-premise deployment becomes clear over time. Here is an indicative estimate for a team of 30 active users: treat these numbers as an order of magnitude, not as a quote:

Cost itemCost rangeNotes
Server with 2x RTX 5090 (or 2x RTX PRO 6000)$$$One-time cost, amortized over 3 years
Electricity$Recurring monthly cost, 2x high-TDP GPUs
Maintenance and ops$$Recurring monthly cost, part-time DevOps/MLOps
Total year 1 (indicative)$$$$Includes hardware + all operating costs
Total years 2-3 (opex only)$$$Operating costs only, hardware already amortized
Cloud equivalent GPT-4o (estimated)$$$$$Estimated annual cost, 30 active users

The typical break-even point is between 12 and 18 months. After that, the savings are real and grow proportionally with usage.

6.4 The Enterprise Context: On-Premise Operating Systems and Platforms

If you are working in a medium-to-large organization, the conversation does not end with choosing a GPU and a model. There is an underlying infrastructure layer that in enterprise environments makes all the difference between a deployment that holds up in production and one that becomes an operational headache within a few months.

Operating systems

In enterprise environments, Linux is the de facto standard for servers running LLMs. The most widely used distributions in this context are:

  • Red Hat Enterprise Linux (RHEL): the most common choice in large organizations, especially in regulated sectors like finance and healthcare. It offers commercial support, predictable update cycles, and security certifications (FIPS, Common Criteria). NVIDIA explicitly supports RHEL for its drivers and CUDA stack.
  • Ubuntu Server LTS: widely used in tech-focused companies and mid-sized organizations. The 5-year LTS cycle guarantees stability, and the ML tooling ecosystem on Ubuntu is the most mature available. Ollama, vLLM, and most frameworks treat Ubuntu as their primary reference platform.
  • Rocky Linux / AlmaLinux: open-source alternatives to RHEL, binary-compatible, designed for those who want Red Hat ecosystem stability without the commercial support cost. Heavily used in universities, public institutions, and SMBs with structured IT teams.
  • SUSE Linux Enterprise (SLES): present mainly in SAP environments and European manufacturing. Less common in pure ML contexts, but relevant if the deployment needs to integrate with existing SAP stacks.

Containerization and orchestration

In environments with more than one server or multiple teams sharing infrastructure, direct deployment on bare metal gives way to containerization. The typical layers in an enterprise context are:

  • Docker / Podman: the starting point. Each component of the stack (the model server, MCP servers, vector database) runs in isolated containers. Podman is preferred in RHEL environments for security reasons (daemon-less, rootless by default).
  • Kubernetes (K8s): when you have multiple GPU nodes and want to manage scheduling, scaling, and availability centrally, Kubernetes is the standard. The NVIDIA GPU Operator automates the installation and configuration of GPU drivers on every cluster node, eliminating manual management.
  • Red Hat OpenShift: the enterprise version of Kubernetes, with a management console, advanced RBAC, integrated CI/CD pipelines, and commercial support. Heavily present in banks, insurance companies, and public administration. OpenShift AI (formerly Red Hat OpenDataScience Platform) adds ML-specific layers for model deployment.
  • NVIDIA NIM (NVIDIA Inference Microservices): NVIDIA-optimized containers that package model, runtime, and API into a single deployable unit. They significantly reduce time to production in Kubernetes environments and are certified for both RHEL and Ubuntu.

Kubernetes and the NVIDIA GPU Operator: how it works in practice

The GPU Operator is the component that makes Kubernetes truly GPU-aware. Without it, adding a GPU node to the cluster requires manual installation of drivers, CUDA toolkit, NVIDIA container runtime, and device plugin a lengthy, brittle process that is hard to standardize. The GPU Operator automates all of this as a set of DaemonSets that run on every node.

Once installed, you can request GPUs in your pods with a simple spec: ‘nvidia.com/gpu: 1’ in the resources block of your manifest. Kubernetes handles scheduling the pod onto the right node and guaranteeing exclusive access to the requested GPU. For large models requiring multiple GPUs, you can request ‘nvidia.com/gpu: 4’ and the system handles placement automatically.

Two advanced features particularly useful for LLM deployments are Time-Slicing and MIG (Multi-Instance GPU). Time-Slicing allows multiple pods to share a single GPU in a multiplexed way, useful for lightweight models or low-frequency inference tasks. MIG, available on A100 and H100 GPUs, physically partitions the GPU into isolated instances with dedicated VRAM and compute, guaranteeing full isolation between different workloads essential in multi-tenant environments.

NVIDIA AI Enterprise and NIM: NVIDIA’s enterprise deployment platform

NVIDIA AI Enterprise is NVIDIA’s commercial software platform for production AI deployment. It includes enterprise support, guaranteed SLAs, security certifications, and priority access to NIM the pre-configured containers that dramatically simplify LLM deployment in Kubernetes environments.

A NIM container includes everything needed to run a specific model in an optimized way: the inference engine (based on TensorRT-LLM for maximum performance on NVIDIA GPUs), optimized model weights, an OpenAI-compatible API server, and monitoring metrics. The difference compared to manually configuring vLLM or TGI is significant: a NIM starts with a single docker run command, immediately exposes an OpenAI-compatible endpoint, and guarantees performance optimized for the specific NVIDIA hardware it runs on.

The NIM catalog available on NGC (NVIDIA GPU Cloud) covers the most widely used models: Llama 3.x, Mistral, Gemma, Phi, and many others. Each NIM is available in variants optimized for different hardware configurations from a single RTX PRO 6000 to multi-node H100 clusters. For those operating in enterprise contexts where operational simplicity and commercial support matter as much as performance, NIM is often the most pragmatic choice.

MLOps and lifecycle management

In an enterprise context, a model is not a static artifact: it gets updated, compared against previous versions, and monitored over time. The tools that manage this lifecycle are:

  • MLflow: experiment tracking, model versioning, centralized registry. It is the de facto standard for keeping track of which version of the fine-tuned model is in production and with what metrics.
  • Kubeflow: native MLOps platform on Kubernetes. Manages training, serving, and monitoring pipelines in an integrated way. More complex to configure than MLflow, but more powerful in multi-team contexts.
  • Ray / Ray Serve: distributed framework for training and inference on multi-node clusters. Particularly useful when working with large models that require parallelism across multiple GPUs or multiple machines.

Storage and networking

Two aspects often underestimated but critical in enterprise on-premise deployments:

  • Storage for model weights: the weights of a 70B model in Q8 take up around 70 GB. With multiple models and multiple versions, dedicated storage grows rapidly. The most commonly used solutions are Ceph (open-source distributed storage), NetApp ONTAP, or IBM Storage Scale (formerly GPFS) for environments with high performance requirements.
  • High-speed networking: for multi-GPU deployment across different nodes, network bandwidth is critical. InfiniBand (100-400 Gb/s) is the standard in serious HPC and datacenter environments. Alternatively, RoCE (RDMA over Converged Ethernet) delivers similar latencies on existing Ethernet infrastructure at lower cost.
LayerKey technologiesWhy it matters
Production OSRHEL / Ubuntu LTS / Rocky LinuxStability, NVIDIA driver support, security certifications
Container runtimeDocker / PodmanIsolation, reproducibility, consistent deployment across environments
OrchestrationKubernetes + NVIDIA GPU OperatorGPU scheduling, automatic scaling, high availability
Enterprise K8sRed Hat OpenShift / OpenShift AIRBAC, CI/CD, commercial support, ideal for regulated sectors
Optimized servingNVIDIA NIMReady-to-use containers, optimized for inference on NVIDIA GPUs
MLOps and versioningMLflow / KubeflowExperiment tracking, model registry, training pipelines
Distributed computeRay / Ray ServeMulti-node parallelism for training and inference on clusters
Infra monitoringPrometheus + Grafana + DCGMGPU metrics (utilization, temperature, memory), alerts, dashboards
Model storageCeph / NetApp / IBM Storage ScaleHigh-capacity distributed storage for weights and datasets
💡 WHERE TO STARTIf you are in an enterprise context evaluating a first structured on-premise deployment, the most pragmatic starting point is: Ubuntu Server LTS or RHEL + Docker for containers + MLflow for tracking. Kubernetes and more elaborate platforms should be introduced when the number of teams or models in production genuinely justifies it, not before.

6.5 Security and Enterprise Governance

When an LLM stops being an experimental tool and becomes a production business system, security and governance stop being optional. The model has access to sensitive data, generates output that influences decisions, and interacts with critical systems through MCP. Ignoring these aspects is not an option in regulated environments.

Authentication and access control (RBAC and SSO)

In a multi-user deployment, not everyone should have the same level of access to the model or to the available MCP tools. An RBAC (Role-Based Access Control) system lets you define who can do what: a standard user can use the chatbot, a power user can access advanced tools, an administrator can manage models and view logs.

SSO (Single Sign-On) integration via standard protocols like OIDC (OpenID Connect) or SAML 2.0 allows you to connect the LLM system to the existing corporate directory, Active Directory, Okta, Azure AD, Keycloak. Users authenticate with the same corporate credentials, access management is centralized, and when an employee leaves the organization their access is automatically revoked. Open WebUI supports OIDC natively; for vLLM and NIM, authentication is typically managed at the API gateway layer (Kong, Nginx, Traefik).

Audit logging: knowing what happened

In regulated environments, finance, healthcare, public administration being able to demonstrate who did what, when, and with which data is often a legal requirement as well as an operational one. An audit logging system for LLMs should record at minimum: user identity, timestamp, prompt sent, response received, MCP tools invoked and with which parameters, and session duration.

The challenge is that prompt logs can contain personal or sensitive data, which creates a conflict with privacy regulations (GDPR in Europe, HIPAA in US healthcare). The typical solution is pseudonymization: logs are stored with anonymous identifiers, with a separate mapping table accessible only to authorized administrators and protected by additional access controls. Langfuse, already mentioned for observability, supports this approach natively and can be configured to automatically mask sensitive patterns (credit card numbers, tax IDs, and similar) before archiving.

Data security and data residency

One of the main reasons for choosing a local deployment is maintaining full control over data. But that control must be explicitly designed, not taken for granted. Some practical considerations:

  • Data classification: explicitly define which data can be sent to the model and which cannot. A classification system (public, internal, confidential, secret) applied to company documents allows automatically blocking the sending of sensitive data to certain models or endpoints.
  • Encryption at rest and in transit: model weights, logs, and vector database data must be encrypted at rest. Communications between client, API server, and MCP servers must run over TLS. In Kubernetes environments, use Network Policies to limit traffic between pods to only what is strictly necessary.
  • Data residency: in some jurisdictions (Europe in particular) data must physically remain within the territory. A local deployment solves the problem by definition, but make sure that backups, logs, and models are also stored in the correct locations.

Model security and abuse protection

The model itself is an attack surface. Prompt injection, jailbreaking, extracting data from the system context these are real threats in environments where the model has access to sensitive information or can execute actions via MCP.

  • System prompt protection: the system prompt that defines the model’s behavior should not be visible to the end user or modifiable by them. Manage it server-side and version it to track changes over time.
  • Input and output filtering: implement guardrails on both input (detection of prompt injection or out-of-policy requests) and output (detection of inappropriate content or potential sensitive data leakage). NVIDIA’s NeMo Guardrails is one of the most comprehensive frameworks for this; lighter alternatives include LLM Guard or custom guardrails via a dedicated MCP server.
  • Per-user rate limiting: limit the number of requests per user over time to prevent abuse, systematic model scraping, or application-layer DoS attacks.
  • Container vulnerability scanning: NIM or vLLM containers must be included in the organization’s vulnerability scanning process (Trivy, Grype, or solutions integrated in platforms like OpenShift). A secure model in a vulnerable container is not a secure system.
AreaApproachToolsWhen to apply
User authenticationOIDC / SAML 2.0Keycloak, Okta, Azure AD, Open WebUI OIDCRequired in multi-user setups
Access controlRBACOpen WebUI roles, API Gateway policies, K8s RBACRequired in enterprise
Audit loggingPrompt & tool loggingLangfuse (with pseudonymization), ElasticsearchRequired in regulated sectors
Data encryptionTLS + encryption at restCert-manager (K8s), LUKS, encrypted storageAlways recommended
Model guardrailsInput/output filteringNeMo Guardrails, LLM Guard, custom MCP serverRecommended in production
Vulnerability scanningContainer scanningTrivy, Grype, OpenShift built-inRequired in certified environments
Rate limitingPer-user throttlingAPI Gateway (Kong, Nginx), vLLM rate limitsRecommended in multi-user
Network isolationNetwork policiesKubernetes NetworkPolicy, OpenShift SDNRequired in multi-tenant
⚠️ EXPERIMENTAL vs ENTERPRISE PATHIf you are experimenting with Ollama on a personal machine, you can skip most of this section. If you are building a system that will handle real company data with multiple users, every point on this list is relevant. You do not need to implement everything on day one, but you do need a clear plan for how you will get there.

Conclusions. Is It Worth It? Yes, and Here Is Why

If you have made it this far, you probably already have a concrete scenario in mind where a local LLM would make a real difference. Before closing, it is worth being transparent about how this post came together.
What you have read is the result of extensive research on the subject, built up over time through technical documentation, real-world use cases, comparison of existing architectures, and direct experimentation. This is not an analysis from an AI Engineer’s perspective you will not find detailed benchmarks, copy-paste code snippets, or low-level optimizations here. The approach is deliberately systemic: the goal was to understand how these tools fit into a real organizational context, which decisions actually matter, and where the friction points lie between theory and production deployment.
This means that some of the architectural choices described here prioritize conceptual clarity over technical depth, and that references to specific technologies should always be verified against the current state of the ecosystem, which evolves rapidly.
That said, the core principles hold. Data privacy, cost control, domain specialization through fine-tuning, integration with internal systems via MCP these are not arguments that change with the next framework release. They are structural reasons why local deployment is worth seriously considering, regardless of which model or tool happens to be trending six months from now.
The ecosystem has become surprisingly accessible. Ollama, Open WebUI, Unsloth, FastMCP are mature, well-documented tools with active communities. A competent person with a free weekend can have a working system in production. Not perfect, working and from there you improve.

Further Reading

  • Ollama: https://ollama.com: the simplest way to get started
  • vLLM: https://github.com/vllm-project/vllm: serious serving for production
  • Open WebUI: https://openwebui.com: the UI that makes you forget about ChatGPT
  • Unsloth: https://github.com/unslothai/unsloth: fast and efficient fine-tuning
  • FastMCP: https://github.com/jlowin/fastmcp: write MCP servers in Python in 5 minutes
  • MCP (official spec): https://modelcontextprotocol.io: everything about the protocol
  • Langfuse: https://langfuse.com: observability for LLMs in production
  • HuggingFace Hub: https://huggingface.co/models: where you download the models
  • Open LLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard: up-to-date benchmarks
Andrea Mariani

Andrea Mariani

Author

Andrea Mariani

Leave a Reply

Your email address will not be published. Required fields are marked *

Archive