- title: “OpenShift AI Platform - Complete Guide” author: “Platform Engineering Documentation” date: “November 2025”
- OpenShift AI Platform - Complete Guide
- Introduction - Platform Engineering
- Platform Engineering
- 1. What platform engineering is really about
- 2. Key principles from the CNCF whitepaper
- 3. Why OpenShift is a strong foundation for an internal platform
- 4. Mapping CNCF platform capabilities to OpenShift
- 5. Product & app teams vs capability & service providers
- 6. Building a “thinnest viable platform” on OpenShift
- 7. Making it secure and governed by default
- 8. Measuring success of your OpenShift-based platform
- 9. A practical adoption roadmap
- Installation
- Agent-based Installation in Air-Gapped Environments
- 1. What we’re building
- 2. Mirroring OpenShift and ecosystem images into Quay
- 3. Auth & trust: pull secrets and CA bundle
- 4. The install-config.yaml for air-gapped, bare-metal, agent-based
- 5. AgentConfig: describing your hosts
- 6. Telling the installer to use your mirrored release image
- 7. Generating the Agent ISO (air-gapped aware)
- 8. Booting the nodes & running the install
- 9. Verifying the cluster is using your Quay mirror
- 10. Troubleshooting on the rendezvous host
- 11. Summary: your end-to-end air-gapped flow
- Proxy Configuration for Installation
- 1. Why Agent-based installation with a proxy?
- 2. Where the proxy lives:
install-config.yaml
- 3. Example: Agent-based install with your proxy
- 4. Creating the Agent-based config files (with proxy)
- 5. Generating the Agent ISO (with proxy embedded)
- 6. Authentication & custom CA during Agent-based install
- 7. How the proxy behaves during and after install
- 8. Verifying your Agent-based install with proxy
- 9. Tips & pitfalls specific to Agent-based installs with proxy
- 10. Summary
- Local Quay Registry Setup
- 1. Where to specify the local Quay registry
- 2. Example: full
install-config.yamlwith proxy + local Quay
- 3. Make sure auth and TLS match your Quay
- 4. What happens during agent-based install
- 5. Quick sanity checks
- Using oc-mirror with Quay
- 1. Why oc-mirror v2 + Quay?
- 2. Requirements & network allowlist
- 3. Preparing auth & environment for oc-mirror
- 4. Designing the ImageSetConfiguration (4.20 + operators)
- 5. Running oc-mirror v2: Mirror-to-Disk
- 6. Disk-to-Mirror: pushing into Quay
- 7. Applying cluster resources in OpenShift
- 8. Verifying everything is working with Quay
- 9. Operational tips with oc-mirror v2 and Quay
- 10. Common pitfalls & troubleshooting
- 11. Summary
- Cluster Configuration
- Cluster-Wide Proxy Configuration
- 1. What the OpenShift cluster-wide proxy actually does
- 2. The Proxy resource – core fields
- 3. Basic proxy configuration (no authentication)
- 4. Proxy with basic auth
- 5. Custom CA for the proxy (trustedCA)
- 6. Configuring a proxy at install time vs Day 2
- 7. How workloads interact with the cluster proxy
- 8. Verifying the proxy configuration
- 9. Updating or removing the proxy
- 10. Common pitfalls & best practices
- 11. Summary
- NFS Storage Configuration
- 1. What we’re building
- 2. OpenShift + NFS basics (very short theory)
- 3. Setting up the NFS server (RHEL 9 example)
- 4. Cluster-side requirements (OpenShift)
- 5. Deploying nfs-subdir-external-provisioner on OpenShift
- 6. Understanding the result on the NFS server
- 7. Alternative: using a values.yaml instead of
--set
- 8. Creating and using a PVC from OpenShift
- 9. Troubleshooting common issues
- 10. Hardening & best practices
- 11. Static NFS PVs vs. dynamic (what you built)
- 12. Summary of your working configuration
- Air-Gapped Operations
- OperatorHub with Local Quay
- 1. Background: OperatorHub, CatalogSource, ClusterCatalog
- 2. Step 1 – Disable the default catalog sources
- 3. Step 2 – Enable Quay-backed catalogs
- 4. Step 3 – Verify catalog health in
openshift-marketplace
- 5. Operational tips & gotchas
- 6. Summary: What you have now
- Operators
- Node Feature Discovery (NFD)
- 1. What Node Feature Discovery actually does
- 2. High-level GitOps structure for NFD
- 3. Namespace: isolating the NFD operator
- 4. OperatorGroup: scoping where the operator works
- 5. Subscription: installing NFD from your internal catalog
- 6. GitOps RBAC: allowing Argo CD to manage NFD CRs
- 7. The NodeFeatureDiscovery CR: how you configure NFD
- 8. How this ties into your GPU / platform stack
- 9. Summary
- NVIDIA GPU Operator
- 1. What the NVIDIA GPU Operator does on OpenShift
- 2. Namespace: where the operator and operands live
- 3. OperatorGroup: scoping the operator
- 4. Subscription: install the certified operator from your Quay-backed catalog
- 5. ClusterPolicy: the GPU Operator’s master configuration
- 6. ConfigMap: device plugin config placeholder
- 7. How this ties in with your NFD & GitOps stack
- 8. Summary
- Networking
- InfiniBand and RDMA Configuration
- InfiniBand + RDMA on OpenShift AI with SR-IOV and NVIDIA Network Operator (Legacy Mode)
- 0. Prerequisites & Assumptions
- 1. Node Feature Discovery: Label the Right Nodes
- 2. SR-IOV Network Operator: Preparing the Legacy SR-IOV Path
- 3. NVIDIA Network Operator: Enabling DOCA/OFED in Legacy Mode
- 4. NicClusterPolicy: Deploying DOCA/OFED for InfiniBand
- 5. Defining SR-IOV InfiniBand Resources
- 6. Test Pod: GPU + InfiniBand RDMA
- 7. Troubleshooting Checklist
- 8. Files in This Setup
- 9. Where to Go Next
- Observability
- NVIDIA DCGM for GPU Monitoring
- 1. What is NVIDIA DCGM?
- 2. Key Capabilities of DCGM
- 3. DCGM Exporter: Bridge to Prometheus
- 4. DCGM in Kubernetes and OpenShift
- 5. Prometheus Integration: Scraping DCGM Metrics
- 6. Common DCGM / DCGM Exporter Metrics to Watch
- 7. Use Cases: Beyond “Nice Dashboards”
- 8. Best Practices and Gotchas
- 9. Putting It All Together
- GitOps
- Repository Settings and Certificates
- 1. What “repository settings” mean in OpenShift GitOps
- 2. Step 1 – Add the GitLab TLS certificate (trust
gitlab.example.local)
- 3. Step 2 – Add the GitLab repository with credentials and proxy
- 4. How proxy settings actually work for repositories
- 5. Declarative equivalent: TLS cert & repo with proxy as YAML
- 6. Quick health checks & troubleshooting
- 7. Summary: what you’ve achieved
- Applications and ApplicationSets
- 1. Quick mental model
- 2. Git layout: base / envs pattern
- 3. Argo CD Application: point to a single path
- 4. ApplicationSet: generate many Applications from one template
- 5. ApplicationSet for per-environment CSI Isilon
- 6. How this fits GitOps best practices
- 7. When to use Application vs ApplicationSet
- 8. Summary
- Benchmarking
- GPU Benchmarking with Inference-Benchmarker
- Deploying a GPU LLM Benchmark-as-Code Pipeline on Kubernetes with Inference-Benchmarker
- 1. Cluster preparation: GPUs & namespace
- 2. Clone the Inference-Benchmarker Helm chart
- 3. Configure
values.yamlfor your GPU benchmark
- 4. (Optional) Persist benchmark results with a PVC
- 5. Install the benchmark stack with Helm
- 6. Track benchmark progress
- 7. Collect the JSON results
- 8. Visualize GPU performance with the Gradio dashboard
- 9. Cleanup
- 10. Architecture overview
- 11. Troubleshooting guide
- 12. Next steps: turning this into a benchmark catalog
- Visualizing Benchmark Results
- Visualizing GPU Benchmark Results with the Inference-Benchmarker Dashboard
- 0. Prerequisites
- 1. Grab the result files from the cluster
- 2. Install the dashboard dependencies (one-time)
- 3. Launch the Gradio dashboard
- 4. Open the web UI
- 5. Pro tips for better visualization workflows
- 6. Cleanup
- 7. Summary
- Appendix
- Summary
OpenShift AI Platform Guide
Platform Engineering, GPUs, and Air-Gapped Clusters with OpenShift AI
Build a real AI platform on OpenShift, not just “another Kubernetes cluster.” This guide walks you through air-gapped installs, Quay mirroring, GPUs, InfiniBand, GitOps, and benchmarking—so platform and SRE teams can deliver a secure, observable, high-performance OpenShift AI environment that app teams actually want to use.
Minimum price
$12.99
$15.00
You pay
$15.00Author earns
$12.00About
About the Book
OpenShift AI Platform Guide is a practical handbook for platform engineers who need to turn OpenShift into a real internal AI platform, not “just a Kubernetes cluster.”
Starting from the CNCF platform engineering whitepaper, the book shows how to apply those ideas on OpenShift: treating the platform as a product, reducing cognitive load for app teams, and building opinionated “golden paths” instead of one-off snowflakes.
From there, you’ll walk through end-to-end, production-grade scenarios:
- Installing OpenShift 4.20 in fully air-gapped environments with a local Quay registry
- Configuring cluster-wide proxies, NFS storage, and disconnected OperatorHub catalogs
- Deploying and managing key operators like Node Feature Discovery and the NVIDIA GPU Operator
- Enabling InfiniBand and RDMA networking with SR-IOV and the NVIDIA Network Operator
- Integrating observability with DCGM, Prometheus, and Grafana for GPU-aware monitoring
- Using GitOps (OpenShift GitOps / Argo CD + GitLab) for declarative, auditable platform config
- Running LLM performance benchmarks as code with Hugging Face’s Inference-Benchmarker and visualizing results with a Gradio dashboard
The guide is written in a “do this, then this” style, with YAML examples, command snippets, and explanations of why each piece matters for a modern AI platform.
If you are a platform engineer, SRE, or infrastructure-minded ML practitioner responsible for OpenShift-based GPU clusters—especially in regulated or disconnected environments—this book gives you a concrete, repeatable blueprint.
Feedback
Author
About the Author
Luca Berton
Luca Berton is an Ansible Automation Expert who has been working with JPMorgan Chase & Co. and previously worked with the Red Hat Hat Ansible Engineer Team for three years. Published author of the Ansible for VMware by Examples and Ansible for Kubernetes by Examples best-seller of the Ansible By Example(s) practical book series and creator of the Ansible Pilot project. With more than 15 years of experience as a System Administrator, he has strong expertise in Infrastructure Hardening and Automation. Enthusiast of the Open Source supports the community, sharing his knowledge in different events of public access. Geek by nature, Linux by choice, Fedora, of course.

Episode 280
An Interview with Luca Berton
Contents
Table of Contents
The Leanpub 60 Day 100% Happiness Guarantee
Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.
You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!
So, there's no reason not to click the Add to Cart button, is there?
See full terms...
Earn $8 on a $10 Purchase, and $16 on a $20 Purchase
We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.
(Yes, some authors have already earned much more than that on Leanpub.)
In fact, authors have earned over $14 million writing, publishing and selling on Leanpub.
Learn more about writing on Leanpub
Free Updates. DRM Free.
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.
Learn more about Leanpub's ebook formats and where to read them
Write and Publish on Leanpub
You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!
Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.
Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.