Highly Available Storage in Proxmox

Highly Available Storage in Proxmox - Ceph Guide

Рет қаралды 45,412

Jim's Garage

Күн бұрын

Пікірлер: 111

@TechnoTim 7 ай бұрын

Nice work! Thanks for making this easy! I need to try it out someday!

@Jims-Garage 7 ай бұрын

Thanks, Tim. I'm finding it particularly useful for K3S Servers and my firewall. Having the VMs failover automatically means there's no disruption to the cluster, no pulling pods etc.

@Layer2Clouds 5 ай бұрын

Great Video - we support Hosted Proxmox clusters in the US and your guides are a go to for our clients! Thank you Jim.

@Jims-Garage 5 ай бұрын

@@Layer2Clouds wow, thanks for sharing. That's great to hear.

@ewenchan1239 7 ай бұрын

1) You don't TECHNICALLY need a separate drive, you just need a separate PARTITION that Ceph can take over and have full control over. For example, in my OASLOA Mini PC (N95, 16 GB, 512 GB NVMe 2242 M.2 SSD), I partitioned the 512 GB NVMe SSD on each of my 3 nodes such that 128 GB is given for the Proxmox install, and the local-lvm, and then the rest is a separate partition that is given to Ceph to have dominion over. (My OASLOA Mini PC doesn't HAVE another slot where I can add additional storage devices, so I had to make do with what it has.) Once you have it partitioned like that, you can proceed with putting the 3 nodes into a Proxmox HA cluster, per usual, and you can then set up the Ceph cluster as well, also via the Proxmox GUI to perform the initial install, and also to set up your first monitor. 2) re: iGPU passthrough This is why I DON'T recommend you install any VMs/CTs until the infrastructure has been set up to be what you want it to be. Set up the clustering and Ceph first, THEN set up your VMs/CTs. That way, the IOMMU groups will stabilise, such that it will be USABLE for what you're trying to do with it before deploying VMs/CTs/services.

@Jims-Garage 7 ай бұрын

Thanks for the tips, I'll consider that on the next deployment.

@ewenchan1239 7 ай бұрын

@@Jims-Garage No problem. In my case, because my storage was dependent on the Ceph RBD/Ceph FS being up and running, before I can store the VM/CT disks, so; that meant that the clustering and Ceph had to be up and running first before I could do anything else. I know that you are storing the VM/CT disks on local storage, rather than storing it on the Ceph storage system, so you were able to start installing VMs/CTs before your Ceph system was set up.

@0xKruzr 7 ай бұрын

yeah, but you don't want to write-exhaust the device if it's also booting the node.

@ewenchan1239 7 ай бұрын

@@0xKruzr Depends on how much traffic you're putting on the system/cluster. For my case, my 3-node HA Proxmox cluster running Ceph exists only to serve Windows AD DC, DNS, and AdGuard Home. So none of that is intensive. The monthly backups is probably more write intensive than anything else that happens for the rest of the month. (My N95 Mini PC, with only 16 GB of RAM, is too slow to really do much of anything else.)

@MrNGm 5 ай бұрын

In the constrained setup ewanchan1239 describes, using a separate partition on a single drive may be acceptable. Readers with other setups and/or reliability wishes should take into account that Ceph's reliability stems from (among others) being able to spread out data chunks to a larger number of OSDs (object storage daemons), such that unavailability of 1, 2, or 10 OSD's doesn't impact the cluster. The latter depends on the configured rules regarding failure domains (further reading in the Ceph documentation: CRUSH maps). I would always advise reading a bit more on Ceph, its architecture on a high level, and the failure modes. In setup ewenchan1239 describes (3-replicated Ceph with Proxmox), the cluster will become unavailable if you're, for example, performing maintenance on 1 host, and the disk of another one fails. Nevertheless, having a setup where VM data is accessible on all hypervisors through shared (network) storage, maintenance on a single hypervisor becomes a lot more simple.

@muhammadabidsaleem7048 7 ай бұрын

Thank You Jim Keep posting new videos specially on SDN please

@Chris-rm1pn 7 ай бұрын

MS-01s also have vPro which supports Serial over Lan, so if you lock yourself out and don't have GPU used by host you can use that to fix issues

@Jims-Garage 7 ай бұрын

Thanks, I'm still to get that working. It's quite buggy from my limited trialling.

@Chris-rm1pn 7 ай бұрын

@@Jims-Garage I recommend using meshcentral and their guides if you haven't tried it's the best working solution I found so far

@Andy-fd5fg 7 ай бұрын

Long live the serial port! Tis a shame they don't have physical 9 pin serial connector

@cschwartz 7 ай бұрын

@@Jims-Garageagreed. The implementation unfortunately is lacking and quirky. I loaded the meshcommander firmware on it to get web based kvm without needing meshcommander sw running on client or hosted app. However even that had quirks but enhanced functionality. I ended up giving up and going lacp with the 2.5 ports and reverted back to a trusty raritan ipkvm and a usb tty console. I never could get the wol aspect of it working and had to be in a booted state for it to function.

@cschwartz 7 ай бұрын

@@Andy-fd5fgtty to usb…. No need for a db9

@davidbuchaca 6 ай бұрын

Very nice and detailed tutorial! abbadon, sanguinius, dorn, proposing names for the following nodes: lion, khan, corax

@Jims-Garage 6 ай бұрын

@@davidbuchaca awesome! Sage choices too!

@Insightfill 7 ай бұрын

Oh! I've been looking forward to this one!

@Jims-Garage 7 ай бұрын

Hope you like it!

@NickS34252 3 ай бұрын

Excellent video - I've been following along while tinkering with my own cluster. When it comes to fast nodes like the MS-01, it's a bit tricky to figure out what to put into ceph vs local storage given the performance limitations.

@Jims-Garage 3 ай бұрын

@@NickS34252 thanks. I totally agree! I'm often scratching my head thinking which should I use.

@johnwalshaw 7 ай бұрын

I opted for 3x Nextorage NEM-PA2TB for 2GB DDR4 SDRAM. Very happy so far. It's great having a 3 node CEPH cluster.

@Jims-Garage 7 ай бұрын

That's great, sounds like a solid setup.

@georgelza Ай бұрын

... have you done a video where you expose ceph storage to a K8S cluster via a csi driver? I have a Proxmox cluster with Ceph configure over it, running a K8S cluster and would like to place my shared block storage for the EBS onto my ceph pool.

@hyperprotagonist 7 ай бұрын

He’s only gone and bloody done it 👏

@Jims-Garage 7 ай бұрын

Haha, thanks. A lot of late nights behind this one for something that on the surface is quite straightforward!

@hyperprotagonist 7 ай бұрын

@@Jims-Garagekudos for persevering. On twitter you highlighted the setbacks, on discord you kept everyone reassured, and in the video your demeanour was as if it was merely a hiccup. You weren’t lying when you said I didn’t know half of it 😂

@nadtz 7 ай бұрын

If I hadn't already built a new proxmox host before the MS01 came out I might have gone this route (though with dedicated hardware for opnsense), it's kind of crazy what minisforum was able to pack into the MS01 for the price and that ceph + proxmox HA is available for home users for free.

@Jims-Garage 7 ай бұрын

I agree. There are quirks but it's impressive.

@Carlos-Rodrigues 2 ай бұрын

I was waiting for this machine for so many years. Now I have 4 of MS-01. 3 for the cluster and another just for OPNSense. It's fast. It's stable. It's amazing. I just wonder if I can create a network with the MS-A1 through Thunderbolt so I can use it as a backup server with PBS.

@DS-ou7xm 7 ай бұрын

Its Ok, Mate nothing wrong with having Cold and Flu symptoms..... And awesome video ... thanks

@JonatanCastro 7 ай бұрын

This is amazing, I just got the MS-01 to create some content for my channel, but definitely would love to have the needed hardware to do a CEPH setup. Anyway, I digress; just want to ask you how quick it is to move a CT, considering you can't live migrate them, but on the other hand, the storage is already shared!

@zxxz-ob7ll 5 ай бұрын

The grim reality of the universe requires a grim order. The machine requires perfection. Any error can become a catastrophe

@Jims-Garage 5 ай бұрын

Prophetic

@orgind7778 7 ай бұрын

Thanks great video

@Jims-Garage 7 ай бұрын

Glad you enjoyed it

@johnvandenhurk8650 2 ай бұрын

First of all, I love your videos and have watched many of them. I have had a similar CEPH configuration on MSI Cubi Proxmox cluster using Samsung 990 Pro NVME SSD's. I was pretty happy with this until I noticed that less then six months in the SMART monitoring is failing on two of VNME's. Wearout for the three 990 Pro's, are (150% ,255%, 6%). On the proxmox forum I'm told that this is due to consumer grade SSD's. The 255% is from the node that does the most IO, but my no means these are heavily loaded systems. i wonder what your experience is so far on wearout because of Ceph?

@Jims-Garage 2 ай бұрын

@@johnvandenhurk8650 thanks. It does chew through consumer SSDs. Mine is on about 40%, I think it's good for about 4 years in total.

@johnvandenhurk8650 2 ай бұрын

@@Jims-Garage Thanks for the swift response! perhaps it is only mine that have an issue, but mine are failing within a year. I will reach out to my vendor and create a ticket. I hope yours are better! How happy are you with your MS-01's? I'm considering an upgrade to an MS01 (i9-12900) cluster for the SFP+

@dimitristsoutsouras2712 7 ай бұрын

Nice presentation of the procedure and your special case scenario as well. At the part where you created a cephfs (after you created individual ceph managers), where does that fs created on? The same1Tb nvme storage? If yes shouldn t it have some kind of partition seperation between VMs storage and ISOs or those object storage services arrange that automatixally (where goes what).

@rodneykahane4994 5 ай бұрын

not sure what the performance implications are, but the nvme osds that were created were classified as ssd. in the advanced tab, you can manually select the drive type (hdd,ssd, or nvme).

@Jims-Garage 5 ай бұрын

@@rodneykahane4994 thanks, let me check that!

@cschwartz 7 ай бұрын

If you are going to continue to do iGPU passthrough, have you thought of passing a TTY console via USB to serial, that way you can connect up should HW change and pve wants to move around your NIC naming.

@Jims-Garage 7 ай бұрын

Good idea, I'll look into that. Thanks

@MarkConstable 7 ай бұрын

I'm pretty sure if you used the gui to set up Ceph you would have had less problems. I've done it a number of times and did not have to use the cli at all.

@Jims-Garage 7 ай бұрын

The cli is necessary for the backhaul network. if it was simply the vmbr0 route then you're right, GUI would be a good choice.

@amateurwizard 23 күн бұрын

Only the Warhammer nerds noticed... Nice! 😼

@amateurwizard 23 күн бұрын

Cluster: Grimdarkfuture

@Jims-Garage 23 күн бұрын

@@amateurwizard haha, have to let the inner nerd out occasionally

@jeffersonsantos4603 7 ай бұрын

Great job, man. Do you have full network performance for Opnsense via the VirtIO bridges?

@Jims-Garage 7 ай бұрын

Yeah, it maxes out 10Gb via iperf3 and full 2Gb up/down via speedtest.net

@romseaaccthree1448 7 ай бұрын

@@Jims-Garage i'm assuming this is for the same VLAN iperf test. Would you also be able to test iperf results for inter VLAN traffic?

@Copernicus22 7 ай бұрын

Hi, very impressive work! are those ceph benchmark speeds normal though? I was expecting more given 25gbit/NVMe?

@Jims-Garage 7 ай бұрын

Normal for consumer devices. Ceph isn't about performance, it's about reliability. It's perfectly fine from my experience. Anything super heavy you want local.

@Copernicus22 7 ай бұрын

@@Jims-Garage ok thanks, yeah I did it once years ago, I think I had stimular results with ceph using microk8s.

@fbifido2 7 ай бұрын

@4:33 - the thunderbolt backhaul does not show up as a network bridge inside Proxmox ???

@Jims-Garage 7 ай бұрын

Eno5 and eno6 are the thunderbolt adapters. You could create a bridge if you wanted.

@vonwerderc 7 ай бұрын

Very interesting. I'm curious how HA with OPNsense would work. Wouldn't the WAN connection from your Modem only go into one node? If that one dies, how would the other nodes be connected?

@Jims-Garage 7 ай бұрын

The WAN connection goes into a switch that splits the internet to the nodes via a vLAN. They are all members.

@headlibrarian1996 7 ай бұрын

How does routing work then? Only one member of the cluster should get the traffic and the switch wouldn’t know which one that is.

@Jims-Garage 7 ай бұрын

@@headlibrarian1996 well there's only one firewall at a time.

@majoryoshi 7 ай бұрын

I could be mistaken on this, but in regards to your HA OPNsense is there any reason why you couldn't your WAN in to a switch (even an unmanaged would do the trick) and plug whatever port your WAN ports on your notes into said switch? Since you're doing HA through Proxmox/Ceph and not through OPNsense, I see no reason why that wouldn't work. Please correct me if I'm wrong though.

@Jims-Garage 7 ай бұрын

That's what I'm going to try.

@monish05m 7 ай бұрын

May i ask for a video on how to set up that virtual nic you have running on you opnsense. Thanks and really loved your video.

@sku2007 7 ай бұрын

there's some pcie passthrough translation in pve8. meaning you can set the hw for each node and in the vm the "friendly name" (don't know their wording right now, it's in datacenter somewhere)

@Jims-Garage 7 ай бұрын

Thanks, wasn't aware of that. I'll take a look

@sku2007 7 ай бұрын

it's called resource mappings, right below metric server

@Jims-Garage 7 ай бұрын

@@sku2007 thanks, I took a look just now and the i226-v isn't on the node. Very odd!

@sku2007 7 ай бұрын

@@Jims-Garage very odd! even when forwarded, the HW gets listed with lspci in host shell. with lspci -v you'll see a line with Kernel driver in use: vfio-pci

@Jims-Garage 7 ай бұрын

@@sku2007 I've tried all of those to no avail. I'm going to load a live Linux installation. If I don't see it I'll rma

@Eli-q5z9h 2 ай бұрын

in the system file /etc/hosts, I put the ip addresses of the public network or the ceph network?

@simuman 7 ай бұрын

Hey jim, really like your videos. I tried this a few months back and not sure if I got this ceph system wrong or not, but couldn't get it to work with a connected external NAS storage through mapped CIFS mount as the HA did not recognize the IP address for media for plex on fail over. Do you know if this is possible or have I got the wrong end of the stick about HA and how it works?

@Jayroglyph 4 ай бұрын

How would one tap into this Ceph cluster from a Kubernetes cluster running on VMs in the HA Proxmox cluster?

@Jims-Garage 4 ай бұрын

@@Jayroglyph you'd simply select the storage volume on the ceph as the storage volume for the VM. You can see that in my OPNSense video afterwards whereby the OPNSense uses the ceph storage to make it HA with a single node.

@RoiskiaFilms 7 ай бұрын

I just noticed that naming scheme and i am confused. Failbaddon the Harmless and then the two primarchs? Anyway, great video. Looking forward to try this myself in the future.

@Jims-Garage 7 ай бұрын

Thanks 👍 Cadia stands (oh wait!) 😲

@janstasik9094 4 ай бұрын

Hello, may i ask you about stability of ms-01 from time you've deployed th4 and ceph? I've ordered boxes but meanwhile i've read horrible stories about ms-01, how hard is to deploy vPRO, proxmox installation is nightmare, bios upgrade and microcode deployment nearly unrealistic, how impossible is to configure and run TH4 ports and overal ceph and box stability is nightmare, every 3 days to reboot etc..what is your real life experience? Is it worth to buy em? From my side, the best hardware for homelab. Thank you.

@Jims-Garage 4 ай бұрын

I haven't had a single issue since buying about 3 months ago. They've been on all that time, are on stock bios and are running ceph via TB4. Proxmox installation is the same as any other device. I don't vpro as I don't have a need to but I've heard it's a nightmare. Only issue I had was to disable ASPM in the BIOS.

@janstasik9094 4 ай бұрын

@@Jims-Garage Thanks...

@kienanvella 7 ай бұрын

You can absolutely run with spinning disks with ceph, but you need quite a few of them, and definitely want some SSD DB/WAL devices. I'm running a cluster of 4 nodes, with 24 spinning disks, 6 per node. 3:1 OSD to DB/WAL drive ratio (3 OSDs share one DB/WAL SSD). Having said that, it's not stupendously fast - especially for my write-heavy workload, but it's fast 'enough'. I've got about 35 guests, which includes a Zabbix server with DB, 3x elasticsearch, and a graylog system. It was quite affordable however, buying used drives in bulk.

@Jims-Garage 7 ай бұрын

That's awesome, thanks for sharing. I'll do some more testing.

@DavidC-rt3or 7 ай бұрын

After having setup somewhat of a test PBS server and backing up the nodes of the cluster, trying to find the steps of how to do a restore of a node that is in a cluster and has ceph.. just to make sure all of the needed information was backed up and how to restore (ahead of time :) ) Ideas?

3 ай бұрын

@9:33 Try to _ALWAYS_ have a serial console. That never fails.

@BenjaminBenStein 7 ай бұрын

🎉

@snowballeffects 6 ай бұрын

SO... that lock out problem when you pass through the GPU - I have a standby PCI (yup PCI 😂) GPU that I popped into that previously annoyingly unused slot - leaving the original gpu in place. plug in the SVGA monitor 😂 and boom - hello cli 😅

@Jims-Garage 6 ай бұрын

@@snowballeffects nice, that's a good failsafe!

@Irish2086 7 ай бұрын

I have been looking for this answer for a while... How would one figure out the right number for a 5-7-9 nodes CEPH configuration... I just foun information about a 3 nodes config

@headlibrarian1996 7 ай бұрын

I like 5 more than 3, but 5 MS-01s is fairly pricey and you can’t do a full-mesh thunderbolt network with 5. With five shutting down a node for maintenance doesn’t completely degrade the cluster and erasure coding works better with more nodes. A 5-node Qotom cluster is interesting because they have 2 SFP+ 10G ports, but I don’t know how well it would actually perform. You could have one set of SFP interfaces on a dumb switch for the private backhaul network, and you need 5 ports on your main switch for the public facing interfaces.

@lsimsdj 3 ай бұрын

My mini pcs have one 512GB NVME SDD each... This will not work? Does it mean I need to buy one additional NVME SSD for each mini pc in the cluster?

@Jims-Garage 3 ай бұрын

Correct, CEPH requires a dedicated drive.

@voldllc9621 7 ай бұрын

I did not see you creating a shared storage for vm and ct disks. Cephfs cannot host these because that gives you posix file storage only, not block storage. You need RADOS block storage.

@Jims-Garage 7 ай бұрын

Thanks, as mentioned that was in the previous video.

@voldllc9621 7 ай бұрын

Sorry, i missed that, probably since i saw you installing Ceph from scratch,and after creating a replicated pool, going straight to Cephfs for ISO and CT template file storage. ISO and CT template are not crucial for HA.

@DavidC-rt3or 7 ай бұрын

In my setup I've got one crush rule and pool setup for ssd's for the vm disk and another with hdd's for data virtual disk of the vms. Not a high volume/performance need

@cberthe067 7 ай бұрын

There is no Erasure Coding in Crush Rule ?

@Jims-Garage 7 ай бұрын

It's a trade off from my understanding. Erasure coding ensures better replication (data loss prevention) but impacts on performance. As I always abstract my data I'm less worried about it as a long term storage mechanism (more for failover capability).

@MelroyvandenBerg 6 ай бұрын

is covid back again in the country? blehh.

@Jims-Garage 6 ай бұрын

@@MelroyvandenBerg yeah, I think there has been a summer spike

@dazealex 6 ай бұрын

@@Jims-Garage Even here in California.

@mridulranjan1069 6 ай бұрын

You didn't show or guide through the setup of anything, just talked, showed your face and a couple of screenshots. Seriously man, what CRAP!

@Jims-Garage 6 ай бұрын

@@mridulranjan1069 did you ensure that your monitor was on and that the sound wasn't muted?

@randallsalyer 5 ай бұрын

the fix for your ipv4 is now in the setup documentation , you have it after your source line, just fyi hope you see this also add this is as the last line to the interfaces file unless there is a sources file in which case put it immeditately before the sources lines (or delete the sources line) /etc/network/interfaces # This must be the last line in the file unless there is a sources line in which case put this immediately above the sources line (or delete the sources line) post-up /usr/bin/systemctl restart frr.service

@Jims-Garage 5 ай бұрын

@@randallsalyer thanks, I will look at that!