TrueNAS Drive Replacement

Рет қаралды 7,306

Күн бұрын

Пікірлер

@johndee7326 29 күн бұрын

In 2018, I build a 8x4TB WD-red (CRM) RaidZ2 pool. Bought 4 disks at 2 different webstores to limit potential manufacturing defects at a particular batch. - Needed to buy another 4TB disk for a different system, and swapped a NAS drive with the new purchase. Tested the replacement first in a VM with 8 small virtual disks, and afterwards performed the same steps on the NAS. It was a quick and simple process. It's one thing to design a system in a specific way, it is even more important testing if it works in the real world. Thanks for sharing this process! Keep up the great content. Cheers!

@geerliglecluse5297 29 күн бұрын

That was actually a super easy job. I had expected a lot more steps in the process to get this job done, frankly.

@Jims-Garage 29 күн бұрын

Agreed, it's amazing how simple they've made the process.

@billybishop3099 20 күн бұрын

Such great timing that you did this video! I'm away on holiday and my TrueNAS just sent a notification two days ago about a degraded pool. I'm a little worried because I'm a week from being home but so far doesn't seem to be getting worse (and like you I'm RAIDZ2). Plus I bought a spare (but stupidly didn't have it plugged in while I'm away). Thanks for the video and love the channel overall!

@Jims-Garage 20 күн бұрын

@@billybishop3099 you're most welcome. Hope you manage to replace it without issue on your return.

@Blolwtf 29 күн бұрын

I just did the same thing 4 days ago, had one of the 2TB drive failed, might as well upgrade the whole pool with 4 x 14TB enterprise drives. The whole resilvering process took me around 14hours. Great video as always.

@Jims-Garage 29 күн бұрын

That's great news, thanks for sharing

@ChrisCebelenski 29 күн бұрын

Nice to see it in operation - I haven't had to do it in TrueNAS yet, but I've done it a few times on other machines. Really really nervewracking when only have RAID5 with only one drive buffer against failure. Since then I've always done two since my arrays tend to be 8-wide at a minimum, and even three with my latest 12-wide.

@Jims-Garage 29 күн бұрын

Exactly why I went with raidz2, I'd be too anxious with only a single drive.

@paullee107 29 күн бұрын

I just got done recovering from a RAID ZFS-1 18TB HDD failure - I was a bit more scared b/c I couldn't lose another drive! I use Server Part Deals manufacturers recertified dives w/ a 2yr warranty- and they had my RMA complete in 3 biz days - very happy. I highly suggest the when users can't budget for new enterprise drives - get enterprise for 1/2 price with great customer service, but half the warranty. :) Glad you made it out alive!!!

@elements88xyz Ай бұрын

Great content as usual Jim, not doing TrueNAS at the moment but planning in future of diving into it :)

@Jims-Garage 29 күн бұрын

Thanks, appreciate the comment. It's a great tool, definitely worth having it as part of the homelab.

@philippemiller4740 29 күн бұрын

Good job Jim. Next time that happens I'd suggest testing the new drive doing a thorough test before putting it in the pool maybe. Unraid has peclear plug in, I'm not sure about truenas but the procedure is worth it imo.

@Jims-Garage 29 күн бұрын

Good tip, I will look into it.

@homelabjohn 29 күн бұрын

As always great video easy to follow and even an idiot like me could follow along :). That case your running your True NAS machine on is terrible, hope someone offers you a new case then we all might get a True NAS build video and you get a new case.

@Jims-Garage 29 күн бұрын

@@homelabjohn thank you! Here's hoping, unfortunately there aren't many manufacturers over here...

@wojtek-33 29 күн бұрын

I've had a couple of those thin sata cables go bad somehow. Reported as drive was bad. Swapped the cable and drive was fine. Just giving this as a potential first step before replacing drive, especially if you are not under warranty.

@shootinputin6332 Ай бұрын

Sorry to hear about the failed drive, Jim. Totally agree about going for the extra warranty drives. But at the same time, to have the enterprise drive failing within that 5 year period is not a good look for that drive/company. Maybe I've just gotten lucky *touches wood*, but I have even some basic WD Reds with 8+ years power-on with no issues. Guess it's just the luck of the draw. Only failure I've had in the last 10 years is a Samsung 850 Pro, which started getting reallocated sectors at 10 years power on / 150TBW (rated for 300TBW, so bit disappointing). Never used Toshiba drives before, but good to see the warranty process was painless.

@Jims-Garage 29 күн бұрын

My Seagate Enterprise drives are now coming up to 8 years (I have 8 of those without failure). I'm hoping I can squeeze a couple more years but it's definintely time to think of a replacement...

@shootinputin6332 29 күн бұрын

@@Jims-Garage I actually dig your case, btw. Have been running my server in a 2008 Lian Li PC-P80 tower case with Silverstone FS305B (SST-FS305-12G) hotswap bays. Have only recently (black friday sale) purchased a Silverstone ( SilverStone RM41-506) rackmount case that I can swap the hotswap bays into (when I can be bothered).

@toddselby443 29 күн бұрын

Nice sweater ! I check out your videos just for the sweaters. I don't even own a computer.

@Jims-Garage 29 күн бұрын

Haha! 🤣 I have some surprises in store for the new year

@peterjackson6228 29 күн бұрын

I've had to replace a failed drive in my Synology NAS's a few times. Fortunetly, they're hot-swappable; the draw back being that, unless one goes through a process of determining which drive is which, and then putting labels on the outside (disk tray) its sometimes hard to know which disk to pull. I normally do this on a fresh NAS build, when the basic RAID setup has completed, and there's no data on the volume/pool etc. Makes it much easier to perform a hotswap later down the road. On more SMB/SME NAS/SAN solutions, there will be disk lights, so the above won't matter so much. 99% of the time hot-swapping disks in a setup that allows hot-swapping, is not an issue. On old SANs, when Ops people were told by the SAN people to replace a failed drive in a disk shelf; it could be a real pain, as there were often full height racks full of disk shelves, sometimes 5 or more cabinets deep. And there would ALWAYS be more than one disk failure in a shelf! I would always go back and say "yep, found the failed disk ID, but did you know that X-Y-&-Z disks are also showing faults....they would go and look and then say "oh, yeah, just as well you didn't just start pulling faulted disks, otherwise we would have lost some random assortment of arrays. Normally, a disk shelf has two or more hot spares; however, depending on the manufactorer of the SAN, those hot-spares might not come online to replace a faulted disk, instead, it takes a numpty...sorry I mean SAN person to manually instigate the hot-spare into that storage pool/array. Oh, and never ever believe the different coloured lights either; firmware versions can change what those stupid lights can mean! Only storage crash I've personally been involved in was with a SAN disk shelf that only had one PSU powered, (usually they have a minimum of two, and can go as high as six). Anyway, started loading some additional disks into the live shelf, got the sixth disk in and the power goes off to the shelf. After much swearing and time on the phone with the OEM; we found out that the shelf needed 1 PSU for every 3 disks! Otherwise the power draw is too great and the shelf will just power off so that it doesn't draw too much power and pop any breakers etc. I took about 5 days for the tape backups to restore all the data.

@Jims-Garage 29 күн бұрын

@@peterjackson6228 hey, thanks for posting your experiences. Crazy that the power draw was that high for 3 shelves. Guess they were dense?

@habana7638 27 күн бұрын

Good practice is always to label your disk with the SN.., and preferably only remove the defective disk when resilvering with the new drive is complete.

@markstanchin1692 29 күн бұрын

Looking Sharp Buddy!

@snowballeffects 29 күн бұрын

Haha yes resilvering - took a week to upgrade all my drives 1 by 1 from 1TB to 4TB each - 5 of them! So it's pretty much the same process - mine were hot swappable though -

@Jims-Garage 29 күн бұрын

@@snowballeffects that's a big upgrade!

@NetBandit70 29 күн бұрын

No mention about the risks of a new drive being bad or going bad shortly after beginning use. The bathtub shaped graph of drive failure is real.

@Jims-Garage 29 күн бұрын

No, however, I have all the remaining drives being well within the safe zone of the Bathtub

@XxSpYxX 29 күн бұрын

I've had this happen to me when I've setup my TrueNAS VM (proxmox) (4x Exos X18 18TB). A drive died after 3 hours of use. If I find a re-certified seller I would buy re-certified in the future just because the cost savings are quite high. You should label the serial numbers on the insides of the case instead of having to look through the the drives.

@Jims-Garage 29 күн бұрын

Thanks, that's a good suggestion. With my current setup I know which batch of 8 or 6 it's in, but if I had more I'd definitely need something like you suggest.

@LanceBryantGrigg 27 күн бұрын

Down'd hard drives are always scary! I've had to deal with some of that in my time. It's pretty rare that they go and its pretty rare that they go together, they tend to last for 10+ years at basically constant usage so when they go after a few days/weeks/months its an outlier for sure.. you basically never have two go at once. Only at like "amazon scale" with millions of drives that you get daily work resolving them.

@vasquezmi 29 күн бұрын

Glad things worked out. What case are you using?

@markstanchin1692 29 күн бұрын

Thanks Jim, this is a great real world video. You got me looking at my truNAS set up. I’m running 4 Disks in raid z1 which is basically raid 5 . And yes, I have heard if you run like a Z2 to basically Raid 6 is when the other drive can fail during the rebuild process some say duel mirrors are better. What are your thoughts ? My question are the alerts for disc failure automatic or is it something you had to enable and is the email or whatever you use like Gmail to receive the alerts automatically set up or do you use a different service to receive the alerts?

@Jims-Garage 29 күн бұрын

I had to setup SMTP to my GMail account (but I also have homepage tied into the TrueNAS API). IMO I would always run raidz2, it just isn't worth the headache for me with losing data. I also do nightly cloud backups of essential data for a belt and bracers approach.

@markstanchin1692 29 күн бұрын

Yes, I agree on the Z2 for sure and I would have if I had more disks but only having four I decided on a Z1 raid, Your set up with the dashboard API is also an excellent idea. I don’t have a Dashboard set up as of yet. I’m sure you probably have a video on that. I should check out. Just curious why you had to set up the SMTP with Gmail? And have you ever heard of running a dual mirror set up not sure what they call it Z 10 perhaps? from what I understand it’s easier to rebuild when a failure occurs as it’s easier on the Disks as it only needs to copy the data? Thanks for your time and knowledge much appreciate it!!

@john__johnson 29 күн бұрын

That's rough. I had a capacitor in my tv blow up this week. At least they were still on sale for holidays.

@adfjasjhf 26 күн бұрын

7:55 Can you really combine drives with different amount of storage? When I was researching TrueNAS vs UnRAID, people mentioned that you can't use different capacity HDDs in TrueNAS. So technically, if I had a pool of 10TB drives, I could have 12TB, 14TB, 20,TB along with the 10TB drives and it would all still work fine? This was the main reason why I bought UnRAID license for the future. I want expandability as I do not care too much about the speed.

@Jims-Garage 26 күн бұрын

Yes, correct. Albeit all drives are limited in usable size to the smallest.

@WickedFalcon 29 күн бұрын

Sorry to hear about the drive. It just happened a week ago for me too but that was on OpenZFS with Proxmox just had to do the changes in cli and proxmox zpool status “for status” zpool tank clear “to see if it was a false postive or random loose connection. Since the error came back it was very real” zpool tank offline drive-gonebad Hotplug new drive in proxmox and init via gui zpool tank replace drive-gonebad newdrive “to replace the drive and start resilver process” The drives where labeled on my rack unit on the drive bays with serial. Gonna reply this comment with a tool you can run to compare drive health with stats for that model drive you have with data from backblaze to get an early warning

@Jims-Garage 29 күн бұрын

@@WickedFalcon great, glad you're sorted and thanks for sharing. I'll likely have to do this one day.

@WickedFalcon 29 күн бұрын

@@Jims-Garage You are so welcome, you helped me a lot with my projects and prep me for a move to RKE2 or K8S in general. The docker container i use on the proxmox machine to get a more clear view of the disk health that compares metrics from backblaze is here just add drives either with ID or /dev/paths docker run -it --rm -p 21000:8080 -p 21001:8086 -v ~/smartdocker:/opt/scrutiny/config -v ~/smartdocker/influxdb:/opt/scrutiny/influxdb -v /run/udev:/run/udev:ro --cap-add SYS_RAWIO --device=/dev/sda --device=/dev/sdb --device=/dev/sdc --device=/dev/sdd --device=/dev/sde --device=/dev/sdf --device=/dev/sdg --device=/dev/sdh --device=/dev/sdi --device=/dev/sdj --device=/dev/sdk --device=/dev/sdl --device=/dev/sdm --device=/dev/sdn --device=/dev/sdo --device=/dev/sdp --device=/dev/sdq --device=/dev/sdr --device=/dev/sds --device=/dev/sdt --device=/dev/sdu --device=/dev/sdv --device=/dev/nvme0n1 --name scrutiny ghcr.io/analogj/scrutiny:master-omnibus This is just example from my own homelab.

@DigiDoc101 29 күн бұрын

I had to do this process two weeks ago. Great video! Do you mind to share the chasis model you use? It's conveniently easy to replace a drive.

@MartinHiggs84 29 күн бұрын

You could use a label printer to put serial numbers end of each drive 😊

@Jims-Garage 29 күн бұрын

@@MartinHiggs84 I have to unplug everything regardless so it's not too much of an issue. Definitely a good idea if I had hotswap

@humanglitch5864 28 күн бұрын

Hey Jim! Amazing video. I was wondering about this myself and this video came right on time. One query: You mentioned that we can add a storage that has higher capacity than other disks. In my usecase, I have 4x 4tb drives currently running in RAIDz1. In the future I plan to shift completely to 16tb / 20tb drives. Can I just swap out the 4tb drive one by one whilst still maintaining my data? Thanks in advance!

@Jims-Garage 28 күн бұрын

@@humanglitch5864 yes, exactly that. Add 1 at a time and wait until the resilver completes

@TrentSnyder308 29 күн бұрын

Where possible, I've been told it's better to leave the old (failing) drive in-place, add the new drive (hanging out the side, if need be), then do the Replace function in the UI with both drives attached and active. This will *clone* from the old drive to the new drive, rather than recalculating from all the drives, so it's less wear-and-tear on the rest of the drives in the vdev, is faster, and a little safer. "So I've been told." Once the cloning is finished, *then* you Offline the bad drive and remove it from the system. All that said, I don't know how ZFS deals with bad sectors during the cloning ... that bit would *have* to be calculated from the rest of the data in the vdev, and I think ZFS does intelligently handle this situation. And as others have mentioned, it's good to do a burn-in on the new drive before putting it "into production" -- another plug for having a spare drive on hand that's pre-vetted. Just have to be able to afford the extra drive, right? :)

@Jims-Garage 29 күн бұрын

Definitely a good idea and what I've done in enterprise before. For a homelab it's not always possible to have a spare but it's something I'm considering.

@SwaggerMeister72 29 күн бұрын

Sorry if you said this in the video, but what case are you using for a nas? I'm trying to find one to rackmount cuz I want it off the ground.

@Jims-Garage 29 күн бұрын

This is off servercaseuk, it's a cheap one. I don't recommend it if you can afford hotswap, but it does the job otherwise.

@Marty-e8s 29 күн бұрын

Great video, Jim. Curious your thoughts on running badblocks. Did you run it before adding to your pool and is it a consideration to skip it when a drive fails since the test is so long (potentially leaving you without reedundancy during the duration)?

@Jims-Garage 29 күн бұрын

I tend to think adding the drive first and letting it attempt resilvering is the best approach. I believe it tests the drive during the process so it's kind of a 2 in 1. Main thing for me is always to rebuild the pool.

@JaleTechAndGaming 29 күн бұрын

Almost no one likes to risk a journey into disaster recovery land. Recovering from backups in a production environment is not fun. It's always great when the raid recovers successfully. I've lost hundreds of enterprise drives over the past 25 years. Once I lost 3 enterprise drives at the same time on a main database server, less than a year old. We had to recover from backups. It was a lot of fun (sarcasm). A few times I've had other enterprise drives in the raid set fail while rebuilding, loosing the raid set, also lovely when that happens.

@Jims-Garage 29 күн бұрын

Yes, I've often heard of horror stories in Enterprise. Definitely a case of the more the merrier (in terms of drives and redundancy).

@davidr8424 6 күн бұрын

Hi Jim, I use unraid aswell as the MS01 setup with proxmox (unraid as a NAS only). Do you have a video on sharing folders from truenas to proxmox LXC (i expect the same as truenas). Also I have a new 48pro unifi switch and I use pfsense on a supermicro. I've setup the unifi controller on one of the MS01 nodes, I'm having a headache getting it going, I have the controller seeing the switch on its own bridged port, I've setup pfsense with VLANS but I can't get the unifi switch to stay on a management vlan, it keeps defaulting to 192.168.1.20. I try set a port on the unifi switch to management IP 10.10.10.2 and the controller wants to adopt again, causing a reset of the switch and defaulting it back to 192.168.1.20. Sending me bonkers, Cisco switches allow the default switch ip to be changed.

@try-that Ай бұрын

Where's your Christmas jumper 😊

@insu_na 29 күн бұрын

If you find the time, consider creating an excel sheet with all the serial numbers and the positions of the drives, so that you don't have to check everything by hand again next time something fails 😅

@Jims-Garage 29 күн бұрын

A good tip, considered writing on the inside of the case.

@insu_na 29 күн бұрын

@@Jims-Garage never hurts :) at work I have all serial numbers and wwns in an excel sheet that looks like the sled locations in the server, and I also have the serial numbers on a label on the drive caddies. given that your server has no drive caddies I agree that it makes sense to also have it labeled there. doesn't help much if you have it in an excel sheet but have to memorize "x drives from the left!" while you walk from the room with excel to the room with your server 😅 it's imo still good to know at a glance where the issue lay, and never hurts to have redundancy in labelling etc :D

@Jims-Garage 29 күн бұрын

@insu_na great idea, thanks for sharing

@StaceySchroederCanada 29 күн бұрын

Yup been there a few times. Had a bad batch of defective Seagate hdds... Lost 5 in a 6 month period... Then same Western D had a bad batch only lost 2 drives but got freaked out and swapped out the rest.... Both companies had recalls and I wasn't smart enough to check... Now I check before I buy 😂

@Jims-Garage 29 күн бұрын

Wow, that must have been an awful experience. Good tip on checking for recalls!

@Lunolux 29 күн бұрын

nice video, that make me a little scare for my raid5, with only one hdd allowed to failed not a big deal since i have back in external hdd

@Jims-Garage 29 күн бұрын

@@Lunolux as long as you have another backup you're probably ok. Just plan ahead and think about how you will actually recover (and test it!)

@dimasshidqiparikesit1338 29 күн бұрын

Is 1 day resilvering normal for a pool of this size?

@Jims-Garage 29 күн бұрын

It's typically how much of a drive is populated divided by write speed (e.g., 15TB / ~200MB/s). That's obviously very crude but it's pretty much it.

@IgnoreMyChan Ай бұрын

5:47 Yuk. Begging. 😞

@counttoast2647 22 күн бұрын

Does anyone know how to identify a drive in a server that uses caddies and that cannot be shut down (enterprise gear, i.e. a Supermicro server) ?

@BenjaminBenStein Ай бұрын

🎉

@fataugie 29 күн бұрын

I just did this two days ago with a Micron SSD

@Jims-Garage 29 күн бұрын

Nice. I imagine SSD resilvering is a lot faster

@markkoops2611 28 күн бұрын

Spinrite is the best chance of fixing the drive

@SharkBait_ZA 29 күн бұрын

Its should be mentioned, but the data is not accessible during the resilvering process. Well, that is my experience anyway.

@Jims-Garage 29 күн бұрын

Interesting. I'm pretty certain it was available for me.

@SharkBait_ZA 29 күн бұрын

@ So I have 2 in production and the one is raidz1, which def wasn’t accessible during resilver. The other is raidz2 with nvme drives, can’t remember on that one and resilvering was quick…

@Jims-Garage 29 күн бұрын

@SharkBait_ZA yeah, I'd love to see how quick that finishes!

@SharkBait_ZA 29 күн бұрын

@@Jims-Garage 12x 2TB Samsung nvme and I think there was like 2-3TB data on it… +-30mins

@Jims-Garage 29 күн бұрын

@SharkBait_ZA amazing