zpool unavail cannot open

Techie blog for me to remember since I did this before after lots of research and was frustrated when it happened to me again and I could not find the command – which means there is a dearth of comments about it on the internet. If you find this may it be exactly what you are looking for… it will be next time I forget.

scenario:
Too many boring details in the history of why – and I have already written too much – odd considering. The zfs pool I had needed a disk replacement and here of the story of how I finally got it working.

Sun Thor x4540 Solaris 10.5
23 RAID 1 zfs sets
Greenplum database 3.6.3.1

A disk had too many errors and we needed to replace it – unfortunately they sent me a Hitachi replacement for a Seagate drive (SATA 500GB 7200rpm)

Standard replacement procedure:
– assuming failed disk is c3t5d0

# zpool status
–will show all disks in the zpool including one that failed

hd
–will show all disks and which physical slot c3t5d0 is in the x4540

# cfgadm -alv | grep c3t4d0
–will show the device configuration slot

c3::dsk/c3t5d0                 connected    configured   unknown    ATA HITACHI HUA7250S
unavailable  disk         n        /devices/pci@0,0/pci10de,376@f/pci1000,1000@0:scsi::dsk/c3t5d0

# zpool offline <pool> c3t5d0
–will remove the disk from the zfs raid set – errors if there is no redundancy… I like zfs.

# cfgadm -c unconfigure   c3::dsk/c3t5d0
— unconfigures device from sun hardware

Remove and replace the device

# cfgadm -alv | grep c3t4d0
–check to see if it is there…

# cfgadm -c configure   c3::dsk/c3t5d0
— configures device back into sun hardware

# zpool clear <poolname>  [c3t5d0]
— this clears the drive back into the pool and it should start resilvering  – with my machine about 80 minutes

***   except that it would not work…  I am pretty sure because the wwn (world wide name) of the disk changed during the swap (shows in dmesg)

and you get the output below with a # zpool status: (second RAID 1 set is what a normal set is like)

          mirror      DEGRADED     0     0     0
            c2t5d0    ONLINE       0     0     0
            spare     DEGRADED     0     0     0
              c3t5d0  UNAVAIL      0     0     0  cannot open
              c6t1d0  ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c2t6d0    ONLINE       0     0     0
            c3t6d0    ONLINE       0     0     0

After much online searching I found it, then forgot and the second time it happened was frustrated trying to find the solution…   so blog it…  some public notes for you too.

in short, it needs to replace itself to accept the new wwn.

# zpool replace <poolname>  c3t5d0 c3t5d0

yea, that simple…    then it starts resilvering.

But wait…   Another interesting issue if the drive shows as “removed” you may need to manually “remove” it again…   annoying.  Here is the command:

List out devices, see what the drive replaced as:

  • # cfgadm -alv

In my case it was

c6::dsk/c6t6d0                 connected    configured   unknown    ATA SEAGATE ST35002Nunavailable  disk         n        /devices/pci@3c,0/pci10de,376@f/pci1000,1000@0:scsi::dsk/c6t6d0

c6::sd40                       connected    configured   unknown    ATA SEAGATE ST35002N

unavailable  disk         n        /devices/pci@3c,0/pci10de,376@f/pci1000,1000@0:scsi::sd40

Then unconfigure it and let it “reinsert”.

  • cfgadm -c unconfigure c6::sd40
  • cfgadm -x remove_device c6::sd40

Removing SCSI devie: /devices/pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@7,0

This operation will suspend activity on SCSI bus: c6

Continue (yes/no)? yes

SCSI bus quiesced successfully.

It is now safe to proceed with hotplug operation.

Enter y if operation is complete or n to abort (yes/no)? yes

 

Don’t forget to check:

#  fmadm faulty -a

(The -a is important since vanilla fmadm faulty hides fixed problems and maintains the service light)

to see if you need to clear the fault.

Al this because there was bad firmware on the 48 drives in each of the three x4540’s we use.   Oh yea… and you did all this 12x on all the drives that had errors… just in case …

Off to Vegas.