raid failure and LVM volume group availability

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* raid failure and LVM volume group availability
@ 2009-05-21  3:07 Tim Connors
  2009-05-21  3:55 ` NeilBrown
  2009-05-25 16:09 ` hank peng
  0 siblings, 2 replies; 7+ messages in thread
From: Tim Connors @ 2009-05-21  3:07 UTC (permalink / raw
  To: linux-raid, dm-devel

I had a raid device (with LVM ontop of it) that failed through the disks
being disconnected in a long power failure that outlasted the UPS (the
computer, being a laptop, had its own builtin UPS).

While I could just reboot the computer, I don't particularly want to
reboot it just yet.  Unfortunately, failing a raid device like that means
that the volume group half disappears in a stream of I/O errors, but you
can't stop the raid device because it still has something accessing it
(LVM), but you can't make LVM stop accessing it by making the volume group
unavailable because it is suffering from I/O errors:

> mdadm -S /dev/md0
mdadm: fail to stop array /dev/md0: Device or resource busy
Perhaps a running process, mounted filesystem or active volume group?

> vgchange -an
  /dev/md0: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
  Can't deactivate volume group "500_lacie" with 2 open logical volume(s)
  Can't deactivate volume group "laptop_250gb" with 3 open logical volume(s)

> vgchange -an rotating_backup
  /dev/md0: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
  /dev/md0: read failed after 0 of 4096 at 1000204664832: Input/output error
  /dev/md0: read failed after 0 of 4096 at 1000204722176: Input/output error
  /dev/md0: read failed after 0 of 4096 at 0: Input/output error
  /dev/md0: read failed after 0 of 4096 at 4096: Input/output error
  /dev/md0: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-5: read failed after 0 of 4096 at 644245028864: Input/output error
  /dev/dm-5: read failed after 0 of 4096 at 644245086208: Input/output error
  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-5: read failed after 0 of 4096 at 4096: Input/output error
  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
  Volume group "rotating_backup" not found

The lvm device file still exists,

> ls -lA /dev/rotating_backup /dev/mapper/rotating_backup-rotating_backup
brw-rw---- 1 root disk 254, 5 May 10 09:22 /dev/mapper/rotating_backup-rotating_backup

/dev/rotating_backup:
total 0
lrwxrwxrwx 1 root root 43 May 10 09:22 rotating_backup -> /dev/mapper/rotating_backup-rotating_backup

however lvdisplay, vgdisplay and pvdisplay can't access it:
> vgdisplay
  /dev/md0: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
  --- Volume group ---
  VG Name               500_lacie
...

but the raid device files don't exist (the drive I plugged back in later
was given a new device name, /dev/sda1) and obviously raid is not very
happy anymore:

> cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[0] sdb1[2](F)
      976762432 blocks [2/1] [U_]
      bitmap: 147/233 pages [588KB], 2048KB chunk
> ls -lA /dev/sdc1 /dev/sdb1 /dev/md0
ls: cannot access /dev/sdc1: No such file or directory
ls: cannot access /dev/sdb1: No such file or directory
brw-rw---- 1 root disk 9, 0 May 10 09:22 /dev/md0

Does anyone know a way out of this, sans rebooting?
I don't suspect I could just add /dev/sda1 back into the array because I'm
sure LVM would still complain about IO errors even if raid would let me (I
suspect raid itself will also fail to add the disk back because it is
still trying to be active but has no live disks so would be completely
inconsistent).

Is it possible to force both lvm and md to give up on the device so I can
readd them without rebooting (since they're not going to be anymore
corrupt yet than you'd expect from an unclean shutdown, because there's
been no IO to them yet, so I should just be able to readd them, mount and
resync)?

-- 
TimC
"This company performed an illegal operation but they will not be shut
down."     -- Scott Harshbarger from consumer lobby group on Microsoft

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid failure and LVM volume group availability
  2009-05-21  3:07 raid failure and LVM volume group availability Tim Connors
@ 2009-05-21  3:55 ` NeilBrown
  2009-05-25 12:15   ` Goswin von Brederlow
  2009-05-25 16:09 ` hank peng
  1 sibling, 1 reply; 7+ messages in thread
From: NeilBrown @ 2009-05-21  3:55 UTC (permalink / raw
  To: Tim Connors; +Cc: linux-raid, dm-devel

On Thu, May 21, 2009 1:07 pm, Tim Connors wrote:
> I had a raid device (with LVM ontop of it) that failed through the disks
> being disconnected in a long power failure that outlasted the UPS (the
> computer, being a laptop, had its own builtin UPS).
>
> While I could just reboot the computer, I don't particularly want to
> reboot it just yet.  Unfortunately, failing a raid device like that means
> that the volume group half disappears in a stream of I/O errors, but you
> can't stop the raid device because it still has something accessing it
> (LVM), but you can't make LVM stop accessing it by making the volume group
> unavailable because it is suffering from I/O errors:
>
>> mdadm -S /dev/md0
> mdadm: fail to stop array /dev/md0: Device or resource busy
> Perhaps a running process, mounted filesystem or active volume group?
>
>> vgchange -an
>   /dev/md0: read failed after 0 of 4096 at 0: Input/output error
>   /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
>   Can't deactivate volume group "500_lacie" with 2 open logical volume(s)
>   Can't deactivate volume group "laptop_250gb" with 3 open logical
> volume(s)
>
>> vgchange -an rotating_backup
>   /dev/md0: read failed after 0 of 4096 at 0: Input/output error
>   /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
>   /dev/md0: read failed after 0 of 4096 at 1000204664832: Input/output
> error
>   /dev/md0: read failed after 0 of 4096 at 1000204722176: Input/output
> error
>   /dev/md0: read failed after 0 of 4096 at 0: Input/output error
>   /dev/md0: read failed after 0 of 4096 at 4096: Input/output error
>   /dev/md0: read failed after 0 of 4096 at 0: Input/output error
>   /dev/dm-5: read failed after 0 of 4096 at 644245028864: Input/output
> error
>   /dev/dm-5: read failed after 0 of 4096 at 644245086208: Input/output
> error
>   /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
>   /dev/dm-5: read failed after 0 of 4096 at 4096: Input/output error
>   /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
>   Volume group "rotating_backup" not found
>
> The lvm device file still exists,
>
>> ls -lA /dev/rotating_backup /dev/mapper/rotating_backup-rotating_backup
> brw-rw---- 1 root disk 254, 5 May 10 09:22
> /dev/mapper/rotating_backup-rotating_backup
>
> /dev/rotating_backup:
> total 0
> lrwxrwxrwx 1 root root 43 May 10 09:22 rotating_backup ->
> /dev/mapper/rotating_backup-rotating_backup
>
> however lvdisplay, vgdisplay and pvdisplay can't access it:
>> vgdisplay
>   /dev/md0: read failed after 0 of 4096 at 0: Input/output error
>   /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
>   --- Volume group ---
>   VG Name               500_lacie
> ...
>
> but the raid device files don't exist (the drive I plugged back in later
> was given a new device name, /dev/sda1) and obviously raid is not very
> happy anymore:
>
>> cat /proc/mdstat
> Personalities : [raid1]
> md0 : active raid1 sdc1[0] sdb1[2](F)
>       976762432 blocks [2/1] [U_]
>       bitmap: 147/233 pages [588KB], 2048KB chunk
>> ls -lA /dev/sdc1 /dev/sdb1 /dev/md0
> ls: cannot access /dev/sdc1: No such file or directory
> ls: cannot access /dev/sdb1: No such file or directory
> brw-rw---- 1 root disk 9, 0 May 10 09:22 /dev/md0
>
>
> Does anyone know a way out of this, sans rebooting?
> I don't suspect I could just add /dev/sda1 back into the array because I'm
> sure LVM would still complain about IO errors even if raid would let me (I
> suspect raid itself will also fail to add the disk back because it is
> still trying to be active but has no live disks so would be completely
> inconsistent).
>
> Is it possible to force both lvm and md to give up on the device so I can
> readd them without rebooting (since they're not going to be anymore
> corrupt yet than you'd expect from an unclean shutdown, because there's
> been no IO to them yet, so I should just be able to readd them, mount and
> resync)?

For the md side, you can just assemble the drives into an array with
a different name.
e.g.
  mdadm -A /dev/md1 /dev/sda1 /dev/sd....

using whatever new names were given to the devices when you plugged them
back in.
Maybe you can do a similar thing with the LVM side, but I know nothing
about that.

NeilBrown


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid failure and LVM volume group availability
  2009-05-21  3:55 ` NeilBrown
@ 2009-05-25 12:15   ` Goswin von Brederlow
  0 siblings, 0 replies; 7+ messages in thread
From: Goswin von Brederlow @ 2009-05-25 12:15 UTC (permalink / raw
  To: NeilBrown; +Cc: Tim Connors, linux-raid, dm-devel

"NeilBrown" <neilb@suse.de> writes:

> On Thu, May 21, 2009 1:07 pm, Tim Connors wrote:
>> Is it possible to force both lvm and md to give up on the device so I can
>> readd them without rebooting (since they're not going to be anymore
>> corrupt yet than you'd expect from an unclean shutdown, because there's
>> been no IO to them yet, so I should just be able to readd them, mount and
>> resync)?
>
> For the md side, you can just assemble the drives into an array with
> a different name.
> e.g.
>   mdadm -A /dev/md1 /dev/sda1 /dev/sd....
>
> using whatever new names were given to the devices when you plugged them
> back in.
> Maybe you can do a similar thing with the LVM side, but I know nothing
> about that.
>
> NeilBrown

On the device mapper side (the thing below lvm) you can use dmsetup to
suspend, alter and resume the device mapper table of your
devices. That should make the actual filesysems happy again.

Not sure what LVM tools will make of that though.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid failure and LVM volume group availability
  2009-05-21  3:07 raid failure and LVM volume group availability Tim Connors
  2009-05-21  3:55 ` NeilBrown
@ 2009-05-25 16:09 ` hank peng
  2009-05-26 11:05   ` Goswin von Brederlow
  1 sibling, 1 reply; 7+ messages in thread
From: hank peng @ 2009-05-25 16:09 UTC (permalink / raw
  To: Tim Connors; +Cc: linux-raid, dm-devel

2009/5/21 Tim Connors <tconnors@rather.puzzling.org>:
> I had a raid device (with LVM ontop of it) that failed through the disks
> being disconnected in a long power failure that outlasted the UPS (the
> computer, being a laptop, had its own builtin UPS).
>
> While I could just reboot the computer, I don't particularly want to
> reboot it just yet.  Unfortunately, failing a raid device like that means
> that the volume group half disappears in a stream of I/O errors, but you
> can't stop the raid device because it still has something accessing it
> (LVM), but you can't make LVM stop accessing it by making the volume group
> unavailable because it is suffering from I/O errors:
>
>> mdadm -S /dev/md0
> mdadm: fail to stop array /dev/md0: Device or resource busy
> Perhaps a running process, mounted filesystem or active volume group?
>
>> vgchange -an
>  /dev/md0: read failed after 0 of 4096 at 0: Input/output error
>  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
>  Can't deactivate volume group "500_lacie" with 2 open logical volume(s)
>  Can't deactivate volume group "laptop_250gb" with 3 open logical volume(s)
>
>> vgchange -an rotating_backup
>  /dev/md0: read failed after 0 of 4096 at 0: Input/output error
>  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
>  /dev/md0: read failed after 0 of 4096 at 1000204664832: Input/output error
>  /dev/md0: read failed after 0 of 4096 at 1000204722176: Input/output error
>  /dev/md0: read failed after 0 of 4096 at 0: Input/output error
>  /dev/md0: read failed after 0 of 4096 at 4096: Input/output error
>  /dev/md0: read failed after 0 of 4096 at 0: Input/output error
>  /dev/dm-5: read failed after 0 of 4096 at 644245028864: Input/output error
>  /dev/dm-5: read failed after 0 of 4096 at 644245086208: Input/output error
>  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
>  /dev/dm-5: read failed after 0 of 4096 at 4096: Input/output error
>  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
>  Volume group "rotating_backup" not found
>
> The lvm device file still exists,
>
>> ls -lA /dev/rotating_backup /dev/mapper/rotating_backup-rotating_backup
> brw-rw---- 1 root disk 254, 5 May 10 09:22 /dev/mapper/rotating_backup-rotating_backup
>
> /dev/rotating_backup:
> total 0
> lrwxrwxrwx 1 root root 43 May 10 09:22 rotating_backup -> /dev/mapper/rotating_backup-rotating_backup
>
> however lvdisplay, vgdisplay and pvdisplay can't access it:
>> vgdisplay
>  /dev/md0: read failed after 0 of 4096 at 0: Input/output error
>  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
>  --- Volume group ---
>  VG Name               500_lacie
> ...
>
> but the raid device files don't exist (the drive I plugged back in later
> was given a new device name, /dev/sda1) and obviously raid is not very
> happy anymore:
>
>> cat /proc/mdstat
> Personalities : [raid1]
> md0 : active raid1 sdc1[0] sdb1[2](F)
>      976762432 blocks [2/1] [U_]
>      bitmap: 147/233 pages [588KB], 2048KB chunk
>> ls -lA /dev/sdc1 /dev/sdb1 /dev/md0
> ls: cannot access /dev/sdc1: No such file or directory
> ls: cannot access /dev/sdb1: No such file or directory
> brw-rw---- 1 root disk 9, 0 May 10 09:22 /dev/md0
>
>
> Does anyone know a way out of this, sans rebooting?
> I don't suspect I could just add /dev/sda1 back into the array because I'm
> sure LVM would still complain about IO errors even if raid would let me (I
> suspect raid itself will also fail to add the disk back because it is
> still trying to be active but has no live disks so would be completely
> inconsistent).
>
> Is it possible to force both lvm and md to give up on the device so I can
> readd them without rebooting (since they're not going to be anymore
> corrupt yet than you'd expect from an unclean shutdown, because there's
> been no IO to them yet, so I should just be able to readd them, mount and
> resync)?
>
Only one of disks in this RAID1failed, it should continue to work with
degraded state.
Why LVM complained with I/O errors??
> --
> TimC
> "This company performed an illegal operation but they will not be shut
> down."     -- Scott Harshbarger from consumer lobby group on Microsoft
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



-- 
The simplest is not all best but the best is surely the simplest!
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid failure and LVM volume group availability
  2009-05-25 16:09 ` hank peng
@ 2009-05-26 11:05   ` Goswin von Brederlow
  2009-05-26 11:56     ` Neil Brown
  0 siblings, 1 reply; 7+ messages in thread
From: Goswin von Brederlow @ 2009-05-26 11:05 UTC (permalink / raw
  To: hank peng; +Cc: Tim Connors, linux-raid, dm-devel

hank peng <pengxihan@gmail.com> writes:

> Only one of disks in this RAID1failed, it should continue to work with
> degraded state.
> Why LVM complained with I/O errors??

That is because the last drive in a raid1 can not fail:

md9 : active raid1 ram1[1] ram0[2](F)
      65472 blocks [2/1] [_U]

# mdadm --fail /dev/md9 /dev/ram1
mdadm: set /dev/ram1 faulty in /dev/md9

md9 : active raid1 ram1[1] ram0[2](F)
      65472 blocks [2/1] [_U]

See, still marked working.

MfG
        Goswin

PS: Why doesn't mdadm or kernel give a message about not failing?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid failure and LVM volume group availability
  2009-05-26 11:05   ` Goswin von Brederlow
@ 2009-05-26 11:56     ` Neil Brown
  2009-05-28 18:48       ` Goswin von Brederlow
  0 siblings, 1 reply; 7+ messages in thread
From: Neil Brown @ 2009-05-26 11:56 UTC (permalink / raw
  To: Goswin von Brederlow; +Cc: hank peng, Tim Connors, linux-raid, dm-devel

On Tuesday May 26, goswin-v-b@web.de wrote:
> hank peng <pengxihan@gmail.com> writes:
> 
> > Only one of disks in this RAID1failed, it should continue to work with
> > degraded state.
> > Why LVM complained with I/O errors??
> 
> That is because the last drive in a raid1 can not fail:
> 
> md9 : active raid1 ram1[1] ram0[2](F)
>       65472 blocks [2/1] [_U]
> 
> # mdadm --fail /dev/md9 /dev/ram1
> mdadm: set /dev/ram1 faulty in /dev/md9
> 
> md9 : active raid1 ram1[1] ram0[2](F)
>       65472 blocks [2/1] [_U]
> 
> See, still marked working.
> 
> MfG
>         Goswin
> 
> PS: Why doesn't mdadm or kernel give a message about not failing?

-ENOPATCH :-)

You would want to rate limit any such message from the kernel, but it
might make sense to have it.

NeilBrown

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: raid failure and LVM volume group availability
  2009-05-26 11:56     ` Neil Brown
@ 2009-05-28 18:48       ` Goswin von Brederlow
  0 siblings, 0 replies; 7+ messages in thread
From: Goswin von Brederlow @ 2009-05-28 18:48 UTC (permalink / raw
  To: Neil Brown
  Cc: Goswin von Brederlow, hank peng, Tim Connors, linux-raid,
	dm-devel

Neil Brown <neilb@suse.de> writes:

> On Tuesday May 26, goswin-v-b@web.de wrote:
>> hank peng <pengxihan@gmail.com> writes:
>> 
>> > Only one of disks in this RAID1failed, it should continue to work with
>> > degraded state.
>> > Why LVM complained with I/O errors??
>> 
>> That is because the last drive in a raid1 can not fail:
>> 
>> md9 : active raid1 ram1[1] ram0[2](F)
>>       65472 blocks [2/1] [_U]
>> 
>> # mdadm --fail /dev/md9 /dev/ram1
>> mdadm: set /dev/ram1 faulty in /dev/md9
>> 
>> md9 : active raid1 ram1[1] ram0[2](F)
>>       65472 blocks [2/1] [_U]
>> 
>> See, still marked working.
>> 
>> MfG
>>         Goswin
>> 
>> PS: Why doesn't mdadm or kernel give a message about not failing?
>
> -ENOPATCH :-)
>
> You would want to rate limit any such message from the kernel, but it
> might make sense to have it.
>
> NeilBrown

No rate risk in mdadm --fail reporting a failure to fail the device.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-05-28 18:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-21  3:07 raid failure and LVM volume group availability Tim Connors
2009-05-21  3:55 ` NeilBrown
2009-05-25 12:15   ` Goswin von Brederlow
2009-05-25 16:09 ` hank peng
2009-05-26 11:05   ` Goswin von Brederlow
2009-05-26 11:56     ` Neil Brown
2009-05-28 18:48       ` Goswin von Brederlow

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.