* Question: how to identify failing disk in a RAID1
@ 2008-04-13 19:14 Maurice Hilarius
2008-04-13 19:29 ` Justin Piszcz
0 siblings, 1 reply; 16+ messages in thread
From: Maurice Hilarius @ 2008-04-13 19:14 UTC (permalink / raw
To: linux-raid
Hi there.
Recently I have been frequently seeing a damaged filesystem on a RAID1
on boot.
a lengthy fsck does get it working, but I am seeing files disappearing
as a result.
I am pretty sure that one of the drives has developed some issues and
needs to be replaced.
How does one identify which of the 2 disks is the one that is failing?
The system has 2 identical disks, and / is on md0
fstab:
/dev/md0 / ext3 defaults 1 1
LABEL=/boot1 /boot ext2 defaults 1 2
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
LABEL=/boot11 /boot1 ext2 defaults 1 2
LABEL=SWAP-sdb3 swap swap defaults 0 0
LABEL=SWAP-sda2 swap swap defaults 0 0
fdisk -l shows me:
Disk /dev/sda: 400.0 GB, 400088457216 bytes
255 heads, 63 sectors/track, 48641 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sda1 * 1 13 104391 83 Linux
/dev/sda2 14 535 4192965 82 Linux swap / Solaris
/dev/sda3 536 48641 386411445 fd Linux raid
autodetect
Disk /dev/sdb: 400.0 GB, 400088457216 bytes
255 heads, 63 sectors/track, 48641 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdb1 * 1 13 104391 83 Linux
/dev/sdb2 14 48118 386403412+ fd Linux raid
autodetect
/dev/sdb3 48119 48640 4192965 82 Linux swap / Solaris
Disk /dev/md0: 395.6 GB, 395677007872 bytes
2 heads, 4 sectors/track, 96600832 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Anyone have a suggestion, please?
Responses off list are probably most appropriate.
Thanks for any help.
--
Regards, Maurice
mhilarius@gmail.com
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Question: how to identify failing disk in a RAID1
2008-04-13 19:14 Question: how to identify failing disk in a RAID1 Maurice Hilarius
@ 2008-04-13 19:29 ` Justin Piszcz
2008-04-14 1:14 ` Bill Davidsen
0 siblings, 1 reply; 16+ messages in thread
From: Justin Piszcz @ 2008-04-13 19:29 UTC (permalink / raw
To: Maurice Hilarius; +Cc: linux-raid
On Sun, 13 Apr 2008, Maurice Hilarius wrote:
> Hi there.
>
> Recently I have been frequently seeing a damaged filesystem on a RAID1 on
> boot.
> a lengthy fsck does get it working, but I am seeing files disappearing as a
> result.
>
> I am pretty sure that one of the drives has developed some issues and needs
> to be replaced.
>
> How does one identify which of the 2 disks is the one that is failing?
>
> The system has 2 identical disks, and / is on md0
>
> fstab:
> /dev/md0 / ext3 defaults 1 1
> LABEL=/boot1 /boot ext2 defaults 1 2
> tmpfs /dev/shm tmpfs defaults 0 0
> devpts /dev/pts devpts gid=5,mode=620 0 0
> sysfs /sys sysfs defaults 0 0
> proc /proc proc defaults 0 0
> LABEL=/boot11 /boot1 ext2 defaults 1 2
> LABEL=SWAP-sdb3 swap swap defaults 0 0
> LABEL=SWAP-sda2 swap swap defaults 0 0
>
> fdisk -l shows me:
> Disk /dev/sda: 400.0 GB, 400088457216 bytes
> 255 heads, 63 sectors/track, 48641 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Device Boot Start End Blocks Id System
> /dev/sda1 * 1 13 104391 83 Linux
> /dev/sda2 14 535 4192965 82 Linux swap / Solaris
> /dev/sda3 536 48641 386411445 fd Linux raid autodetect
>
> Disk /dev/sdb: 400.0 GB, 400088457216 bytes
> 255 heads, 63 sectors/track, 48641 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Device Boot Start End Blocks Id System
> /dev/sdb1 * 1 13 104391 83 Linux
> /dev/sdb2 14 48118 386403412+ fd Linux raid autodetect
> /dev/sdb3 48119 48640 4192965 82 Linux swap / Solaris
>
> Disk /dev/md0: 395.6 GB, 395677007872 bytes
> 2 heads, 4 sectors/track, 96600832 cylinders
> Units = cylinders of 8 * 512 = 4096 bytes
>
> Anyone have a suggestion, please?
> Responses off list are probably most appropriate.
>
> Thanks for any help.
>
> --
> Regards, Maurice
> mhilarius@gmail.com
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
smartctl -a /dev/sda
smartctl -a /dev/sdb
also, how come swap was not on the raid1?
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Question: how to identify failing disk in a RAID1
2008-04-13 19:29 ` Justin Piszcz
@ 2008-04-14 1:14 ` Bill Davidsen
[not found] ` <4802CDA2.605@harddata.com>
[not found] ` <480F7105.9030405@harddata.com>
0 siblings, 2 replies; 16+ messages in thread
From: Bill Davidsen @ 2008-04-14 1:14 UTC (permalink / raw
To: Justin Piszcz; +Cc: Maurice Hilarius, linux-raid
Justin Piszcz wrote:
>
>
> On Sun, 13 Apr 2008, Maurice Hilarius wrote:
>
>> Hi there.
>>
>> Recently I have been frequently seeing a damaged filesystem on a
>> RAID1 on boot.
>> a lengthy fsck does get it working, but I am seeing files
>> disappearing as a result.
>>
>> I am pretty sure that one of the drives has developed some issues and
>> needs to be replaced.
>>
>> How does one identify which of the 2 disks is the one that is failing?
>>
>> The system has 2 identical disks, and / is on md0
>>
>> fstab:
>> /dev/md0 / ext3
>> defaults 1 1
>> LABEL=/boot1 /boot ext2
>> defaults 1 2
>> tmpfs /dev/shm tmpfs
>> defaults 0 0
>> devpts /dev/pts devpts
>> gid=5,mode=620 0 0
>> sysfs /sys sysfs
>> defaults 0 0
>> proc /proc proc
>> defaults 0 0
>> LABEL=/boot11 /boot1 ext2
>> defaults 1 2
>> LABEL=SWAP-sdb3 swap swap
>> defaults 0 0
>> LABEL=SWAP-sda2 swap swap
>> defaults 0 0
>>
>> fdisk -l shows me:
>> Disk /dev/sda: 400.0 GB, 400088457216 bytes
>> 255 heads, 63 sectors/track, 48641 cylinders
>> Units = cylinders of 16065 * 512 = 8225280 bytes
>>
>> Device Boot Start End Blocks Id System
>> /dev/sda1 * 1 13 104391 83 Linux
>> /dev/sda2 14 535 4192965 82 Linux swap /
>> Solaris
>> /dev/sda3 536 48641 386411445 fd Linux raid
>> autodetect
>>
>> Disk /dev/sdb: 400.0 GB, 400088457216 bytes
>> 255 heads, 63 sectors/track, 48641 cylinders
>> Units = cylinders of 16065 * 512 = 8225280 bytes
>>
>> Device Boot Start End Blocks Id System
>> /dev/sdb1 * 1 13 104391 83 Linux
>> /dev/sdb2 14 48118 386403412+ fd Linux raid
>> autodetect
>> /dev/sdb3 48119 48640 4192965 82 Linux swap /
>> Solaris
>>
>> Disk /dev/md0: 395.6 GB, 395677007872 bytes
>> 2 heads, 4 sectors/track, 96600832 cylinders
>> Units = cylinders of 8 * 512 = 4096 bytes
>>
>> Anyone have a suggestion, please?
>> Responses off list are probably most appropriate.
>>
>> Thanks for any help.
>>
>> --
>> Regards, Maurice
>> mhilarius@gmail.com
>>
>
> smartctl -a /dev/sda
> smartctl -a /dev/sdb
>
> also, how come swap was not on the raid1?
Very unexpected that the data would be bad without any hardware errors.
Did you look at your logs to see if one of your drives, or perhasps
both, are getting hardware errors? I would run a 'check' and and see
what mdadm finds on the array, you may have other problems.
Actually, I think I would run memtest86 for at least a few hours,
starting from a really cold system (not just a cold boot, off for a few
hours). Your comment "on boot" may come from memory or other component
which needs to physically get up to temperature before working reliably.
Particularly if you don't get additional errors after you have been up
for a while.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Question: how to identify failing disk in a RAID1
[not found] ` <4802CDA2.605@harddata.com>
@ 2008-04-14 16:38 ` Bill Davidsen
[not found] ` <4804CD4F.7080303@harddata.com>
0 siblings, 1 reply; 16+ messages in thread
From: Bill Davidsen @ 2008-04-14 16:38 UTC (permalink / raw
To: Maurice Hilarius; +Cc: Linux RAID
Maurice Hilarius wrote:
> Bill Davidsen wrote:
>> ..
>>>> I am pretty sure that one of the drives has developed some issues
>>>> and needs to be replaced.
>>>> ..
>>
>> Very unexpected that the data would be bad without any hardware errors.
> I DID say:
> "I am pretty sure that one of the drives has developed some issues and
> needs to be replaced. "
>> Did you look at your logs to see if one of your drives, or perhasps
>> both, are getting hardware errors?
> Oh, I KNOW one does..
> The question is WHICH one?
>
I no longer have any old logs showing errors, but /var/log/messages
and/or dmesg should have an error message with a drive identification if
you are getting disk errors.
>> I would run a 'check' and and see what mdadm finds on the array, you
>> may have other problems.
>>
> Pardon my stupidity, care to share some syntax for that?
cd /sys/block/md0/md
echo check >sync_action; cat mismatch_cnt
That's the count of errors found. Replace 'check' with 'repair' to make
the errors go away, reboot, run 'check' again.
>> Actually, I think I would run memtest86 for at least a few hours,
>> starting from a really cold system (not just a cold boot, off for a
>> few hours).
> Did that already.
>> Your comment "on boot" may come from memory or other component which
>> needs to physically get up to temperature before working reliably.
>> Particularly if you don't get additional errors after you have been
>> up for a while.
>>
> It happens cold or hot.
>
>
> --
> Regards, Maurice
>
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Question: how to identify failing disk in a RAID1
[not found] ` <4804CD4F.7080303@harddata.com>
@ 2008-04-15 18:14 ` Bill Davidsen
[not found] ` <48050DD6.7020404@harddata.com>
0 siblings, 1 reply; 16+ messages in thread
From: Bill Davidsen @ 2008-04-15 18:14 UTC (permalink / raw
To: Maurice Hilarius; +Cc: Linux RAID
Maurice Hilarius wrote:
> Bill Davidsen wrote:
>> ..
>> cd /sys/block/md0/md
>> echo check >sync_action; cat mismatch_cnt
>>
> Hi Bill,
>
> I am doing this as root.
> I am seeing:
>
> [root@localhost md]# echo check >sync_action; cat mismatch_cnt
> -bash: echo: write error: Device or resource busy
> 0
>
> any suggestions?
First, did you cd to the /sys/block/mdX/md directory? And did you wait
for the check to finish, watching /proc/mdstat?
I left that out, assumed you had read it in the man pages...
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Question: how to identify failing disk in a RAID1
[not found] ` <480607F2.3060504@harddata.com>
@ 2008-04-17 13:12 ` Bill Davidsen
[not found] ` <48076096.2020804@harddata.com>
0 siblings, 1 reply; 16+ messages in thread
From: Bill Davidsen @ 2008-04-17 13:12 UTC (permalink / raw
To: Maurice Hilarius, Linux RAID
Maurice Hilarius wrote:
> Morning Bill.
>
> BTW< I want to say "Thanks for your help with this" first.
> Just in case I forgot.
>
> So, I ran "check" once. It complained, and failed.
>
Does the failure provide any useful information?
> A few hours later, I ran it again, and it immediately returned "0"
>
I totally don't understand that, assuming that the first check was sone.
> I am still puzzled:
> Why it failed the first time
> Why it returned a result in a couple of seconds the second time.
> What it tells me?
> I gather this means the md0 device is healthy.
>
I don't think so, I've never had a check *fail*, I just expect it to
tell me how bad things are.
> So, meanwhile back at the ranch, I still think sda is failing..
I think it's time to be keeping a good backup, and hopefully someone
else has a good thought on running this down more.
> Any thoughts on that?
The only thought I have at the moment is marginal power supply, and
that's just because it can generate all manner of odd behaviors, rather
than any other hints. Sorry.
If you aren't getting errors from SMART or logs, and I don't remember
you sending me that info, I'm not sure how you determine which drive is
the problem.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Question: how to identify failing disk in a RAID1
[not found] ` <48076096.2020804@harddata.com>
@ 2008-04-18 13:17 ` Bill Davidsen
0 siblings, 0 replies; 16+ messages in thread
From: Bill Davidsen @ 2008-04-18 13:17 UTC (permalink / raw
To: Maurice Hilarius; +Cc: vger majordomo for lists
Maurice Hilarius wrote:
> Bill Davidsen wrote:
>> Maurice Hilarius wrote:
>>> Morning Bill.
>>>
>>> BTW< I want to say "Thanks for your help with this" first.
>>> Just in case I forgot.
>>>
>>> So, I ran "check" once. It complained, and failed.
>>>
>> Does the failure provide any useful information?
>>
> No.
> Here is what I got the first time:
>
> root@localhost md]# echo check >sync_action; cat mismatch_cnt
> -bash: echo: write error: Device or resource busy
> 0
>
> Later, on my second try, a few hours later, it worked, reporting no error.
> ..
> [maurice@localhost ~]$ su -
> Password:
> [root@localhost ~]# cd /sys/block/md0/md
> [root@localhost md]# cat /proc/mdstat
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md0 : active raid1 sda3[0] sdb2[1]
> 386403328 blocks [2/2] [UU]
>
> unused devices: <none>
> [root@localhost md]# echo check >sync_action; cat mismatch_cnt
> 0
>
>>
>> I think it's time to be keeping a good backup, and hopefully someone
>> else has a good thought on running this down more.
>>
> Thanks, updated that backup at the first sign of trouble
>>> Any thoughts on that?
>>
>> The only thought I have at the moment is marginal power supply, and
>> that's just because it can generate all manner of odd behaviors,
>> rather than any other hints. Sorry.
>>
> Yeah. I am going to replace *both* disks, and then run the
> manufacturers utility (Seatest) on them.
>> If you aren't getting errors from SMART or logs, and I don't remember
>> you sending me that info, I'm not sure how you determine which drive
>> is the problem.
> Exactly.
>
> Thanks a LOT for trying, Bill..
Actually, my though is that you may not actually be getting hardware
errors, which is why they are not being report by either the kernel or
SMART. That's why I thought of memory and/or power issues, either of
which could cause what you are seeing.
Guess I have to leave it there, maybe someone else will have a thought.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Question: how to identify failing disk in a RAID1
@ 2008-04-18 17:36 David Lethe
0 siblings, 0 replies; 16+ messages in thread
From: David Lethe @ 2008-04-18 17:36 UTC (permalink / raw
To: Bill Davidsen, Maurice Hilarius; +Cc: vger majordomo for lists
The sympoms are indicative of a standard bad block reallocation. Depending on make, model, firmare rev and even location of the new defect it could take several seconds for the disk to grab a spare from the reserved are and fix the defect. No reason for concern ... The system worked like it was desigmed to .
-----Original Message-----
From: "Bill Davidsen" <davidsen@tmr.com>
Subj: Re: Question: how to identify failing disk in a RAID1
Date: Fri Apr 18, 2008 8:15 am
Size: 2K
To: "Maurice Hilarius" <maurice@harddata.com>
cc: "vger majordomo for lists" <linux-raid@vger.kernel.org>
Maurice Hilarius wrote:
> Bill Davidsen wrote:
>> Maurice Hilarius wrote:
>>> Morning Bill.
>>>
>>> BTW< I want to say "Thanks for your help with this" first.
>>> Just in case I forgot.
>>>
>>> So, I ran "check" once. It complained, and failed.
>>>
>> Does the failure provide any useful information?
>>
> No.
> Here is what I got the first time:
>
> root@localhost md]# echo check >sync_action; cat mismatch_cnt
> -bash: echo: write error: Device or resource busy
> 0
>
> Later, on my second try, a few hours later, it worked, reporting no error.
> ..
> [maurice@localhost ~]$ su -
> Password:
> [root@localhost ~]# cd /sys/block/md0/md
> [root@localhost md]# cat /proc/mdstat
> Personalities : [raid1] [raid6] [raid5] [raid4]
> md0 : active raid1 sda3[0] sdb2[1]
> 386403328 blocks [2/2] [UU]
>
> unused devices: <none>
> [root@localhost md]# echo check >sync_action; cat mismatch_cnt
> 0
>
>>
>> I think it's time to be keeping a good backup, and hopefully someone
>> else has a good thought on running this down more.
>>
> Thanks, updated that backup at the first sign of trouble
>>> Any thoughts on that?
>>
>> The only thought I have at the moment is marginal power supply, and
>> that's just because it can generate all manner of odd behaviors,
>> rather than any other hints. Sorry.
>>
> Yeah. I am going to replace *both* disks, and then run the
> manufacturers utility (Seatest) on them.
>> If you aren't getting errors from SMART or logs, and I don't remember
>> you sending me that info, I'm not sure how you determine which drive
>> is the problem.
> Exactly.
>
> Thanks a LOT for trying, Bill..
Actually, my though is that you may not actually be getting hardware
errors, which is why they are not being report by either the kernel or
SMART. That's why I thought of memory and/or power issues, either of
which could cause what you are seeing.
Guess I have to leave it there, maybe someone else will have a thought.
--
Bill Davidsen <davidsen@tmr.com>
"Woe unto the statesman who makes war without a reason that will still
be valid when the war is over..." Otto von Bismark
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Question: how to identify failing disk in a RAID1
[not found] ` <480F7105.9030405@harddata.com>
@ 2008-04-23 18:54 ` Justin Piszcz
[not found] ` <480F8830.6020207@harddata.com>
0 siblings, 1 reply; 16+ messages in thread
From: Justin Piszcz @ 2008-04-23 18:54 UTC (permalink / raw
To: Maurice Hilarius; +Cc: Bill Davidsen, linux-raid
On Wed, 23 Apr 2008, Maurice Hilarius wrote:
> Hi all.
>
> With much appreciated help from Bell Davidsen and Justin Piszcz I recently
> dealt with a problem with a RAID1 set, caused by a failing hard disk.
>
> At the end, there is one question remaining, which I think is quite
> important:
> When one has a RAID5 or RAID6, and a disk starts "acting up" mdadm rapidly
> kicks out the offending device.
> Some might say "too easily" but that is another thread.
>
> On a RAID1 set, until the failing disk completely "packs it in" it remains
> part of the RAID.
>
> Why??
>
> Some more background:
> Since the issue was reported and explored I have recreated this on a test
> machine.
> Installed RAID1 with one known good and one know error prone drive.
> Easy to do as the error drive has a thermal issue.
> Keep it cold, no problems, but after 30 minutes use in a +25C room it start
> to generate data errors.
> I reproduced exactly the problem I saw before:
> Data errors occur, the other drive in the RAID1 set gets "infected" with the
> bad data, and the file system will get corrupted.
> On BOTH drives.
>
> This is highly reproducible.
>
> In summary:
> 1) RAID1 lacks significant protection from the effects of a data error
> condition on a failing drive
> 2) I recommend anyone using madadm refrain from using RAID1 until this issue
> is addressed and resolved.
>
> Thanks again.
I can confirm this, until you actually REBOOT the host with RAID1 only
then will it kick it out. Whereas with RAID5, I experienced the same
thing, it kicks it out right away, would need to wait for the
linux-raid/developers to answer this one.
Justin.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Question: how to identify failing disk in a RAID1
[not found] ` <480F8830.6020207@harddata.com>
@ 2008-04-23 19:26 ` Justin Piszcz
2008-04-27 17:03 ` Keith Roberts
0 siblings, 1 reply; 16+ messages in thread
From: Justin Piszcz @ 2008-04-23 19:26 UTC (permalink / raw
To: Maurice Hilarius; +Cc: Bill Davidsen, linux-raid
On Wed, 23 Apr 2008, Maurice Hilarius wrote:
> Justin Piszcz wrote:
>>
>>
>> On Wed, 23 Apr 2008, Maurice Hilarius wrote:
>>
>>> Hi all.
>>>
>>> With much appreciated help from Bell Davidsen and Justin Piszcz I recently
>>> dealt with a problem with a RAID1 set, caused by a failing hard disk.
>>>
>>> At the end, there is one question remaining, which I think is quite
>>> important:
>>> When one has a RAID5 or RAID6, and a disk starts "acting up" mdadm rapidly
>>> kicks out the offending device.
>>> Some might say "too easily" but that is another thread.
>>>
>>> On a RAID1 set, until the failing disk completely "packs it in" it remains
>>> part of the RAID.
>>>
>>> Why??
>>>
>>> Some more background:
>>> Since the issue was reported and explored I have recreated this on a test
>>> machine.
>>> Installed RAID1 with one known good and one know error prone drive.
>>> Easy to do as the error drive has a thermal issue.
>>> Keep it cold, no problems, but after 30 minutes use in a +25C room it
>>> start to generate data errors.
>>> I reproduced exactly the problem I saw before:
>>> Data errors occur, the other drive in the RAID1 set gets "infected" with
>>> the bad data, and the file system will get corrupted.
>>> On BOTH drives.
>>>
>>> This is highly reproducible.
>>>
>>> In summary:
>>> 1) RAID1 lacks significant protection from the effects of a data error
>>> condition on a failing drive
>>> 2) I recommend anyone using madadm refrain from using RAID1 until this
>>> issue is addressed and resolved.
>>>
>>> Thanks again.
>> I can confirm this, until you actually REBOOT the host with RAID1 only then
>> will it kick it out. Whereas with RAID5, I experienced the same thing, it
>> kicks it out right away, would need to wait for the linux-raid/developers
>> to answer this one.
>>
>> Justin.
>>
> Actually reboot does not help me.
> mdadm seems to NEVER "kick out" the bad disk.
> Even when it is horribly erroring.
>
> I think this is a CRITICAL problem, as, if one is using RAID1 thinking it
> will enhance their data reliability,
> they stand a very good chance of getting a nasty surprise.
Yikes, what kernel+mobo+chipset+drives are in use (the developers will
want to know) also are you using drives on different channels? Or e.g.,
two drives on one ide cable? (To summarize for the developers)
Justin.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Question: how to identify failing disk in a RAID1
2008-04-23 19:26 ` Justin Piszcz
@ 2008-04-27 17:03 ` Keith Roberts
2008-04-27 19:28 ` Richard Scobie
2008-04-27 21:53 ` Mark Hahn
0 siblings, 2 replies; 16+ messages in thread
From: Keith Roberts @ 2008-04-27 17:03 UTC (permalink / raw
To: linux-raid
On Wed, 23 Apr 2008, Justin Piszcz wrote:
> To: Maurice Hilarius <maurice@harddata.com>
> From: Justin Piszcz <jpiszcz@lucidpixels.com>
> Subject: Re: Question: how to identify failing disk in a RAID1
>
>
>
> On Wed, 23 Apr 2008, Maurice Hilarius wrote:
>
>> Justin Piszcz wrote:
>>>
>>>
>>> On Wed, 23 Apr 2008, Maurice Hilarius wrote:
>>>
>>>> Hi all.
>>>>
>>>> With much appreciated help from Bell Davidsen and Justin Piszcz I
>>>> recently dealt with a problem with a RAID1 set, caused by a failing
>>>> hard disk.
>>>>
>>>> At the end, there is one question remaining, which I think is quite
>>>> important:
>>>> When one has a RAID5 or RAID6, and a disk starts "acting up" mdadm
>>>> rapidly kicks out the offending device.
>>>> Some might say "too easily" but that is another thread.
>>>>
>>>> On a RAID1 set, until the failing disk completely "packs it in" it
>>>> remains part of the RAID.
>>>>
>>>> Why??
>>>>
>>>> Some more background:
>>>> Since the issue was reported and explored I have recreated this on a
>>>> test machine.
>>>> Installed RAID1 with one known good and one know error prone drive.
>>>> Easy to do as the error drive has a thermal issue.
>>>> Keep it cold, no problems, but after 30 minutes use in a +25C room it
>>>> start to generate data errors.
>>>> I reproduced exactly the problem I saw before:
>>>> Data errors occur, the other drive in the RAID1 set gets "infected"
>>>> with the bad data, and the file system will get corrupted.
>>>> On BOTH drives.
>>>>
>>>> This is highly reproducible.
>>>>
>>>> In summary:
>>>> 1) RAID1 lacks significant protection from the effects of a data
>>>> error condition on a failing drive
>>>> 2) I recommend anyone using madadm refrain from using RAID1 until
>>>> this issue is addressed and resolved.
>>>>
>>>> Thanks again.
>>> I can confirm this, until you actually REBOOT the host with RAID1 only
>>> then will it kick it out. Whereas with RAID5, I experienced the same
>>> thing, it kicks it out right away, would need to wait for the
>>> linux-raid/developers to answer this one.
>>>
>>> Justin.
>>>
>> Actually reboot does not help me.
>> mdadm seems to NEVER "kick out" the bad disk.
>> Even when it is horribly erroring.
>>
>> I think this is a CRITICAL problem, as, if one is using RAID1 thinking
>> it will enhance their data reliability,
>> they stand a very good chance of getting a nasty surprise.
> Yikes, what kernel+mobo+chipset+drives are in use (the developers will
> want to know) also are you using drives on different channels? Or e.g.,
> two drives on one ide cable? (To summarize for the developers)
>
> Justin.
I'm now looking at using smartmontools to monitor my hard
drive's status, maybe instead of using RAID1 arrays.
http://en.wikipedia.org/wiki/S.M.A.R.T.
http://smartmontools.sourceforge.net/
It appears that smartmontools will not work with the linux
software RAID layer. So I guess I need to make a choice of
which one to use - smartmontools or RAID1 mirrors?
Obviously I don't want to be mirroring corrupted drive data.
It would be nice to be able to use smartmontools to monitor
the health of the drives in a RAID1 array. Get the best of
both worlds then.
Is there any way that the smartmontools code can be included
in the md driver code, to allow mdadm access to the SMART
data on a RAID1 set of disks please?
Kind Regards
Keith Roberts
-----------------------------------------------------------------
Websites:
http://www.php-debuggers.net
http://www.karsites.net
http://www.raised-from-the-dead.org.uk
All email addresses are challenge-response protected with
TMDA [http://tmda.net]
-----------------------------------------------------------------
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Question: how to identify failing disk in a RAID1
2008-04-27 17:03 ` Keith Roberts
@ 2008-04-27 19:28 ` Richard Scobie
2008-04-28 5:29 ` Keith Roberts
2008-04-27 21:53 ` Mark Hahn
1 sibling, 1 reply; 16+ messages in thread
From: Richard Scobie @ 2008-04-27 19:28 UTC (permalink / raw
To: Linux RAID Mailing List
Keith Roberts wrote:
> It would be nice to be able to use smartmontools to monitor the health
> of the drives in a RAID1 array. Get the best of both worlds then.
I have been doing this for years.
What problems are you seeing using smartd on a RAID1?
Regards,
Richard
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Question: how to identify failing disk in a RAID1
2008-04-27 17:03 ` Keith Roberts
2008-04-27 19:28 ` Richard Scobie
@ 2008-04-27 21:53 ` Mark Hahn
1 sibling, 0 replies; 16+ messages in thread
From: Mark Hahn @ 2008-04-27 21:53 UTC (permalink / raw
To: Keith Roberts; +Cc: linux-raid
> Is there any way that the smartmontools code can be included in the md driver
> code, to allow mdadm access to the SMART data on a RAID1 set of disks please?
eh? smart monitors disks. the disks that comprise your raids are still
available as disks, and smart can monitor them just fine...
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Question: how to identify failing disk in a RAID1
2008-04-27 19:28 ` Richard Scobie
@ 2008-04-28 5:29 ` Keith Roberts
2008-04-28 6:06 ` Michael Tokarev
2008-04-28 7:01 ` Richard Scobie
0 siblings, 2 replies; 16+ messages in thread
From: Keith Roberts @ 2008-04-28 5:29 UTC (permalink / raw
To: linux-raid
On Mon, 28 Apr 2008, Richard Scobie wrote:
> To: Linux RAID Mailing List <linux-raid@vger.kernel.org>
> From: Richard Scobie <richard@sauce.co.nz>
> Subject: Re: Question: how to identify failing disk in a RAID1
>
> Keith Roberts wrote:
>
>> It would be nice to be able to use smartmontools to monitor the health
>> of the drives in a RAID1 array. Get the best of both worlds then.
>
> I have been doing this for years.
>
> What problems are you seeing using smartd on a RAID1?
>
> Regards,
>
> Richard
Reading the documentation for smartmontools I got the
impression that it cannot work with RAID controllers, apart
from 3ware and some Highpoint. Maybe I'm getting mixed up
with hardware raid?
So is it safe to use all features of smartmontools,
including running tests, on a Linux software RAID1 array?
Kind Regards
Keith Roberts
-----------------------------------------------------------------
Websites:
http://www.php-debuggers.net
http://www.karsites.net
http://www.raised-from-the-dead.org.uk
All email addresses are challenge-response protected with
TMDA [http://tmda.net]
-----------------------------------------------------------------
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Question: how to identify failing disk in a RAID1
2008-04-28 5:29 ` Keith Roberts
@ 2008-04-28 6:06 ` Michael Tokarev
2008-04-28 7:01 ` Richard Scobie
1 sibling, 0 replies; 16+ messages in thread
From: Michael Tokarev @ 2008-04-28 6:06 UTC (permalink / raw
To: linux-raid
Keith Roberts wrote:
[]
i started writing a reply but later noticed this:
> All email addresses are challenge-response protected with
> TMDA [http://tmda.net]
And removed the reply. Don't outsource YOUR mail filtering
to everyone else. Thank you.
/mjt
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Question: how to identify failing disk in a RAID1
2008-04-28 5:29 ` Keith Roberts
2008-04-28 6:06 ` Michael Tokarev
@ 2008-04-28 7:01 ` Richard Scobie
1 sibling, 0 replies; 16+ messages in thread
From: Richard Scobie @ 2008-04-28 7:01 UTC (permalink / raw
To: linux-raid
Keith Roberts wrote:
> Reading the documentation for smartmontools I got the impression that it
> cannot work with RAID controllers, apart from 3ware and some Highpoint.
> Maybe I'm getting mixed up with hardware raid?
Hardware controllers are a different story, (you can add to LSI to the
above). There are no problems with md RAID.
> So is it safe to use all features of smartmontools, including running
> tests, on a Linux software RAID1 array?
No problems at all that I am aware of. I run smartd and perform long
self checks weekly on all drives I have in live RAID 1 and 5 arrays.
Regards,
Richard
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2008-04-28 7:01 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-13 19:14 Question: how to identify failing disk in a RAID1 Maurice Hilarius
2008-04-13 19:29 ` Justin Piszcz
2008-04-14 1:14 ` Bill Davidsen
[not found] ` <4802CDA2.605@harddata.com>
2008-04-14 16:38 ` Bill Davidsen
[not found] ` <4804CD4F.7080303@harddata.com>
2008-04-15 18:14 ` Bill Davidsen
[not found] ` <48050DD6.7020404@harddata.com>
[not found] ` <48055EFA.8060505@tmr.com>
[not found] ` <480607F2.3060504@harddata.com>
2008-04-17 13:12 ` Bill Davidsen
[not found] ` <48076096.2020804@harddata.com>
2008-04-18 13:17 ` Bill Davidsen
[not found] ` <480F7105.9030405@harddata.com>
2008-04-23 18:54 ` Justin Piszcz
[not found] ` <480F8830.6020207@harddata.com>
2008-04-23 19:26 ` Justin Piszcz
2008-04-27 17:03 ` Keith Roberts
2008-04-27 19:28 ` Richard Scobie
2008-04-28 5:29 ` Keith Roberts
2008-04-28 6:06 ` Michael Tokarev
2008-04-28 7:01 ` Richard Scobie
2008-04-27 21:53 ` Mark Hahn
-- strict thread matches above, loose matches on Subject: below --
2008-04-18 17:36 David Lethe
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.