[PATCH 001 of 5] md: Avoid oops when attempting to fix read errors on raid10

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 001 of 5] md: Avoid oops when attempting to fix read errors on raid10
  2006-04-28  2:51 [PATCH 000 of 5] md: Introduction - assorted raid10/raid1 fixes NeilBrown
@ 2006-04-28  2:50 ` NeilBrown
  2006-05-01 21:52   ` [stable] " Greg KH
  2006-04-28  2:51 ` [PATCH 002 of 5] md: Fixed refcounting/locking when attempting read error correction in raid10 NeilBrown
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: NeilBrown @ 2006-04-28  2:50 UTC (permalink / raw
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, stable


We should add to the counter for the rdev *after* checking
if the rdev is NULL !!!

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid10.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff ./drivers/md/raid10.c~current~ ./drivers/md/raid10.c
--- ./drivers/md/raid10.c~current~	2006-04-28 12:13:20.000000000 +1000
+++ ./drivers/md/raid10.c	2006-04-28 12:13:20.000000000 +1000
@@ -1435,9 +1435,9 @@ static void raid10d(mddev_t *mddev)
 						sl--;
 						d = r10_bio->devs[sl].devnum;
 						rdev = conf->mirrors[d].rdev;
-						atomic_add(s, &rdev->corrected_errors);
 						if (rdev &&
 						    test_bit(In_sync, &rdev->flags)) {
+							atomic_add(s, &rdev->corrected_errors);
 							if (sync_page_io(rdev->bdev,
 									 r10_bio->devs[sl].addr +
 									 sect + rdev->data_offset,

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 000 of 5] md: Introduction - assorted raid10/raid1 fixes
@ 2006-04-28  2:51 NeilBrown
  2006-04-28  2:50 ` [PATCH 001 of 5] md: Avoid oops when attempting to fix read errors on raid10 NeilBrown
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: NeilBrown @ 2006-04-28  2:51 UTC (permalink / raw
  To: Andrew Morton; +Cc: linux-raid, linux-kernel, stable

Following are 5 patches to md that are suitable for 2.6.17.

The first is also suitable for 2.6.16.something as it is an obvious
trivial fix for an oops. So it and this are cc:ed to stable@kernel.org

The first two fix problems with the attempt-to-fix-read-errors code in raid10.
The next three fix problems with the handling of BIO_RW_BARRIER requests in raid1.

All patches created against 2.6.17-rc2-mm1. First patch checked to apply to 2.6.16.11.

Thanks,
NeilBrown

 [PATCH 001 of 5] md: Avoid oops when attempting to fix read errors on raid10
 [PATCH 002 of 5] md: Fixed refcounting/locking when attempting read error correction in raid10
 [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP
 [PATCH 004 of 5] md: Improve detection of lack of barrier support in raid1
 [PATCH 005 of 5] md: Fix 'rdev->nr_pending' count when retrying barrier requests.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 002 of 5] md: Fixed refcounting/locking when attempting read error correction in raid10
  2006-04-28  2:51 [PATCH 000 of 5] md: Introduction - assorted raid10/raid1 fixes NeilBrown
  2006-04-28  2:50 ` [PATCH 001 of 5] md: Avoid oops when attempting to fix read errors on raid10 NeilBrown
@ 2006-04-28  2:51 ` NeilBrown
  2006-04-28  2:51 ` [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP NeilBrown
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2006-04-28  2:51 UTC (permalink / raw
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


We need to hold a reference to rdevs while reading
and writing to attempt to correct read errors.  This
reference must be taken under an rcu lock.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid10.c |   44 ++++++++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 14 deletions(-)

diff ./drivers/md/raid10.c~current~ ./drivers/md/raid10.c
--- ./drivers/md/raid10.c~current~	2006-04-28 12:13:20.000000000 +1000
+++ ./drivers/md/raid10.c	2006-04-28 12:16:54.000000000 +1000
@@ -1407,36 +1407,45 @@ static void raid10d(mddev_t *mddev)
 				if (s > (PAGE_SIZE>>9))
 					s = PAGE_SIZE >> 9;
 
+				rcu_read_lock();
 				do {
 					int d = r10_bio->devs[sl].devnum;
-					rdev = conf->mirrors[d].rdev;
+					rdev = rcu_dereference(conf->mirrors[d].rdev);
 					if (rdev &&
-					    test_bit(In_sync, &rdev->flags) &&
-					    sync_page_io(rdev->bdev,
-							 r10_bio->devs[sl].addr +
-							 sect + rdev->data_offset,
-							 s<<9,
-							 conf->tmppage, READ))
-						success = 1;
-					else {
-						sl++;
-						if (sl == conf->copies)
-							sl = 0;
+					    test_bit(In_sync, &rdev->flags)) {
+						atomic_inc(&rdev->nr_pending);
+						rcu_read_unlock();
+						success = sync_page_io(rdev->bdev,
+								       r10_bio->devs[sl].addr +
+								       sect + rdev->data_offset,
+								       s<<9,
+								       conf->tmppage, READ);
+						rdev_dec_pending(rdev, mddev);
+						rcu_read_lock();
+						if (success)
+							break;
 					}
+					sl++;
+					if (sl == conf->copies)
+						sl = 0;
 				} while (!success && sl != r10_bio->read_slot);
+				rcu_read_unlock();
 
 				if (success) {
 					int start = sl;
 					/* write it back and re-read */
+					rcu_read_lock();
 					while (sl != r10_bio->read_slot) {
 						int d;
 						if (sl==0)
 							sl = conf->copies;
 						sl--;
 						d = r10_bio->devs[sl].devnum;
-						rdev = conf->mirrors[d].rdev;
+						rdev = rcu_dereference(conf->mirrors[d].rdev);
 						if (rdev &&
 						    test_bit(In_sync, &rdev->flags)) {
+							atomic_inc(&rdev->nr_pending);
+							rcu_read_unlock();
 							atomic_add(s, &rdev->corrected_errors);
 							if (sync_page_io(rdev->bdev,
 									 r10_bio->devs[sl].addr +
@@ -1444,6 +1453,8 @@ static void raid10d(mddev_t *mddev)
 									 s<<9, conf->tmppage, WRITE) == 0)
 								/* Well, this device is dead */
 								md_error(mddev, rdev);
+							rdev_dec_pending(rdev, mddev);
+							rcu_read_lock();
 						}
 					}
 					sl = start;
@@ -1453,17 +1464,22 @@ static void raid10d(mddev_t *mddev)
 							sl = conf->copies;
 						sl--;
 						d = r10_bio->devs[sl].devnum;
-						rdev = conf->mirrors[d].rdev;
+						rdev = rcu_dereference(conf->mirrors[d].rdev);
 						if (rdev &&
 						    test_bit(In_sync, &rdev->flags)) {
+							atomic_inc(&rdev->nr_pending);
+							rcu_read_unlock();
 							if (sync_page_io(rdev->bdev,
 									 r10_bio->devs[sl].addr +
 									 sect + rdev->data_offset,
 									 s<<9, conf->tmppage, READ) == 0)
 								/* Well, this device is dead */
 								md_error(mddev, rdev);
+							rdev_dec_pending(rdev, mddev);
+							rcu_read_lock();
 						}
 					}
+					rcu_read_unlock();
 				} else {
 					/* Cannot read from anywhere -- bye bye array */
 					md_error(mddev, conf->mirrors[r10_bio->devs[r10_bio->read_slot].devnum].rdev);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP
  2006-04-28  2:51 [PATCH 000 of 5] md: Introduction - assorted raid10/raid1 fixes NeilBrown
  2006-04-28  2:50 ` [PATCH 001 of 5] md: Avoid oops when attempting to fix read errors on raid10 NeilBrown
  2006-04-28  2:51 ` [PATCH 002 of 5] md: Fixed refcounting/locking when attempting read error correction in raid10 NeilBrown
@ 2006-04-28  2:51 ` NeilBrown
  2006-04-28 13:34   ` Molle Bestefich
  2006-04-28  2:51 ` [PATCH 004 of 5] md: Improve detection of lack of barrier support in raid1 NeilBrown
  2006-04-28  2:51 ` [PATCH 005 of 5] md: Fix 'rdev->nr_pending' count when retrying barrier requests NeilBrown
  4 siblings, 1 reply; 18+ messages in thread
From: NeilBrown @ 2006-04-28  2:51 UTC (permalink / raw
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


Because that is what you get if a BIO_RW_BARRIER isn't supported !

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2006-04-28 12:17:27.000000000 +1000
+++ ./drivers/md/raid1.c	2006-04-28 12:17:27.000000000 +1000
@@ -315,7 +315,7 @@ static int raid1_end_write_request(struc
 		if (r1_bio->bios[mirror] == bio)
 			break;
 
-	if (error == -ENOTSUPP && test_bit(R1BIO_Barrier, &r1_bio->state)) {
+	if (error == -EOPNOTSUPP && test_bit(R1BIO_Barrier, &r1_bio->state)) {
 		set_bit(BarriersNotsupp, &conf->mirrors[mirror].rdev->flags);
 		set_bit(R1BIO_BarrierRetry, &r1_bio->state);
 		r1_bio->mddev->barriers_work = 0;
@@ -1404,7 +1404,7 @@ static void raid1d(mddev_t *mddev)
 			unplug = 1;
 		} else if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) {
 			/* some requests in the r1bio were BIO_RW_BARRIER
-			 * requests which failed with -ENOTSUPP.  Hohumm..
+			 * requests which failed with -EOPNOTSUPP.  Hohumm..
 			 * Better resubmit without the barrier.
 			 * We know which devices to resubmit for, because
 			 * all others have had their bios[] entry cleared.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 004 of 5] md: Improve detection of lack of barrier support in raid1
  2006-04-28  2:51 [PATCH 000 of 5] md: Introduction - assorted raid10/raid1 fixes NeilBrown
                   ` (2 preceding siblings ...)
  2006-04-28  2:51 ` [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP NeilBrown
@ 2006-04-28  2:51 ` NeilBrown
  2006-04-28  2:51 ` [PATCH 005 of 5] md: Fix 'rdev->nr_pending' count when retrying barrier requests NeilBrown
  4 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2006-04-28  2:51 UTC (permalink / raw
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


Move the test for 'do barrier work' down a bit so that if the first
write to a raid1 is a BIO_RW_BARRIER write, the checking done by
superblock writes will cause the right thing to happen.


Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c |   16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2006-04-28 12:17:27.000000000 +1000
+++ ./drivers/md/raid1.c	2006-04-28 12:17:40.000000000 +1000
@@ -753,18 +753,24 @@ static int make_request(request_queue_t 
 	const int rw = bio_data_dir(bio);
 	int do_barriers;
 
-	if (unlikely(!mddev->barriers_work && bio_barrier(bio))) {
-		bio_endio(bio, bio->bi_size, -EOPNOTSUPP);
-		return 0;
-	}
-
 	/*
 	 * Register the new request and wait if the reconstruction
 	 * thread has put up a bar for new requests.
 	 * Continue immediately if no resync is active currently.
+	 * We test barriers_work *after* md_write_start as md_write_start
+	 * may cause the first superblock write, and that will check out
+	 * if barriers work.
 	 */
+
 	md_write_start(mddev, bio); /* wait on superblock update early */
 
+	if (unlikely(!mddev->barriers_work && bio_barrier(bio))) {
+		if (rw == WRITE)
+			md_write_end(mddev);
+		bio_endio(bio, bio->bi_size, -EOPNOTSUPP);
+		return 0;
+	}
+
 	wait_barrier(conf);
 
 	disk_stat_inc(mddev->gendisk, ios[rw]);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 005 of 5] md: Fix 'rdev->nr_pending' count when retrying barrier requests.
  2006-04-28  2:51 [PATCH 000 of 5] md: Introduction - assorted raid10/raid1 fixes NeilBrown
                   ` (3 preceding siblings ...)
  2006-04-28  2:51 ` [PATCH 004 of 5] md: Improve detection of lack of barrier support in raid1 NeilBrown
@ 2006-04-28  2:51 ` NeilBrown
  4 siblings, 0 replies; 18+ messages in thread
From: NeilBrown @ 2006-04-28  2:51 UTC (permalink / raw
  To: Andrew Morton; +Cc: linux-raid, linux-kernel


When retrying a failed BIO_RW_BARRIER request, we need to
keep the reference in ->nr_pending over the whole retry.
Currently, we only hold the reference if the failed request is
the *last* one to finish - which is silly, because it would normally 
be the first to finish.

So move the rdev_dec_pending call up into the didn't-fail branch.
As the rdev isn't used in the later code, calling rdev_dec_pending
earlier doesn't hurt.

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/raid1.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff ./drivers/md/raid1.c~current~ ./drivers/md/raid1.c
--- ./drivers/md/raid1.c~current~	2006-04-28 12:17:40.000000000 +1000
+++ ./drivers/md/raid1.c	2006-04-28 12:17:58.000000000 +1000
@@ -319,6 +319,7 @@ static int raid1_end_write_request(struc
 		set_bit(BarriersNotsupp, &conf->mirrors[mirror].rdev->flags);
 		set_bit(R1BIO_BarrierRetry, &r1_bio->state);
 		r1_bio->mddev->barriers_work = 0;
+		/* Don't rdev_dec_pending in this branch - keep it for the retry */
 	} else {
 		/*
 		 * this branch is our 'one mirror IO has finished' event handler:
@@ -365,6 +366,7 @@ static int raid1_end_write_request(struc
 				}
 			}
 		}
+		rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
 	}
 	/*
 	 *
@@ -374,11 +376,9 @@ static int raid1_end_write_request(struc
 	if (atomic_dec_and_test(&r1_bio->remaining)) {
 		if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) {
 			reschedule_retry(r1_bio);
-			/* Don't dec_pending yet, we want to hold
-			 * the reference over the retry
-			 */
 			goto out;
 		}
+		/* it really is the end of this request */
 		if (test_bit(R1BIO_BehindIO, &r1_bio->state)) {
 			/* free extra copy of the data pages */
 			int i = bio->bi_vcnt;
@@ -393,8 +393,6 @@ static int raid1_end_write_request(struc
 		md_write_end(r1_bio->mddev);
 		raid_end_bio_io(r1_bio);
 	}
-
-	rdev_dec_pending(conf->mirrors[mirror].rdev, conf->mddev);
  out:
 	if (to_put)
 		bio_put(to_put);
@@ -1414,6 +1412,7 @@ static void raid1d(mddev_t *mddev)
 			 * Better resubmit without the barrier.
 			 * We know which devices to resubmit for, because
 			 * all others have had their bios[] entry cleared.
+			 * We already have a nr_pending reference on these rdevs.
 			 */
 			int i;
 			clear_bit(R1BIO_BarrierRetry, &r1_bio->state);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP
  2006-04-28  2:51 ` [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP NeilBrown
@ 2006-04-28 13:34   ` Molle Bestefich
  2006-04-28 16:48     ` Ric Wheeler
  2006-04-30  4:13     ` [PATCH 003 of 5] " Neil Brown
  0 siblings, 2 replies; 18+ messages in thread
From: Molle Bestefich @ 2006-04-28 13:34 UTC (permalink / raw
  To: NeilBrown; +Cc: linux-raid

NeilBrown wrote:
> Change ENOTSUPP to EOPNOTSUPP
> Because that is what you get if a BIO_RW_BARRIER isn't supported !

Dumb question, hope someone can answer it :).

Does this mean that any version of MD up till now won't know that SATA
disks does not support barriers, and therefore won't flush SATA disks
and therefore I need to disable the disks's write cache if I want to
be 100% sure that raid arrays are not corrupted?

Or am I way off :-).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP
  2006-04-28 13:34   ` Molle Bestefich
@ 2006-04-28 16:48     ` Ric Wheeler
  2006-04-29 13:50       ` Molle Bestefich
  2006-04-30  4:13     ` [PATCH 003 of 5] " Neil Brown
  1 sibling, 1 reply; 18+ messages in thread
From: Ric Wheeler @ 2006-04-28 16:48 UTC (permalink / raw
  To: Molle Bestefich; +Cc: NeilBrown, linux-raid



Molle Bestefich wrote:

> NeilBrown wrote:
>
>> Change ENOTSUPP to EOPNOTSUPP
>> Because that is what you get if a BIO_RW_BARRIER isn't supported !
>
>
> Dumb question, hope someone can answer it :).
>
> Does this mean that any version of MD up till now won't know that SATA
> disks does not support barriers, and therefore won't flush SATA disks
> and therefore I need to disable the disks's write cache if I want to
> be 100% sure that raid arrays are not corrupted?
>
> Or am I way off :-).
>
You are absolutely right - if you do not have a validated, working 
barrier for your low level devices (or a high end, battery backed array 
or JBOD), you should disable the write cache on your RAIDed partitions 
and on your normal file systems ;-)

There is working support for SCSI (or libata S-ATA) barrier operations 
in mainline, but they conflict with queue enable targets which ends up 
leaving queuing on and disabling the barriers.

ric


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: md: Change ENOTSUPP to EOPNOTSUPP
  2006-04-28 16:48     ` Ric Wheeler
@ 2006-04-29 13:50       ` Molle Bestefich
  2006-04-29 20:23         ` Ric Wheeler
  0 siblings, 1 reply; 18+ messages in thread
From: Molle Bestefich @ 2006-04-29 13:50 UTC (permalink / raw
  To: Ric Wheeler; +Cc: NeilBrown, linux-raid

Ric Wheeler wrote:
> You are absolutely right - if you do not have a validated, working
> barrier for your low level devices (or a high end, battery backed array
> or JBOD), you should disable the write cache on your RAIDed partitions
> and on your normal file systems ;-)
>
> There is working support for SCSI (or libata S-ATA) barrier operations
> in mainline, but they conflict with queue enable targets which ends up
> leaving queuing on and disabling the barriers.

Thank you very much for the information!

How can I check that I have a validated, working barrier with my
particular kernel version etc.?
(Do I just assume that since it's not SCSI, it doesn't work?)

I find it, hmm... stupefying?  horrendous?  completely brain dead?  I
don't know..  that noone warns users about this.  I bet there's a
million people out there, happily using MD (probably installed and
initialized it with Fedora Core / anaconda) and thinking their data is
safe, while in fact it is anything but.  Damn, this is not a good
situation..

(Any suggestions for a good place to fix this?  Better really really
really late than never...)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: md: Change ENOTSUPP to EOPNOTSUPP
  2006-04-29 13:50       ` Molle Bestefich
@ 2006-04-29 20:23         ` Ric Wheeler
  2006-05-01 16:14           ` Gil
  0 siblings, 1 reply; 18+ messages in thread
From: Ric Wheeler @ 2006-04-29 20:23 UTC (permalink / raw
  To: Molle Bestefich; +Cc: NeilBrown, linux-raid

Molle Bestefich wrote:

> Ric Wheeler wrote:
>
>> You are absolutely right - if you do not have a validated, working
>> barrier for your low level devices (or a high end, battery backed array
>> or JBOD), you should disable the write cache on your RAIDed partitions
>> and on your normal file systems ;-)
>>
>> There is working support for SCSI (or libata S-ATA) barrier operations
>> in mainline, but they conflict with queue enable targets which ends up
>> leaving queuing on and disabling the barriers.
>
>
> Thank you very much for the information!
>
> How can I check that I have a validated, working barrier with my
> particular kernel version etc.?
> (Do I just assume that since it's not SCSI, it doesn't work?)

The support is in for all drive types now, but you do have to check.

You should look in /var/log/messages and see that you have something 
like this:

  Mar 29 16:07:19 localhost kernel: ReiserFS: sda14: warning: reiserfs: 
option "skip_busy" is set
  Mar 29 16:07:19 localhost kernel: ReiserFS: sda14: warning: allocator 
options = [00000020] 
  Mar 29 16:07:19 localhost kernel:
  Mar 29 16:07:19 localhost kernel: ReiserFS: sda14: found reiserfs 
format "3.6" with standard journal
  Mar 29 16:07:19 localhost kernel: ReiserFS: sdc14: Using r5 hash to 
sort names
  Mar 29 16:07:19 localhost kernel: ReiserFS: sda14: using ordered data mode
  Mar 29 16:07:20 localhost kernel: reiserfs: using flush barriers

You can also do a sanity check on the number of synchronous IO's/second 
and make sure that it seems sane for your class of drive.  For example, 
I use a simple test which creates files, fills them and then fsync's 
each file before close. 

With the barrier on and write cache active, I can write about 30 (10k) 
files/sec to a new file system.  I get the same number with no barrier 
and write cache off which is what you would expect.

If you manually mount with barriers off and the write cache on however, 
your numbers will jump up to about 852 (10k) files/sec.  This is the one 
to look out for ;-)

>
> I find it, hmm... stupefying?  horrendous?  completely brain dead?  I
> don't know..  that noone warns users about this.  I bet there's a
> million people out there, happily using MD (probably installed and
> initialized it with Fedora Core / anaconda) and thinking their data is
> safe, while in fact it is anything but.  Damn, this is not a good
> situation..

The wide spread use of the write barrier is pretty new stuff.  In 
fairness, the accepted wisdom is (and has been for a long time) to 
always run with write cache off if you care about your data integrity 
(again, regardless of MD or native file system). Think of the write 
barrier support as a great performance boost (I can see almost a 50% win 
in some cases), but getting it well understood and routinely tested is 
still a challenge.

>
> (Any suggestions for a good place to fix this?  Better really really
> really late than never...)

Good test suites and lots of user testing...

ric

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP
  2006-04-28 13:34   ` Molle Bestefich
  2006-04-28 16:48     ` Ric Wheeler
@ 2006-04-30  4:13     ` Neil Brown
  2006-04-30  5:33       ` Guy
  1 sibling, 1 reply; 18+ messages in thread
From: Neil Brown @ 2006-04-30  4:13 UTC (permalink / raw
  To: Molle Bestefich; +Cc: linux-raid

On Friday April 28, molle.bestefich@gmail.com wrote:
> NeilBrown wrote:
> > Change ENOTSUPP to EOPNOTSUPP
> > Because that is what you get if a BIO_RW_BARRIER isn't supported !
> 
> Dumb question, hope someone can answer it :).
> 
> Does this mean that any version of MD up till now won't know that SATA
> disks does not support barriers, and therefore won't flush SATA disks
> and therefore I need to disable the disks's write cache if I want to
> be 100% sure that raid arrays are not corrupted?
> 
> Or am I way off :-).

The effect of this bug is almost unnoticeable.

In almost all cases, md will detect that a drive doesn't support
barriers when writing out the superblock - this is completely separate
code and is correct.  Thus md/raid1 will reject any barrier requests
coming from the filesystem and will never pass them down, and will not
make a wrong decision because of this bug.

The only cases where this bug could cause a problem are:
 1/ when the first write is a barrier write.  It is possible that
    reiserfs does this in some cases.  However only this write will be
    at risk.
 2/ if a device changes its behaviour from accepting barriers to  
    not accepting barrier (Which is very uncommon).

As md will be rejecting barrier requests, the filesystem will know not
to trust them and should use other techniques such as waiting for
dependant requests to complete, and calling blkdev_issue_flush were
appropriate.

Whether filesystems actually do this, I am less certain.

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP
  2006-04-30  4:13     ` [PATCH 003 of 5] " Neil Brown
@ 2006-04-30  5:33       ` Guy
  2006-04-30  6:00         ` Neil Brown
  0 siblings, 1 reply; 18+ messages in thread
From: Guy @ 2006-04-30  5:33 UTC (permalink / raw
  To: 'Neil Brown', 'Molle Bestefich'; +Cc: linux-raid

} -----Original Message-----
} From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
} owner@vger.kernel.org] On Behalf Of Neil Brown
} Sent: Sunday, April 30, 2006 12:13 AM
} To: Molle Bestefich
} Cc: linux-raid@vger.kernel.org
} Subject: Re: [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP
} 
} On Friday April 28, molle.bestefich@gmail.com wrote:
} > NeilBrown wrote:
} > > Change ENOTSUPP to EOPNOTSUPP
} > > Because that is what you get if a BIO_RW_BARRIER isn't supported !
} >
} > Dumb question, hope someone can answer it :).
} >
} > Does this mean that any version of MD up till now won't know that SATA
} > disks does not support barriers, and therefore won't flush SATA disks
} > and therefore I need to disable the disks's write cache if I want to
} > be 100% sure that raid arrays are not corrupted?
} >
} > Or am I way off :-).
} 
} The effect of this bug is almost unnoticeable.
} 
} In almost all cases, md will detect that a drive doesn't support
} barriers when writing out the superblock - this is completely separate
} code and is correct.  Thus md/raid1 will reject any barrier requests
} coming from the filesystem and will never pass them down, and will not
} make a wrong decision because of this bug.
} 
} The only cases where this bug could cause a problem are:
}  1/ when the first write is a barrier write.  It is possible that
}     reiserfs does this in some cases.  However only this write will be
}     at risk.
}  2/ if a device changes its behaviour from accepting barriers to
}     not accepting barrier (Which is very uncommon).
} 
} As md will be rejecting barrier requests, the filesystem will know not
} to trust them and should use other techniques such as waiting for
} dependant requests to complete, and calling blkdev_issue_flush were
} appropriate.
} 
} Whether filesystems actually do this, I am less certain.

What if a disk is hot added while the filesystem is mounted.  And the new
disk does not support barriers but the old disks do?  Or you have a mix?

If the new disk can't be handled correctly, maybe md should refuse to add
it.

Guy

} 
} NeilBrown
} -
} To unsubscribe from this list: send the line "unsubscribe linux-raid" in
} the body of a message to majordomo@vger.kernel.org
} More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP
  2006-04-30  5:33       ` Guy
@ 2006-04-30  6:00         ` Neil Brown
  0 siblings, 0 replies; 18+ messages in thread
From: Neil Brown @ 2006-04-30  6:00 UTC (permalink / raw
  To: Guy; +Cc: 'Molle Bestefich', linux-raid

On Sunday April 30, bugzilla@watkins-home.com wrote:
> 
> What if a disk is hot added while the filesystem is mounted.  And the new
> disk does not support barriers but the old disks do?  Or you have a mix?

When a disk is added, md will write the superblock and determine if
barriers are supported or not at that time.
On discovering that barriers are not support, md will flag the whole
array as not supporting barriers and will start rejecting barrier
requests.

Filesystems that I have looked at are written to cope with barriers
stopping working.  The first time they get a failure, they turn off
their own "use barriers" flag and retry the request in a more
traditional way.

> 
> If the new disk can't be handled correctly, maybe md should refuse to add
> it.

That isn't necessary.

NeilBrown

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: md: Change ENOTSUPP to EOPNOTSUPP
  2006-04-29 20:23         ` Ric Wheeler
@ 2006-05-01 16:14           ` Gil
  2006-05-02  1:54             ` Paul Clements
  0 siblings, 1 reply; 18+ messages in thread
From: Gil @ 2006-05-01 16:14 UTC (permalink / raw
  To: linux-raid

So for those of us using other filesystems (e.g. ext2/3), is there
some way to determine whether or not barriers are available?

Neil says that mdadm detects this and writes it into the superblock
but I can't find a reference to this anywhere in the md output in
the logs or in any interactive output of mdadm.

FWIW I'm running mdadm 1.11 with 0.90.03 superblocks (e.g. stock
Fedora Core 4).

Any ideas?

TIA,

--Gil

Ric Wheeler wrote:
> 
> 
> Molle Bestefich wrote:
> 
>> Ric Wheeler wrote:
>>
>>> You are absolutely right - if you do not have a validated, working
>>> barrier for your low level devices (or a high end, battery backed array
>>> or JBOD), you should disable the write cache on your RAIDed partitions
>>> and on your normal file systems ;-)
>>>
>>> There is working support for SCSI (or libata S-ATA) barrier operations
>>> in mainline, but they conflict with queue enable targets which ends up
>>> leaving queuing on and disabling the barriers.
>>
>>
>> Thank you very much for the information!
>>
>> How can I check that I have a validated, working barrier with my
>> particular kernel version etc.?
>> (Do I just assume that since it's not SCSI, it doesn't work?)
> 
> The support is in for all drive types now, but you do have to check.
> 
> You should look in /var/log/messages and see that you have something
> like this:
> 
>  Mar 29 16:07:19 localhost kernel: ReiserFS: sda14: warning: reiserfs:
> option "skip_busy" is set
>  Mar 29 16:07:19 localhost kernel: ReiserFS: sda14: warning: allocator
> options = [00000020]  Mar 29 16:07:19 localhost kernel:
>  Mar 29 16:07:19 localhost kernel: ReiserFS: sda14: found reiserfs
> format "3.6" with standard journal
>  Mar 29 16:07:19 localhost kernel: ReiserFS: sdc14: Using r5 hash to
> sort names
>  Mar 29 16:07:19 localhost kernel: ReiserFS: sda14: using ordered data mode
>  Mar 29 16:07:20 localhost kernel: reiserfs: using flush barriers
> 
> You can also do a sanity check on the number of synchronous IO's/second
> and make sure that it seems sane for your class of drive.  For example,
> I use a simple test which creates files, fills them and then fsync's
> each file before close.
> With the barrier on and write cache active, I can write about 30 (10k)
> files/sec to a new file system.  I get the same number with no barrier
> and write cache off which is what you would expect.
> 
> If you manually mount with barriers off and the write cache on however,
> your numbers will jump up to about 852 (10k) files/sec.  This is the one
> to look out for ;-)
> 
>>
>> I find it, hmm... stupefying?  horrendous?  completely brain dead?  I
>> don't know..  that noone warns users about this.  I bet there's a
>> million people out there, happily using MD (probably installed and
>> initialized it with Fedora Core / anaconda) and thinking their data is
>> safe, while in fact it is anything but.  Damn, this is not a good
>> situation..
> 
> The wide spread use of the write barrier is pretty new stuff.  In
> fairness, the accepted wisdom is (and has been for a long time) to
> always run with write cache off if you care about your data integrity
> (again, regardless of MD or native file system). Think of the write
> barrier support as a great performance boost (I can see almost a 50% win
> in some cases), but getting it well understood and routinely tested is
> still a challenge.
> 
>>
>> (Any suggestions for a good place to fix this?  Better really really
>> really late than never...)
> 
> Good test suites and lots of user testing...
> 
> ric
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [stable] [PATCH 001 of 5] md: Avoid oops when attempting to fix read errors on raid10
  2006-04-28  2:50 ` [PATCH 001 of 5] md: Avoid oops when attempting to fix read errors on raid10 NeilBrown
@ 2006-05-01 21:52   ` Greg KH
  0 siblings, 0 replies; 18+ messages in thread
From: Greg KH @ 2006-05-01 21:52 UTC (permalink / raw
  To: NeilBrown; +Cc: Andrew Morton, linux-raid, linux-kernel, stable

On Fri, Apr 28, 2006 at 12:50:15PM +1000, NeilBrown wrote:
> 
> We should add to the counter for the rdev *after* checking
> if the rdev is NULL !!!
> 
> Signed-off-by: Neil Brown <neilb@suse.de>

Queued to -stable, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: md: Change ENOTSUPP to EOPNOTSUPP
  2006-05-01 16:14           ` Gil
@ 2006-05-02  1:54             ` Paul Clements
  2006-05-02  2:17               ` Mike Hardy
  2006-05-02 11:37               ` Ric Wheeler
  0 siblings, 2 replies; 18+ messages in thread
From: Paul Clements @ 2006-05-02  1:54 UTC (permalink / raw
  To: Gil; +Cc: linux-raid

Gil wrote:
> So for those of us using other filesystems (e.g. ext2/3), is there
> some way to determine whether or not barriers are available?

You'll see something like this in your system log if barriers are not 
supported:

Apr  3 16:44:01 adam kernel: JBD: barrier-based sync failed on md0 - 
disabling barriers


Otherwise, assume that they are. But like Neil said, it shouldn't matter 
to a user whether they are supported or not. Filesystems will work 
correctly either way.

--
Paul

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: md: Change ENOTSUPP to EOPNOTSUPP
  2006-05-02  1:54             ` Paul Clements
@ 2006-05-02  2:17               ` Mike Hardy
  2006-05-02 11:37               ` Ric Wheeler
  1 sibling, 0 replies; 18+ messages in thread
From: Mike Hardy @ 2006-05-02  2:17 UTC (permalink / raw
  To: Paul Clements, Gil, linux-raid

Paul Clements wrote:
> Gil wrote:
> 
>> So for those of us using other filesystems (e.g. ext2/3), is there
>> some way to determine whether or not barriers are available?
> 
> 
> You'll see something like this in your system log if barriers are not
> supported:
> 
> Apr  3 16:44:01 adam kernel: JBD: barrier-based sync failed on md0 -
> disabling barriers
> 
> 
> Otherwise, assume that they are. But like Neil said, it shouldn't matter
> to a user whether they are supported or not. Filesystems will work
> correctly either way.

This seems very important to me to understand thoroughly, so please
forgive me if I'm being dense.

What I'm not sure of in the above is for what definition of "working"?

For the definition where the code simply doesn't bomb out, or for the
stricter definition that despite write caching at the drive level there
is no point where there could possibly be a data inconsistency between
what the filesystem thinks is written and what got written, power loss
or no?

My understanding to this point is that with write caching and no barrier
support, you would still care as power loss would give you a window of
inconsistency.

With the exception of the very minor situation Neil mentioned about the
first write through md not being a superblock write...

-Mike

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: md: Change ENOTSUPP to EOPNOTSUPP
  2006-05-02  1:54             ` Paul Clements
  2006-05-02  2:17               ` Mike Hardy
@ 2006-05-02 11:37               ` Ric Wheeler
  1 sibling, 0 replies; 18+ messages in thread
From: Ric Wheeler @ 2006-05-02 11:37 UTC (permalink / raw
  To: Paul Clements; +Cc: Gil, linux-raid

Paul Clements wrote:

>
> You'll see something like this in your system log if barriers are not 
> supported:
>
> Apr  3 16:44:01 adam kernel: JBD: barrier-based sync failed on md0 - 
> disabling barriers
>
>
> Otherwise, assume that they are. But like Neil said, it shouldn't 
> matter to a user whether they are supported or not. Filesystems will 
> work correctly either way.
>
> -- 
> Paul
>
File systems will work correctly, but if you are running without 
barriers and with your write cache enabled, you are running the risk of 
data loss or file system corruption on any power loss.

It is an issue of concern since drive companies ship the write cache on 
by default. When we detect a non-supported drive (queuing enabled, lack 
of support for the barrier low level mechanism) we disable future 
barrier request and leave the write cache enabled.

I guess that you could argue that this is what most home users want 
(i.e., best performance at the cost of some possible data loss on power 
outage since most people lose power rarely these days), but it is not 
good enough for critical data storage.

I would suggest that if you see this message on ext3 (or the reiserfs 
message for reiser users), you should run with your write cache disabled 
by default or disable queuing (which is often the reason the barrier ops 
get disabled).

ric

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2006-05-02 11:37 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-28  2:51 [PATCH 000 of 5] md: Introduction - assorted raid10/raid1 fixes NeilBrown
2006-04-28  2:50 ` [PATCH 001 of 5] md: Avoid oops when attempting to fix read errors on raid10 NeilBrown
2006-05-01 21:52   ` [stable] " Greg KH
2006-04-28  2:51 ` [PATCH 002 of 5] md: Fixed refcounting/locking when attempting read error correction in raid10 NeilBrown
2006-04-28  2:51 ` [PATCH 003 of 5] md: Change ENOTSUPP to EOPNOTSUPP NeilBrown
2006-04-28 13:34   ` Molle Bestefich
2006-04-28 16:48     ` Ric Wheeler
2006-04-29 13:50       ` Molle Bestefich
2006-04-29 20:23         ` Ric Wheeler
2006-05-01 16:14           ` Gil
2006-05-02  1:54             ` Paul Clements
2006-05-02  2:17               ` Mike Hardy
2006-05-02 11:37               ` Ric Wheeler
2006-04-30  4:13     ` [PATCH 003 of 5] " Neil Brown
2006-04-30  5:33       ` Guy
2006-04-30  6:00         ` Neil Brown
2006-04-28  2:51 ` [PATCH 004 of 5] md: Improve detection of lack of barrier support in raid1 NeilBrown
2006-04-28  2:51 ` [PATCH 005 of 5] md: Fix 'rdev->nr_pending' count when retrying barrier requests NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.