Good and bad news on 2.1.110, and a fix

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* Good and bad news on 2.1.110, and a fix
@ 1998-07-23 12:48 Stephen C. Tweedie
  1998-07-23 16:08 ` Bill Hawes
  0 siblings, 1 reply; 5+ messages in thread
From: Stephen C. Tweedie @ 1998-07-23 12:48 UTC (permalink / raw
  To: Linus Torvalds, Alan Cox, David S. Miller, Bill Hawes,
	Ingo Molnar, Mark Hemment
  Cc: linux-mm, linux-kernel, Stephen Tweedie

As the subject says, 2.1.110 is both very very promising and a stability
nightmare depending on what you are doing with it.  Fortunately, a very
simple failsafe mechanism against the observed problems seems to deal
with them extremely efficiently without any other performance impact,
and the resulting VM appears to be very stable.

The new memory code in get_free_pages is founded on a good principle: if
you don't let non-atomic consumers at the last few pages, then the last
high order pages get reserved for atomic use.

That's great as far as it goes.  When 2.1.110 gets going, it works a
treat: the performance on low memory is by *far* the best of any recent
kernels, and approaches 2.0 performance.  I'm seriously impressed.

However, if the memory useage pattern just happens by chance to fail to
leave any high order pages free, then the free_memory_available just
gives up in disgust.  My first attempt at booting 2.1.110 on a
low-memory setup failed because 4k nfs deadlocked during logon.
Shift-scrlck showed all the free memory in 4k and 8k pages, but 4k nfs
requires 16k pages.

The problem is twofold.  First of all, with low memory it is enormously
harder to get 16k free pages than 8k free pages.  With the default
SLAB_BREAK_GFP_ORDER setting of two, the slab allocator tries to
allocate 16k slabs for every object over 2048 bytes.  Setting this to
one instead improved things dramatically, and I haven't been able to
reproduce the problem since.  This does make some allocations less
efficient, but since it is primarily networking which creates atomic
demand for higher order pages than 8k, we can expect the allocations to
be sufficiently short-lived that the packing density is not important.

The second problem is more serious: the free_memory_available simply
doesn't care about page orders any more, and if you don't have enough
high order pages then you won't be given any.  A "ping -s 3000" to a
2.1.110 box doing NFS will kill it, even on a 16MB configuration.
Depending on the amount of background VM activity (eg. cron), it may
unstick itself after a few minutes once the attack stops, but a complete
session freeze for 4 or 5 minutes is still pretty bad.  Shift-scrlck on
the 16MB box in this mode shows all 97 free pages being of order 0;
there are no higher order pages available, at all, even after the ping
flood.

Linus, the patch at the end fixes these two problems for me, in a
painless manner.  The patch to slab.c simply makes SLAB_BREAK_GFP_ORDER
dependent on the memory size, and defaults to 1 instead of 2 if the
machine has less than 16MB.

The patch to page_alloc.c is a minimal fix for the fragmentation
problem.  It simply records allocation failures for high-order pages,
and forces free_memory_available to return false until a page of at
least that order becomes available.  The impact should be low, since
with the SLAB_BREAK_GFP_ORDER patch 2.1.111-pre1 seems to survive pretty
well anyway (and hence won't invoke the new mechanism), but in cases of
major atomic allocation load, the patch allows even low memory machines
to survive the ping attack handsomely (even with 8k NFS on a 6.5MB
configuration).  I get tons of "IP: queue_glue: no memory for gluing
queue" failures, but enough NFS retries get through even during the ping
flood to prevent any NFS server unreachables happening.

2.1.110 has fixed most of the VM problems I've been tracking.  It
eliminates the rusting memory death: a "find /" can still increase the
inode cache enough to cause a small but perceptible and permanent
performance drop on low memory, but at least it is stable, does not
appear to be cumulative, and does not result in swap deaths.  The page
cache is trimmed very nicely, and compiles are running more smoothly
than ever on 2.1.  The only stability problems I found were the atomic
allocation failures: preventing boot is a _serious_ problem.  With these
fixes in place, even that problem appears to have vanished.

It's looking good.  Comments?

--Stephen
----------------------------------------------------------------
--- mm/page_alloc.c.~1~	Wed Jul 22 14:48:23 1998
+++ mm/page_alloc.c	Thu Jul 23 13:00:54 1998
@@ -31,6 +31,8 @@
 int nr_swap_pages = 0;
 int nr_free_pages = 0;

+static int max_failed_order;
+
 /*
  * Free area management
  *
@@ -114,6 +116,23 @@
 {
 	static int available = 1;

+	/* First, perform a very simple test for fragmentation */
+	if (max_failed_order) {
+		unsigned long flags;
+		struct free_area_struct * list;
+		spin_lock_irqsave(&page_alloc_lock, flags);
+		for (list = free_area+max_failed_order;
+		     list < free_area+NR_MEM_LISTS;
+		     list++) {
+			if (list->next != memory_head(list))
+				break;
+		}
+		spin_unlock_irqrestore(&page_alloc_lock, flags);
+		if (list == free_area+NR_MEM_LISTS)
+			return 0;
+		max_failed_order = 0;
+	}
+	
 	if (nr_free_pages < freepages.low) {
 		available = 0;
 		return 0;
@@ -209,6 +228,8 @@
 				nr_free_pages -= 1 << order; \
 				EXPAND(ret, map_nr, order, new_order, area); \
 				spin_unlock_irqrestore(&page_alloc_lock, flags); \
+				if (order >= max_failed_order) \
+					max_failed_order = 0; \
 				return ADDRESS(map_nr); \
 			} \
 			prev = ret; \
@@ -263,6 +284,8 @@
 	spin_lock_irqsave(&page_alloc_lock, flags);
 	RMQUEUE(order, (gfp_mask & GFP_DMA));
 	spin_unlock_irqrestore(&page_alloc_lock, flags);
+	if (order > max_failed_order)
+		max_failed_order = order;
 nopage:
 	return 0;
 }
--- mm/slab.c.~1~	Wed Jul  8 14:35:46 1998
+++ mm/slab.c	Thu Jul 23 12:41:57 1998
@@ -313,7 +313,9 @@
 /* If the num of objs per slab is <= SLAB_MIN_OBJS_PER_SLAB,
  * then the page order must be less than this before trying the next order.
  */
-#define	SLAB_BREAK_GFP_ORDER	2
+#define	SLAB_BREAK_GFP_ORDER_HI	2
+#define	SLAB_BREAK_GFP_ORDER_LO	1
+static int slab_break_gfp_order = SLAB_BREAK_GFP_ORDER_LO;

 /* Macros for storing/retrieving the cachep and or slab from the
  * global 'mem_map'.  With off-slab bufctls, these are used to find the
@@ -447,6 +449,11 @@
 	cache_cache.c_colour = (i-(cache_cache.c_num*size))/L1_CACHE_BYTES;
 	cache_cache.c_colour_next = cache_cache.c_colour;

+	/* Fragmentation resistance on low memory */
+	if ((num_physpages * PAGE_SIZE) < 16 * 1024 * 1024)
+		slab_break_gfp_order = SLAB_BREAK_GFP_ORDER_LO;
+	else
+		slab_break_gfp_order = SLAB_BREAK_GFP_ORDER_HI;
 	return start;
 }

@@ -869,7 +876,7 @@
 		 * bad for the gfp()s.
 		 */
 		if (cachep->c_num <= SLAB_MIN_OBJS_PER_SLAB) {
-			if (cachep->c_gfporder < SLAB_BREAK_GFP_ORDER)
+			if (cachep->c_gfporder < slab_break_gfp_order)
 				goto next;
 		}

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Good and bad news on 2.1.110, and a fix
  1998-07-23 12:48 Good and bad news on 2.1.110, and a fix Stephen C. Tweedie
@ 1998-07-23 16:08 ` Bill Hawes
  1998-07-23 17:30   ` Stephen C. Tweedie
  1998-07-23 20:28   ` Rik van Riel
  0 siblings, 2 replies; 5+ messages in thread
From: Bill Hawes @ 1998-07-23 16:08 UTC (permalink / raw
  To: Stephen C. Tweedie
  Cc: Linus Torvalds, Alan Cox, David S. Miller, Ingo Molnar,
	Mark Hemment, linux-mm, linux-kernel

Stephen C. Tweedie wrote:

> The patch to page_alloc.c is a minimal fix for the fragmentation
> problem.  It simply records allocation failures for high-order pages,
> and forces free_memory_available to return false until a page of at
> least that order becomes available.  The impact should be low, since
> with the SLAB_BREAK_GFP_ORDER patch 2.1.111-pre1 seems to survive pretty
> well anyway (and hence won't invoke the new mechanism), but in cases of
> major atomic allocation load, the patch allows even low memory machines
> to survive the ping attack handsomely (even with 8k NFS on a 6.5MB
> configuration).  I get tons of "IP: queue_glue: no memory for gluing
> queue" failures, but enough NFS retries get through even during the ping
> flood to prevent any NFS server unreachables happening.

Hi Stephen,

Your change to track the maximum failed allocation looks helpful, as
this will focus extra swap attention when a problem actually occurs. So
assuming that the client has a retry capability (as with NFS), it should
improve recoverability.

One possible downside is that kswapd infinite looping may become more
likely, as we still have no way to determine when the memory
configuration makes it impossible to achieve the memory goal. I still
see this "swap deadlock" in 110 (and all recent kernels) under low
memory or by doing a swapoff. Any ideas on how to best determine an
infeasible memory configuration?

Under some conditions the most helpful action may be to let some
allocations fail, to shed load or kill processes. (But selecting the
right process to kill may not be easy ...)

Regards,
Bill
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Good and bad news on 2.1.110, and a fix
  1998-07-23 16:08 ` Bill Hawes
@ 1998-07-23 17:30   ` Stephen C. Tweedie
  1998-07-23 20:28   ` Rik van Riel
  1 sibling, 0 replies; 5+ messages in thread
From: Stephen C. Tweedie @ 1998-07-23 17:30 UTC (permalink / raw
  To: Bill Hawes
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, David S. Miller,
	Ingo Molnar, Mark Hemment, linux-mm, linux-kernel

Hi,

On Thu, 23 Jul 1998 12:08:08 -0400, Bill Hawes <whawes@star.net> said:

> Your change to track the maximum failed allocation looks helpful, as
> this will focus extra swap attention when a problem actually occurs. So
> assuming that the client has a retry capability (as with NFS), it should
> improve recoverability.

> One possible downside is that kswapd infinite looping may become more
> likely, as we still have no way to determine when the memory
> configuration makes it impossible to achieve the memory goal. I still
> see this "swap deadlock" in 110 (and all recent kernels) under low
> memory or by doing a swapoff. Any ideas on how to best determine an
> infeasible memory configuration?

Yes.   One thing I had toyed with, and have implemented on test kernels
based on 2.1.108, was simply to keep a history of VM activity so that we
only base swapping performance on recent requests.  Ageing the
max_failed_order variable so that it is reset every second or so would
at least prevent a swap deadlock if the large allocation was only a
one-off event, but won't help if there is something like NFS repeatedly
demanding the memory.

That said, if NFS is deadlocked on a large allocation, then we have a
hung machine _anyway_, and if a swap storm is the only conceivable way
out of it, it's not clear that it's a bad thing to do!

> Under some conditions the most helpful action may be to let some
> allocations fail, to shed load or kill processes. (But selecting the
> right process to kill may not be easy ...)

Yes, and one thing we should perhaps do is to limit pageable allocations
such that they never exhaust the supply of higher order pages
completely.  However, that still won't help if it's atomic allocations
which are causing the shortage.  In this case, probably the only hope of
progress is a swapper which can actively return entire free zones.

Hmm, how about this for a thought: why not stall all pageable
allocations completely if we get into this situation, and give the
swapper enough breathing space to get a higher order page free?  The
situation should be sufficiently infrequent that it shouldn't impact
performance at all, and there are very few places which would need to
pass the new GFP_PAGEABLE flag into get_free_pages (or we could simply
apply it to all __GFP_LOW/__GFP_WAIT allocations).  

This will still fail if *all* user memory is fragmented, but the zoned
allocator would fix that too.  However, we're now getting into the
realms of the extremely unlikely, so it's probably not important to go
that far unless we have benchmarks which show it to be a problem.

I'm off at a wedding until Monday, so feel free to implement something
over the weekend yourself. :)

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Good and bad news on 2.1.110, and a fix
  1998-07-23 16:08 ` Bill Hawes
  1998-07-23 17:30   ` Stephen C. Tweedie
@ 1998-07-23 20:28   ` Rik van Riel
  1998-07-27 11:07     ` Stephen C. Tweedie
  1 sibling, 1 reply; 5+ messages in thread
From: Rik van Riel @ 1998-07-23 20:28 UTC (permalink / raw
  To: Bill Hawes
  Cc: Stephen C. Tweedie, Linus Torvalds, Alan Cox, David S. Miller,
	Ingo Molnar, Mark Hemment, linux-mm, linux-kernel

On Thu, 23 Jul 1998, Bill Hawes wrote:
> Stephen C. Tweedie wrote:
>  
> > The patch to page_alloc.c is a minimal fix for the fragmentation
> > problem.  It simply records allocation failures for high-order pages,
> > and forces free_memory_available to return false until a page of at
> > least that order becomes available.  The impact should be low, since

This sound suspiciously like the first version of
free_memory_available() that Linus introduced in
2.1.89...

> One possible downside is that kswapd infinite looping may become more
> likely, as we still have no way to determine when the memory

It will happen for sure; just think of what will happen
when that 64 kB DMA allocation fails on your 6 MB box :(

We saw the results in 2.1.89 and I don't see any reason
to repeat the experiments now, at least not until Bill's
patch for freeing inodes is merged...

> configuration makes it impossible to achieve the memory goal. I still
> see this "swap deadlock" in 110 (and all recent kernels) under low
> memory or by doing a swapoff. Any ideas on how to best determine an
> infeasible memory configuration?

Well, freepages.high should be a nice hint as to when to
stop; unfortunately it is used now instead of fragmentation
issues.

Maybe we want to count the number of order-3 memory structures
free and keep that number above a certain level (back to
Zlatko's 2.1.59 patch :-).

Rik.
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Good and bad news on 2.1.110, and a fix
  1998-07-23 20:28   ` Rik van Riel
@ 1998-07-27 11:07     ` Stephen C. Tweedie
  0 siblings, 0 replies; 5+ messages in thread
From: Stephen C. Tweedie @ 1998-07-27 11:07 UTC (permalink / raw
  To: Rik van Riel
  Cc: Bill Hawes, Stephen C. Tweedie, Linus Torvalds, Alan Cox,
	David S. Miller, Ingo Molnar, Mark Hemment, linux-mm,
	linux-kernel

Hi,

On Thu, 23 Jul 1998 22:28:39 +0200 (CEST), Rik van Riel
<H.H.vanRiel@phys.uu.nl> said:

> On Thu, 23 Jul 1998, Bill Hawes wrote:
>> Stephen C. Tweedie wrote:
>> 
>> > The patch to page_alloc.c is a minimal fix for the fragmentation
>> > problem.  It simply records allocation failures for high-order pages,
>> > and forces free_memory_available to return false until a page of at
>> > least that order becomes available.  The impact should be low, since

> This sound suspiciously like the first version of
> free_memory_available() that Linus introduced in
> 2.1.89...

No, it's very different; first, it is adaptive, and second, it only
waits for _one_ of the higher order free page lists to be filled.  The
patch carefully does absolutely nothing until we get a definite
failure to get a higher order page, and then it does the minimum
necessary work to satisfy one request before going inactive again. 

It is the minimum necessary patch to keep the kernel from locking up,
but it does nothing at all most of the time.  

> It will happen for sure; just think of what will happen
> when that 64 kB DMA allocation fails on your 6 MB box :(

Which is one reason why we probably want to timeout the condition
after a second or two.

> Maybe we want to count the number of order-3 memory structures
> free and keep that number above a certain level (back to
> Zlatko's 2.1.59 patch :-).

Again, it's arbitrary, and would result in unnecessary extra
activity.  What we'd like to do is make sure we only do _necessary_
pageing work, and keep as much memory as possible in use the rest of
the time.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~1998-07-27 19:46 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
1998-07-23 12:48 Good and bad news on 2.1.110, and a fix Stephen C. Tweedie
1998-07-23 16:08 ` Bill Hawes
1998-07-23 17:30   ` Stephen C. Tweedie
1998-07-23 20:28   ` Rik van Riel
1998-07-27 11:07     ` Stephen C. Tweedie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.