[PATCH] swapin readahead v3 + kswapd fixes

Linux-mm Archive mirror
 help / color / mirror / Atom feed

* [PATCH] swapin readahead v3 + kswapd fixes
@ 1998-12-01  6:55 Rik van Riel
  1998-12-01  8:15 ` Andrea Arcangeli
  1998-12-17  1:24 ` Linus Torvalds
  0 siblings, 2 replies; 32+ messages in thread
From: Rik van Riel @ 1998-12-01  6:55 UTC (permalink / raw
  To: Linux MM; +Cc: Linus Torvalds, Linux-Kernel

Hi,

I just created a third version of my swapin readahead patch.

It has all sorts of other kswapd fixes too, so you should
probably take a look even when you aren't interested in
swapin readahead at all. I'd really like your opinions on
this...

cheers,

Rik -- now completely used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--- ./mm/vmscan.c.orig	Thu Nov 26 11:26:50 1998
+++ ./mm/vmscan.c	Tue Dec  1 07:12:28 1998
@@ -431,6 +431,8 @@
 	kmem_cache_reap(gfp_mask);
 
 	if (buffer_over_borrow() || pgcache_over_borrow())
+		state = 0;		
+	if (atomic_read(&nr_async_pages) > pager_daemon.swap_cluster / 2)
 		shrink_mmap(i, gfp_mask);
 
 	switch (state) {
--- ./mm/page_io.c.orig	Thu Nov 26 11:26:49 1998
+++ ./mm/page_io.c	Thu Nov 26 11:30:43 1998
@@ -60,7 +60,7 @@
 	}
 
 	/* Don't allow too many pending pages in flight.. */
-	if (atomic_read(&nr_async_pages) > SWAP_CLUSTER_MAX)
+	if (atomic_read(&nr_async_pages) > pager_daemon.swap_cluster)
 		wait = 1;
 
 	p = &swap_info[type];
--- ./mm/page_alloc.c.orig	Thu Nov 26 11:26:49 1998
+++ ./mm/page_alloc.c	Tue Dec  1 07:25:51 1998
@@ -370,9 +370,30 @@
 	pte_t * page_table, unsigned long entry, int write_access)
 {
 	unsigned long page;
-	struct page *page_map;
-	
-	page_map = read_swap_cache(entry);
+	int i;
+	struct page *page_map = lookup_swap_cache(entry);
+	unsigned long offset = SWP_OFFSET(entry);
+	struct swap_info_struct *swapdev = SWP_TYPE(entry) + swap_info;
+
+	if (!page_map) {	
+	  page_map = read_swap_cache(entry);
+
+	/*
+	 * Primitive swap readahead code. We simply read the
+	 * next 16 entries in the swap area. The break below
+	 * is needed or else the request queue will explode :)
+	 */
+	  for (i = 1; i++ < 16;) {
+		offset++;
+		if (!swapdev->swap_map[offset] || offset >= swapdev->max
+			|| nr_free_pages - atomic_read(&nr_async_pages) <
+				(freepages.high + freepages.low)/2)
+			break;
+		read_swap_cache_async(SWP_ENTRY(SWP_TYPE(entry), offset),
+0);
+			break;
+	  }
+	}
 
 	if (pte_val(*page_table) != entry) {
 		if (page_map)
--- ./mm/swap_state.c.orig	Thu Nov 26 11:26:49 1998
+++ ./mm/swap_state.c	Tue Dec  1 07:33:31 1998
@@ -258,7 +258,7 @@
  * incremented.
  */
 
-static struct page * lookup_swap_cache(unsigned long entry)
+struct page * lookup_swap_cache(unsigned long entry)
 {
 	struct page *found;
 	
--- ./include/linux/swap.h.orig	Tue Dec  1 07:29:56 1998
+++ ./include/linux/swap.h	Tue Dec  1 07:31:03 1998
@@ -90,6 +90,7 @@
 extern struct page * read_swap_cache_async(unsigned long, int);
 #define read_swap_cache(entry) read_swap_cache_async(entry, 1);
 extern int FASTCALL(swap_count(unsigned long));
+extern struct page * lookup_swap_cache(unsigned long); 
 /*
  * Make these inline later once they are working properly.
  */

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] swapin readahead v3 + kswapd fixes
  1998-12-01  6:55 [PATCH] swapin readahead v3 + kswapd fixes Rik van Riel
@ 1998-12-01  8:15 ` Andrea Arcangeli
  1998-12-01 15:28   ` Rik van Riel
  1998-12-17  1:24 ` Linus Torvalds
  1 sibling, 1 reply; 32+ messages in thread
From: Andrea Arcangeli @ 1998-12-01  8:15 UTC (permalink / raw
  To: Rik van Riel; +Cc: Linux MM, Linus Torvalds, Linux-Kernel

On Tue, 1 Dec 1998, Rik van Riel wrote:

>--- ./mm/vmscan.c.orig	Thu Nov 26 11:26:50 1998
>+++ ./mm/vmscan.c	Tue Dec  1 07:12:28 1998
>@@ -431,6 +431,8 @@
> 	kmem_cache_reap(gfp_mask);
> 
> 	if (buffer_over_borrow() || pgcache_over_borrow())
>+		state = 0;		

This _my_ patch should be enough. Did you tried it without the other
stuff?

Andrea Arcangeli

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] swapin readahead v3 + kswapd fixes
  1998-12-01  8:15 ` Andrea Arcangeli
@ 1998-12-01 15:28   ` Rik van Riel
  0 siblings, 0 replies; 32+ messages in thread
From: Rik van Riel @ 1998-12-01 15:28 UTC (permalink / raw
  To: Andrea Arcangeli; +Cc: Linux MM, Linus Torvalds, Linux-Kernel

On Tue, 1 Dec 1998, Andrea Arcangeli wrote:
> On Tue, 1 Dec 1998, Rik van Riel wrote:
> 
> >--- ./mm/vmscan.c.orig	Thu Nov 26 11:26:50 1998
> >+++ ./mm/vmscan.c	Tue Dec  1 07:12:28 1998
> >@@ -431,6 +431,8 @@
> > 	kmem_cache_reap(gfp_mask);
> > 
> > 	if (buffer_over_borrow() || pgcache_over_borrow())
> >+		state = 0;		
> 
> This _my_ patch should be enough. Did you tried it without the other
> stuff?

Yes, I tried the other stuff but something broke without
the little piece I added. All my piece added to vmscan.c
does is make sure that we actually free memory when we
have done some swap_out()s.

Otherwise kswapd won't stop swapping when things 'go well'.

cheers,

Rik -- now completely used to dvorak kbd layout...
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] swapin readahead v3 + kswapd fixes
  1998-12-01  6:55 [PATCH] swapin readahead v3 + kswapd fixes Rik van Riel
  1998-12-01  8:15 ` Andrea Arcangeli
@ 1998-12-17  1:24 ` Linus Torvalds
  1998-12-19 17:09   ` New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes) Stephen C. Tweedie
  1 sibling, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 1998-12-17  1:24 UTC (permalink / raw
  To: Rik van Riel; +Cc: Linux MM

On Tue, 1 Dec 1998, Rik van Riel wrote:
> 
> --- ./mm/vmscan.c.orig	Thu Nov 26 11:26:50 1998
> +++ ./mm/vmscan.c	Tue Dec  1 07:12:28 1998
> @@ -431,6 +431,8 @@
>  	kmem_cache_reap(gfp_mask);
>  
>  	if (buffer_over_borrow() || pgcache_over_borrow())
> +		state = 0;		
> +	if (atomic_read(&nr_async_pages) > pager_daemon.swap_cluster / 2)
>  		shrink_mmap(i, gfp_mask);
>  
>  	switch (state) {

I really hate the above tests that make no sense at all from a conceptual
view, and are fairly obviously just something to correct for a more basic
problem. 

So I've removed them, and re-written the logic for the "state" in the VM
scanning. I made "state" be private to the invocation, and always start at
zero - and could thus remove it altogether. 

That means that the first thing freeing memory always tries to do is the
shrink_mmap() thing, and thus the problem becomes one of just making sure
that shrink_mmap() doesn't try _too_ aggressively to throw out stuff that
is still needed. So I changed shrink_mmap() too a bit, and simplified that
too (so that it looks at at most 1/32th of all memory on the first try,
and if it can't find anything to free there it lets the other memory
de-allocators have a go at it). 

It's a lot simpler, has no arbitrary heuristics like the above two tests,
and worked for me both with a small memory setup and my normal half gig
setup. Would you guys please test and comment? It's in the pre-2.1.132-1
patch. 

		Linus "arbitrary rules are bad rules" Torvalds

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-17  1:24 ` Linus Torvalds
@ 1998-12-19 17:09   ` Stephen C. Tweedie
  1998-12-19 18:41     ` Linus Torvalds
                       ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Stephen C. Tweedie @ 1998-12-19 17:09 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Rik van Riel, Linux MM, Andrea Arcangeli, Alan Cox

[-- Attachment #1: Type: text/plain, Size: 6045 bytes --]

Hi,

On Wed, 16 Dec 1998 17:24:05 -0800 (PST), Linus Torvalds
<torvalds@transmeta.com> said:

> On Tue, 1 Dec 1998, Rik van Riel wrote:
>> 
>> --- ./mm/vmscan.c.orig	Thu Nov 26 11:26:50 1998
>> +++ ./mm/vmscan.c	Tue Dec  1 07:12:28 1998
>> @@ -431,6 +431,8 @@
>> kmem_cache_reap(gfp_mask);
>> 
>> if (buffer_over_borrow() || pgcache_over_borrow())
>> +		state = 0;		
>> +	if (atomic_read(&nr_async_pages) > pager_daemon.swap_cluster / 2)
>> shrink_mmap(i, gfp_mask);
>> 
>> switch (state) {

> I really hate the above tests that make no sense at all from a conceptual
> view, and are fairly obviously just something to correct for a more basic
> problem. 

Agreed: I've been saying this for several years now. :)

Linus, I've had a test with your 132-pre2 patch, and the performance is
really disappointing in some important cases.  Particular effects I can
reproduce with it include:

* Extra file IO activity

  Doing a kernel build on a full (lots of applications have been loaded)
  but otherwise idle 64MB machine results in sustained 50 to 200kb/sec
  IO block read rates according to vmstat.  I've never seen this with
  older kernels, and it results in  a drop of about 10% in the cpu
  utilisation sustained over the entire kernel build.  I've had
  independent confirmation of this effect from other people.

* Poor swapout performance

  On my main development box, I've been able to sustain about 3MB/sec to
  swap quite easily when the VM got busy on most recent kernels since
  2.1.130 (including all the late ac* patches with my VM changes in).
  Swap out peaks at a little under 4MB/sec, and I can sustain about
  3MB/sec combined read+write traffic too.  It streams to/from swap very
  well indeed.

  The 132-pre2 peaks at about 800kb/sec to swap, and sustains between
  300 and 400. 

* Swap fragmentation

  The reduced swap streaming means that swap does seem to get much more
  fragmented than under, say, ac11.  In particular, this appears to have
  two side effects: it defeats the swap clustered readin code in ac11
  (which I have ported forward to 132-pre2), resulting in much slower
  swapping behaviour if I start up more applications than I have ram for
  and swap between them; and, especially on low memory, the swap
  fragmentation appears to make successive compilation runs in 8MB ever
  more slow as bits of my background tasks (https, cron) scatter
  themselves over swap.

The problem that we have with the strict state-driven logic in
do_try_to_free_page is that, for prolonged periods, it can bypass the
normal shrink_mmap() loop which we _do_ want to keep active even while
swapping.  However, I think that the 132-pre2 cure is worse than the
disease, because it penalises swap to such an extent that we lose the
substantial performance benefit that comes from being able to stream
both to and from swap rapidly.

The VM in 2.1.131-ac11+ seems to work incredibly well.  On my own 64MB
box it feels as if the memory has doubled.  I've had similar feedback
from other people, including reports of 300% performance improvement
over 2.0 in 4MB memory (!).  Alan reports a huge increase in the uptake
of his ac patches since the new VM stuff went in there.  

I've tried to port the best bits of that VM to 132-pre2, preserving your
do_try_to_free_page state change, but so far I have not been able find a
combination which gives anywhere near the overall performance of ac11
for all of my test cases (although it works reasonably well on low
memory at first, until we start to fragment swap).

The patch below is the best I have so far against 132-pre2.  You will
find that it has absolutely no references to the borrow percentages, and
although it does honour the buffer/pgcache min percentages, those
default to 1%.

Andrea, I know you've seen odd behaviours since 2.1.131, although I'm
not quite sure exactly which VMs you've been testing on.  The one change
I've found which does have a significant effect on predictability here
is in do_try_to_free_page:

	if (current != kswapd_task)
		if (shrink_mmap(6, gfp_mask))
			return 1;

which means that even if kswapd is busy swapping, we can _still_ bypass
the swap and go straight to the cache shrinking if we need more memory.
The overall effect I observe is that large IO-bound tasks _can_ still
grow the cache, and I don't see any excessive input IO during a kernel
build, but that kswapd itself can still stream efficiently out to swap.

The patch also includes a few extra performance counters in
/proc/swapstats, and adds back the heuristic from a while ago that the
kswap wakeup has a hysteresis behaviour between freepages.high and
freepages.med: kswapd will remain inactive until nr_free_pages reaches
freepages.med, and will then swap until it is brought back up to
freepages.high.  Any failure of shrink_mmap immediately kicks kswapd
into action, though.  To be honest, I haven't been able to measure a
huge difference from this, but it's in my current tree so you are
welcome to it.

Finally, the patch includes the swap and mmap clustered read logic.
That is entirely responsible for my being able to sustain 2MB/sec or
more swapin performance, and disk performance (5MB/sec) when doing a
mmap-based grep.

Tested on 8MB, 64MB and with high filesystem and VM load.  Doing an
anonymous-page stress test (basically a memset on a region 3 times
physical memory) it sustains 1.5M/sec to swap (and about 150K/sec from
swap) for a couple of minutes until completion.  Performance sucks
during this, but X is still usable (although switching windows is slow),
vmstat 1" in an xterm didn't miss a tick, and all the swapped-out
applications swapped back within a couple of seconds after the test was
complete.


Please test and comment.  Note that I'll be mostly offline until the new
year, so don't expect me to test it too much more until then.  However,
this VM is mostly equivalent to the one in ac11, except without the
messy borrow percentage rules and with the extra shrink_mmap for
foreground page stealing in do_try_to_free_page.


--Stephen


[-- Attachment #2: Clustered pagin/balancing patch to 2.1.132-pre2 --]
[-- Type: text/plain, Size: 15802 bytes --]

--- fs/proc/array.c.~1~	Sat Dec 19 00:44:22 1998
+++ fs/proc/array.c	Sat Dec 19 14:40:16 1998
@@ -60,6 +60,7 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/swap.h>
+#include <linux/swapctl.h>
 #include <linux/io_trace.h>
 #include <linux/slab.h>
 #include <linux/smp.h>
@@ -415,6 +416,28 @@
 		i.freeswap >> 10);
 }
 
+static int get_swapstats(char * buffer)
+{
+	unsigned long *w = swapstats.kswap_wakeups;
+	
+	return sprintf(buffer,
+		       "ProcFreeTry:    %8lu\n"
+		       "ProcFreeSucc:   %8lu\n"
+		       "ProcShrinkTry:  %8lu\n"
+		       "ProcShrinkSucc: %8lu\n"
+		       "KswapFreeTry:   %8lu\n"
+		       "KswapFreeSucc:  %8lu\n"
+		       "KswapWakeups:	%8lu %lu %lu %lu\n",
+		       swapstats.gfp_freepage_attempts,
+		       swapstats.gfp_freepage_successes,
+		       swapstats.gfp_shrink_attempts,
+		       swapstats.gfp_shrink_successes,
+		       swapstats.kswap_freepage_attempts,
+		       swapstats.kswap_freepage_successes,
+		       w[0], w[1], w[2], w[3]
+		       );
+}
+
 static int get_version(char * buffer)
 {
 	extern char *linux_banner;
@@ -1301,6 +1324,9 @@
 		case PROC_MEMINFO:
 			return get_meminfo(page);
 
+		case PROC_SWAPSTATS:
+			return get_swapstats(page);
+
 #ifdef CONFIG_PCI_OLD_PROC
   	        case PROC_PCI:
 			return get_pci_list(page);
@@ -1386,7 +1412,7 @@
 static int process_unauthorized(int type, int pid)
 {
 	struct task_struct *p;
-	uid_t euid;	/* Save the euid keep the lock short */
+	uid_t euid=0;	/* Save the euid keep the lock short */
 		
 	read_lock(&tasklist_lock);
 	
--- fs/proc/root.c.~1~	Sat Dec 19 00:44:22 1998
+++ fs/proc/root.c	Sat Dec 19 13:10:27 1998
@@ -494,6 +494,11 @@
 	S_IFREG | S_IRUGO, 1, 0, 0,
 	0, &proc_array_inode_operations
 };
+static struct proc_dir_entry proc_root_swapstats = {
+	PROC_SWAPSTATS, 9, "swapstats",
+	S_IFREG | S_IRUGO, 1, 0, 0,
+	0, &proc_array_inode_operations
+};
 static struct proc_dir_entry proc_root_kmsg = {
 	PROC_KMSG, 4, "kmsg",
 	S_IFREG | S_IRUSR, 1, 0, 0,
@@ -654,6 +659,7 @@
 	proc_register(&proc_root, &proc_root_loadavg);
 	proc_register(&proc_root, &proc_root_uptime);
 	proc_register(&proc_root, &proc_root_meminfo);
+	proc_register(&proc_root, &proc_root_swapstats);
 	proc_register(&proc_root, &proc_root_kmsg);
 	proc_register(&proc_root, &proc_root_version);
 	proc_register(&proc_root, &proc_root_cpuinfo);
--- include/linux/mm.h.~1~	Fri Nov 27 12:36:29 1998
+++ include/linux/mm.h	Sat Dec 19 15:05:14 1998
@@ -11,6 +11,7 @@
 extern unsigned long max_mapnr;
 extern unsigned long num_physpages;
 extern void * high_memory;
+extern int page_cluster;
 
 #include <asm/page.h>
 #include <asm/atomic.h>
--- include/linux/proc_fs.h.~1~	Sat Dec 19 00:55:10 1998
+++ include/linux/proc_fs.h	Sat Dec 19 15:20:25 1998
@@ -53,7 +53,8 @@
 	PROC_STRAM,
 	PROC_SOUND,
 	PROC_MTRR, /* whether enabled or not */
-	PROC_FS
+	PROC_FS,
+	PROC_SWAPSTATS
 };
 
 enum pid_directory_inos {
--- include/linux/swap.h.~1~	Sat Dec 19 00:42:54 1998
+++ include/linux/swap.h	Sat Dec 19 13:57:47 1998
@@ -61,6 +61,15 @@
 extern unsigned long page_cache_size;
 extern int buffermem;
 
+struct swap_stats 
+{
+	long	proc_freepage_attempts;
+	long	proc_freepage_successes;
+	long	kswap_freepage_attempts;
+	long	kswap_freepage_successes;
+};
+extern struct swap_stats swap_stats;
+
 /* Incomplete types for prototype declarations: */
 struct task_struct;
 struct vm_area_struct;
@@ -69,8 +78,12 @@
 /* linux/ipc/shm.c */
 extern int shm_swap (int, int);
 
+/* linux/mm/swap.c */
+extern void swap_setup (void);
+
 /* linux/mm/vmscan.c */
 extern int try_to_free_pages(unsigned int gfp_mask, int count);
+extern void try_to_shrink_cache(int);
 
 /* linux/mm/page_io.c */
 extern void rw_swap_page(int, unsigned long, char *, int);
@@ -87,6 +100,7 @@
 extern int add_to_swap_cache(struct page *, unsigned long);
 extern int swap_duplicate(unsigned long);
 extern int swap_check_entry(unsigned long);
+struct page * lookup_swap_cache(unsigned long);
 extern struct page * read_swap_cache_async(unsigned long, int);
 #define read_swap_cache(entry) read_swap_cache_async(entry, 1);
 extern int FASTCALL(swap_count(unsigned long));
--- include/linux/swapctl.h~	Sat Dec 19 00:55:55 1998
+++ include/linux/swapctl.h	Sat Dec 19 16:19:20 1998
@@ -22,11 +22,19 @@
 
 typedef struct swapstat_v1
 {
-	unsigned int	wakeups;
-	unsigned int	pages_reclaimed;
-	unsigned int	pages_shm;
-	unsigned int	pages_mmap;
-	unsigned int	pages_swap;
+	unsigned long	wakeups;
+	unsigned long	pages_reclaimed;
+	unsigned long	pages_shm;
+	unsigned long	pages_mmap;
+	unsigned long	pages_swap;
+
+	unsigned long	gfp_freepage_attempts;
+	unsigned long	gfp_freepage_successes;
+	unsigned long	gfp_shrink_attempts;
+	unsigned long	gfp_shrink_successes;
+	unsigned long	kswap_freepage_attempts;
+	unsigned long	kswap_freepage_successes;
+	unsigned long	kswap_wakeups[4];
 } swapstat_v1;
 typedef swapstat_v1 swapstat_t;
 extern swapstat_t swapstats;
--- include/linux/sysctl.h.~1~	Sat Dec 19 00:44:22 1998
+++ include/linux/sysctl.h	Sat Dec 19 00:45:09 1998
@@ -103,7 +103,8 @@
 	VM_BUFFERMEM=6,		/* struct: Set buffer memory thresholds */
 	VM_PAGECACHE=7,		/* struct: Set cache memory thresholds */
 	VM_PAGERDAEMON=8,	/* struct: Control kswapd behaviour */
-	VM_PGT_CACHE=9		/* struct: Set page table cache parameters */
+	VM_PGT_CACHE=9,		/* struct: Set page table cache parameters */
+	VM_PAGE_CLUSTER=10	/* int: set number of pages to swap together */
 };
 
 
--- kernel/sysctl.c.~1~	Fri Nov 27 12:36:42 1998
+++ kernel/sysctl.c	Sat Dec 19 00:45:09 1998
@@ -216,6 +216,8 @@
 	 &pager_daemon, sizeof(pager_daemon_t), 0644, NULL, &proc_dointvec},
 	{VM_PGT_CACHE, "pagetable_cache", 
 	 &pgt_cache_water, 2*sizeof(int), 0600, NULL, &proc_dointvec},
+	{VM_PAGE_CLUSTER, "page-cluster", 
+	 &page_cluster, sizeof(int), 0600, NULL, &proc_dointvec},
 	{0}
 };
 
--- mm/filemap.c.~1~	Sat Dec 19 00:43:23 1998
+++ mm/filemap.c	Sat Dec 19 13:37:37 1998
@@ -200,7 +200,11 @@
 	struct page * page;
 	int count;
 
+#if 0
 	count = (limit<<1) >> (priority);
+#else
+	count = (limit<<2) >> (priority);
+#endif
 
 	page = mem_map + clock;
 	do {
@@ -212,13 +216,26 @@
 		
 		if (shrink_one_page(page, gfp_mask))
 			return 1;
+		/* 
+		 * If the page we looked at was recyclable but we didn't
+		 * reclaim it (presumably due to PG_referenced), don't
+		 * count it as scanned.  This way, the more referenced
+		 * page cache pages we encounter, the more rapidly we
+		 * will age them. 
+		 */
+
+#if 1
+		if (atomic_read(&page->count) != 1 ||
+		    (!page->inode && !page->buffers))
+#endif
+			count--;
 		page++;
 		clock++;
 		if (clock >= max_mapnr) {
 			clock = 0;
 			page = mem_map;
 		}
-	} while (--count >= 0);
+	} while (count >= 0);
 	return 0;
 }
 
@@ -962,7 +979,7 @@
 	struct file * file = area->vm_file;
 	struct dentry * dentry = file->f_dentry;
 	struct inode * inode = dentry->d_inode;
-	unsigned long offset;
+	unsigned long offset, reada, i;
 	struct page * page, **hash;
 	unsigned long old_page, new_page;
 
@@ -1023,7 +1040,19 @@
 	return new_page;
 
 no_cached_page:
-	new_page = __get_free_page(GFP_USER);
+	/*
+	 * Try to read in an entire cluster at once.
+	 */
+	reada   = offset;
+	reada >>= PAGE_SHIFT;
+	reada   = (reada / page_cluster) * page_cluster;
+	reada <<= PAGE_SHIFT;
+
+	for (i=0; i<page_cluster; i++, reada += PAGE_SIZE)
+		new_page = try_to_read_ahead(file, reada, new_page);
+
+	if (!new_page)
+		new_page = __get_free_page(GFP_USER);
 	if (!new_page)
 		goto no_page;
 
@@ -1047,11 +1076,6 @@
 	if (inode->i_op->readpage(file, page) != 0)
 		goto failure;
 
-	/*
-	 * Do a very limited read-ahead if appropriate
-	 */
-	if (PageLocked(page))
-		new_page = try_to_read_ahead(file, offset + PAGE_SIZE, 0);
 	goto found_page;
 
 page_locked_wait:
@@ -1625,7 +1649,7 @@
 	if (!page) {
 		if (!new)
 			goto out;
-		page_cache = get_free_page(GFP_KERNEL);
+		page_cache = get_free_page(GFP_USER);
 		if (!page_cache)
 			goto out;
 		page = mem_map + MAP_NR(page_cache);
--- mm/page_alloc.c.~1~	Fri Nov 27 12:36:42 1998
+++ mm/page_alloc.c	Sat Dec 19 15:14:23 1998
@@ -241,7 +241,17 @@
 			goto nopage;
 		}
 
-		if (freepages.min > nr_free_pages) {
+		/* Try this if you want, but it seems to result in too
+		 * much IO activity during builds, and does not
+		 * substantially reduce the number of times we invoke
+		 * kswapd.  --sct */
+#if 0
+		if (nr_free_pages < freepages.high &&
+		    !(gfp_mask & (__GFP_MED | __GFP_HIGH)))
+			try_to_shrink_cache(gfp_mask);
+#endif
+						
+		if (nr_free_pages < freepages.min) {
 			int freed;
 			freed = try_to_free_pages(gfp_mask, SWAP_CLUSTER_MAX);
 			/*
@@ -359,6 +369,37 @@
 	return start_mem;
 }
 
+/* 
+ * Primitive swap readahead code. We simply read an aligned block of
+ * (page_cluster) entries in the swap area. This method is chosen
+ * because it doesn't cost us any seek time.  We also make sure to queue
+ * the 'original' request together with the readahead ones...  
+ */
+void swapin_readahead(unsigned long entry) {
+        int i;
+        struct page *new_page;
+	unsigned long offset = SWP_OFFSET(entry);
+	struct swap_info_struct *swapdev = SWP_TYPE(entry) + swap_info;
+	
+	offset = (offset/page_cluster) * page_cluster;
+	
+	for (i = 0; i < page_cluster; i++) {
+	      if (offset >= swapdev->max
+		              || nr_free_pages - atomic_read(&nr_async_pages) <
+			      (freepages.high + freepages.low)/2)
+		      return;
+	      if (!swapdev->swap_map[offset] ||
+		  swapdev->swap_map[offset] == SWAP_MAP_BAD ||
+		  test_bit(offset, swapdev->swap_lockmap))
+		      continue;
+	      new_page = read_swap_cache_async(SWP_ENTRY(SWP_TYPE(entry), offset), 0);
+	      if (new_page != NULL)
+                      __free_page(new_page);
+	      offset++;
+	}
+	return;
+}
+
 /*
  * The tests may look silly, but it essentially makes sure that
  * no other process did a swap-in on us just as we were waiting.
@@ -370,10 +411,12 @@
 	pte_t * page_table, unsigned long entry, int write_access)
 {
 	unsigned long page;
-	struct page *page_map;
-	
-	page_map = read_swap_cache(entry);
+	struct page *page_map = lookup_swap_cache(entry);
 
+	if (!page_map) {
+                swapin_readahead(entry);
+		page_map = read_swap_cache(entry);
+	}
 	if (pte_val(*page_table) != entry) {
 		if (page_map)
 			free_page_and_swap_cache(page_address(page_map));
--- mm/page_io.c.~1~	Fri Nov 27 12:36:42 1998
+++ mm/page_io.c	Sat Dec 19 00:45:09 1998
@@ -60,7 +60,7 @@
 	}
 
 	/* Don't allow too many pending pages in flight.. */
-	if (atomic_read(&nr_async_pages) > SWAP_CLUSTER_MAX)
+	if (atomic_read(&nr_async_pages) > pager_daemon.swap_cluster)
 		wait = 1;
 
 	p = &swap_info[type];
--- mm/swap.c.~1~	Sat Dec 19 00:42:55 1998
+++ mm/swap.c	Sat Dec 19 12:49:51 1998
@@ -39,6 +39,9 @@
 	144	/* freepages.high */
 };
 
+/* How many pages do we try to swap or page in/out together? */
+int page_cluster = 16; /* Default value modified in swap_setup() */
+
 /* We track the number of pages currently being asynchronously swapped
    out, so that we don't try to swap TOO many pages out at once */
 atomic_t nr_async_pages = ATOMIC_INIT(0);
@@ -61,13 +64,13 @@
 swapstat_t swapstats = {0};
 
 buffer_mem_t buffer_mem = {
-	5,	/* minimum percent buffer */
+	1,	/* minimum percent buffer */
 	10,	/* borrow percent buffer */
 	60	/* maximum percent buffer */
 };
 
 buffer_mem_t page_cache = {
-	5,	/* minimum percent page cache */
+	1,	/* minimum percent page cache */
 	15,	/* borrow percent page cache */
 	75	/* maximum */
 };
@@ -77,3 +80,19 @@
 	SWAP_CLUSTER_MAX,	/* minimum number of tries */
 	SWAP_CLUSTER_MAX,	/* do swap I/O in clusters of this size */
 };
+
+
+/*
+ * Perform any setup for the swap system
+ */
+
+void __init swap_setup(void)
+{
+	/* Use a smaller cluster for memory <16MB or <32MB */
+	if (num_physpages < ((16 * 1024 * 1024) >> PAGE_SHIFT))
+		page_cluster = 4;
+	else if (num_physpages < ((32 * 1024 * 1024) >> PAGE_SHIFT))
+		page_cluster = 8;
+	else
+		page_cluster = 16;
+}
--- mm/swap_state.c.~1~	Fri Nov 27 12:36:42 1998
+++ mm/swap_state.c	Sat Dec 19 13:35:07 1998
@@ -258,7 +258,7 @@
  * incremented.
  */
 
-static struct page * lookup_swap_cache(unsigned long entry)
+struct page * lookup_swap_cache(unsigned long entry)
 {
 	struct page *found;
 	
@@ -305,7 +305,7 @@
 	if (found_page)
 		goto out;
 
-	new_page_addr = __get_free_page(GFP_KERNEL);
+	new_page_addr = __get_free_page(GFP_USER);
 	if (!new_page_addr)
 		goto out;	/* Out of memory */
 	new_page = mem_map + MAP_NR(new_page_addr);
--- mm/vmscan.c.~1~	Sat Dec 19 00:43:24 1998
+++ mm/vmscan.c	Sat Dec 19 14:58:49 1998
@@ -25,6 +25,11 @@
  */
 static struct task_struct * kswapd_task = NULL;
 
+/*
+ * Flag to start low-priorty background kswapping
+ */
+static int kswap_default_wakeup;
+
 static void init_swap_timer(void);
 
 /*
@@ -424,21 +429,36 @@
  */
 static int do_try_to_free_page(int gfp_mask)
 {
+	static int state = 0;
 	int i=6;
 
 	/* Always trim SLAB caches when memory gets low. */
 	kmem_cache_reap(gfp_mask);
-
-	do {
-		if (shrink_mmap(i, gfp_mask))
-			return 1;
-		if (shm_swap(i, gfp_mask))
-			return 1;
-		if (swap_out(i, gfp_mask))
+	
+	if (current != kswapd_task)
+		if (shrink_mmap(6, gfp_mask))
 			return 1;
-		shrink_dcache_memory(i, gfp_mask);
+
+	switch (state) {
+		do {
+		case 0:
+			if (shrink_mmap(i, gfp_mask))
+				return 1;
+			state = 1;
+		case 1:
+			if (shm_swap(i, gfp_mask))
+				return 1;
+			state = 2;
+		case 2:
+			if (swap_out(i, gfp_mask))
+				return 1;
+			state = 3;
+		case 3:
+			shrink_dcache_memory(i, gfp_mask);
+			state = 0;
 		i--;
-	} while (i >= 0);
+		} while (i >= 0);
+	}
 	return 0;
 }
 
@@ -453,6 +473,8 @@
        int i;
        char *revision="$Revision: 1.5 $", *s, *e;
 
+       swap_setup();
+       
        if ((s = strchr(revision, ':')) &&
            (e = strchr(s, '$')))
                s++, i = e - s;
@@ -514,9 +536,11 @@
 		/* max one hundreth of a second */
 		end_time = jiffies + (HZ-1)/100;
 		do {
+			swapstats.kswap_freepage_attempts++;
 			if (!do_try_to_free_page(0))
 				break;
-			if (nr_free_pages > freepages.high + SWAP_CLUSTER_MAX)
+			swapstats.kswap_freepage_successes++;
+			if (nr_free_pages > freepages.high + pager_daemon.swap_cluster)
 				break;
 		} while (time_before_eq(jiffies,end_time));
 	}
@@ -544,9 +568,11 @@
 	if (!(current->flags & PF_MEMALLOC)) {
 		current->flags |= PF_MEMALLOC;
 		do {
+			swapstats.gfp_freepage_attempts++;
 			retval = do_try_to_free_page(gfp_mask);
 			if (!retval)
 				break;
+			swapstats.gfp_freepage_successes++;
 			count--;
 		} while (count > 0);
 		current->flags &= ~PF_MEMALLOC;
@@ -556,6 +582,24 @@
 }
 
 /*
+ * Try to shrink the page cache slightly, on low-priority memory
+ * allocation.  If this fails, it's a hint that maybe kswapd might want
+ * to start doing something useful.
+ */
+void try_to_shrink_cache(int gfp_mask)
+{
+	int i;
+	for (i = 0; i < 16; i++) {
+		swapstats.gfp_shrink_attempts++;
+		if (shrink_mmap(6, gfp_mask))
+			swapstats.gfp_shrink_successes++;
+		else
+			kswap_default_wakeup = 1;
+	}
+}
+
+
+/*
  * Wake up kswapd according to the priority
  *	0 - no wakeup
  *	1 - wake up as a low-priority process
@@ -598,15 +642,22 @@
 		 * that we'd better give kswapd a realtime
 		 * priority.
 		 */
+
 		want_wakeup = 0;
 		pages = nr_free_pages;
 		if (pages < freepages.high)
-			want_wakeup = 1;
-		if (pages < freepages.low)
+			want_wakeup = kswap_default_wakeup;
+		if (pages < freepages.low) {
 			want_wakeup = 2;
+			kswap_default_wakeup = 1;
+		}
 		if (pages < freepages.min)
 			want_wakeup = 3;
-	
+
+		/* If you increase the maximum want_wakeup, expand the
+                   swapstats.kswap_wakeups[] table in swapctl.h */
+		swapstats.kswap_wakeups[want_wakeup]++;
+
 		kswapd_wakeup(p,want_wakeup);
 	}
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-19 17:09   ` New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes) Stephen C. Tweedie
@ 1998-12-19 18:41     ` Linus Torvalds
  1998-12-19 19:41     ` Linus Torvalds
  1998-12-21  9:53     ` Andrea Arcangeli
  2 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 1998-12-19 18:41 UTC (permalink / raw
  To: Stephen C. Tweedie; +Cc: Rik van Riel, Linux MM, Andrea Arcangeli, Alan Cox

On Sat, 19 Dec 1998, Stephen C. Tweedie wrote:
> 
> Linus, I've had a test with your 132-pre2 patch, and the performance is
> really disappointing in some important cases.  Particular effects I can
> reproduce with it include:

Try a really trivial change, which is to start the page_out priority from
8 instead of 6 (maybe 7 is the right balance, but let's try 8 first. 

The problem, I suspect, is simply that the new code obviously _always_
calls "shrink_mmap()", and with a priority of 6 shrink_mmap() is a bit too
good at throwing out page cache pages etc, so we tend to wait a bit too
long before we actually start to swap out user pages instead. 

Previously we didn't have that problem, because once we got over
shrink_mmap() due to any problem what-so-ever, then we didn't tend to
re-enter it very easily. Obviously sometimes it was _too_ hard to re-enter
it, which is why we had all the ugly hacks to magically sometimes force
our state to shrink_mmap(). 

> The problem that we have with the strict state-driven logic in
> do_try_to_free_page is that, for prolonged periods, it can bypass the
> normal shrink_mmap() loop which we _do_ want to keep active even while
> swapping.  However, I think that the 132-pre2 cure is worse than the
> disease, because it penalises swap to such an extent that we lose the
> substantial performance benefit that comes from being able to stream
> both to and from swap rapidly.

Right. It's essentially not likely enough to start swapping. 

> The patch below is the best I have so far against 132-pre2.  You will
> find that it has absolutely no references to the borrow percentages, and
> although it does honour the buffer/pgcache min percentages, those
> default to 1%.

Can you try the even siompler patch of just changing

	int i=6;

to

	int i=8;

in do_try_to_free_page()? I suspect that's actually enough.

Basically, let's think about the problem analytically before we add any
"magic rules". That's what I tried to do with the pre-2 patch, and
basically the pre-2 patch has a _very_ simple lay-out:

   Always start with "shrink_mmap()", because that's the "simple" case,
   and gets rid of excessive page caches etc. HOWEVER, make
   "shrink_mmap()" initially timid enough, that if it doesn't find a nice
   page quickly, we then try to really swap things out.

Basically, there are no magic rules, no made-up "in this case we do that" 
setup. The only issue is one of "how timid are we initially" to get a good
balance. 

With a value of 6, it means that we try to see if we can find a page we
can easily throw out in the first 1/32th of the memory we test. That
sounds fairly timid, but it really isn't all that timid at all: if we have
even just a third of all pages being buffer cache pages, it's actually
fairly likely that we'd throw out that instead of trying to page anything
out. 

A initial "timidity" value of 8 means that we'd throw out a page from the
page map only if we find it really easily (ie we only look at 1/128th of
our memory). That may be too timid (and maybe 7 is right), but basically I
think this approach should work reasonably well for a wide range of memory
sizes. And I _really_ really want to try something without any silly magic
rules first. 

In short, first prove to me somehow that the rule _has_ to be there. 
Either by some argument that makes it obvious, or by showing that the
above simple change really doesn't work. 

		Linus

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-19 17:09   ` New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes) Stephen C. Tweedie
  1998-12-19 18:41     ` Linus Torvalds
@ 1998-12-19 19:41     ` Linus Torvalds
  1998-12-19 22:01       ` Stephen C. Tweedie
  1998-12-21  9:53     ` Andrea Arcangeli
  2 siblings, 1 reply; 32+ messages in thread
From: Linus Torvalds @ 1998-12-19 19:41 UTC (permalink / raw
  To: Stephen C. Tweedie; +Cc: Rik van Riel, Linux MM, Andrea Arcangeli, Alan Cox



Btw, Steven, there's another approach that might actually be the best one,
and that also makes a ton of sense to me. I'm not married to any specific
approach, but basically what I want is that whatever we do it should be
sensible, in a way that we can say "this is the basic approach", and then
when you read the code you see that yes, that's what it does. In other
words, something "pretty". 

If you're testing different approaches, how about this one (_reasoning_
first, not just some magic heuristic): 

 - kswapd and normal processes are decidedly different animals - that's
   fairly obvious. A normal process wants low latency in order to go on
   with what it's doing, while kswapd is meant to be this background
   deamon to make sure we can get memory with low latency.

 - as a result, it doesn't necessarily make sense to have the same
   "do_try_to_free_page()" for them both. For example, for a normal
   process, it makes sense to do a shrink_mmap() more aggressively to just
   try to get rid of some page without actually having to do any IO. In
   contrast, kswapd quite naturally wants to be more aggressive about
   paging things out so that when a regular process does need memory, it
   will get it easily without having to wait for it.

So with the above premise of _not_ trying to make one function work for
both cases, how about:

 - regular processes use something that looks very much like the
   "do_try_to_free_page()" in pre-2. No state crap, and it uses
   shrink_mmap() first (and then it can be reasonably aggressive, so
   forget about increasing "i" to make it timid)

 - kswapd uses something totally different, which essentially looks more
   like the previous loop that used a state to "stay" in a good mode for a
   while. We want kswapd to "stay" in the swap-out mode in order to get
   nice cpntiguous bursty page-outs that we can do efficiently. 

Does the above make sense to you? It would quite naturally explain your
"magic heuristic" in your previous patch with "current != kswapd", but
would be more explainable and cleaner - be quite up front about the fact
that kswapd tries to generate nice page-out patterns, while normal
processes (when they have to call try_to_free_page() at all, which is
hopefully not too often) just want to get memory quickly. 

		Linus

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-19 19:41     ` Linus Torvalds
@ 1998-12-19 22:01       ` Stephen C. Tweedie
  1998-12-20  3:05         ` Linus Torvalds
  1998-12-20 14:18         ` Linus Torvalds
  0 siblings, 2 replies; 32+ messages in thread
From: Stephen C. Tweedie @ 1998-12-19 22:01 UTC (permalink / raw
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Rik van Riel, Linux MM, Andrea Arcangeli,
	Alan Cox

Hi Linus,

On Sat, 19 Dec 1998 11:41:56 -0800 (PST), Linus Torvalds
<torvalds@transmeta.com> said:

> If you're testing different approaches, how about this one (_reasoning_
> first, not just some magic heuristic): 

>  - kswapd and normal processes are decidedly different animals

>  - as a result, it doesn't necessarily make sense to have the same
>    "do_try_to_free_page()" for them both. 

Absolutely.  There's nothing "magic" in the patch I just sent you,
except for one special condition in do_try_to_free_page:

	if (current != kswapd_task)
		if (shrink_mmap(6, gfp_mask))
			return 1;

The entire point is that we really want swapping out to disk to be a
background task, but we still want memory to be continue to be made
available to otherwise-stalled foreground processes.  

I have already been experimenting with this: having a fairly traditional
but shrink_mmap-biased do_try_to_free_page for foreground tasks and a
dedicated swap loop within kswapd, for example.  I simply failed to get
any such scheme to perform as well as the patch I sent you.

>    For example, for a normal process, it makes sense to do a
>    shrink_mmap() more aggressively to just try to get rid of some page
>    without actually having to do any IO. In contrast, kswapd quite
>    naturally wants to be more aggressive about paging things out so
>    that when a regular process does need memory, it will get it easily
>    without having to wait for it.

That is precisely the compromise I reached in the patch I sent you,
courtesy of the test above.  In fact, you'll see in that patch that
get_free_pages() also has a section

		/* Try this if you want, but it seems to result in too
		 * much IO activity during builds, and does not
		 * substantially reduce the number of times we invoke
		 * kswapd.  --sct */
#if 0
		if (nr_free_pages < freepages.high &&
		    !(gfp_mask & (__GFP_MED | __GFP_HIGH)))
			try_to_shrink_cache(gfp_mask);
#endif

In other words, I have _already_ tried this and it didn't work, for one
of the same reasons your own 132-pre2 didn't work: that it could far too
easily trap itself in a loop in which a kernel build was regularly doing
100 or 200 read IOs per second, indicating too small a cache.  This is a
regression over the mechanism we already had.

> Does the above make sense to you?  It would quite naturally explain
> your "magic heuristic" in your previous patch

Yes.  Linus, I actually tried a _huge_ number of such schemes back when
I was doing the original 1.2.13 kswap stuff.  I think I had about 24
separately tagged sets of heuristics in CVS at one point.  From what I
learnt then, and from what I've experienced recently when looking more
closely at new stuff like the effects on performance of Rik's swap
clustering ideas, I really do think that we need to find a balance which
allows us to expand a cache somewhat under heavy IO load, to shrink the
cache aggressively when under other loads, and which still allows us to
stream to swap very rapidly.  

The "magic heuristic" you talk about was a quite deliberate result of
the same line of thought you are taking now.  The problem is, I'm going
to be mostly offline now until the New Year, so right now I do not have
time to sit down and invent a completely new mechanism here: I want to
make sure that we have something which is sufficiently recognisable as
our tried and tested VM to have some confidence we are not introducing
new pathological behaviours before 2.2, while still removing the black
magic and using reasoned algorithms.  The patch I sent you is the result
of that.

Ultimately, I don't think we just want a separate set of routines for
background and foreground swapping: I think we really need more than
that.  We need the background swap task to be separate from the
background cache cleaner: swapping can potentially stall for a number of
reasons (especially when swapping to a file), but for the sake of IRQ
memory demand we still want to have some page stealing capacity when the
swapout-task blocks (one of the old heuristics I have got coded
somewhere in fact does this, separating out the kswapd thread --- which
is now fully asynchronous --- from a kswapio thread which does the
actual IO, and just in this past week I've been trying a scheme in which
kswapd only calls swap_out if it knows that there are few enough async
pages in progress that the IO won't block.

The trouble is, every such scheme I have tried has some disadvantage
compared with ac11 or with the last patch I sent you.  In particular,
they tend to stream swap poorly, and often have very unbalanced or
unstable cache sizes.

I _could_ in theory tune one of these other mechanisms up until it
matched what we have today, but I've spent quite a bit of time over
the past week or two getting various problems out of the existing VM
now and benchmarking it under a variety of load conditions.
Unfortunately, that will have to wait until the new year if you want
it done.  If somebody else wants to, it might be useful to try putting
together your changes here with the clustered pagein stuff to see what
sort of performance results.

Anyway, so far I've done a very brief couple of tests with the s/6/8/
change you suggested.  It certainly seems to work about as well many
of the previous VMs for the 64MB test case, although a bit slower in
8MB; however it appears to shrink the cache rather aggressively and I
noticed some rather odd amounts of IO during a kernel build.  Too
early to make a definitive judgement, but it is certainly _much_
better than the "i=6" version.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-19 22:01       ` Stephen C. Tweedie
@ 1998-12-20  3:05         ` Linus Torvalds
  1998-12-20 14:18         ` Linus Torvalds
  1 sibling, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 1998-12-20  3:05 UTC (permalink / raw
  To: Stephen C. Tweedie; +Cc: Rik van Riel, Linux MM, Andrea Arcangeli, Alan Cox

On Sat, 19 Dec 1998, Stephen C. Tweedie wrote:
> 
> That is precisely the compromise I reached in the patch I sent you,
> courtesy of the test above. 

The problem I have with your version is that it's not at all obvious. It's
just another "magic test" rather than being clearly split out. We've had
too many of those already, and we've had too many people just adding more
and more magic tests on top of the old ones. 

I want a _design_, not just something that happens to work. See my point?

		Linus

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-19 22:01       ` Stephen C. Tweedie
  1998-12-20  3:05         ` Linus Torvalds
@ 1998-12-20 14:18         ` Linus Torvalds
  1998-12-21 13:03           ` Andrea Arcangeli
  1998-12-21 13:39           ` Stephen C. Tweedie
  1 sibling, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 1998-12-20 14:18 UTC (permalink / raw
  To: Stephen C. Tweedie; +Cc: Rik van Riel, Linux MM, Andrea Arcangeli, Alan Cox

There's a new pre-patch on ftp.kernel.org.

This has Stephens page-in read-ahead code, and I clearly separated the
cases where kswapd tries to throw something out vs a normal user - I
suspect Stephen can agree with the new setup. 

I expect that it needs to be tested in different configurations to find
the optimal values for various tunables, but hopefully this is it when it
comes to basic code.

It also has everything Alan has sent me so far integrated, along with
various other peoples patches. 

		Linus

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-19 17:09   ` New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes) Stephen C. Tweedie
  1998-12-19 18:41     ` Linus Torvalds
  1998-12-19 19:41     ` Linus Torvalds
@ 1998-12-21  9:53     ` Andrea Arcangeli
  1998-12-21 16:37       ` Stephen C. Tweedie
  2 siblings, 1 reply; 32+ messages in thread
From: Andrea Arcangeli @ 1998-12-21  9:53 UTC (permalink / raw
  To: Stephen C. Tweedie; +Cc: Linus Torvalds, Rik van Riel, Linux MM, Alan Cox

On Sat, 19 Dec 1998, Stephen C. Tweedie wrote:

>I've tried to port the best bits of that VM to 132-pre2, preserving your
>do_try_to_free_page state change, but so far I have not been able find a
>combination which gives anywhere near the overall performance of ac11
>for all of my test cases (although it works reasonably well on low
			   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>memory at first, until we start to fragment swap).

The good point of 132-pre2 is that you' ll never see a thread on linux
kernel that will say "132-pre2 VM performance jerky". It could be not
the best but sure will work well for everybody out there on every
hardware. 132-pre2 policy is "if you need great performance buy more
memory" swap will work fine but it' s not the default action. I agree to
help improving it though. 

>The patch below is the best I have so far against 132-pre2.  You will
>find that it has absolutely no references to the borrow percentages, and
>although it does honour the buffer/pgcache min percentages, those
>default to 1%.

I agree also to drop every borrow/max check in the kernel since we don' t
want a limit on the cache/buffer used until there is free memory. If a
special software need a lot of memory at once can grab it slowly and then
mlock it I think.

Index: linux/fs/buffer.c
diff -u linux/fs/buffer.c:1.1.1.1 linux/fs/buffer.c:1.1.1.1.2.1
--- linux/fs/buffer.c:1.1.1.1	Fri Nov 20 00:01:06 1998
+++ linux/fs/buffer.c	Thu Dec 17 22:35:20 1998
@@ -725,8 +725,7 @@
 	/* We are going to try to locate this much memory. */
 	needed = bdf_prm.b_un.nrefill * size;  
 
-	while ((nr_free_pages > freepages.min*2) &&
-	        !buffer_over_max() &&
+	while (free_memory_available() == 2 &&
 		grow_buffers(GFP_BUFFER, size)) {
 		obtained += PAGE_SIZE;
 		if (obtained >= needed)


Alternatively we could set the default of max to 90% or something
similar... probably it would be more tunable but I like more the
total autotuning approch...

Andrea Arcangeli

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-20 14:18         ` Linus Torvalds
@ 1998-12-21 13:03           ` Andrea Arcangeli
  1998-12-21 13:39           ` Stephen C. Tweedie
  1 sibling, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 1998-12-21 13:03 UTC (permalink / raw
  To: Linus Torvalds; +Cc: Stephen C. Tweedie, Rik van Riel, Linux MM, Alan Cox

On Sun, 20 Dec 1998, Linus Torvalds wrote:

>I expect that it needs to be tested in different configurations to find
>the optimal values for various tunables, but hopefully this is it when it
>comes to basic code.

I've done some changes to your code. The most experimental (easily
removable from the patch) is to move the check_pgt_cache() at the top of
the try_to/kswapd engines. This way we' ll trim the page table cache only
when we are low on memory, there's no reason to reclaim memory until there
is other memory unused I think.

The patch add also a bit of stats to the swap cache find case. 

The real difference is to change the priority of try_to_free_pages() to 4
since this way we more probably allow process to continue their work
without to sleep for a lot of time waiting for SYNC I/O completation.

I also revert to shrink_mmap() if the async IO queue request is saturated.

Seems to work fine here...

Index: linux/mm/memory.c
diff -u linux/mm/memory.c:1.1.1.2 linux/mm/memory.c:1.1.1.1.2.6
--- linux/mm/memory.c:1.1.1.2	Fri Nov 27 11:19:10 1998
+++ linux/mm/memory.c	Sun Dec 20 23:42:16 1998
@@ -136,8 +136,10 @@
 	for (i = 0 ; i < USER_PTRS_PER_PGD ; i++)
 		free_one_pgd(page_dir + i);
 
+#if 0 /* let kswapd to do this */
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
+#endif
 	return;
 
 out_bad:
@@ -165,8 +167,10 @@
 		free_one_pgd(page_dir + i);
 	pgd_free(page_dir);
 
+#if 0 /* let kswapd to do this */
 	/* keep the page table cache within bounds */
 	check_pgt_cache();
+#endif
 out:
 	return;
 
Index: linux/mm/swap_state.c
diff -u linux/mm/swap_state.c:1.1.1.3 linux/mm/swap_state.c:1.1.1.1.2.7
--- linux/mm/swap_state.c:1.1.1.3	Sun Dec 20 16:31:12 1998
+++ linux/mm/swap_state.c	Sun Dec 20 16:51:32 1998
@@ -261,6 +261,9 @@
 struct page * lookup_swap_cache(unsigned long entry)
 {
 	struct page *found;
+#ifdef	SWAP_CACHE_INFO
+	swap_cache_find_total++;
+#endif
 	
 	while (1) {
 		found = find_page(&swapper_inode, entry);
@@ -268,8 +271,12 @@
 			return 0;
 		if (found->inode != &swapper_inode || !PageSwapCache(found))
 			goto out_bad;
-		if (!PageLocked(found))
+		if (!PageLocked(found)) {
+#ifdef	SWAP_CACHE_INFO
+			swap_cache_find_success++;
+#endif
 			return found;
+		}
 		__free_page(found);
 		__wait_on_page(found);
 	}
Index: linux/mm/vmscan.c
diff -u linux/mm/vmscan.c:1.1.1.5 linux/mm/vmscan.c:1.1.1.1.2.33
--- linux/mm/vmscan.c:1.1.1.5	Sun Dec 20 16:31:12 1998
+++ linux/mm/vmscan.c	Mon Dec 21 10:25:22 1998
@@ -447,11 +447,12 @@
 
 	/* Always trim SLAB caches when memory gets low. */
 	kmem_cache_reap(0);
+	check_pgt_cache();
 
 	/* max one hundreth of a second */
 	end_time = jiffies + (HZ-1)/100;
 	do {
-		int priority = 7;
+		int priority = 6;
 		int count = pager_daemon.swap_cluster;
 
 		switch (kswapd_state) {
@@ -476,6 +477,12 @@
 	return kswapd_state;
 }
 
+static inline void enable_swap_tick(void)
+{
+	timer_table[SWAP_TIMER].expires = jiffies;
+	timer_active |= 1<<SWAP_TIMER;
+}
+
 /*
  * The background pageout daemon.
  * Started as a kernel thread from the init process.
@@ -523,6 +530,7 @@
 		current->state = TASK_INTERRUPTIBLE;
 		flush_signals(current);
 		run_task_queue(&tq_disk);
+		enable_swap_tick();
 		schedule();
 		swapstats.wakeups++;
 		state = kswapd_free_pages(state);
@@ -553,6 +561,7 @@
 
 	lock_kernel();
 
+	check_pgt_cache();
 	/* Always trim SLAB caches when memory gets low. */
 	kmem_cache_reap(gfp_mask);
 
@@ -562,7 +571,7 @@
 
 		current->flags |= PF_MEMALLOC;
 	
-		priority = 8;
+		priority = 4;
 		do {
 			free_memory(shrink_mmap(priority, gfp_mask));
 			free_memory(shm_swap(priority, gfp_mask));
@@ -593,7 +602,8 @@
 	if (priority) {
 		p->counter = p->priority << priority;
 		wake_up_process(p);
-	}
+	} else
+		enable_swap_tick();
 }
 
 /* 
@@ -631,9 +641,8 @@
 			want_wakeup = 3;
 	
 		kswapd_wakeup(p,want_wakeup);
-	}
-
-	timer_active |= (1<<SWAP_TIMER);
+	} else
+		enable_swap_tick();
 }
 
 /* 
Index: linux/arch/i386/kernel/process.c
diff -u linux/arch/i386/kernel/process.c:1.1.1.4 linux/arch/i386/kernel/process.c:1.1.1.1.2.32
--- linux/arch/i386/kernel/process.c:1.1.1.4	Thu Dec 17 16:33:27 1998
+++ linux/arch/i386/kernel/process.c	Mon Dec 21 10:35:52 1998
@@ -73,11 +73,11 @@
 
 #ifndef __SMP__
 
+#ifdef CONFIG_APM
 static void hard_idle(void)
 {
 	while (!current->need_resched) {
 		if (boot_cpu_data.hlt_works_ok && !hlt_counter) {
-#ifdef CONFIG_APM
 				/* If the APM BIOS is not enabled, or there
 				 is an error calling the idle routine, we
 				 should hlt if possible.  We need to check
@@ -87,44 +87,50 @@
 			if (!apm_do_idle() && !current->need_resched)
 				__asm__("hlt");
 			end_bh_atomic();
-#else
-			__asm__("hlt");
-#endif
 	        }
  		if (current->need_resched) 
  			break;
 		schedule();
 	}
-#ifdef CONFIG_APM
 	apm_do_busy();
-#endif
 }
+#endif
 
 /*
  * The idle loop on a uniprocessor i386..
  */ 
 static int cpu_idle(void *unused)
 {
+#ifdef CONFIG_APM
 	int work = 1;
 	unsigned long start_idle = 0;
+#endif
+	long * need_resched = &current->need_resched;
 
 	/* endless idle loop with no priority at all */
 	current->priority = 0;
 	current->counter = -100;
 	for (;;) {
+#ifdef CONFIG_APM
 		if (work)
 			start_idle = jiffies;
 
 		if (jiffies - start_idle > HARD_IDLE_TIMEOUT) 
 			hard_idle();
 		else  {
-			if (boot_cpu_data.hlt_works_ok && !hlt_counter && !current->need_resched)
+#endif
+			if (boot_cpu_data.hlt_works_ok && !hlt_counter && !*need_resched)
 		        	__asm__("hlt");
+
+#ifdef CONFIG_APM
 		}
 
-		work = current->need_resched;
+		work = *need_resched;
+#endif
 		schedule();
+#if 0
 		check_pgt_cache();
+#endif
 	}
 }
 
@@ -136,14 +142,18 @@
 
 int cpu_idle(void *unused)
 {
+	long * need_resched = &current->need_resched;
+
 	/* endless idle loop with no priority at all */
 	current->priority = 0;
 	current->counter = -100;
 	while(1) {
-		if (current_cpu_data.hlt_works_ok && !hlt_counter && !current->need_resched)
+		if (current_cpu_data.hlt_works_ok && !hlt_counter && !*need_resched)
 			__asm__("hlt");
 		schedule();
+#if 0
 		check_pgt_cache();
+#endif
 	}
 }
 


Andrea Arcangeli

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-20 14:18         ` Linus Torvalds
  1998-12-21 13:03           ` Andrea Arcangeli
@ 1998-12-21 13:39           ` Stephen C. Tweedie
  1998-12-21 14:08             ` Andrea Arcangeli
  1 sibling, 1 reply; 32+ messages in thread
From: Stephen C. Tweedie @ 1998-12-21 13:39 UTC (permalink / raw
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Rik van Riel, Linux MM, Andrea Arcangeli,
	Alan Cox

Hi,

On Sun, 20 Dec 1998 06:18:23 -0800 (PST), Linus Torvalds
<torvalds@transmeta.com> said:

> This has Stephens page-in read-ahead code, and I clearly separated the
> cases where kswapd tries to throw something out vs a normal user - I
> suspect Stephen can agree with the new setup. 

It certainly looks OK, and it performs very well on my 64MB system.
Sadly, in low memory it stinks.  It has just given me the worst
benchmark _ever_ of any of the VMs I have tried for an 8MB NFS defrag
build, taking nearly twice as long as ac11.

Taking both the kswapd and foreground pageout priority initial values
down to 6, things improve: it is only 45% slower now.

> I expect that it needs to be tested in different configurations to
> find the optimal values for various tunables, but hopefully this is it
> when it comes to basic code.

Linus, I have tried this sort of thing before.  I have stopped believing
that one can write the VM balancing code just by thinking about it.
There is a very delicate balance between good performance in various
typical loads and reasonable worst-case behaviour, and ac11 is the best
I've tried for this.  You might well be able to tweak the new algorithm
for good performance on low-memory, but you may well upset larger-memory
behaviour in the process.

On the other hand, I will readily agree that the code in ac11 could be
better expressed: you are quite right when you point out that the
shrink_mmap() test, conditional on (current != kswap_task) would be
better written explicitly as a separate code path for the foreground
memory reclaim code.

As I've said, I'll not have any more time to fine-tune this stuff before
the New Year.  It's up to you what you decide to do about this, but if
you want things fine-tuned sooner than that you'll have to find somebody
else to do it; I've already tuned the ac11 VM and it works well overall
in every case I have tried.  132-pre3 seems OK on a larger memory
machine, but there's no way I'll be running it on my low-memory test
boxes.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-21 13:39           ` Stephen C. Tweedie
@ 1998-12-21 14:08             ` Andrea Arcangeli
  1998-12-21 16:42               ` Stephen C. Tweedie
  0 siblings, 1 reply; 32+ messages in thread
From: Andrea Arcangeli @ 1998-12-21 14:08 UTC (permalink / raw
  To: Stephen C. Tweedie; +Cc: Linus Torvalds, Rik van Riel, Linux MM, Alan Cox

On Mon, 21 Dec 1998, Stephen C. Tweedie wrote:

>in every case I have tried.  132-pre3 seems OK on a larger memory
>machine, but there's no way I'll be running it on my low-memory test
>boxes.

Could you try to apply my patch I sent to you too some minutes ago? It
seems to perform well at least on 32Mbyte. The point is that setting the
prio = 4 in try_to_free_pages() avoid that processes will stuck in the
SYNC IO. 

Andrea Arcangeli

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-21  9:53     ` Andrea Arcangeli
@ 1998-12-21 16:37       ` Stephen C. Tweedie
  1998-12-21 17:58         ` Linus Torvalds
  0 siblings, 1 reply; 32+ messages in thread
From: Stephen C. Tweedie @ 1998-12-21 16:37 UTC (permalink / raw
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Linus Torvalds, Rik van Riel, Linux MM,
	Alan Cox

Hi,

On Mon, 21 Dec 1998 10:53:35 +0100 (CET), Andrea Arcangeli
<andrea@e-mind.com> said:

> The good point of 132-pre2 is that you' ll never see a thread on linux
> kernel that will say "132-pre2 VM performance jerky". 

I haven't seen that for the current ac patches, either.

> It could be not the best but sure will work well for everybody out
> there on every hardware. 

Of course, you've tested this, haven't you?

pre2 works OK on low memory for me but its performance on 64MB sucks
here.  pre3 works fine on 64MB but its performance on 8MB sucks even
more.  You simply CANNOT tell from looking at the code that it "will
work well for everybody out there on every hardware".  

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-21 14:08             ` Andrea Arcangeli
@ 1998-12-21 16:42               ` Stephen C. Tweedie
  0 siblings, 0 replies; 32+ messages in thread
From: Stephen C. Tweedie @ 1998-12-21 16:42 UTC (permalink / raw
  To: Andrea Arcangeli
  Cc: Stephen C. Tweedie, Linus Torvalds, Rik van Riel, Linux MM,
	Alan Cox

Hi,

On Mon, 21 Dec 1998 15:08:35 +0100 (CET), Andrea Arcangeli
<andrea@e-mind.com> said:

> On Mon, 21 Dec 1998, Stephen C. Tweedie wrote:
>> in every case I have tried.  132-pre3 seems OK on a larger memory
>> machine, but there's no way I'll be running it on my low-memory test
>> boxes.

> Could you try to apply my patch I sent to you too some minutes ago? 

No.  I'm in thesis mode until the new year (I really shouldn't be
writing this!).  I've already tested the VM, and have something which
works.  What is in ac* has been tuned and gives overall good
behaviour.  Every single proposal I've seen since, without exception,
has performed worse on low memory, worse on 64MB, has trashed the
cache, has result in large amounts of read IO during kernel builds, or
has had some other such regression against the VM I've been tuning for
the last two or three weeks.  I am _not_ about to go starting that
tuning process all over again.

If you want my attention, then benchmark your own patch and show me
that it is better than what we have.  So far, I have been benchmarking
everybody else's patches, and they are all worse than what I already
have.

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-21 16:37       ` Stephen C. Tweedie
@ 1998-12-21 17:58         ` Linus Torvalds
  1998-12-21 18:59           ` Stephen C. Tweedie
  1998-12-22  7:56           ` Eric W. Biederman
  0 siblings, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 1998-12-21 17:58 UTC (permalink / raw
  To: Stephen C. Tweedie; +Cc: Andrea Arcangeli, Rik van Riel, Linux MM, Alan Cox

On Mon, 21 Dec 1998, Stephen C. Tweedie wrote:
> 
> pre2 works OK on low memory for me but its performance on 64MB sucks
> here.  pre3 works fine on 64MB but its performance on 8MB sucks even
> more.

I'm testing it now - the problem is probably just due to my mixing up the
pre-2 and pre-3 patches, and pre-3 got the "timid" memory freeing
parameters even though the whole point of the pre-3 approach is that it
isn't needed any more.

>	  You simply CANNOT tell from looking at the code that it "will
> work well for everybody out there on every hardware".  

Agreed.

However, I very much believe that tweaking comes _after_ the basic
arhictecture is right. Before the basic architecture is correct, any
tweaking is useful only to (a) try to make do with a bad setup and (b) 
give hints as to what makes a difference, and what the basic architecture
_should_ be. 

As such, your "current != kswapd" tweak gave a whopping good hint about
what the architecture _should_ be. And we'll be zeroing in on something
that has both the performance and the architecture right. 

		Linus

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-21 17:58         ` Linus Torvalds
@ 1998-12-21 18:59           ` Stephen C. Tweedie
  1998-12-21 19:38             ` Linus Torvalds
  1998-12-22  7:56           ` Eric W. Biederman
  1 sibling, 1 reply; 32+ messages in thread
From: Stephen C. Tweedie @ 1998-12-21 18:59 UTC (permalink / raw
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Andrea Arcangeli, Rik van Riel, Linux MM,
	Alan Cox

Hi,

On Mon, 21 Dec 1998 09:58:10 -0800 (PST), Linus Torvalds
<torvalds@transmeta.com> said:

> I'm testing it now - the problem is probably just due to my mixing up the
> pre-2 and pre-3 patches, and pre-3 got the "timid" memory freeing
> parameters even though the whole point of the pre-3 approach is that it
> isn't needed any more.

Yep, and although things did improve when I restored some of that
aggressiveness (initial priority = 6 again), it was still mondo slow
on 8MB.  I also restored the swapout loop (so that the foreground
try_to_free_page() takes a swap cluster argument again, rather than
always freeing just one page at a time); still no improvement (which
actually surprised me --- I guess that kswapd is doing clustering for
swapout well enough on its own).

>> You simply CANNOT tell from looking at the code that it "will
>> work well for everybody out there on every hardware".  

> Agreed.

> However, I very much believe that tweaking comes _after_ the basic
> arhictecture is right. 

Right.

> As such, your "current != kswapd" tweak gave a whopping good hint about
> what the architecture _should_ be. And we'll be zeroing in on something
> that has both the performance and the architecture right. 

Sure: I think we can agree that the most important principle in this
respect is that the foreground and background swapping tasks may be
similar but they do not _need_ to be the same, and they may well have
different requirements.

Linus, would it help at all if I just sat down and recoded the VM I'm
running now in a manner which makes the design obvious?  In other
words, clearly separate out the foreground and background paths as you
have done, with the "current != kswapd" test removed and the
foreground-specific code in its own, identifiable code path, but
preserving the actual algorithm?

--Stephen
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-21 18:59           ` Stephen C. Tweedie
@ 1998-12-21 19:38             ` Linus Torvalds
  0 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 1998-12-21 19:38 UTC (permalink / raw
  To: Stephen C. Tweedie; +Cc: Andrea Arcangeli, Rik van Riel, Linux MM, Alan Cox



On Mon, 21 Dec 1998, Stephen C. Tweedie wrote:
>
> Yep, and although things did improve when I restored some of that
> aggressiveness (initial priority = 6 again), it was still mondo slow
> on 8MB.  I also restored the swapout loop (so that the foreground
> try_to_free_page() takes a swap cluster argument again, rather than
> always freeing just one page at a time);

Hmm.. It already does that. Maybe you didn't look at the "free_memory()"
macro?

>				 still no improvement (which
> actually surprised me --- I guess that kswapd is doing clustering for
> swapout well enough on its own).

You shouldn't be surprised, as I don't think you changed anything ;)

> Linus, would it help at all if I just sat down and recoded the VM I'm
> running now in a manner which makes the design obvious?  In other
> words, clearly separate out the foreground and background paths as you
> have done, with the "current != kswapd" test removed and the
> foreground-specific code in its own, identifiable code path, but
> preserving the actual algorithm?

Sure, send me patches.

		Linus

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-21 17:58         ` Linus Torvalds
  1998-12-21 18:59           ` Stephen C. Tweedie
@ 1998-12-22  7:56           ` Eric W. Biederman
  1998-12-22 10:49             ` Andrea Arcangeli
  1 sibling, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 1998-12-22  7:56 UTC (permalink / raw
  To: Linus Torvalds
  Cc: Stephen C. Tweedie, Andrea Arcangeli, Rik van Riel, Linux MM,
	Alan Cox

>>>>> "LT" == Linus Torvalds <torvalds@transmeta.com> writes:

LT> On Mon, 21 Dec 1998, Stephen C. Tweedie wrote:
>> 
>> pre2 works OK on low memory for me but its performance on 64MB sucks
>> here.  pre3 works fine on 64MB but its performance on 8MB sucks even
>> more.

LT> I'm testing it now - the problem is probably just due to my mixing up the
LT> pre-2 and pre-3 patches, and pre-3 got the "timid" memory freeing
LT> parameters even though the whole point of the pre-3 approach is that it
LT> isn't needed any more.

>> You simply CANNOT tell from looking at the code that it "will
>> work well for everybody out there on every hardware".  

LT> Agreed.

LT> However, I very much believe that tweaking comes _after_ the basic
LT> arhictecture is right. Before the basic architecture is correct, any
LT> tweaking is useful only to (a) try to make do with a bad setup and (b) 
LT> give hints as to what makes a difference, and what the basic architecture
LT> _should_ be. 

LT> As such, your "current != kswapd" tweak gave a whopping good hint about
LT> what the architecture _should_ be. And we'll be zeroing in on something
LT> that has both the performance and the architecture right. 

In getting the architecture right,  Let's make it clear why the
foreground task should be more aggressive with shrink_mmap than the
background task. 

The semantics of shrink_mmap, & swap_out are no longer the same,
and they should not be treated equally.

shrink_mmap actually free's memory.
swap_out never free's memory.

The background task doesn't really ever need to free memory unless memory
starts getting too low for atomic allocations, so only then should it call
shrink_mmap.

The foreground task always really want's memory so it should never call swap_out
unless it needs to accellerate the swapping process (so it could also wake up or 
whatever the daemon).

To date I have only studied one very specific case,  what happens when
a process dirties pages faster then the system can handle. 

The results I have are:
1) Using the stated logic and staying with swap_out (and never calling
   shrink_mmap) locks the machine until all dirty pages are cleaned.

2) Calling shrink_mmap anytime during a swap_out cycle gives slow
   performance but the machine doesn't lock.

3) The vm I was playing with had no way to limit the total vm size.
   So process that are thrashing will slow other processes as well.
   So we have a potential worst case scenario, the only solution to 
   would be to implement RLIMIT_RSS.  
   If I can find enough time I'm going to look at implementing
   RLIMIT_RSS in handle_pte_fault, it should be fairly simple.

Eric

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-22  7:56           ` Eric W. Biederman
@ 1998-12-22 10:49             ` Andrea Arcangeli
  1998-12-22 15:32               ` Eric W. Biederman
  1998-12-22 17:23               ` [patch] swap_out now really free (the right) pages [Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)] Andrea Arcangeli
  0 siblings, 2 replies; 32+ messages in thread
From: Andrea Arcangeli @ 1998-12-22 10:49 UTC (permalink / raw
  To: Eric W. Biederman
  Cc: Linus Torvalds, Stephen C. Tweedie, Rik van Riel, Linux MM,
	Alan Cox

On 22 Dec 1998, Eric W. Biederman wrote:

>To date I have only studied one very specific case,  what happens when
>a process dirties pages faster then the system can handle. 

Me too.

>3) The vm I was playing with had no way to limit the total vm size.
>   So process that are thrashing will slow other processes as well.
>   So we have a potential worst case scenario, the only solution to 
>   would be to implement RLIMIT_RSS.  

Hmm, no limiting the resident size is a workaround I think...

I agree that the fact that swapout returns 1 and really has not freed a
page is a bit messy though. Should we always do a shrink_mmap()  after
every succesfully swapout? 

Andrea Arcangeli

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-22 10:49             ` Andrea Arcangeli
@ 1998-12-22 15:32               ` Eric W. Biederman
  1998-12-22 15:40                 ` Andrea Arcangeli
  1998-12-22 20:03                 ` Rik van Riel
  1998-12-22 17:23               ` [patch] swap_out now really free (the right) pages [Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)] Andrea Arcangeli
  1 sibling, 2 replies; 32+ messages in thread
From: Eric W. Biederman @ 1998-12-22 15:32 UTC (permalink / raw
  To: Andrea Arcangeli
  Cc: Eric W. Biederman, Linus Torvalds, Stephen C. Tweedie,
	Rik van Riel, Linux MM, Alan Cox

>>>>> "AA" == Andrea Arcangeli <andrea@e-mind.com> writes:

AA> On 22 Dec 1998, Eric W. Biederman wrote:
>> To date I have only studied one very specific case,  what happens when
>> a process dirties pages faster then the system can handle. 

AA> Me too.

>> 3) The vm I was playing with had no way to limit the total vm size.
>> So process that are thrashing will slow other processes as well.
>> So we have a potential worst case scenario, the only solution to 
>> would be to implement RLIMIT_RSS.  

AA> Hmm, no limiting the resident size is a workaround I think...

Not totally, though there may be another way.

The worst case is very simple, a program eating pages at the maximum
possible rate, and out competing every other program for pages.

The goal is to keep one single rogue program from outcompeting all of the others.
With implementing a RSS limit this is accomplished by at some point forcing free
pages to come from the program that needs the memory, (via swap_out) instead of directly.

What currently happens is when such a program starts thrashing, is whenever it wakes
up it steals all of the memory, and sleeps until it can steel some more.  Because
the program is a better competitor, than the others.  With a RSS limit we would
garantee that there is some memory left over for other programs to run in.

Eventually we should attempt to autotune a programs RSS by it's workload, and
if giving a program a larger RSS doesn't help (that is the program continues to thrash with
an RSS we give it) we should scale back it's RSS, so as not to compete with other programs.

Implementing simple RSS limits is a first aproximation of the above.

Implementing arbitrary RSS limits should have little effect on
performance because all of the pages go simply to the swap_cache.

Implementing RSS limits is only a means of preventing a denial of
service attack, and it should not be a case we autotune for.

AA> I agree that the fact that swapout returns 1 and really has not freed a
AA> page is a bit messy though. Should we always do a shrink_mmap()  after
AA> every succesfully swapout? 

No.  That doesn't buy you anything, let the routines have different
semantics and stop trying to treat them the same.

This is simply one reason why everyone's trick of calling shrink_mmap at
strange times worked.  

My suggestion (again) would be to not call shrink_mmap in the swapper
(unless we are endangering atomic allocations).  And to never call
swap_out in the memory allocator (just wake up kswapd).

Since we are into getting the architecture right.  Let's stop trying
to force square pegs through round holes.  It's o.k. to make a square
hole too.

Eric
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-22 15:32               ` Eric W. Biederman
@ 1998-12-22 15:40                 ` Andrea Arcangeli
  1998-12-22 16:26                   ` Linus Torvalds
  1998-12-22 20:10                   ` Rik van Riel
  1998-12-22 20:03                 ` Rik van Riel
  1 sibling, 2 replies; 32+ messages in thread
From: Andrea Arcangeli @ 1998-12-22 15:40 UTC (permalink / raw
  To: Eric W. Biederman
  Cc: Linus Torvalds, Stephen C. Tweedie, Rik van Riel, Linux MM,
	Alan Cox

On 22 Dec 1998, Eric W. Biederman wrote:

>My suggestion (again) would be to not call shrink_mmap in the swapper
>(unless we are endangering atomic allocations).  And to never call
>swap_out in the memory allocator (just wake up kswapd).

Ah, I just had your _same_ _exactly_ idea yesterday but there' s a good
reason I nor proposed/tried it. The point are Real time tasks. kswapd is
not realtime and a realtime task must be able to swapout a little by
itself in try_to_free_pages() when there's nothing to free on the cache
anymore. 

Since I agree with you to run mainly shrink_mmap() in the foreground
freeing I just proposed yesterday to use an higher priority in
try_to_free_pages (see my patch, it starts with priority = 4, Linus's now
start with prio = 5). This way we are pretty sure that the foreground
freeing will be done in shrink_mmap() and so that some memory will be
really freed some way (and this will avoid also tasks other than kswapd to
sleep waiting for slowww SYNC IO). 

I agree with you with the argument that it's a bogus architecture to use
in the same way the actual swap_out and shrink_mmap() since swap_out
doesn' t really free pages....

Linus's pre-4 seems to work well here though...

Andrea Arcangeli

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-22 15:40                 ` Andrea Arcangeli
@ 1998-12-22 16:26                   ` Linus Torvalds
  1998-12-22 19:55                     ` Eric W. Biederman
  1998-12-22 20:25                     ` Rik van Riel
  1998-12-22 20:10                   ` Rik van Riel
  1 sibling, 2 replies; 32+ messages in thread
From: Linus Torvalds @ 1998-12-22 16:26 UTC (permalink / raw
  To: Andrea Arcangeli
  Cc: Eric W. Biederman, Stephen C. Tweedie, Rik van Riel, Linux MM,
	Alan Cox



On Tue, 22 Dec 1998, Andrea Arcangeli wrote:
>
> On 22 Dec 1998, Eric W. Biederman wrote:
> 
> >My suggestion (again) would be to not call shrink_mmap in the swapper
> >(unless we are endangering atomic allocations).  And to never call
> >swap_out in the memory allocator (just wake up kswapd).
> 
> Ah, I just had your _same_ _exactly_ idea yesterday but there' s a good
> reason I nor proposed/tried it. The point are Real time tasks. kswapd is
> not realtime and a realtime task must be able to swapout a little by
> itself in try_to_free_pages() when there's nothing to free on the cache
> anymore. 

There's another one: if you never call shrink_mmap() in the swapper, the
swapper at least currently won't ever really know when it should finish.

> Linus's pre-4 seems to work well here though...

I'm still trying to integrate some of the stuff from Stephen in there: the
pre-4 contained some re-writes to shrink_mmap() to make Stephens
PG_referenced stuff cleaner, but it didn't yet take it into account for
"count", for example. The aim certainly is to have something clean that
essentially does what Stephen was trying to do. 

		Linus

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch] swap_out now really free (the right) pages [Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)]
  1998-12-22 10:49             ` Andrea Arcangeli
  1998-12-22 15:32               ` Eric W. Biederman
@ 1998-12-22 17:23               ` Andrea Arcangeli
  1 sibling, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 1998-12-22 17:23 UTC (permalink / raw
  To: Eric W. Biederman
  Cc: Linus Torvalds, Stephen C. Tweedie, Rik van Riel, Linux MM,
	Alan Cox

On Tue, 22 Dec 1998, Andrea Arcangeli wrote:

>page is a bit messy though. Should we always do a shrink_mmap()  after
>every succesfully swapout? 

Tried and seems to work greatly here! This my new mm patch improves things
because now swap_out() is able to really free pages and so very less
frequently processes get blocked in try_to_free_pages because now kswapd
is able to take the freepages over the min limit. It seems to _not_ hurt the
aging at all. And btw this my patch make _tons_ of sense to me.

Could you try the patch and feedback?

Andrea Arcangeli

PS. As usual I don't know if adding a Copyright there can make sense or is
    legal...

Patch against 2.1.132-4

Index: filemap.c
===================================================================
RCS file: /var/cvs/linux/mm/filemap.c,v
retrieving revision 1.1.1.1.2.24
diff -u -r1.1.1.1.2.24 filemap.c
--- filemap.c	1998/12/22 11:07:28	1.1.1.1.2.24
+++ linux/mm/filemap.c	1998/12/22 17:03:55
@@ -181,26 +181,6 @@
 }
 
 /*
- * This is called from try_to_swap_out() when we try to get rid of some
- * pages..  If we're unmapping the last occurrence of this page, we also
- * free it from the page hash-queues etc, as we don't want to keep it
- * in-core unnecessarily.
- */
-unsigned long page_unuse(struct page * page)
-{
-	int count = atomic_read(&page->count);
-
-	if (count != 2)
-		return count;
-	if (!page->inode)
-		return count;
-	if (PageSwapCache(page))
-		panic ("Doing a normal page_unuse of a swap cache page");
-	remove_inode_page(page);
-	return 1;
-}
-
-/*
  * Update a page cache copy, when we're doing a "write()" system call
  * See also "update_vm_cache()".
  */
Index: swap_state.c
===================================================================
RCS file: /var/cvs/linux/mm/swap_state.c,v
retrieving revision 1.1.1.1.2.7
diff -u -r1.1.1.1.2.7 swap_state.c
--- swap_state.c	1998/12/20 15:51:32	1.1.1.1.2.7
+++ linux/mm/swap_state.c	1998/12/22 16:33:29
@@ -248,7 +248,7 @@
 		delete_from_swap_cache(page);
 	}
 	
-	free_page(addr);
+	__free_page(page);
 }
 
 
Index: vmscan.c
===================================================================
RCS file: /var/cvs/linux/mm/vmscan.c,v
retrieving revision 1.1.1.1.2.39
diff -u -r1.1.1.1.2.39 vmscan.c
--- vmscan.c	1998/12/22 11:07:28	1.1.1.1.2.39
+++ linux/mm/vmscan.c	1998/12/22 17:19:17
@@ -10,6 +10,16 @@
  *  Version: $Id: vmscan.c,v 1.5 1998/02/23 22:14:28 sct Exp $
  */
 
+/*
+ * Changed swap_out() to have really freed one page when it returns 1
+ * (that was not longer true since 2.1.130).
+ * The trick is done doing a fast pass of shrink_mmap() and freeing
+ * the swapped out page by hand from the swap cache only if shrink_mmap()
+ * has failed. This way we are swapping out and freeing ram but taking care
+ * of the page aging (PG_referenced).
+ *			Copyright (C) 1998  Andrea Arcangeli
+ */
+
 #include <linux/slab.h>
 #include <linux/kernel_stat.h>
 #include <linux/swap.h>
@@ -27,6 +37,8 @@
 
 static void init_swap_timer(void);
 
+#define	SWAPOUT_SHRINK_PRIORITY	6
+
 /*
  * The swap-out functions return 1 if they successfully
  * threw something out, and we got a free page. It returns
@@ -162,7 +174,12 @@
 			 * copy in memory, so we add it to the swap
 			 * cache. */
 			if (PageSwapCache(page_map)) {
-				free_page(page);
+				if (shrink_mmap(SWAPOUT_SHRINK_PRIORITY, 0))
+				{
+					__free_page(page_map);
+					return 1;
+				}
+				free_page_and_swap_cache(page);
 				return (atomic_read(&page_map->count) == 0);
 			}
 			add_to_swap_cache(page_map, entry);
@@ -180,7 +197,11 @@
 		 * asynchronously.  That's no problem, shrink_mmap() can
 		 * correctly clean up the occassional unshared page
 		 * which gets left behind in the swap cache. */
-		free_page(page);
+		if (shrink_mmap(SWAPOUT_SHRINK_PRIORITY, 0))
+			__free_page(page_map);
+		else
+			free_page_and_swap_cache(page);
+
 		return 1;	/* we slept: the process may not exist any more */
 	}
 
@@ -194,8 +215,14 @@
 		set_pte(page_table, __pte(entry));
 		flush_tlb_page(vma, address);
 		swap_duplicate(entry);
-		free_page(page);
-		return (atomic_read(&page_map->count) == 0);
+		if (shrink_mmap(SWAPOUT_SHRINK_PRIORITY, 0))
+		{
+			__free_page(page_map);
+			return 1;
+		} else {
+			free_page_and_swap_cache(page);
+			return (atomic_read(&page_map->count) == 0);
+		}
 	} 
 	/* 
 	 * A clean page to be discarded?  Must be mmap()ed from
@@ -210,9 +237,15 @@
 	flush_cache_page(vma, address);
 	pte_clear(page_table);
 	flush_tlb_page(vma, address);
-	entry = (atomic_read(&page_map->count) == 1);
+	entry = atomic_read(&page_map->count);
 	__free_page(page_map);
-	return entry;
+	if (entry == 2 && page_map->inode)
+	{
+		if (!shrink_mmap(SWAPOUT_SHRINK_PRIORITY, 0))
+			remove_inode_page(page_map);
+		return 1;
+	}
+	return entry == 1;
 }
 
 /*

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-22 16:26                   ` Linus Torvalds
@ 1998-12-22 19:55                     ` Eric W. Biederman
  1998-12-22 20:25                     ` Rik van Riel
  1 sibling, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 1998-12-22 19:55 UTC (permalink / raw
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Stephen C. Tweedie, Rik van Riel, Linux MM,
	Alan Cox

>>>>> "LT" == Linus Torvalds <torvalds@transmeta.com> writes:

LT> On Tue, 22 Dec 1998, Andrea Arcangeli wrote:
>> 
>> On 22 Dec 1998, Eric W. Biederman wrote:
>> 
>> >My suggestion (again) would be to not call shrink_mmap in the swapper
>> >(unless we are endangering atomic allocations).  And to never call
>> >swap_out in the memory allocator (just wake up kswapd).
>> 
>> Ah, I just had your _same_ _exactly_ idea yesterday but there' s a good
>> reason I nor proposed/tried it. The point are Real time tasks. kswapd is
>> not realtime and a realtime task must be able to swapout a little by
>> itself in try_to_free_pages() when there's nothing to free on the cache
>> anymore. 

LT> There's another one: if you never call shrink_mmap() in the swapper, the
LT> swapper at least currently won't ever really know when it should finish.

Unless there are foreground allocations, that free a little too much memory.

With respect to real time tasks. 
A) they don't generally swap.
B) If there is code in __get_free_pages to put the real time task to sleep if it
   must while waiting for memory.
C) We are currently examining all of the code and seeing if it is comprehensible.
   Do we want to free memory to freepages.high in kswapd.

>> Linus's pre-4 seems to work well here though...

LT> I'm still trying to integrate some of the stuff from Stephen in there: the
LT> pre-4 contained some re-writes to shrink_mmap() to make Stephens
LT> PG_referenced stuff cleaner, but it didn't yet take it into account for
LT> "count", for example. The aim certainly is to have something clean that
LT> essentially does what Stephen was trying to do. 

If the aim is to make Stephen's code comprehensible I won't push too
hard.  But I will push to make sure that the code is comprehensible.
And with the change to swap_out to only half free memory there is code
that used to make sense but no longer does. 

As for pre-4 I am still baffled by treating swap_out the same as
as shrink_mmap, they aren't the same.

swap_out is an investment in free memory to come, and shrink_mmap
capitializes on that investment.

Eric
--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-22 15:32               ` Eric W. Biederman
  1998-12-22 15:40                 ` Andrea Arcangeli
@ 1998-12-22 20:03                 ` Rik van Riel
  1 sibling, 0 replies; 32+ messages in thread
From: Rik van Riel @ 1998-12-22 20:03 UTC (permalink / raw
  To: Eric W. Biederman
  Cc: Andrea Arcangeli, Linus Torvalds, Stephen C. Tweedie, Linux MM,
	Alan Cox

On 22 Dec 1998, Eric W. Biederman wrote:

> The goal is to keep one single rogue program from outcompeting all
> of the others. With implementing a RSS limit this is accomplished
> by at some point forcing free pages to come from the program that
> needs the memory, (via swap_out) instead of directly.
> 
> What currently happens is when such a program starts thrashing, is
> whenever it wakes up it steals all of the memory, and sleeps until
> it can steel some more.  Because the program is a better
> competitor, than the others.  With a RSS limit we would garantee
> that there is some memory left over for other programs to run in.
> 
> Eventually we should attempt to autotune a programs RSS by it's
> workload, and if giving a program a larger RSS doesn't help (that
> is the program continues to thrash with an RSS we give it) we
> should scale back it's RSS, so as not to compete with other
> programs.

I have a better idea:

if (current->mm->rss > hog_pct && total_mapped > syshog_pct) {
    ... swap_out_process(current, GFP)  swap_cluster pages ...
}

We can easily do something like this because swap_out() only
unmaps the pages and they can easily be mapped in again.

I know we tried it before and it horribly failed back then,
but now pages are not freed on swap_out(). Things have changed
in such a way that it could probably work now...

We want the above routine in one of the functions surrounding
mm/page_alloc.c::swap_in() -- this way we 'throttle at the
source'.

I know some of you think throttling at the source is a bad
thing (even for buffer cache), but you'll have to throttle
eventually and not doing it will mean you also 'throttle'
the (innocent) rest of the system...

cheers,

Rik -- the flu hits, the flu hits, the flu hits -- MORE
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-22 15:40                 ` Andrea Arcangeli
  1998-12-22 16:26                   ` Linus Torvalds
@ 1998-12-22 20:10                   ` Rik van Riel
  1998-12-22 22:35                     ` Andrea Arcangeli
  1 sibling, 1 reply; 32+ messages in thread
From: Rik van Riel @ 1998-12-22 20:10 UTC (permalink / raw
  To: Andrea Arcangeli
  Cc: Eric W. Biederman, Linus Torvalds, Stephen C. Tweedie, Linux MM,
	Alan Cox

On Tue, 22 Dec 1998, Andrea Arcangeli wrote:
> On 22 Dec 1998, Eric W. Biederman wrote:
> 
> >My suggestion (again) would be to not call shrink_mmap in the swapper
> >(unless we are endangering atomic allocations).  And to never call
> >swap_out in the memory allocator (just wake up kswapd).
> 
> Ah, I just had your _same_ _exactly_ idea yesterday but there' s a
> good reason I nor proposed/tried it. The point are Real time
> tasks. kswapd is not realtime and a realtime task must be able to
> swapout a little by itself in try_to_free_pages() when there's
> nothing to free on the cache anymore.

- kswapd should make sure that there is enough on the cache
  (we should keep track of how many 1-count cache pages there
  are in the system)
- realtime tasks shouldn't go around allocating huge amounts
  of memory -- this totally ruins the realtime aspect anyway

> (and this will avoid also tasks other than kswapd to
> sleep waiting for slowww SYNC IO). 

Some tasks (really big memory hogs) are better left sleeping
for I/O because they otherwise completely overpower the rest
of the system. But that's a slightly different story :)

cheers,

Rik -- the flu hits, the flu hits, the flu hits -- MORE
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-22 16:26                   ` Linus Torvalds
  1998-12-22 19:55                     ` Eric W. Biederman
@ 1998-12-22 20:25                     ` Rik van Riel
  1998-12-22 21:56                       ` Linus Torvalds
  1 sibling, 1 reply; 32+ messages in thread
From: Rik van Riel @ 1998-12-22 20:25 UTC (permalink / raw
  To: Linus Torvalds
  Cc: Andrea Arcangeli, Eric W. Biederman, Stephen C. Tweedie, Linux MM,
	Alan Cox

On Tue, 22 Dec 1998, Linus Torvalds wrote:
> On Tue, 22 Dec 1998, Andrea Arcangeli wrote:
> > On 22 Dec 1998, Eric W. Biederman wrote:
> > 
> > >My suggestion (again) would be to not call shrink_mmap in the swapper
> > >(unless we are endangering atomic allocations).  And to never call
> > >swap_out in the memory allocator (just wake up kswapd).
> > 
> > Ah, I just had your _same_ _exactly_ idea yesterday but there' s a good
> > reason I nor proposed/tried it. The point are Real time tasks. kswapd is
> > not realtime and a realtime task must be able to swapout a little by
> > itself in try_to_free_pages() when there's nothing to free on the cache
> > anymore. 
> 
> There's another one: if you never call shrink_mmap() in the swapper, the
> swapper at least currently won't ever really know when it should finish.

Remember 2.1.89, when you solemnly swore off any kswapd solution
that had anything to do with nr_freepages?

I guess it's time to just let kswapd finish when there are enough
pages that can be 'reapt' by shrink_mmap(). This is a somewhat less
arbitrary way than what we have now, since those clean pages can be
mapped back in any time.

And when we have not enough memory for DMA buffers or something
like that, we can just set a flag that:
- orders kswapd to unmap XX pages a second
- modifies shrink_mmap() to look for contiguous areas that it
  can free -- and free them

regards,

Rik -- the flu hits, the flu hits, the flu hits -- MORE
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-22 20:25                     ` Rik van Riel
@ 1998-12-22 21:56                       ` Linus Torvalds
  0 siblings, 0 replies; 32+ messages in thread
From: Linus Torvalds @ 1998-12-22 21:56 UTC (permalink / raw
  To: Rik van Riel
  Cc: Andrea Arcangeli, Eric W. Biederman, Stephen C. Tweedie, Linux MM,
	Alan Cox

On Tue, 22 Dec 1998, Rik van Riel wrote:
> > 
> > There's another one: if you never call shrink_mmap() in the swapper, the
> > swapper at least currently won't ever really know when it should finish.
> 
> Remember 2.1.89, when you solemnly swore off any kswapd solution
> that had anything to do with nr_freepages?

The problem is that we have to have _something_ to go by. I tried for the
longest time to use the memory queues, but eventually gave up. 

> I guess it's time to just let kswapd finish when there are enough
> pages that can be 'reapt' by shrink_mmap(). This is a somewhat less
> arbitrary way than what we have now, since those clean pages can be
> mapped back in any time.

If we'd have a count of "freeable pages", that would certainly work for
me. I only asked for _some_ way to know when it should finish. 

Btw, I just made a 2.1.132. I would have liked to get this issue put to
death, but it didn't look likely, and I had all the other patches pending
that I wanted out (the irda stuff etc), so 2.1.132 is reality, and I hope
we can work based on that.

Logically 2.1.132 should be reasonably close to Stephens patches, but as
the code actually looks very different it's hard for me to judge whether
it actually performs comparably. And a 8MB machine feels so sluggish to me
these days that I can't make any judgement at all from that. 

		Linus

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-22 20:10                   ` Rik van Riel
@ 1998-12-22 22:35                     ` Andrea Arcangeli
  1998-12-23  8:45                       ` Rik van Riel
  0 siblings, 1 reply; 32+ messages in thread
From: Andrea Arcangeli @ 1998-12-22 22:35 UTC (permalink / raw
  To: Rik van Riel
  Cc: Eric W. Biederman, Linus Torvalds, Stephen C. Tweedie, Linux MM,
	Alan Cox

On Tue, 22 Dec 1998, Rik van Riel wrote:

>- kswapd should make sure that there is enough on the cache
>  (we should keep track of how many 1-count cache pages there
>  are in the system)
>- realtime tasks shouldn't go around allocating huge amounts
>  of memory -- this totally ruins the realtime aspect anyway

What about if there is netscape iconized and the realtime task want to
allocate some memory to mlock it but has to swapout netscape to do that?

>> (and this will avoid also tasks other than kswapd to
>> sleep waiting for slowww SYNC IO). 
>
>Some tasks (really big memory hogs) are better left sleeping
>for I/O because they otherwise completely overpower the rest
>of the system. But that's a slightly different story :)

The point here is that `free` get blocked on I/O because the malicious
process is trashing VM. 

Andrea Arcangeli

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)
  1998-12-22 22:35                     ` Andrea Arcangeli
@ 1998-12-23  8:45                       ` Rik van Riel
  0 siblings, 0 replies; 32+ messages in thread
From: Rik van Riel @ 1998-12-23  8:45 UTC (permalink / raw
  To: Andrea Arcangeli
  Cc: Eric W. Biederman, Linus Torvalds, Stephen C. Tweedie, Linux MM,
	Alan Cox

On Tue, 22 Dec 1998, Andrea Arcangeli wrote:
> On Tue, 22 Dec 1998, Rik van Riel wrote:
> 
> >- kswapd should make sure that there is enough on the cache
> >  (we should keep track of how many 1-count cache pages there
> >  are in the system)
> >- realtime tasks shouldn't go around allocating huge amounts
> >  of memory -- this totally ruins the realtime aspect anyway
> 
> What about if there is netscape iconized and the realtime task want to
> allocate some memory to mlock it but has to swapout netscape to do that?

When the realtime task is still setting up it's resources, it's
not yet busy with it's real task and shouldn't be considered RT
yet -- but I agree with your general idea that tasks should be
able to swap_out() too in emergencies...

> >> (and this will avoid also tasks other than kswapd to
> >> sleep waiting for slowww SYNC IO). 
> >
> >Some tasks (really big memory hogs) are better left sleeping
> >for I/O because they otherwise completely overpower the rest
> >of the system. But that's a slightly different story :)
> 
> The point here is that `free` get blocked on I/O because the
> malicious process is trashing VM.

No, the idea is that the big task is swap_out()ing itself
when it exceeds it's RSS limit _and_ the systemwide RSS
'thrash' limit is exceeded.

With the current (independant swap cache freeing) scheme,
RSS limits are fairly unobtrusive and only give noticable
overhead when the rest of the system would have been
bothered anyway.

cheers,

Rik -- the flu hits, the flu hits, the flu hits -- MORE
+-------------------------------------------------------------------+
| Linux memory management tour guide.        H.H.vanRiel@phys.uu.nl |
| Scouting Vries cubscout leader.      http://www.phys.uu.nl/~riel/ |
+-------------------------------------------------------------------+

--
This is a majordomo managed list.  To unsubscribe, send a message with
the body 'unsubscribe linux-mm me@address' to: majordomo@kvack.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~1998-12-23 19:05 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
1998-12-01  6:55 [PATCH] swapin readahead v3 + kswapd fixes Rik van Riel
1998-12-01  8:15 ` Andrea Arcangeli
1998-12-01 15:28   ` Rik van Riel
1998-12-17  1:24 ` Linus Torvalds
1998-12-19 17:09   ` New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes) Stephen C. Tweedie
1998-12-19 18:41     ` Linus Torvalds
1998-12-19 19:41     ` Linus Torvalds
1998-12-19 22:01       ` Stephen C. Tweedie
1998-12-20  3:05         ` Linus Torvalds
1998-12-20 14:18         ` Linus Torvalds
1998-12-21 13:03           ` Andrea Arcangeli
1998-12-21 13:39           ` Stephen C. Tweedie
1998-12-21 14:08             ` Andrea Arcangeli
1998-12-21 16:42               ` Stephen C. Tweedie
1998-12-21  9:53     ` Andrea Arcangeli
1998-12-21 16:37       ` Stephen C. Tweedie
1998-12-21 17:58         ` Linus Torvalds
1998-12-21 18:59           ` Stephen C. Tweedie
1998-12-21 19:38             ` Linus Torvalds
1998-12-22  7:56           ` Eric W. Biederman
1998-12-22 10:49             ` Andrea Arcangeli
1998-12-22 15:32               ` Eric W. Biederman
1998-12-22 15:40                 ` Andrea Arcangeli
1998-12-22 16:26                   ` Linus Torvalds
1998-12-22 19:55                     ` Eric W. Biederman
1998-12-22 20:25                     ` Rik van Riel
1998-12-22 21:56                       ` Linus Torvalds
1998-12-22 20:10                   ` Rik van Riel
1998-12-22 22:35                     ` Andrea Arcangeli
1998-12-23  8:45                       ` Rik van Riel
1998-12-22 20:03                 ` Rik van Riel
1998-12-22 17:23               ` [patch] swap_out now really free (the right) pages [Re: New patch (was Re: [PATCH] swapin readahead v3 + kswapd fixes)] Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).