[PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3
@ 2008-02-27 21:47 Lee Schermerhorn, Mel Gorman
  2008-02-27 21:47 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Lee Schermerhorn, Mel Gorman
                   ` (7 more replies)
  0 siblings, 8 replies; 37+ messages in thread
From: Lee Schermerhorn, Mel Gorman @ 2008-02-27 21:47 UTC (permalink / raw
  To: akpm; +Cc: mel, ak, clameter, kamezawa.hiroyu, linux-mm, rientjes,
	eric.whitney

[PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3

This is a rebase of the two-zonelist patchset to 2.6.25-rc2-mm1.

Mel, still on vacation last I checked,  asked me to repost these
as I'd already rebased them and I've been testing them continually
on each -mm tree for months, hoping to see them in -mm for wider
testing.

I have a series of mempolicy cleanup patches, including a rework of the
reference counting that depend on this series.  David R. has a series
of mempolicy enhancements out for review that, IMO, will benefit from
this series.  In both cases, the removal of the custom zonelist for
MPOL_BIND is the important feature.

Lee

---

Changelog since V11r2
  o Rebase to 2.6.25-rc2-mm1

Changelog since V10
  o Rebase to 2.6.24-rc4-mm1
  o Clear up warnings in fs/buffer.c early in the patchset

Changelog since V9
  o Rebase to 2.6.24-rc2-mm1
  o Lookup the nodemask for each allocator callsite in mempolicy.c
  o Update NUMA statistics based on preferred zone, not first zonelist entry
  o When __GFP_THISNODE is specified with MPOL_BIND and the current node is
    not in the allowed nodemask, the first node in the mask will be used
  o Stick with using two zonelists instead of one because of excessive
    complexity with corner cases

Changelog since V8
  o Rebase to 2.6.24-rc2
  o Added ack for the OOM changes
  o Behave correctly when GFP_THISNODE and a node ID are specified
  o Clear up warning over type of nodes_intersects() function

Changelog since V7
  o Rebase to 2.6.23-rc8-mm2

Changelog since V6
  o Fix build bug in relation to memory controller combined with one-zonelist
  o Use while() instead of a stupid looking for()
  o Instead of encoding zone index information in a pointer, this version
    introduces a structure that stores a zone pointer and its index 

Changelog since V5
  o Rebase to 2.6.23-rc4-mm1
  o Drop patch that replaces inline functions with macros

Changelog since V4
  o Rebase to -mm kernel. Host of memoryless patches collisions dealt with
  o Do not call wakeup_kswapd() for every zone in a zonelist
  o Dropped the FASTCALL removal
  o Have cursor in iterator advance earlier
  o Use nodes_and in cpuset_nodes_valid_mems_allowed()
  o Use defines instead of inlines, noticably better performance on gcc-3.4
    No difference on later compilers such as gcc 4.1
  o Dropped gfp_skip patch until it is proven to be of benefit. Tests are
    currently inconclusive but it definitly consumes at least one cache
    line

Changelog since V3
  o Fix compile error in the parisc change
  o Calculate gfp_zone only once in __alloc_pages
  o Calculate classzone_idx properly in get_page_from_freelist
  o Alter check so that zone id embedded may still be used on UP
  o Use Kamezawa-sans suggestion for skipping zones in zonelist
  o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This
    removes the need for MPOL_BIND to have a custom zonelist
  o Move zonelist iterators and helpers to mm.h
  o Change _zones from struct zone * to unsigned long

Changelog since V2
  o shrink_zones() uses zonelist instead of zonelist->zones
  o hugetlb uses zonelist iterator
  o zone_idx information is embedded in zonelist pointers
  o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid)

Changelog since V1
  o Break up the patch into 3 patches
  o Introduce iterators for zonelists
  o Performance regression test

The following patches replace multiple zonelists per node with two zonelists
that are filtered based on the GFP flags. The patches as a set fix a bug
with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset,
the MPOL_BIND will apply to the two highest zones when the highest zone
is ZONE_MOVABLE. This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that
filters only custom zonelists.

The first patch cleans up an inconsistency where direct reclaim uses
zonelist->zones where other places use zonelist.

The second patch introduces a helper function node_zonelist() for looking
up the appropriate zonelist for a GFP mask which simplifies patches later
in the set.

The third patch defines/remembers the "preferred zone" for numa statistics,
as it is no longer always the first zone in a zonelist.

The forth patch replaces multiple zonelists with two zonelists that are
filtered. The two zonelists are due to the fact that the memoryless patchset
introduces a second set of zonelists for __GFP_THISNODE.

The fifth patch introduces helper macros for retrieving the zone and node
indices of entries in a zonelist.

The final patch introduces filtering of the zonelists based on a nodemask. Two
zonelists exist per node, one for normal allocations and one for __GFP_THISNODE.

Performance results varied depending on the machine configuration. In real
workloads the gain/loss will depend on how much the userspace portion of
the benchmark benefits from having more cache available due to reduced
referencing of zonelists.

These are the range of performance losses/gains when running against
2.6.24-rc4-mm1. The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.

			     loss   to  gain
Total CPU time on Kernbench: -0.86% to  1.13%
Elapsed   time on Kernbench: -0.79% to  0.76%
page_test from aim9:         -4.37% to  0.79%
brk_test  from aim9:         -0.71% to  4.07%
fork_test from aim9:         -1.84% to  4.60%
exec_test from aim9:         -0.71% to  1.08%

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages
  2008-02-27 21:47 [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3 Lee Schermerhorn, Mel Gorman
@ 2008-02-27 21:47 ` Lee Schermerhorn, Mel Gorman
  2008-02-27 21:47 ` [PATCH 2/6] Introduce node_zonelist() for accessing the zonelist for a GFP mask Lee Schermerhorn, Mel Gorman
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 37+ messages in thread
From: Lee Schermerhorn, Mel Gorman @ 2008-02-27 21:47 UTC (permalink / raw
  To: akpm; +Cc: mel, ak, clameter, kamezawa.hiroyu, linux-mm, rientjes,
	eric.whitney

[PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages

V11r3 against 2.6.25-rc2-mm1

The allocator deals with zonelists which indicate the order in which zones
should be targeted for an allocation. Similarly, direct reclaim of pages
iterates over an array of zones. For consistency, this patch converts direct
reclaim to use a zonelist. No functionality is changed by this patch. This
simplifies zonelist iterators in the next patch.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/buffer.c          |    8 ++++----
 include/linux/swap.h |    2 +-
 mm/page_alloc.c      |    2 +-
 mm/vmscan.c          |   21 ++++++++++++---------
 4 files changed, 18 insertions(+), 15 deletions(-)

Index: linux-2.6.25-rc2-mm1/fs/buffer.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/fs/buffer.c	2008-02-27 16:28:07.000000000 -0500
+++ linux-2.6.25-rc2-mm1/fs/buffer.c	2008-02-27 16:28:09.000000000 -0500
@@ -368,16 +368,16 @@ void invalidate_bdev(struct block_device
  */
 static void free_more_memory(void)
 {
-	struct zone **zones;
+	struct zonelist *zonelist;
 	pg_data_t *pgdat;
 
 	wakeup_pdflush(1024);
 	yield();
 
 	for_each_online_pgdat(pgdat) {
-		zones = pgdat->node_zonelists[gfp_zone(GFP_NOFS)].zones;
-		if (*zones)
-			try_to_free_pages(zones, 0, GFP_NOFS);
+		zonelist = &pgdat->node_zonelists[gfp_zone(GFP_NOFS)];
+		if (zonelist->zones[0])
+			try_to_free_pages(zonelist, 0, GFP_NOFS);
 	}
 }
 
Index: linux-2.6.25-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.25-rc2-mm1.orig/include/linux/swap.h	2008-02-27 16:28:07.000000000 -0500
+++ linux-2.6.25-rc2-mm1/include/linux/swap.h	2008-02-27 16:28:09.000000000 -0500
@@ -181,7 +181,7 @@ extern int rotate_reclaimable_page(struc
 extern void swap_setup(void);
 
 /* linux/mm/vmscan.c */
-extern unsigned long try_to_free_pages(struct zone **zones, int order,
+extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 							gfp_t gfp_mask);
Index: linux-2.6.25-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/page_alloc.c	2008-02-27 16:28:07.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/page_alloc.c	2008-02-27 16:28:09.000000000 -0500
@@ -1635,7 +1635,7 @@ nofail_alloc:
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	did_some_progress = try_to_free_pages(zonelist->zones, order, gfp_mask);
+	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
 
 	p->reclaim_state = NULL;
 	p->flags &= ~PF_MEMALLOC;
Index: linux-2.6.25-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/vmscan.c	2008-02-27 16:28:07.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/vmscan.c	2008-02-27 16:28:09.000000000 -0500
@@ -1268,10 +1268,11 @@ static unsigned long shrink_zone(int pri
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
  */
-static unsigned long shrink_zones(int priority, struct zone **zones,
+static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
 	unsigned long nr_reclaimed = 0;
+	struct zone **zones = zonelist->zones;
 	int i;
 
 
@@ -1323,8 +1324,8 @@ static unsigned long shrink_zones(int pr
  * holds filesystem locks which prevent writeout this might not work, and the
  * allocation attempt will fail.
  */
-static unsigned long do_try_to_free_pages(struct zone **zones, gfp_t gfp_mask,
-					  struct scan_control *sc)
+static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
+					gfp_t gfp_mask, struct scan_control *sc)
 {
 	int priority;
 	int ret = 0;
@@ -1332,6 +1333,7 @@ static unsigned long do_try_to_free_page
 	unsigned long nr_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long lru_pages = 0;
+	struct zone **zones = zonelist->zones;
 	int i;
 
 	if (scan_global_lru(sc))
@@ -1356,7 +1358,7 @@ static unsigned long do_try_to_free_page
 		sc->nr_io_pages = 0;
 		if (!priority)
 			disable_swap_token();
-		nr_reclaimed += shrink_zones(priority, zones, sc);
+		nr_reclaimed += shrink_zones(priority, zonelist, sc);
 		/*
 		 * Don't shrink slabs when reclaiming memory from
 		 * over limit cgroups
@@ -1421,7 +1423,8 @@ out:
 	return ret;
 }
 
-unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
+unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
+								gfp_t gfp_mask)
 {
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
@@ -1434,7 +1437,7 @@ unsigned long try_to_free_pages(struct z
 		.isolate_pages = isolate_pages_global,
 	};
 
-	return do_try_to_free_pages(zones, gfp_mask, &sc);
+	return do_try_to_free_pages(zonelist, gfp_mask, &sc);
 }
 
 #ifdef CONFIG_CGROUP_MEM_CONT
@@ -1452,11 +1455,11 @@ unsigned long try_to_free_mem_cgroup_pag
 		.mem_cgroup = mem_cont,
 		.isolate_pages = mem_cgroup_isolate_pages,
 	};
-	struct zone **zones;
+	struct zonelist *zonelist;
 	int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE);
 
-	zones = NODE_DATA(numa_node_id())->node_zonelists[target_zone].zones;
-	if (do_try_to_free_pages(zones, sc.gfp_mask, &sc))
+	zonelist = &NODE_DATA(numa_node_id())->node_zonelists[target_zone];
+	if (do_try_to_free_pages(zonelist, sc.gfp_mask, &sc))
 		return 1;
 	return 0;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 2/6] Introduce node_zonelist() for accessing the zonelist for a GFP mask
  2008-02-27 21:47 [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3 Lee Schermerhorn, Mel Gorman
  2008-02-27 21:47 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Lee Schermerhorn, Mel Gorman
@ 2008-02-27 21:47 ` Lee Schermerhorn, Mel Gorman
  2008-02-27 21:47 ` [PATCH 3/6] Remember what the preferred zone is for zone_statistics Lee Schermerhorn, Mel Gorman
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 37+ messages in thread
From: Lee Schermerhorn, Mel Gorman @ 2008-02-27 21:47 UTC (permalink / raw
  To: akpm; +Cc: mel, ak, clameter, kamezawa.hiroyu, linux-mm, rientjes,
	eric.whitney

[PATCH 2/6] Introduce node_zonelist() for accessing the zonelist for a GFP mask

V11r3 against 2.6.25-rc2-mm1

This patch introduces a node_zonelist() helper function. It is used to lookup
the appropriate zonelist given a node and a GFP mask. The patch on its own is
a cleanup but it helps clarify parts of the two-zonelist-per-node patchset. If
necessary, it can be merged with the next patch in this set without problems.

Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 drivers/char/sysrq.c      |    3 +--
 fs/buffer.c               |    6 +++---
 include/linux/gfp.h       |    8 ++++++--
 include/linux/mempolicy.h |    2 +-
 mm/mempolicy.c            |    6 +++---
 mm/page_alloc.c           |    3 +--
 mm/slab.c                 |    3 +--
 mm/slub.c                 |    3 +--
 8 files changed, 17 insertions(+), 17 deletions(-)

Index: linux-2.6.25-rc2-mm1/drivers/char/sysrq.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/drivers/char/sysrq.c	2008-02-27 16:28:05.000000000 -0500
+++ linux-2.6.25-rc2-mm1/drivers/char/sysrq.c	2008-02-27 16:28:11.000000000 -0500
@@ -271,8 +271,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(&NODE_DATA(0)->node_zonelists[ZONE_NORMAL],
-			GFP_KERNEL, 0);
+	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
Index: linux-2.6.25-rc2-mm1/fs/buffer.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/fs/buffer.c	2008-02-27 16:28:09.000000000 -0500
+++ linux-2.6.25-rc2-mm1/fs/buffer.c	2008-02-27 16:28:11.000000000 -0500
@@ -369,13 +369,13 @@ void invalidate_bdev(struct block_device
 static void free_more_memory(void)
 {
 	struct zonelist *zonelist;
-	pg_data_t *pgdat;
+	int nid;
 
 	wakeup_pdflush(1024);
 	yield();
 
-	for_each_online_pgdat(pgdat) {
-		zonelist = &pgdat->node_zonelists[gfp_zone(GFP_NOFS)];
+	for_each_online_node(nid) {
+		zonelist = node_zonelist(nid, GFP_NOFS);
 		if (zonelist->zones[0])
 			try_to_free_pages(zonelist, 0, GFP_NOFS);
 	}
Index: linux-2.6.25-rc2-mm1/include/linux/gfp.h
===================================================================
--- linux-2.6.25-rc2-mm1.orig/include/linux/gfp.h	2008-02-27 16:28:05.000000000 -0500
+++ linux-2.6.25-rc2-mm1/include/linux/gfp.h	2008-02-27 16:28:11.000000000 -0500
@@ -154,10 +154,15 @@ static inline enum zone_type gfp_zone(gf
 /*
  * We get the zone list from the current node and the gfp_mask.
  * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
+ * There are many zonelists per node, two for each active zone.
  *
  * For the normal case of non-DISCONTIGMEM systems the NODE_DATA() gets
  * optimized to &contig_page_data at compile-time.
  */
+static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
+{
+	return NODE_DATA(nid)->node_zonelists + gfp_zone(flags);
+}
 
 #ifndef HAVE_ARCH_FREE_PAGE
 static inline void arch_free_page(struct page *page, int order) { }
@@ -178,8 +183,7 @@ static inline struct page *alloc_pages_n
 	if (nid < 0)
 		nid = numa_node_id();
 
-	return __alloc_pages(gfp_mask, order,
-		NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
+	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
 #ifdef CONFIG_NUMA
Index: linux-2.6.25-rc2-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.25-rc2-mm1.orig/include/linux/mempolicy.h	2008-02-27 16:28:05.000000000 -0500
+++ linux-2.6.25-rc2-mm1/include/linux/mempolicy.h	2008-02-27 16:28:11.000000000 -0500
@@ -241,7 +241,7 @@ static inline void mpol_fix_fork_child_f
 static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
  		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol)
 {
-	return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
+	return node_zonelist(0, gfp_flags);
 }
 
 static inline int do_migrate_pages(struct mm_struct *mm,
Index: linux-2.6.25-rc2-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/mempolicy.c	2008-02-27 16:28:05.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/mempolicy.c	2008-02-27 16:28:11.000000000 -0500
@@ -1183,7 +1183,7 @@ static struct zonelist *zonelist_policy(
 		nd = 0;
 		BUG();
 	}
-	return NODE_DATA(nd)->node_zonelists + gfp_zone(gfp);
+	return node_zonelist(nd, gfp);
 }
 
 /* Do dynamic interleaving for a process */
@@ -1297,7 +1297,7 @@ struct zonelist *huge_zonelist(struct vm
 
 		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
 		__mpol_free(pol);		/* finished with pol */
-		return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
+		return node_zonelist(nid, gfp_flags);
 	}
 
 	zl = zonelist_policy(GFP_HIGHUSER, pol);
@@ -1319,7 +1319,7 @@ static struct page *alloc_page_interleav
 	struct zonelist *zl;
 	struct page *page;
 
-	zl = NODE_DATA(nid)->node_zonelists + gfp_zone(gfp);
+	zl = node_zonelist(nid, gfp);
 	page = __alloc_pages(gfp, order, zl);
 	if (page && page_zone(page) == zl->zones[0])
 		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
Index: linux-2.6.25-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/page_alloc.c	2008-02-27 16:28:09.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/page_alloc.c	2008-02-27 16:28:11.000000000 -0500
@@ -1783,10 +1783,9 @@ EXPORT_SYMBOL(free_pages);
 static unsigned int nr_free_zone_pages(int offset)
 {
 	/* Just pick one node, since fallback list is circular */
-	pg_data_t *pgdat = NODE_DATA(numa_node_id());
 	unsigned int sum = 0;
 
-	struct zonelist *zonelist = pgdat->node_zonelists + offset;
+	struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
 	struct zone **zonep = zonelist->zones;
 	struct zone *zone;
 
Index: linux-2.6.25-rc2-mm1/mm/slab.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/slab.c	2008-02-27 16:28:05.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/slab.c	2008-02-27 16:28:11.000000000 -0500
@@ -3251,8 +3251,7 @@ static void *fallback_alloc(struct kmem_
 	if (flags & __GFP_THISNODE)
 		return NULL;
 
-	zonelist = &NODE_DATA(slab_node(current->mempolicy))
-			->node_zonelists[gfp_zone(flags)];
+	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
 retry:
Index: linux-2.6.25-rc2-mm1/mm/slub.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/slub.c	2008-02-27 16:28:05.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/slub.c	2008-02-27 16:28:11.000000000 -0500
@@ -1324,8 +1324,7 @@ static struct page *get_any_partial(stru
 			get_cycles() % 1024 > s->remote_node_defrag_ratio)
 		return NULL;
 
-	zonelist = &NODE_DATA(
-		slab_node(current->mempolicy))->node_zonelists[gfp_zone(flags)];
+	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 	for (z = zonelist->zones; *z; z++) {
 		struct kmem_cache_node *n;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 3/6] Remember what the preferred zone is for zone_statistics
  2008-02-27 21:47 [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3 Lee Schermerhorn, Mel Gorman
  2008-02-27 21:47 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Lee Schermerhorn, Mel Gorman
  2008-02-27 21:47 ` [PATCH 2/6] Introduce node_zonelist() for accessing the zonelist for a GFP mask Lee Schermerhorn, Mel Gorman
@ 2008-02-27 21:47 ` Lee Schermerhorn, Mel Gorman
  2008-02-27 22:00   ` Christoph Lameter
  2008-02-29  2:30   ` KAMEZAWA Hiroyuki
  2008-02-27 21:47 ` [PATCH 4/6] Use two zonelist that are filtered by GFP mask Lee Schermerhorn, Mel Gorman
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 37+ messages in thread
From: Lee Schermerhorn, Mel Gorman @ 2008-02-27 21:47 UTC (permalink / raw
  To: akpm; +Cc: mel, ak, clameter, kamezawa.hiroyu, linux-mm, rientjes,
	eric.whitney

[PATCH 3/6] Remember what the preferred zone is for zone_statistics

V11r3 against 2.6.25-rc2-mm1

On NUMA, zone_statistics() is used to record events like numa hit, miss
and foreign. It assumes that the first zone in a zonelist is the preferred
zone. When multiple zonelists are replaced by one that is filtered, this
is no longer the case.

This patch records what the preferred zone is rather than assuming the
first zone in the zonelist is it. This simplifies the reading of later
patches in this set.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/vmstat.h |    2 +-
 mm/page_alloc.c        |    9 +++++----
 mm/vmstat.c            |    6 +++---
 3 files changed, 9 insertions(+), 8 deletions(-)

Index: linux-2.6.25-rc2-mm1/include/linux/vmstat.h
===================================================================
--- linux-2.6.25-rc2-mm1.orig/include/linux/vmstat.h	2008-02-27 16:28:04.000000000 -0500
+++ linux-2.6.25-rc2-mm1/include/linux/vmstat.h	2008-02-27 16:28:14.000000000 -0500
@@ -174,7 +174,7 @@ static inline unsigned long node_page_st
 		zone_page_state(&zones[ZONE_MOVABLE], item);
 }
 
-extern void zone_statistics(struct zonelist *, struct zone *);
+extern void zone_statistics(struct zone *, struct zone *);
 
 #else
 
Index: linux-2.6.25-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/page_alloc.c	2008-02-27 16:28:11.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/page_alloc.c	2008-02-27 16:28:14.000000000 -0500
@@ -1060,7 +1060,7 @@ void split_page(struct page *page, unsig
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
  */
-static struct page *buffered_rmqueue(struct zonelist *zonelist,
+static struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, int order, gfp_t gfp_flags)
 {
 	unsigned long flags;
@@ -1112,7 +1112,7 @@ again:
 	}
 
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
-	zone_statistics(zonelist, zone);
+	zone_statistics(preferred_zone, zone);
 	local_irq_restore(flags);
 	put_cpu();
 
@@ -1393,7 +1393,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	struct zone **z;
 	struct page *page = NULL;
 	int classzone_idx = zone_idx(zonelist->zones[0]);
-	struct zone *zone;
+	struct zone *zone, *preferred_zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
@@ -1405,6 +1405,7 @@ zonelist_scan:
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
 	z = zonelist->zones;
+	preferred_zone = *z;
 
 	do {
 		/*
@@ -1443,7 +1444,7 @@ zonelist_scan:
 			}
 		}
 
-		page = buffered_rmqueue(zonelist, zone, order, gfp_mask);
+		page = buffered_rmqueue(preferred_zone, zone, order, gfp_mask);
 		if (page)
 			break;
 this_zone_full:
Index: linux-2.6.25-rc2-mm1/mm/vmstat.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/vmstat.c	2008-02-27 16:28:04.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/vmstat.c	2008-02-27 16:28:14.000000000 -0500
@@ -365,13 +365,13 @@ void refresh_cpu_vm_stats(int cpu)
  *
  * Must be called with interrupts disabled.
  */
-void zone_statistics(struct zonelist *zonelist, struct zone *z)
+void zone_statistics(struct zone *preferred_zone, struct zone *z)
 {
-	if (z->zone_pgdat == zonelist->zones[0]->zone_pgdat) {
+	if (z->zone_pgdat == preferred_zone->zone_pgdat) {
 		__inc_zone_state(z, NUMA_HIT);
 	} else {
 		__inc_zone_state(z, NUMA_MISS);
-		__inc_zone_state(zonelist->zones[0], NUMA_FOREIGN);
+		__inc_zone_state(preferred_zone, NUMA_FOREIGN);
 	}
 	if (z->node == numa_node_id())
 		__inc_zone_state(z, NUMA_LOCAL);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 4/6] Use two zonelist that are filtered by GFP mask
  2008-02-27 21:47 [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3 Lee Schermerhorn, Mel Gorman
                   ` (2 preceding siblings ...)
  2008-02-27 21:47 ` [PATCH 3/6] Remember what the preferred zone is for zone_statistics Lee Schermerhorn, Mel Gorman
@ 2008-02-27 21:47 ` Lee Schermerhorn, Mel Gorman
  2008-02-28 21:32   ` Andrew Morton
  2008-02-27 21:47 ` [PATCH 5/6] Have zonelist contains structs with both a zone pointer and zone_idx Lee Schermerhorn, Mel Gorman
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Lee Schermerhorn, Mel Gorman @ 2008-02-27 21:47 UTC (permalink / raw
  To: akpm; +Cc: mel, ak, clameter, kamezawa.hiroyu, linux-mm, rientjes,
	eric.whitney

[PATCH 4/6] Use two zonelist that are filtered by GFP mask

V11r3 against 2.6.25-rc2-mm1

Currently a node has two sets of zonelists, one for each zone type in the
system and a second set for GFP_THISNODE allocations. Based on the zones
allowed by a gfp mask, one of these zonelists is selected. All of these
zonelists consume memory and occupy cache lines.

This patch replaces the multiple zonelists per-node with two zonelists. The
first contains all populated zones in the system, ordered by distance, for
fallback allocations when the target/preferred node has no free pages.  The
second contains all populated zones in the node suitable for GFP_THISNODE
allocations.

An iterator macro is introduced called for_each_zone_zonelist() that interates
through each zone allowed by the GFP flags in the selected zonelist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 arch/parisc/mm/init.c  |   11 ++-
 fs/buffer.c            |   10 +-
 include/linux/gfp.h    |   13 +++
 include/linux/mmzone.h |   65 ++++++++++++------
 mm/hugetlb.c           |    8 +-
 mm/oom_kill.c          |    8 +-
 mm/page_alloc.c        |  170 +++++++++++++++++++++----------------------------
 mm/slab.c              |    8 +-
 mm/slub.c              |    8 +-
 mm/vmscan.c            |   21 ++----
 10 files changed, 168 insertions(+), 154 deletions(-)

Index: linux-2.6.25-rc2-mm1/arch/parisc/mm/init.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/arch/parisc/mm/init.c	2008-02-27 16:28:03.000000000 -0500
+++ linux-2.6.25-rc2-mm1/arch/parisc/mm/init.c	2008-02-27 16:28:16.000000000 -0500
@@ -603,15 +603,18 @@ void show_mem(void)
 #ifdef CONFIG_DISCONTIGMEM
 	{
 		struct zonelist *zl;
-		int i, j, k;
+		int i, j;
 
 		for (i = 0; i < npmem_ranges; i++) {
+			zl = node_zonelist(i);
 			for (j = 0; j < MAX_NR_ZONES; j++) {
-				zl = NODE_DATA(i)->node_zonelists + j;
+				struct zone **z;
+				struct zone *zone;
 
 				printk("Zone list for zone %d on node %d: ", j, i);
-				for (k = 0; zl->zones[k] != NULL; k++) 
-					printk("[%d/%s] ", zone_to_nid(zl->zones[k]), zl->zones[k]->name);
+				for_each_zone_zonelist(zone, z, zl, j)
+					printk("[%d/%s] ", zone_to_nid(zone),
+								zone->name);
 				printk("\n");
 			}
 		}
Index: linux-2.6.25-rc2-mm1/fs/buffer.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/fs/buffer.c	2008-02-27 16:28:11.000000000 -0500
+++ linux-2.6.25-rc2-mm1/fs/buffer.c	2008-02-27 16:28:16.000000000 -0500
@@ -368,16 +368,18 @@ void invalidate_bdev(struct block_device
  */
 static void free_more_memory(void)
 {
-	struct zonelist *zonelist;
+	struct zone **zones;
 	int nid;
 
 	wakeup_pdflush(1024);
 	yield();
 
 	for_each_online_node(nid) {
-		zonelist = node_zonelist(nid, GFP_NOFS);
-		if (zonelist->zones[0])
-			try_to_free_pages(zonelist, 0, GFP_NOFS);
+		zones = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
+						gfp_zone(GFP_NOFS));
+		if (*zones)
+			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
+						GFP_NOFS);
 	}
 }
 
Index: linux-2.6.25-rc2-mm1/include/linux/gfp.h
===================================================================
--- linux-2.6.25-rc2-mm1.orig/include/linux/gfp.h	2008-02-27 16:28:11.000000000 -0500
+++ linux-2.6.25-rc2-mm1/include/linux/gfp.h	2008-02-27 16:28:16.000000000 -0500
@@ -151,17 +151,26 @@ static inline enum zone_type gfp_zone(gf
  * virtual kernel addresses to the allocated page(s).
  */
 
+static inline int gfp_zonelist(gfp_t flags)
+{
+	if (NUMA_BUILD && unlikely(flags & __GFP_THISNODE))
+		return 1;
+
+	return 0;
+}
+
 /*
  * We get the zone list from the current node and the gfp_mask.
  * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
- * There are many zonelists per node, two for each active zone.
+ * There are two zonelists per node, one for all zones with memory and
+ * one containing just zones from the node the zonelist belongs to.
  *
  * For the normal case of non-DISCONTIGMEM systems the NODE_DATA() gets
  * optimized to &contig_page_data at compile-time.
  */
 static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
 {
-	return NODE_DATA(nid)->node_zonelists + gfp_zone(flags);
+	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
 }
 
 #ifndef HAVE_ARCH_FREE_PAGE
Index: linux-2.6.25-rc2-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.25-rc2-mm1.orig/include/linux/mmzone.h	2008-02-27 16:28:03.000000000 -0500
+++ linux-2.6.25-rc2-mm1/include/linux/mmzone.h	2008-02-27 16:28:16.000000000 -0500
@@ -393,10 +393,10 @@ static inline int zone_is_oom_locked(con
  * The NUMA zonelists are doubled becausse we need zonelists that restrict the
  * allocations to a single node for GFP_THISNODE.
  *
- * [0 .. MAX_NR_ZONES -1] 		: Zonelists with fallback
- * [MAZ_NR_ZONES ... MAZ_ZONELISTS -1]  : No fallback (GFP_THISNODE)
+ * [0]	: Zonelist with fallback
+ * [1]	: No fallback (GFP_THISNODE)
  */
-#define MAX_ZONELISTS (2 * MAX_NR_ZONES)
+#define MAX_ZONELISTS 2
 
 
 /*
@@ -464,7 +464,7 @@ struct zonelist_cache {
 	unsigned long last_full_zap;		/* when last zap'd (jiffies) */
 };
 #else
-#define MAX_ZONELISTS MAX_NR_ZONES
+#define MAX_ZONELISTS 1
 struct zonelist_cache;
 #endif
 
@@ -486,24 +486,6 @@ struct zonelist {
 #endif
 };
 
-#ifdef CONFIG_NUMA
-/*
- * Only custom zonelists like MPOL_BIND need to be filtered as part of
- * policies. As described in the comment for struct zonelist_cache, these
- * zonelists will not have a zlcache so zlcache_ptr will not be set. Use
- * that to determine if the zonelists needs to be filtered or not.
- */
-static inline int alloc_should_filter_zonelist(struct zonelist *zonelist)
-{
-	return !zonelist->zlcache_ptr;
-}
-#else
-static inline int alloc_should_filter_zonelist(struct zonelist *zonelist)
-{
-	return 0;
-}
-#endif /* CONFIG_NUMA */
-
 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP
 struct node_active_region {
 	unsigned long start_pfn;
@@ -732,6 +714,45 @@ extern struct zone *next_zone(struct zon
 	     zone;					\
 	     zone = next_zone(zone))
 
+/* Returns the first zone at or below highest_zoneidx in a zonelist */
+static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
+					enum zone_type highest_zoneidx)
+{
+	struct zone **z;
+
+	/* Find the first suitable zone to use for the allocation */
+	z = zonelist->zones;
+	while (*z && zone_idx(*z) > highest_zoneidx)
+		z++;
+
+	return z;
+}
+
+/* Returns the next zone at or below highest_zoneidx in a zonelist */
+static inline struct zone **next_zones_zonelist(struct zone **z,
+					enum zone_type highest_zoneidx)
+{
+	/* Find the next suitable zone to use for the allocation */
+	while (*z && zone_idx(*z) > highest_zoneidx)
+		z++;
+
+	return z;
+}
+
+/**
+ * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * @zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ *
+ * This iterator iterates though all zones at or below a given zone index.
+ */
+#define for_each_zone_zonelist(zone, z, zlist, highidx) \
+	for (z = first_zones_zonelist(zlist, highidx), zone = *z++;	\
+		zone;							\
+		z = next_zones_zonelist(z, highidx), zone = *z++)
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 #endif
Index: linux-2.6.25-rc2-mm1/mm/hugetlb.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/hugetlb.c	2008-02-27 16:28:03.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/hugetlb.c	2008-02-27 16:28:16.000000000 -0500
@@ -79,11 +79,11 @@ static struct page *dequeue_huge_page(st
 	struct mempolicy *mpol;
 	struct zonelist *zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol);
-	struct zone **z;
+	struct zone *zone, **z;
 
-	for (z = zonelist->zones; *z; z++) {
-		nid = zone_to_nid(*z);
-		if (cpuset_zone_allowed_softwall(*z, htlb_alloc_mask) &&
+	for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) {
+		nid = zone_to_nid(zone);
+		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
 		    !list_empty(&hugepage_freelists[nid])) {
 			page = list_entry(hugepage_freelists[nid].next,
 					  struct page, lru);
Index: linux-2.6.25-rc2-mm1/mm/oom_kill.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/oom_kill.c	2008-02-27 16:28:03.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/oom_kill.c	2008-02-27 16:28:16.000000000 -0500
@@ -174,12 +174,14 @@ static inline enum oom_constraint constr
 						    gfp_t gfp_mask)
 {
 #ifdef CONFIG_NUMA
+	struct zone *zone;
 	struct zone **z;
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	nodemask_t nodes = node_states[N_HIGH_MEMORY];
 
-	for (z = zonelist->zones; *z; z++)
-		if (cpuset_zone_allowed_softwall(*z, gfp_mask))
-			node_clear(zone_to_nid(*z), nodes);
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
+			node_clear(zone_to_nid(zone), nodes);
 		else
 			return CONSTRAINT_CPUSET;
 
Index: linux-2.6.25-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/page_alloc.c	2008-02-27 16:28:14.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/page_alloc.c	2008-02-27 16:28:16.000000000 -0500
@@ -1388,42 +1388,29 @@ static void zlc_mark_zone_full(struct zo
  */
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist, int alloc_flags)
+		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
 {
 	struct zone **z;
 	struct page *page = NULL;
-	int classzone_idx = zone_idx(zonelist->zones[0]);
+	int classzone_idx;
 	struct zone *zone, *preferred_zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
-	enum zone_type highest_zoneidx = -1; /* Gets set for policy zonelists */
+
+	z = first_zones_zonelist(zonelist, high_zoneidx);
+	classzone_idx = zone_idx(*z);
+	preferred_zone = *z;
 
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	z = zonelist->zones;
-	preferred_zone = *z;
-
-	do {
-		/*
-		 * In NUMA, this could be a policy zonelist which contains
-		 * zones that may not be allowed by the current gfp_mask.
-		 * Check the zone is allowed by the current flags
-		 */
-		if (unlikely(alloc_should_filter_zonelist(zonelist))) {
-			if (highest_zoneidx == -1)
-				highest_zoneidx = gfp_zone(gfp_mask);
-			if (zone_idx(*z) > highest_zoneidx)
-				continue;
-		}
-
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		zone = *z;
 		if ((alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
@@ -1457,7 +1444,7 @@ try_next_zone:
 			zlc_active = 1;
 			did_zlc_setup = 1;
 		}
-	} while (*(++z) != NULL);
+	}
 
 	if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
 		/* Disable zlc cache for second zonelist scan */
@@ -1531,6 +1518,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
 		struct zonelist *zonelist)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone **z;
 	struct page *page;
 	struct reclaim_state reclaim_state;
@@ -1556,7 +1544,7 @@ restart:
 	}
 
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-				zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
 		goto got_pg;
 
@@ -1600,7 +1588,8 @@ restart:
 	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+	page = get_page_from_freelist(gfp_mask, order, zonelist,
+						high_zoneidx, alloc_flags);
 	if (page)
 		goto got_pg;
 
@@ -1613,7 +1602,7 @@ rebalance:
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */
 			page = get_page_from_freelist(gfp_mask, order,
-				zonelist, ALLOC_NO_WATERMARKS);
+				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
 			if (page)
 				goto got_pg;
 			if (gfp_mask & __GFP_NOFAIL) {
@@ -1648,7 +1637,7 @@ nofail_alloc:
 
 	if (likely(did_some_progress)) {
 		page = get_page_from_freelist(gfp_mask, order,
-						zonelist, alloc_flags);
+					zonelist, high_zoneidx, alloc_flags);
 		if (page)
 			goto got_pg;
 	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
@@ -1664,7 +1653,7 @@ nofail_alloc:
 		 * under heavy pressure.
 		 */
 		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-				zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+			zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page) {
 			clear_zonelist_oom(zonelist);
 			goto got_pg;
@@ -1783,14 +1772,15 @@ EXPORT_SYMBOL(free_pages);
 
 static unsigned int nr_free_zone_pages(int offset)
 {
+	struct zone **z;
+	struct zone *zone;
+
 	/* Just pick one node, since fallback list is circular */
 	unsigned int sum = 0;
 
 	struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
-	struct zone **zonep = zonelist->zones;
-	struct zone *zone;
 
-	for (zone = *zonep++; zone; zone = *zonep++) {
+	for_each_zone_zonelist(zone, z, zonelist, offset) {
 		unsigned long size = zone->present_pages;
 		unsigned long high = zone->pages_high;
 		if (size > high)
@@ -2148,17 +2138,15 @@ static int find_next_best_node(int node,
  */
 static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
 {
-	enum zone_type i;
 	int j;
 	struct zonelist *zonelist;
 
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		for (j = 0; zonelist->zones[j] != NULL; j++)
-			;
- 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
-		zonelist->zones[j] = NULL;
-	}
+	zonelist = &pgdat->node_zonelists[0];
+	for (j = 0; zonelist->zones[j] != NULL; j++)
+		;
+	j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+							MAX_NR_ZONES - 1);
+	zonelist->zones[j] = NULL;
 }
 
 /*
@@ -2166,15 +2154,12 @@ static void build_zonelists_in_node_orde
  */
 static void build_thisnode_zonelists(pg_data_t *pgdat)
 {
-	enum zone_type i;
 	int j;
 	struct zonelist *zonelist;
 
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		zonelist = pgdat->node_zonelists + MAX_NR_ZONES + i;
-		j = build_zonelists_node(pgdat, zonelist, 0, i);
-		zonelist->zones[j] = NULL;
-	}
+	zonelist = &pgdat->node_zonelists[1];
+	j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1);
+	zonelist->zones[j] = NULL;
 }
 
 /*
@@ -2187,27 +2172,24 @@ static int node_order[MAX_NUMNODES];
 
 static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 {
-	enum zone_type i;
 	int pos, j, node;
 	int zone_type;		/* needs to be signed */
 	struct zone *z;
 	struct zonelist *zonelist;
 
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		pos = 0;
-		for (zone_type = i; zone_type >= 0; zone_type--) {
-			for (j = 0; j < nr_nodes; j++) {
-				node = node_order[j];
-				z = &NODE_DATA(node)->node_zones[zone_type];
-				if (populated_zone(z)) {
-					zonelist->zones[pos++] = z;
-					check_highest_zone(zone_type);
-				}
+	zonelist = &pgdat->node_zonelists[0];
+	pos = 0;
+	for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) {
+		for (j = 0; j < nr_nodes; j++) {
+			node = node_order[j];
+			z = &NODE_DATA(node)->node_zones[zone_type];
+			if (populated_zone(z)) {
+				zonelist->zones[pos++] = z;
+				check_highest_zone(zone_type);
 			}
 		}
-		zonelist->zones[pos] = NULL;
 	}
+	zonelist->zones[pos] = NULL;
 }
 
 static int default_zonelist_order(void)
@@ -2334,19 +2316,15 @@ static void build_zonelists(pg_data_t *p
 /* Construct the zonelist performance cache - see further mmzone.h */
 static void build_zonelist_cache(pg_data_t *pgdat)
 {
-	int i;
-
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		struct zonelist *zonelist;
-		struct zonelist_cache *zlc;
-		struct zone **z;
+	struct zonelist *zonelist;
+	struct zonelist_cache *zlc;
+	struct zone **z;
 
-		zonelist = pgdat->node_zonelists + i;
-		zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
-		bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
-		for (z = zonelist->zones; *z; z++)
-			zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
-	}
+	zonelist = &pgdat->node_zonelists[0];
+	zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
+	bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
+	for (z = zonelist->zones; *z; z++)
+		zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
 }
 
 
@@ -2360,45 +2338,43 @@ static void set_zonelist_order(void)
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
-	enum zone_type i,j;
+	enum zone_type j;
+	struct zonelist *zonelist;
 
 	local_node = pgdat->node_id;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		struct zonelist *zonelist;
-
-		zonelist = pgdat->node_zonelists + i;
 
- 		j = build_zonelists_node(pgdat, zonelist, 0, i);
- 		/*
- 		 * Now we build the zonelist so that it contains the zones
- 		 * of all the other nodes.
- 		 * We don't want to pressure a particular node, so when
- 		 * building the zones for node N, we make sure that the
- 		 * zones coming right after the local ones are those from
- 		 * node N+1 (modulo N)
- 		 */
-		for (node = local_node + 1; node < MAX_NUMNODES; node++) {
-			if (!node_online(node))
-				continue;
-			j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
-		}
-		for (node = 0; node < local_node; node++) {
-			if (!node_online(node))
-				continue;
-			j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
-		}
+	zonelist = &pgdat->node_zonelists[0];
+	j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1);
 
-		zonelist->zones[j] = NULL;
+	/*
+	 * Now we build the zonelist so that it contains the zones
+	 * of all the other nodes.
+	 * We don't want to pressure a particular node, so when
+	 * building the zones for node N, we make sure that the
+	 * zones coming right after the local ones are those from
+	 * node N+1 (modulo N)
+	 */
+	for (node = local_node + 1; node < MAX_NUMNODES; node++) {
+		if (!node_online(node))
+			continue;
+		j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+							MAX_NR_ZONES - 1);
 	}
+	for (node = 0; node < local_node; node++) {
+		if (!node_online(node))
+			continue;
+		j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+							MAX_NR_ZONES - 1);
+	}
+
+	zonelist->zones[j] = NULL;
 }
 
 /* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
 static void build_zonelist_cache(pg_data_t *pgdat)
 {
-	int i;
-
-	for (i = 0; i < MAX_NR_ZONES; i++)
-		pgdat->node_zonelists[i].zlcache_ptr = NULL;
+	pgdat->node_zonelists[0].zlcache_ptr = NULL;
+	pgdat->node_zonelists[1].zlcache_ptr = NULL;
 }
 
 #endif	/* CONFIG_NUMA */
Index: linux-2.6.25-rc2-mm1/mm/slab.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/slab.c	2008-02-27 16:28:11.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/slab.c	2008-02-27 16:28:16.000000000 -0500
@@ -3245,6 +3245,8 @@ static void *fallback_alloc(struct kmem_
 	struct zonelist *zonelist;
 	gfp_t local_flags;
 	struct zone **z;
+	struct zone *zone;
+	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
 	int nid;
 
@@ -3259,10 +3261,10 @@ retry:
 	 * Look through allowed nodes for objects available
 	 * from existing per node queues.
 	 */
-	for (z = zonelist->zones; *z && !obj; z++) {
-		nid = zone_to_nid(*z);
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		nid = zone_to_nid(zone);
 
-		if (cpuset_zone_allowed_hardwall(*z, flags) &&
+		if (cpuset_zone_allowed_hardwall(zone, flags) &&
 			cache->nodelists[nid] &&
 			cache->nodelists[nid]->free_objects)
 				obj = ____cache_alloc_node(cache,
Index: linux-2.6.25-rc2-mm1/mm/slub.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/slub.c	2008-02-27 16:28:11.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/slub.c	2008-02-27 16:28:16.000000000 -0500
@@ -1300,6 +1300,8 @@ static struct page *get_any_partial(stru
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
 	struct zone **z;
+	struct zone *zone;
+	enum zone_type high_zoneidx = gfp_zone(flags);
 	struct page *page;
 
 	/*
@@ -1325,12 +1327,12 @@ static struct page *get_any_partial(stru
 		return NULL;
 
 	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
-	for (z = zonelist->zones; *z; z++) {
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		struct kmem_cache_node *n;
 
-		n = get_node(s, zone_to_nid(*z));
+		n = get_node(s, zone_to_nid(zone));
 
-		if (n && cpuset_zone_allowed_hardwall(*z, flags) &&
+		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 				n->nr_partial > MIN_PARTIAL) {
 			page = get_partial_node(n);
 			if (page)
Index: linux-2.6.25-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/vmscan.c	2008-02-27 16:28:09.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/vmscan.c	2008-02-27 16:28:16.000000000 -0500
@@ -1271,15 +1271,13 @@ static unsigned long shrink_zone(int pri
 static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
+	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	unsigned long nr_reclaimed = 0;
-	struct zone **zones = zonelist->zones;
-	int i;
-
+	struct zone **z;
+	struct zone *zone;
 
 	sc->all_unreclaimable = 1;
-	for (i = 0; zones[i] != NULL; i++) {
-		struct zone *zone = zones[i];
-
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		if (!populated_zone(zone))
 			continue;
 		/*
@@ -1333,8 +1331,9 @@ static unsigned long do_try_to_free_page
 	unsigned long nr_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long lru_pages = 0;
-	struct zone **zones = zonelist->zones;
-	int i;
+	struct zone **z;
+	struct zone *zone;
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 
 	if (scan_global_lru(sc))
 		count_vm_event(ALLOCSTALL);
@@ -1342,8 +1341,7 @@ static unsigned long do_try_to_free_page
 	 * mem_cgroup will not do shrink_slab.
 	 */
 	if (scan_global_lru(sc)) {
-		for (i = 0; zones[i] != NULL; i++) {
-			struct zone *zone = zones[i];
+		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
@@ -1409,8 +1407,7 @@ out:
 		priority = 0;
 
 	if (scan_global_lru(sc)) {
-		for (i = 0; zones[i] != NULL; i++) {
-			struct zone *zone = zones[i];
+		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 5/6] Have zonelist contains structs with both a zone pointer and zone_idx
  2008-02-27 21:47 [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3 Lee Schermerhorn, Mel Gorman
                   ` (3 preceding siblings ...)
  2008-02-27 21:47 ` [PATCH 4/6] Use two zonelist that are filtered by GFP mask Lee Schermerhorn, Mel Gorman
@ 2008-02-27 21:47 ` Lee Schermerhorn, Mel Gorman
  2008-02-29  7:49   ` KOSAKI Motohiro
  2008-02-27 21:47 ` [PATCH 6/6] Filter based on a nodemask as well as a gfp_mask Lee Schermerhorn, Mel Gorman
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 37+ messages in thread
From: Lee Schermerhorn, Mel Gorman @ 2008-02-27 21:47 UTC (permalink / raw
  To: akpm; +Cc: mel, ak, clameter, kamezawa.hiroyu, linux-mm, rientjes,
	eric.whitney

[PATCH 5/6] Have zonelist contains structs with both a zone pointer and zone_idx

V11r3 against 2.6.25-rc2-mm1

Filtering zonelists requires very frequent use of zone_idx(). This is costly
as it involves a lookup of another structure and a substraction operation. As
the zone_idx is often required, it should be quickly accessible.  The node
idx could also be stored here if it was found that accessing zone->node is
significant which may be the case on workloads where nodemasks are heavily
used.

This patch introduces a struct zoneref to store a zone pointer and a zone
index.  The zonelist then consists of an array of these struct zonerefs which
are looked up as necessary. Helpers are given for accessing the zone index
as well as the node index.

[kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding
information in pointers]

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: David Rientjes <rientjes@google.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 arch/parisc/mm/init.c  |    2 -
 fs/buffer.c            |    6 ++--
 include/linux/mmzone.h |   64 ++++++++++++++++++++++++++++++++++++++--------
 include/linux/oom.h    |    4 +-
 kernel/cpuset.c        |    4 +-
 mm/hugetlb.c           |    3 +-
 mm/mempolicy.c         |   36 +++++++++++++++----------
 mm/oom_kill.c          |   45 +++++++++++++++-----------------
 mm/page_alloc.c        |   68 +++++++++++++++++++++++++++----------------------
 mm/slab.c              |    2 -
 mm/slub.c              |    2 -
 mm/vmscan.c            |    6 ++--
 12 files changed, 151 insertions(+), 91 deletions(-)

Index: linux-2.6.25-rc2-mm1/arch/parisc/mm/init.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/arch/parisc/mm/init.c	2008-02-27 16:28:16.000000000 -0500
+++ linux-2.6.25-rc2-mm1/arch/parisc/mm/init.c	2008-02-27 16:28:17.000000000 -0500
@@ -608,7 +608,7 @@ void show_mem(void)
 		for (i = 0; i < npmem_ranges; i++) {
 			zl = node_zonelist(i);
 			for (j = 0; j < MAX_NR_ZONES; j++) {
-				struct zone **z;
+				struct zoneref *z;
 				struct zone *zone;
 
 				printk("Zone list for zone %d on node %d: ", j, i);
Index: linux-2.6.25-rc2-mm1/fs/buffer.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/fs/buffer.c	2008-02-27 16:28:16.000000000 -0500
+++ linux-2.6.25-rc2-mm1/fs/buffer.c	2008-02-27 16:28:17.000000000 -0500
@@ -368,16 +368,16 @@ void invalidate_bdev(struct block_device
  */
 static void free_more_memory(void)
 {
-	struct zone **zones;
+	struct zoneref *zrefs;
 	int nid;
 
 	wakeup_pdflush(1024);
 	yield();
 
 	for_each_online_node(nid) {
-		zones = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
+		zrefs = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
 						gfp_zone(GFP_NOFS));
-		if (*zones)
+		if (zrefs->zone)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
 						GFP_NOFS);
 	}
Index: linux-2.6.25-rc2-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.25-rc2-mm1.orig/include/linux/mmzone.h	2008-02-27 16:28:16.000000000 -0500
+++ linux-2.6.25-rc2-mm1/include/linux/mmzone.h	2008-02-27 16:28:17.000000000 -0500
@@ -469,6 +469,15 @@ struct zonelist_cache;
 #endif
 
 /*
+ * This struct contains information about a zone in a zonelist. It is stored
+ * here to avoid dereferences into large structures and lookups of tables
+ */
+struct zoneref {
+	struct zone *zone;	/* Pointer to actual zone */
+	int zone_idx;		/* zone_idx(zoneref->zone) */
+};
+
+/*
  * One allocation request operates on a zonelist. A zonelist
  * is a list of zones, the first one is the 'goal' of the
  * allocation, the other zones are fallback zones, in decreasing
@@ -476,11 +485,18 @@ struct zonelist_cache;
  *
  * If zlcache_ptr is not NULL, then it is just the address of zlcache,
  * as explained above.  If zlcache_ptr is NULL, there is no zlcache.
+ * *
+ * To speed the reading of the zonelist, the zonerefs contain the zone index
+ * of the entry being read. Helper functions to access information given
+ * a struct zoneref are
+ *
+ * zonelist_zone()	- Return the struct zone * for an entry in _zonerefs
+ * zonelist_zone_idx()	- Return the index of the zone for an entry
+ * zonelist_node_idx()	- Return the index of the node for an entry
  */
-
 struct zonelist {
 	struct zonelist_cache *zlcache_ptr;		     // NULL or &zlcache
-	struct zone *zones[MAX_ZONES_PER_ZONELIST + 1];      // NULL delimited
+	struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
 #ifdef CONFIG_NUMA
 	struct zonelist_cache zlcache;			     // optional ...
 #endif
@@ -714,26 +730,52 @@ extern struct zone *next_zone(struct zon
 	     zone;					\
 	     zone = next_zone(zone))
 
+static inline struct zone *zonelist_zone(struct zoneref *zoneref)
+{
+	return zoneref->zone;
+}
+
+static inline int zonelist_zone_idx(struct zoneref *zoneref)
+{
+	return zoneref->zone_idx;
+}
+
+static inline int zonelist_node_idx(struct zoneref *zoneref)
+{
+#ifdef CONFIG_NUMA
+	/* zone_to_nid not available in this context */
+	return zoneref->zone->node;
+#else
+	return 0;
+#endif /* CONFIG_NUMA */
+}
+
+static inline void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
+{
+	zoneref->zone = zone;
+	zoneref->zone_idx = zone_idx(zone);
+}
+
 /* Returns the first zone at or below highest_zoneidx in a zonelist */
-static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
+static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
 					enum zone_type highest_zoneidx)
 {
-	struct zone **z;
+	struct zoneref *z;
 
 	/* Find the first suitable zone to use for the allocation */
-	z = zonelist->zones;
-	while (*z && zone_idx(*z) > highest_zoneidx)
+	z = zonelist->_zonerefs;
+	while (zonelist_zone_idx(z) > highest_zoneidx)
 		z++;
 
 	return z;
 }
 
 /* Returns the next zone at or below highest_zoneidx in a zonelist */
-static inline struct zone **next_zones_zonelist(struct zone **z,
+static inline struct zoneref *next_zones_zonelist(struct zoneref *z,
 					enum zone_type highest_zoneidx)
 {
 	/* Find the next suitable zone to use for the allocation */
-	while (*z && zone_idx(*z) > highest_zoneidx)
+	while (zonelist_zone_idx(z) > highest_zoneidx)
 		z++;
 
 	return z;
@@ -749,9 +791,11 @@ static inline struct zone **next_zones_z
  * This iterator iterates though all zones at or below a given zone index.
  */
 #define for_each_zone_zonelist(zone, z, zlist, highidx) \
-	for (z = first_zones_zonelist(zlist, highidx), zone = *z++;	\
+	for (z = first_zones_zonelist(zlist, highidx),			\
+					zone = zonelist_zone(z++);	\
 		zone;							\
-		z = next_zones_zonelist(z, highidx), zone = *z++)
+		z = next_zones_zonelist(z, highidx),			\
+					zone = zonelist_zone(z++))
 
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
Index: linux-2.6.25-rc2-mm1/include/linux/oom.h
===================================================================
--- linux-2.6.25-rc2-mm1.orig/include/linux/oom.h	2008-02-27 16:28:02.000000000 -0500
+++ linux-2.6.25-rc2-mm1/include/linux/oom.h	2008-02-27 16:28:17.000000000 -0500
@@ -23,8 +23,8 @@ enum oom_constraint {
 	CONSTRAINT_MEMORY_POLICY,
 };
 
-extern int try_set_zone_oom(struct zonelist *zonelist);
-extern void clear_zonelist_oom(struct zonelist *zonelist);
+extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
+extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
 extern int register_oom_notifier(struct notifier_block *nb);
Index: linux-2.6.25-rc2-mm1/kernel/cpuset.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/kernel/cpuset.c	2008-02-27 16:28:02.000000000 -0500
+++ linux-2.6.25-rc2-mm1/kernel/cpuset.c	2008-02-27 16:28:17.000000000 -0500
@@ -1915,8 +1915,8 @@ int cpuset_zonelist_valid_mems_allowed(s
 {
 	int i;
 
-	for (i = 0; zl->zones[i]; i++) {
-		int nid = zone_to_nid(zl->zones[i]);
+	for (i = 0; zl->_zonerefs[i].zone; i++) {
+		int nid = zonelist_node_idx(&zl->_zonerefs[i]);
 
 		if (node_isset(nid, current->mems_allowed))
 			return 1;
Index: linux-2.6.25-rc2-mm1/mm/hugetlb.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/hugetlb.c	2008-02-27 16:28:16.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/hugetlb.c	2008-02-27 16:28:17.000000000 -0500
@@ -79,7 +79,8 @@ static struct page *dequeue_huge_page(st
 	struct mempolicy *mpol;
 	struct zonelist *zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol);
-	struct zone *zone, **z;
+	struct zone *zone;
+	struct zoneref *z;
 
 	for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) {
 		nid = zone_to_nid(zone);
Index: linux-2.6.25-rc2-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/mempolicy.c	2008-02-27 16:28:11.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/mempolicy.c	2008-02-27 16:28:17.000000000 -0500
@@ -186,7 +186,7 @@ static struct zonelist *bind_zonelist(no
 		for_each_node_mask(nd, *nodes) { 
 			struct zone *z = &NODE_DATA(nd)->node_zones[k];
 			if (z->present_pages > 0) 
-				zl->zones[num++] = z;
+				zoneref_set_zone(z, &zl->_zonerefs[num++]);
 		}
 		if (k == 0)
 			break;
@@ -196,7 +196,8 @@ static struct zonelist *bind_zonelist(no
 		kfree(zl);
 		return ERR_PTR(-EINVAL);
 	}
-	zl->zones[num] = NULL;
+	zl->_zonerefs[num].zone = NULL;
+	zl->_zonerefs[num].zone_idx = 0;
 	return zl;
 }
 
@@ -504,9 +505,11 @@ static void get_zonemask(struct mempolic
 	nodes_clear(*nodes);
 	switch (p->policy) {
 	case MPOL_BIND:
-		for (i = 0; p->v.zonelist->zones[i]; i++)
-			node_set(zone_to_nid(p->v.zonelist->zones[i]),
-				*nodes);
+		for (i = 0; p->v.zonelist->_zonerefs[i].zone; i++) {
+			struct zoneref *zref;
+			zref = &p->v.zonelist->_zonerefs[i];
+			node_set(zonelist_node_idx(zref), *nodes);
+		}
 		break;
 	case MPOL_DEFAULT:
 		break;
@@ -1212,12 +1215,13 @@ unsigned slab_node(struct mempolicy *pol
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
 
-	case MPOL_BIND:
+	case MPOL_BIND: {
 		/*
 		 * Follow bind policy behavior and start allocation at the
 		 * first node.
 		 */
-		return zone_to_nid(policy->v.zonelist->zones[0]);
+		return zonelist_node_idx(policy->v.zonelist->_zonerefs);
+	}
 
 	case MPOL_PREFERRED:
 		if (policy->v.preferred_node >= 0)
@@ -1321,7 +1325,7 @@ static struct page *alloc_page_interleav
 
 	zl = node_zonelist(nid, gfp);
 	page = __alloc_pages(gfp, order, zl);
-	if (page && page_zone(page) == zl->zones[0])
+	if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
 		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
 	return page;
 }
@@ -1458,10 +1462,14 @@ int __mpol_equal(struct mempolicy *a, st
 		return a->v.preferred_node == b->v.preferred_node;
 	case MPOL_BIND: {
 		int i;
-		for (i = 0; a->v.zonelist->zones[i]; i++)
-			if (a->v.zonelist->zones[i] != b->v.zonelist->zones[i])
+		for (i = 0; a->v.zonelist->_zonerefs[i].zone; i++) {
+			struct zone *za, *zb;
+			za = zonelist_zone(&a->v.zonelist->_zonerefs[i]);
+			zb = zonelist_zone(&b->v.zonelist->_zonerefs[i]);
+			if (za != zb)
 				return 0;
-		return b->v.zonelist->zones[i] == NULL;
+		}
+		return b->v.zonelist->_zonerefs[i].zone == NULL;
 	}
 	default:
 		BUG();
@@ -1780,12 +1788,12 @@ static void mpol_rebind_policy(struct me
 		break;
 	case MPOL_BIND: {
 		nodemask_t nodes;
-		struct zone **z;
+		struct zoneref *z;
 		struct zonelist *zonelist;
 
 		nodes_clear(nodes);
-		for (z = pol->v.zonelist->zones; *z; z++)
-			node_set(zone_to_nid(*z), nodes);
+		for (z = pol->v.zonelist->_zonerefs; z->zone; z++)
+			node_set(zonelist_node_idx(z), nodes);
 		nodes_remap(tmp, nodes, *mpolmask, *newmask);
 		nodes = tmp;
 
Index: linux-2.6.25-rc2-mm1/mm/oom_kill.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/oom_kill.c	2008-02-27 16:28:16.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/oom_kill.c	2008-02-27 16:28:17.000000000 -0500
@@ -175,7 +175,7 @@ static inline enum oom_constraint constr
 {
 #ifdef CONFIG_NUMA
 	struct zone *zone;
-	struct zone **z;
+	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	nodemask_t nodes = node_states[N_HIGH_MEMORY];
 
@@ -458,29 +458,29 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifie
  * if a parallel OOM killing is already taking place that includes a zone in
  * the zonelist.  Otherwise, locks all zones in the zonelist and returns 1.
  */
-int try_set_zone_oom(struct zonelist *zonelist)
+int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 {
-	struct zone **z;
+	struct zoneref *z;
+	struct zone *zone;
 	int ret = 1;
 
-	z = zonelist->zones;
-
 	spin_lock(&zone_scan_mutex);
-	do {
-		if (zone_is_oom_locked(*z)) {
+	for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) {
+		if (zone_is_oom_locked(zone)) {
 			ret = 0;
 			goto out;
 		}
-	} while (*(++z) != NULL);
+	}
+
+	for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) {
+		/*
+		 * Lock each zone in the zonelist under zone_scan_mutex so a
+		 * parallel invocation of try_set_zone_oom() doesn't succeed
+		 * when it shouldn't.
+		 */
+		zone_set_flag(zone, ZONE_OOM_LOCKED);
+	}
 
-	/*
-	 * Lock each zone in the zonelist under zone_scan_mutex so a parallel
-	 * invocation of try_set_zone_oom() doesn't succeed when it shouldn't.
-	 */
-	z = zonelist->zones;
-	do {
-		zone_set_flag(*z, ZONE_OOM_LOCKED);
-	} while (*(++z) != NULL);
 out:
 	spin_unlock(&zone_scan_mutex);
 	return ret;
@@ -491,16 +491,15 @@ out:
  * allocation attempts with zonelists containing them may now recall the OOM
  * killer, if necessary.
  */
-void clear_zonelist_oom(struct zonelist *zonelist)
+void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 {
-	struct zone **z;
-
-	z = zonelist->zones;
+	struct zoneref *z;
+	struct zone *zone;
 
 	spin_lock(&zone_scan_mutex);
-	do {
-		zone_clear_flag(*z, ZONE_OOM_LOCKED);
-	} while (*(++z) != NULL);
+	for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) {
+		zone_clear_flag(zone, ZONE_OOM_LOCKED);
+	}
 	spin_unlock(&zone_scan_mutex);
 }
 
Index: linux-2.6.25-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/page_alloc.c	2008-02-27 16:28:16.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/page_alloc.c	2008-02-27 16:28:17.000000000 -0500
@@ -1327,7 +1327,7 @@ static nodemask_t *zlc_setup(struct zone
  * We are low on memory in the second scan, and should leave no stone
  * unturned looking for a free page.
  */
-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zone **z,
+static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
 						nodemask_t *allowednodes)
 {
 	struct zonelist_cache *zlc;	/* cached zonelist speedup info */
@@ -1338,7 +1338,7 @@ static int zlc_zone_worth_trying(struct 
 	if (!zlc)
 		return 1;
 
-	i = z - zonelist->zones;
+	i = z - zonelist->_zonerefs;
 	n = zlc->z_to_n[i];
 
 	/* This zone is worth trying if it is allowed but not full */
@@ -1350,7 +1350,7 @@ static int zlc_zone_worth_trying(struct 
  * zlc->fullzones, so that subsequent attempts to allocate a page
  * from that zone don't waste time re-examining it.
  */
-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zone **z)
+static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 {
 	struct zonelist_cache *zlc;	/* cached zonelist speedup info */
 	int i;				/* index of *z in zonelist zones */
@@ -1359,7 +1359,7 @@ static void zlc_mark_zone_full(struct zo
 	if (!zlc)
 		return;
 
-	i = z - zonelist->zones;
+	i = z - zonelist->_zonerefs;
 
 	set_bit(i, zlc->fullzones);
 }
@@ -1371,13 +1371,13 @@ static nodemask_t *zlc_setup(struct zone
 	return NULL;
 }
 
-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zone **z,
+static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
 				nodemask_t *allowednodes)
 {
 	return 1;
 }
 
-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zone **z)
+static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 {
 }
 #endif	/* CONFIG_NUMA */
@@ -1390,7 +1390,7 @@ static struct page *
 get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
 {
-	struct zone **z;
+	struct zoneref *z;
 	struct page *page = NULL;
 	int classzone_idx;
 	struct zone *zone, *preferred_zone;
@@ -1399,8 +1399,8 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
 	z = first_zones_zonelist(zonelist, high_zoneidx);
-	classzone_idx = zone_idx(*z);
-	preferred_zone = *z;
+	classzone_idx = zonelist_zone_idx(z);
+	preferred_zone = zonelist_zone(z);
 
 zonelist_scan:
 	/*
@@ -1519,7 +1519,8 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	struct zone **z;
+	struct zoneref *z;
+	struct zone *zone;
 	struct page *page;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
@@ -1533,9 +1534,9 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
 		return NULL;
 
 restart:
-	z = zonelist->zones;  /* the list of zones suitable for gfp_mask */
+	z = zonelist->_zonerefs;  /* the list of zones suitable for gfp_mask */
 
-	if (unlikely(*z == NULL)) {
+	if (unlikely(!z->zone)) {
 		/*
 		 * Happens if we have an empty zonelist as a result of
 		 * GFP_THISNODE being used on a memoryless node
@@ -1559,8 +1560,8 @@ restart:
 	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
 		goto nopage;
 
-	for (z = zonelist->zones; *z; z++)
-		wakeup_kswapd(*z, order);
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+		wakeup_kswapd(zone, order);
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -1641,7 +1642,7 @@ nofail_alloc:
 		if (page)
 			goto got_pg;
 	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
-		if (!try_set_zone_oom(zonelist)) {
+		if (!try_set_zone_oom(zonelist, gfp_mask)) {
 			schedule_timeout_uninterruptible(1);
 			goto restart;
 		}
@@ -1655,18 +1656,18 @@ nofail_alloc:
 		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page) {
-			clear_zonelist_oom(zonelist);
+			clear_zonelist_oom(zonelist, gfp_mask);
 			goto got_pg;
 		}
 
 		/* The OOM killer will not help higher order allocs so fail */
 		if (order > PAGE_ALLOC_COSTLY_ORDER) {
-			clear_zonelist_oom(zonelist);
+			clear_zonelist_oom(zonelist, gfp_mask);
 			goto nopage;
 		}
 
 		out_of_memory(zonelist, gfp_mask, order);
-		clear_zonelist_oom(zonelist);
+		clear_zonelist_oom(zonelist, gfp_mask);
 		goto restart;
 	}
 
@@ -1772,7 +1773,7 @@ EXPORT_SYMBOL(free_pages);
 
 static unsigned int nr_free_zone_pages(int offset)
 {
-	struct zone **z;
+	struct zoneref *z;
 	struct zone *zone;
 
 	/* Just pick one node, since fallback list is circular */
@@ -1966,7 +1967,8 @@ static int build_zonelists_node(pg_data_
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
 		if (populated_zone(zone)) {
-			zonelist->zones[nr_zones++] = zone;
+			zoneref_set_zone(zone,
+				&zonelist->_zonerefs[nr_zones++]);
 			check_highest_zone(zone_type);
 		}
 
@@ -2142,11 +2144,12 @@ static void build_zonelists_in_node_orde
 	struct zonelist *zonelist;
 
 	zonelist = &pgdat->node_zonelists[0];
-	for (j = 0; zonelist->zones[j] != NULL; j++)
+	for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
 		;
 	j = build_zonelists_node(NODE_DATA(node), zonelist, j,
 							MAX_NR_ZONES - 1);
-	zonelist->zones[j] = NULL;
+	zonelist->_zonerefs[j].zone = NULL;
+	zonelist->_zonerefs[j].zone_idx = 0;
 }
 
 /*
@@ -2159,7 +2162,8 @@ static void build_thisnode_zonelists(pg_
 
 	zonelist = &pgdat->node_zonelists[1];
 	j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1);
-	zonelist->zones[j] = NULL;
+	zonelist->_zonerefs[j].zone = NULL;
+	zonelist->_zonerefs[j].zone_idx = 0;
 }
 
 /*
@@ -2184,12 +2188,14 @@ static void build_zonelists_in_zone_orde
 			node = node_order[j];
 			z = &NODE_DATA(node)->node_zones[zone_type];
 			if (populated_zone(z)) {
-				zonelist->zones[pos++] = z;
+				zoneref_set_zone(z,
+					&zonelist->_zonerefs[pos++]);
 				check_highest_zone(zone_type);
 			}
 		}
 	}
-	zonelist->zones[pos] = NULL;
+	zonelist->_zonerefs[pos].zone = NULL;
+	zonelist->_zonerefs[pos].zone_idx = 0;
 }
 
 static int default_zonelist_order(void)
@@ -2266,7 +2272,8 @@ static void build_zonelists(pg_data_t *p
 	/* initialize zonelists */
 	for (i = 0; i < MAX_ZONELISTS; i++) {
 		zonelist = pgdat->node_zonelists + i;
-		zonelist->zones[0] = NULL;
+		zonelist->_zonerefs[0].zone = NULL;
+		zonelist->_zonerefs[0].zone_idx = 0;
 	}
 
 	/* NUMA-aware ordering of nodes */
@@ -2318,13 +2325,13 @@ static void build_zonelist_cache(pg_data
 {
 	struct zonelist *zonelist;
 	struct zonelist_cache *zlc;
-	struct zone **z;
+	struct zoneref *z;
 
 	zonelist = &pgdat->node_zonelists[0];
 	zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
 	bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
-	for (z = zonelist->zones; *z; z++)
-		zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
+	for (z = zonelist->_zonerefs; z->zone; z++)
+		zlc->z_to_n[z - zonelist->_zonerefs] = zonelist_node_idx(z);
 }
 
 
@@ -2367,7 +2374,8 @@ static void build_zonelists(pg_data_t *p
 							MAX_NR_ZONES - 1);
 	}
 
-	zonelist->zones[j] = NULL;
+	zonelist->_zonerefs[j].zone = NULL;
+	zonelist->_zonerefs[j].zone_idx = 0;
 }
 
 /* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
Index: linux-2.6.25-rc2-mm1/mm/slab.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/slab.c	2008-02-27 16:28:16.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/slab.c	2008-02-27 16:28:17.000000000 -0500
@@ -3244,7 +3244,7 @@ static void *fallback_alloc(struct kmem_
 {
 	struct zonelist *zonelist;
 	gfp_t local_flags;
-	struct zone **z;
+	struct zoneref *z;
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
Index: linux-2.6.25-rc2-mm1/mm/slub.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/slub.c	2008-02-27 16:28:16.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/slub.c	2008-02-27 16:28:17.000000000 -0500
@@ -1299,7 +1299,7 @@ static struct page *get_any_partial(stru
 {
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
-	struct zone **z;
+	struct zoneref *z;
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	struct page *page;
Index: linux-2.6.25-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/vmscan.c	2008-02-27 16:28:16.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/vmscan.c	2008-02-27 16:28:17.000000000 -0500
@@ -1273,7 +1273,7 @@ static unsigned long shrink_zones(int pr
 {
 	enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
 	unsigned long nr_reclaimed = 0;
-	struct zone **z;
+	struct zoneref *z;
 	struct zone *zone;
 
 	sc->all_unreclaimable = 1;
@@ -1331,7 +1331,7 @@ static unsigned long do_try_to_free_page
 	unsigned long nr_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long lru_pages = 0;
-	struct zone **z;
+	struct zoneref *z;
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 
@@ -1453,7 +1453,7 @@ unsigned long try_to_free_mem_cgroup_pag
 		.isolate_pages = mem_cgroup_isolate_pages,
 	};
 	struct zonelist *zonelist;
-	int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE);
+	int target_zone = gfp_zonelist(GFP_HIGHUSER_MOVABLE);
 
 	zonelist = &NODE_DATA(numa_node_id())->node_zonelists[target_zone];
 	if (do_try_to_free_pages(zonelist, sc.gfp_mask, &sc))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 6/6] Filter based on a nodemask as well as a gfp_mask
  2008-02-27 21:47 [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3 Lee Schermerhorn, Mel Gorman
                   ` (4 preceding siblings ...)
  2008-02-27 21:47 ` [PATCH 5/6] Have zonelist contains structs with both a zone pointer and zone_idx Lee Schermerhorn, Mel Gorman
@ 2008-02-27 21:47 ` Lee Schermerhorn, Mel Gorman
  2008-02-29  2:59   ` KAMEZAWA Hiroyuki
  2008-02-29  8:48   ` KOSAKI Motohiro
  2008-02-27 21:53 ` [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3 Lee Schermerhorn
  2008-02-29 14:12 ` Mel Gorman
  7 siblings, 2 replies; 37+ messages in thread
From: Lee Schermerhorn, Mel Gorman @ 2008-02-27 21:47 UTC (permalink / raw
  To: akpm; +Cc: mel, ak, clameter, kamezawa.hiroyu, linux-mm, rientjes,
	eric.whitney

[PATCH 6/6] Filter based on a nodemask as well as a gfp_mask

V11r3 against 2.6.25-rc2-mm1

The MPOL_BIND policy creates a zonelist that is used for allocations
controlled by that mempolicy. As the per-node zonelist is already being
filtered based on a zone id, this patch adds a version of __alloc_pages()
that takes a nodemask for further filtering. This eliminates the need
for MPOL_BIND to create a custom zonelist.

A positive benefit of this is that allocations using MPOL_BIND now use the
local node's distance-ordered zonelist instead of a custom node-id-ordered
zonelist.  I.e., pages will be allocated from the closest allowed node with
available memory.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/buffer.c               |    2 
 include/linux/cpuset.h    |    4 -
 include/linux/gfp.h       |    4 +
 include/linux/mempolicy.h |    3 
 include/linux/mmzone.h    |   62 +++++++++++++----
 kernel/cpuset.c           |   18 +----
 mm/mempolicy.c            |  165 ++++++++++++++++------------------------------
 mm/page_alloc.c           |   40 +++++++----
 8 files changed, 151 insertions(+), 147 deletions(-)

Index: linux-2.6.25-rc2-mm1/fs/buffer.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/fs/buffer.c	2008-02-27 16:28:17.000000000 -0500
+++ linux-2.6.25-rc2-mm1/fs/buffer.c	2008-02-27 16:28:20.000000000 -0500
@@ -376,7 +376,7 @@ static void free_more_memory(void)
 
 	for_each_online_node(nid) {
 		zrefs = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
-						gfp_zone(GFP_NOFS));
+						gfp_zone(GFP_NOFS), NULL);
 		if (zrefs->zone)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
 						GFP_NOFS);
Index: linux-2.6.25-rc2-mm1/include/linux/cpuset.h
===================================================================
--- linux-2.6.25-rc2-mm1.orig/include/linux/cpuset.h	2008-02-27 16:28:01.000000000 -0500
+++ linux-2.6.25-rc2-mm1/include/linux/cpuset.h	2008-02-27 16:28:20.000000000 -0500
@@ -26,7 +26,7 @@ extern nodemask_t cpuset_mems_allowed(st
 #define cpuset_current_mems_allowed (current->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
 void cpuset_update_task_memory_state(void);
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
 extern int __cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask);
@@ -102,7 +102,7 @@ static inline nodemask_t cpuset_mems_all
 static inline void cpuset_init_current_mems_allowed(void) {}
 static inline void cpuset_update_task_memory_state(void) {}
 
-static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
 	return 1;
 }
Index: linux-2.6.25-rc2-mm1/include/linux/gfp.h
===================================================================
--- linux-2.6.25-rc2-mm1.orig/include/linux/gfp.h	2008-02-27 16:28:16.000000000 -0500
+++ linux-2.6.25-rc2-mm1/include/linux/gfp.h	2008-02-27 16:28:20.000000000 -0500
@@ -182,6 +182,10 @@ static inline void arch_alloc_page(struc
 
 extern struct page *__alloc_pages(gfp_t, unsigned int, struct zonelist *);
 
+extern struct page *
+__alloc_pages_nodemask(gfp_t, unsigned int,
+				struct zonelist *, nodemask_t *nodemask);
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
Index: linux-2.6.25-rc2-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.25-rc2-mm1.orig/include/linux/mempolicy.h	2008-02-27 16:28:11.000000000 -0500
+++ linux-2.6.25-rc2-mm1/include/linux/mempolicy.h	2008-02-27 16:28:20.000000000 -0500
@@ -64,9 +64,8 @@ struct mempolicy {
 	atomic_t refcnt;
 	short policy; 	/* See MPOL_* above */
 	union {
-		struct zonelist  *zonelist;	/* bind */
 		short 		 preferred_node; /* preferred */
-		nodemask_t	 nodes;		/* interleave */
+		nodemask_t	 nodes;		/* interleave/bind */
 		/* undefined for default */
 	} v;
 	nodemask_t cpuset_mems_allowed;	/* mempolicy relative to these nodes */
Index: linux-2.6.25-rc2-mm1/include/linux/mmzone.h
===================================================================
--- linux-2.6.25-rc2-mm1.orig/include/linux/mmzone.h	2008-02-27 16:28:17.000000000 -0500
+++ linux-2.6.25-rc2-mm1/include/linux/mmzone.h	2008-02-27 16:28:20.000000000 -0500
@@ -756,47 +756,85 @@ static inline void zoneref_set_zone(stru
 	zoneref->zone_idx = zone_idx(zone);
 }
 
+static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
+{
+#ifdef CONFIG_NUMA
+	return node_isset(zonelist_node_idx(zref), *nodes);
+#else
+	return 1;
+#endif /* CONFIG_NUMA */
+}
+
 /* Returns the first zone at or below highest_zoneidx in a zonelist */
 static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
-					enum zone_type highest_zoneidx)
+					enum zone_type highest_zoneidx,
+					nodemask_t *nodes)
 {
 	struct zoneref *z;
 
 	/* Find the first suitable zone to use for the allocation */
 	z = zonelist->_zonerefs;
-	while (zonelist_zone_idx(z) > highest_zoneidx)
-		z++;
+	if (likely(nodes == NULL))
+		while (zonelist_zone_idx(z) > highest_zoneidx)
+			z++;
+	else
+		while (zonelist_zone_idx(z) > highest_zoneidx ||
+				(z->zone && !zref_in_nodemask(z, nodes)))
+			z++;
 
 	return z;
 }
 
 /* Returns the next zone at or below highest_zoneidx in a zonelist */
 static inline struct zoneref *next_zones_zonelist(struct zoneref *z,
-					enum zone_type highest_zoneidx)
+					enum zone_type highest_zoneidx,
+					nodemask_t *nodes)
 {
-	/* Find the next suitable zone to use for the allocation */
-	while (zonelist_zone_idx(z) > highest_zoneidx)
-		z++;
+	/*
+	 * Find the next suitable zone to use for the allocation.
+	 * Only filter based on nodemask if it's set
+	 */
+	if (likely(nodes == NULL))
+		while (zonelist_zone_idx(z) > highest_zoneidx)
+			z++;
+	else
+		while (zonelist_zone_idx(z) > highest_zoneidx ||
+				(z->zone && !zref_in_nodemask(z, nodes)))
+			z++;
 
 	return z;
 }
 
 /**
- * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask
  * @zone - The current zone in the iterator
  * @z - The current pointer within zonelist->zones being iterated
  * @zlist - The zonelist being iterated
  * @highidx - The zone index of the highest zone to return
+ * @nodemask - Nodemask allowed by the allocator
  *
- * This iterator iterates though all zones at or below a given zone index.
+ * This iterator iterates though all zones at or below a given zone index and
+ * within a given nodemask
  */
-#define for_each_zone_zonelist(zone, z, zlist, highidx) \
-	for (z = first_zones_zonelist(zlist, highidx),			\
+#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
+	for (z = first_zones_zonelist(zlist, highidx, nodemask),	\
 					zone = zonelist_zone(z++);	\
 		zone;							\
-		z = next_zones_zonelist(z, highidx),			\
+		z = next_zones_zonelist(z, highidx, nodemask),		\
 					zone = zonelist_zone(z++))
 
+/**
+ * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * @zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ *
+ * This iterator iterates though all zones at or below a given zone index.
+ */
+#define for_each_zone_zonelist(zone, z, zlist, highidx) \
+	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 #endif
Index: linux-2.6.25-rc2-mm1/kernel/cpuset.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/kernel/cpuset.c	2008-02-27 16:28:17.000000000 -0500
+++ linux-2.6.25-rc2-mm1/kernel/cpuset.c	2008-02-27 16:28:20.000000000 -0500
@@ -1906,22 +1906,14 @@ nodemask_t cpuset_mems_allowed(struct ta
 }
 
 /**
- * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
- * @zl: the zonelist to be checked
+ * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed
+ * @nodemask: the nodemask to be checked
  *
- * Are any of the nodes on zonelist zl allowed in current->mems_allowed?
+ * Are any of the nodes in the nodemask allowed in current->mems_allowed?
  */
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
-	int i;
-
-	for (i = 0; zl->_zonerefs[i].zone; i++) {
-		int nid = zonelist_node_idx(&zl->_zonerefs[i]);
-
-		if (node_isset(nid, current->mems_allowed))
-			return 1;
-	}
-	return 0;
+	return nodes_intersects(*nodemask, current->mems_allowed);
 }
 
 /*
Index: linux-2.6.25-rc2-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/mempolicy.c	2008-02-27 16:28:17.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/mempolicy.c	2008-02-27 16:28:20.000000000 -0500
@@ -163,42 +163,25 @@ static int mpol_check_policy(int mode, n
 	return 0;
 }
 
-/* Generate a custom zonelist for the BIND policy. */
-static struct zonelist *bind_zonelist(nodemask_t *nodes)
+/* Check that the nodemask contains at least one populated zone */
+static int is_valid_nodemask(nodemask_t *nodemask)
 {
-	struct zonelist *zl;
-	int num, max, nd;
-	enum zone_type k;
+	int nd, k;
 
-	max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
-	max++;			/* space for zlcache_ptr (see mmzone.h) */
-	zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
-	if (!zl)
-		return ERR_PTR(-ENOMEM);
-	zl->zlcache_ptr = NULL;
-	num = 0;
-	/* First put in the highest zones from all nodes, then all the next 
-	   lower zones etc. Avoid empty zones because the memory allocator
-	   doesn't like them. If you implement node hot removal you
-	   have to fix that. */
-	k = MAX_NR_ZONES - 1;
-	while (1) {
-		for_each_node_mask(nd, *nodes) { 
-			struct zone *z = &NODE_DATA(nd)->node_zones[k];
-			if (z->present_pages > 0) 
-				zoneref_set_zone(z, &zl->_zonerefs[num++]);
+	/* Check that there is something useful in this mask */
+	k = policy_zone;
+
+	for_each_node_mask(nd, *nodemask) {
+		struct zone *z;
+
+		for (k = 0; k <= policy_zone; k++) {
+			z = &NODE_DATA(nd)->node_zones[k];
+			if (z->present_pages > 0)
+				return 1;
 		}
-		if (k == 0)
-			break;
-		k--;
-	}
-	if (num == 0) {
-		kfree(zl);
-		return ERR_PTR(-EINVAL);
 	}
-	zl->_zonerefs[num].zone = NULL;
-	zl->_zonerefs[num].zone_idx = 0;
-	return zl;
+
+	return 0;
 }
 
 /* Create a new policy */
@@ -229,12 +212,11 @@ static struct mempolicy *mpol_new(int mo
 			policy->v.preferred_node = -1;
 		break;
 	case MPOL_BIND:
-		policy->v.zonelist = bind_zonelist(nodes);
-		if (IS_ERR(policy->v.zonelist)) {
-			void *error_code = policy->v.zonelist;
+		if (!is_valid_nodemask(nodes)) {
 			kmem_cache_free(policy_cache, policy);
-			return error_code;
+			return ERR_PTR(-EINVAL);
 		}
+		policy->v.nodes = *nodes;
 		break;
 	}
 	policy->policy = mode;
@@ -500,19 +482,12 @@ static long do_set_mempolicy(int mode, n
 /* Fill a zone bitmap for a policy */
 static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
 {
-	int i;
-
 	nodes_clear(*nodes);
 	switch (p->policy) {
-	case MPOL_BIND:
-		for (i = 0; p->v.zonelist->_zonerefs[i].zone; i++) {
-			struct zoneref *zref;
-			zref = &p->v.zonelist->_zonerefs[i];
-			node_set(zonelist_node_idx(zref), *nodes);
-		}
-		break;
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		*nodes = p->v.nodes;
 		break;
@@ -1160,6 +1135,18 @@ static struct mempolicy * get_vma_policy
 	return pol;
 }
 
+/* Return a nodemask representing a mempolicy */
+static nodemask_t *nodemask_policy(gfp_t gfp, struct mempolicy *policy)
+{
+	/* Lower zones don't get a nodemask applied for MPOL_BIND */
+	if (unlikely(policy->policy == MPOL_BIND) &&
+			gfp_zone(gfp) >= policy_zone &&
+			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
+		return &policy->v.nodes;
+
+	return NULL;
+}
+
 /* Return a zonelist representing a mempolicy */
 static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
 {
@@ -1172,12 +1159,17 @@ static struct zonelist *zonelist_policy(
 			nd = numa_node_id();
 		break;
 	case MPOL_BIND:
-		/* Lower zones don't get a policy applied */
-		/* Careful: current->mems_allowed might have moved */
-		if (gfp_zone(gfp) >= policy_zone)
-			if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist))
-				return policy->v.zonelist;
-		/*FALL THROUGH*/
+		/*
+		 * Normally, MPOL_BIND allocations node-local are node-local
+		 * within the allowed nodemask. However, if __GFP_THISNODE is
+		 * set and the current node is part of the mask, we use the
+		 * the zonelist for the first node in the mask instead.
+		 */
+		nd = numa_node_id();
+		if (unlikely(gfp & __GFP_THISNODE) &&
+				unlikely(!node_isset(nd, policy->v.nodes)))
+			nd = first_node(policy->v.nodes);
+		break;
 	case MPOL_INTERLEAVE: /* should not happen */
 	case MPOL_DEFAULT:
 		nd = numa_node_id();
@@ -1220,7 +1212,13 @@ unsigned slab_node(struct mempolicy *pol
 		 * Follow bind policy behavior and start allocation at the
 		 * first node.
 		 */
-		return zonelist_node_idx(policy->v.zonelist->_zonerefs);
+		struct zonelist *zonelist;
+		struct zoneref *z;
+		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
+		zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
+		z = first_zones_zonelist(zonelist, highest_zoneidx,
+							&policy->v.nodes);
+		return zonelist_node_idx(z);
 	}
 
 	case MPOL_PREFERRED:
@@ -1371,14 +1369,15 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		/*
 		 * slow path: ref counted policy -- shared or vma
 		 */
-		struct page *page =  __alloc_pages(gfp, 0, zl);
+		struct page *page =  __alloc_pages_nodemask(gfp, 0,
+						zl, nodemask_policy(gfp, pol));
 		__mpol_free(pol);
 		return page;
 	}
 	/*
 	 * fast path:  default or task policy
 	 */
-	return __alloc_pages(gfp, 0, zl);
+	return __alloc_pages_nodemask(gfp, 0, zl, nodemask_policy(gfp, pol));
 }
 
 /**
@@ -1410,7 +1409,8 @@ struct page *alloc_pages_current(gfp_t g
 		pol = &default_policy;
 	if (pol->policy == MPOL_INTERLEAVE)
 		return alloc_page_interleave(gfp, order, interleave_nodes(pol));
-	return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
+	return __alloc_pages_nodemask(gfp, order,
+			zonelist_policy(gfp, pol), nodemask_policy(gfp, pol));
 }
 EXPORT_SYMBOL(alloc_pages_current);
 
@@ -1435,14 +1435,6 @@ struct mempolicy *__mpol_copy(struct mem
 	}
 	*new = *old;
 	atomic_set(&new->refcnt, 1);
-	if (new->policy == MPOL_BIND) {
-		int sz = ksize(old->v.zonelist);
-		new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL);
-		if (!new->v.zonelist) {
-			kmem_cache_free(policy_cache, new);
-			return ERR_PTR(-ENOMEM);
-		}
-	}
 	return new;
 }
 
@@ -1456,21 +1448,12 @@ int __mpol_equal(struct mempolicy *a, st
 	switch (a->policy) {
 	case MPOL_DEFAULT:
 		return 1;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		return nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
 		return a->v.preferred_node == b->v.preferred_node;
-	case MPOL_BIND: {
-		int i;
-		for (i = 0; a->v.zonelist->_zonerefs[i].zone; i++) {
-			struct zone *za, *zb;
-			za = zonelist_zone(&a->v.zonelist->_zonerefs[i]);
-			zb = zonelist_zone(&b->v.zonelist->_zonerefs[i]);
-			if (za != zb)
-				return 0;
-		}
-		return b->v.zonelist->_zonerefs[i].zone == NULL;
-	}
 	default:
 		BUG();
 		return 0;
@@ -1482,8 +1465,6 @@ void __mpol_free(struct mempolicy *p)
 {
 	if (!atomic_dec_and_test(&p->refcnt))
 		return;
-	if (p->policy == MPOL_BIND)
-		kfree(p->v.zonelist);
 	p->policy = MPOL_DEFAULT;
 	kmem_cache_free(policy_cache, p);
 }
@@ -1774,6 +1755,8 @@ static void mpol_rebind_policy(struct me
 	switch (pol->policy) {
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask);
 		pol->v.nodes = tmp;
@@ -1786,32 +1769,6 @@ static void mpol_rebind_policy(struct me
 						*mpolmask, *newmask);
 		*mpolmask = *newmask;
 		break;
-	case MPOL_BIND: {
-		nodemask_t nodes;
-		struct zoneref *z;
-		struct zonelist *zonelist;
-
-		nodes_clear(nodes);
-		for (z = pol->v.zonelist->_zonerefs; z->zone; z++)
-			node_set(zonelist_node_idx(z), nodes);
-		nodes_remap(tmp, nodes, *mpolmask, *newmask);
-		nodes = tmp;
-
-		zonelist = bind_zonelist(&nodes);
-
-		/* If no mem, then zonelist is NULL and we keep old zonelist.
-		 * If that old zonelist has no remaining mems_allowed nodes,
-		 * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
-		 */
-
-		if (!IS_ERR(zonelist)) {
-			/* Good - got mem - substitute new zonelist */
-			kfree(pol->v.zonelist);
-			pol->v.zonelist = zonelist;
-		}
-		*mpolmask = *newmask;
-		break;
-	}
 	default:
 		BUG();
 		break;
@@ -1874,9 +1831,7 @@ static inline int mpol_to_str(char *buff
 		break;
 
 	case MPOL_BIND:
-		get_zonemask(pol, &nodes);
-		break;
-
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes = pol->v.nodes;
 		break;
Index: linux-2.6.25-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/mm/page_alloc.c	2008-02-27 16:28:17.000000000 -0500
+++ linux-2.6.25-rc2-mm1/mm/page_alloc.c	2008-02-27 16:28:20.000000000 -0500
@@ -1387,7 +1387,7 @@ static void zlc_mark_zone_full(struct zo
  * a page.
  */
 static struct page *
-get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
 {
 	struct zoneref *z;
@@ -1398,7 +1398,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	z = first_zones_zonelist(zonelist, high_zoneidx);
+	z = first_zones_zonelist(zonelist, high_zoneidx, nodemask);
 	classzone_idx = zonelist_zone_idx(z);
 	preferred_zone = zonelist_zone(z);
 
@@ -1407,7 +1407,8 @@ zonelist_scan:
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+						high_zoneidx, nodemask) {
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
@@ -1513,9 +1514,9 @@ static void set_page_owner(struct page *
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
-struct page *
-__alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+static struct page *
+__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
@@ -1544,7 +1545,7 @@ restart:
 		return NULL;
 	}
 
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
 		goto got_pg;
@@ -1589,7 +1590,7 @@ restart:
 	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist,
+	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags);
 	if (page)
 		goto got_pg;
@@ -1602,7 +1603,7 @@ rebalance:
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
+			page = get_page_from_freelist(gfp_mask, nodemask, order,
 				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
 			if (page)
 				goto got_pg;
@@ -1637,7 +1638,7 @@ nofail_alloc:
 		drain_all_pages();
 
 	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, order,
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx, alloc_flags);
 		if (page)
 			goto got_pg;
@@ -1653,8 +1654,9 @@ nofail_alloc:
 		 * a parallel oom killing, we must fail if we're still
 		 * under heavy pressure.
 		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+			order, zonelist, high_zoneidx,
+			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page) {
 			clear_zonelist_oom(zonelist, gfp_mask);
 			goto got_pg;
@@ -1707,6 +1709,20 @@ got_pg:
 	return page;
 }
 
+struct page *
+__alloc_pages(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+}
+
+struct page *
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
+}
+
 EXPORT_SYMBOL(__alloc_pages);
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3
  2008-02-27 21:47 [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3 Lee Schermerhorn, Mel Gorman
                   ` (5 preceding siblings ...)
  2008-02-27 21:47 ` [PATCH 6/6] Filter based on a nodemask as well as a gfp_mask Lee Schermerhorn, Mel Gorman
@ 2008-02-27 21:53 ` Lee Schermerhorn
  2008-02-29 14:12 ` Mel Gorman
  7 siblings, 0 replies; 37+ messages in thread
From: Lee Schermerhorn @ 2008-02-27 21:53 UTC (permalink / raw
  To: Mel Gorman
  Cc: akpm, ak, clameter, kamezawa.hiroyu, linux-mm, rientjes,
	eric.whitney

On Wed, 2008-02-27 at 16:47 -0500, Mel Gorman wrote:
> [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3
> 
> This is a rebase of the two-zonelist patchset to 2.6.25-rc2-mm1.
> 
> Mel, still on vacation last I checked,  asked me to repost these
> as I'd already rebased them and I've been testing them continually
> on each -mm tree for months, hoping to see them in -mm for wider
> testing.
> 
> I have a series of mempolicy cleanup patches, including a rework of the
> reference counting that depend on this series.  David R. has a series
> of mempolicy enhancements out for review that, IMO, will benefit from
> this series.  In both cases, the removal of the custom zonelist for
> MPOL_BIND is the important feature.
> 
> Lee

FYI:  I reposted this series, at Mel's request, as I mentioned above.  I
see that the mail system blames Mel for the mail--probably because I
placed his 'From:  Mel..." in the body as original author.  Hope this
doesn't cause any confusion.

Lee

> 
> ---
> 
> Changelog since V11r2
>   o Rebase to 2.6.25-rc2-mm1
> 
> Changelog since V10
>   o Rebase to 2.6.24-rc4-mm1
>   o Clear up warnings in fs/buffer.c early in the patchset
> 
> Changelog since V9
>   o Rebase to 2.6.24-rc2-mm1
>   o Lookup the nodemask for each allocator callsite in mempolicy.c
>   o Update NUMA statistics based on preferred zone, not first zonelist entry
>   o When __GFP_THISNODE is specified with MPOL_BIND and the current node is
>     not in the allowed nodemask, the first node in the mask will be used
>   o Stick with using two zonelists instead of one because of excessive
>     complexity with corner cases
> 
> Changelog since V8
>   o Rebase to 2.6.24-rc2
>   o Added ack for the OOM changes
>   o Behave correctly when GFP_THISNODE and a node ID are specified
>   o Clear up warning over type of nodes_intersects() function
> 
> Changelog since V7
>   o Rebase to 2.6.23-rc8-mm2
> 
> Changelog since V6
>   o Fix build bug in relation to memory controller combined with one-zonelist
>   o Use while() instead of a stupid looking for()
>   o Instead of encoding zone index information in a pointer, this version
>     introduces a structure that stores a zone pointer and its index 
> 
> Changelog since V5
>   o Rebase to 2.6.23-rc4-mm1
>   o Drop patch that replaces inline functions with macros
> 
> Changelog since V4
>   o Rebase to -mm kernel. Host of memoryless patches collisions dealt with
>   o Do not call wakeup_kswapd() for every zone in a zonelist
>   o Dropped the FASTCALL removal
>   o Have cursor in iterator advance earlier
>   o Use nodes_and in cpuset_nodes_valid_mems_allowed()
>   o Use defines instead of inlines, noticably better performance on gcc-3.4
>     No difference on later compilers such as gcc 4.1
>   o Dropped gfp_skip patch until it is proven to be of benefit. Tests are
>     currently inconclusive but it definitly consumes at least one cache
>     line
> 
> Changelog since V3
>   o Fix compile error in the parisc change
>   o Calculate gfp_zone only once in __alloc_pages
>   o Calculate classzone_idx properly in get_page_from_freelist
>   o Alter check so that zone id embedded may still be used on UP
>   o Use Kamezawa-sans suggestion for skipping zones in zonelist
>   o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This
>     removes the need for MPOL_BIND to have a custom zonelist
>   o Move zonelist iterators and helpers to mm.h
>   o Change _zones from struct zone * to unsigned long
>   
> Changelog since V2
>   o shrink_zones() uses zonelist instead of zonelist->zones
>   o hugetlb uses zonelist iterator
>   o zone_idx information is embedded in zonelist pointers
>   o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid)
> 
> Changelog since V1
>   o Break up the patch into 3 patches
>   o Introduce iterators for zonelists
>   o Performance regression test
> 
> The following patches replace multiple zonelists per node with two zonelists
> that are filtered based on the GFP flags. The patches as a set fix a bug
> with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset,
> the MPOL_BIND will apply to the two highest zones when the highest zone
> is ZONE_MOVABLE. This should be considered as an alternative fix for the
> MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that
> filters only custom zonelists.
> 
> The first patch cleans up an inconsistency where direct reclaim uses
> zonelist->zones where other places use zonelist.
> 
> The second patch introduces a helper function node_zonelist() for looking
> up the appropriate zonelist for a GFP mask which simplifies patches later
> in the set.
> 
> The third patch defines/remembers the "preferred zone" for numa statistics,
> as it is no longer always the first zone in a zonelist.
> 
> The forth patch replaces multiple zonelists with two zonelists that are
> filtered. The two zonelists are due to the fact that the memoryless patchset
> introduces a second set of zonelists for __GFP_THISNODE.
> 
> The fifth patch introduces helper macros for retrieving the zone and node
> indices of entries in a zonelist.
> 
> The final patch introduces filtering of the zonelists based on a nodemask. Two
> zonelists exist per node, one for normal allocations and one for __GFP_THISNODE.
> 
> Performance results varied depending on the machine configuration. In real
> workloads the gain/loss will depend on how much the userspace portion of
> the benchmark benefits from having more cache available due to reduced
> referencing of zonelists.
> 
> These are the range of performance losses/gains when running against
> 2.6.24-rc4-mm1. The set and these machines are a mix of i386, x86_64 and
> ppc64 both NUMA and non-NUMA.
> 
> 			     loss   to  gain
> Total CPU time on Kernbench: -0.86% to  1.13%
> Elapsed   time on Kernbench: -0.79% to  0.76%
> page_test from aim9:         -4.37% to  0.79%
> brk_test  from aim9:         -0.71% to  4.07%
> fork_test from aim9:         -1.84% to  4.60%
> exec_test from aim9:         -0.71% to  1.08%
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] Remember what the preferred zone is for zone_statistics
  2008-02-27 21:47 ` [PATCH 3/6] Remember what the preferred zone is for zone_statistics Lee Schermerhorn, Mel Gorman
@ 2008-02-27 22:00   ` Christoph Lameter
  2008-02-28 17:45     ` Lee Schermerhorn
  2008-02-29 14:19     ` Mel Gorman
  2008-02-29  2:30   ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 37+ messages in thread
From: Christoph Lameter @ 2008-02-27 22:00 UTC (permalink / raw
  To: Lee Schermerhorn
  Cc: akpm, mel, ak, kamezawa.hiroyu, linux-mm, rientjes, eric.whitney

> This patch records what the preferred zone is rather than assuming the
> first zone in the zonelist is it. This simplifies the reading of later
> patches in this set.

And is needed for correctness if GFP_THISNODE is used?

Reviewed-by: Christoph Lameter <clameter@sgi.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] Remember what the preferred zone is for zone_statistics
  2008-02-27 22:00   ` Christoph Lameter
@ 2008-02-28 17:45     ` Lee Schermerhorn
  2008-02-29 14:19     ` Mel Gorman
  1 sibling, 0 replies; 37+ messages in thread
From: Lee Schermerhorn @ 2008-02-28 17:45 UTC (permalink / raw
  To: Christoph Lameter
  Cc: akpm, mel, ak, kamezawa.hiroyu, linux-mm, rientjes, eric.whitney

On Wed, 2008-02-27 at 14:00 -0800, Christoph Lameter wrote:
> > This patch records what the preferred zone is rather than assuming the
> > first zone in the zonelist is it. This simplifies the reading of later
> > patches in this set.
> 
> And is needed for correctness if GFP_THISNODE is used?

Mel can correct me.  

I believe this is needed for MPOL_BIND allocations because we now use
the zonelist for the local node--i.e., the node from which the
allocation takes place--and search for the nearest node in the policy
nodemask that satisfies the allocation.  For the purpose of numa stats,
a "numa_hit" occurs if the allocation succeeds on the first node in the
zone list that is also in the policy nodemask--this is the "preferred
node".

Lee
> 
> Reviewed-by: Christoph Lameter <clameter@sgi.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/6] Use two zonelist that are filtered by GFP mask
  2008-02-27 21:47 ` [PATCH 4/6] Use two zonelist that are filtered by GFP mask Lee Schermerhorn, Mel Gorman
@ 2008-02-28 21:32   ` Andrew Morton
  2008-02-28 21:53     ` Lee Schermerhorn
                       ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Andrew Morton @ 2008-02-28 21:32 UTC (permalink / raw
  To: Lee Schermerhorn
  Cc: mel, ak, clameter, kamezawa.hiroyu, linux-mm, rientjes,
	eric.whitney

On Wed, 27 Feb 2008 16:47:34 -0500
Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:

> +/* Returns the first zone at or below highest_zoneidx in a zonelist */
> +static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
> +					enum zone_type highest_zoneidx)
> +{
> +	struct zone **z;
> +
> +	/* Find the first suitable zone to use for the allocation */
> +	z = zonelist->zones;
> +	while (*z && zone_idx(*z) > highest_zoneidx)
> +		z++;
> +
> +	return z;
> +}
> +
> +/* Returns the next zone at or below highest_zoneidx in a zonelist */
> +static inline struct zone **next_zones_zonelist(struct zone **z,
> +					enum zone_type highest_zoneidx)
> +{
> +	/* Find the next suitable zone to use for the allocation */
> +	while (*z && zone_idx(*z) > highest_zoneidx)
> +		z++;
> +
> +	return z;
> +}
> +
> +/**
> + * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
> + * @zone - The current zone in the iterator
> + * @z - The current pointer within zonelist->zones being iterated
> + * @zlist - The zonelist being iterated
> + * @highidx - The zone index of the highest zone to return
> + *
> + * This iterator iterates though all zones at or below a given zone index.
> + */
> +#define for_each_zone_zonelist(zone, z, zlist, highidx) \
> +	for (z = first_zones_zonelist(zlist, highidx), zone = *z++;	\
> +		zone;							\
> +		z = next_zones_zonelist(z, highidx), zone = *z++)
> +

omygawd will that thing generate a lot of code!

It has four call sites in mm/oom_kill.c and the overall patchset increases
mm/oom_kill.o's text section (x86_64 allmodconfig) from 3268 bytes to 3845.

vmscan.o and page_alloc.o also grew a lot.  otoh total vmlinux bloat from
the patchset is only around 700 bytes, so I expect that with a little less
insanity we could actually get an aggregate improvement here.

Some of the inlining in mmzone.h is just comical.  Some of it is obvious
(first_zones_zonelist) and some of it is less obvious (pfn_present).

I applied these for testing but I really don't think we should be merging
such easily-fixed regressions into mainline.  Could someone please take a
look at de-porking core MM?


Also, I switched all your Tested-by:s to Signed-off-by:s.  You were on the
delivery path, so s-o-b is the appropriate tag.  I would like to believe
that Signed-off-by: implies Tested-by: anyway (rofl).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/6] Use two zonelist that are filtered by GFP mask
  2008-02-28 21:32   ` Andrew Morton
@ 2008-02-28 21:53     ` Lee Schermerhorn
  2008-02-29  2:37       ` KAMEZAWA Hiroyuki
  2008-02-29 14:50     ` Mel Gorman
  2008-03-06 18:41     ` [PATCH 4/6] Use two zonelist that are filtered by GFP mask Mel Gorman
  2 siblings, 1 reply; 37+ messages in thread
From: Lee Schermerhorn @ 2008-02-28 21:53 UTC (permalink / raw
  To: Andrew Morton
  Cc: mel, ak, clameter, kamezawa.hiroyu, linux-mm, rientjes,
	eric.whitney

On Thu, 2008-02-28 at 13:32 -0800, Andrew Morton wrote:
> On Wed, 27 Feb 2008 16:47:34 -0500
> Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:
> 
> > +/* Returns the first zone at or below highest_zoneidx in a zonelist */
> > +static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
> > +					enum zone_type highest_zoneidx)
> > +{
> > +	struct zone **z;
> > +
> > +	/* Find the first suitable zone to use for the allocation */
> > +	z = zonelist->zones;
> > +	while (*z && zone_idx(*z) > highest_zoneidx)
> > +		z++;
> > +
> > +	return z;
> > +}
> > +
> > +/* Returns the next zone at or below highest_zoneidx in a zonelist */
> > +static inline struct zone **next_zones_zonelist(struct zone **z,
> > +					enum zone_type highest_zoneidx)
> > +{
> > +	/* Find the next suitable zone to use for the allocation */
> > +	while (*z && zone_idx(*z) > highest_zoneidx)
> > +		z++;
> > +
> > +	return z;
> > +}
> > +
> > +/**
> > + * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
> > + * @zone - The current zone in the iterator
> > + * @z - The current pointer within zonelist->zones being iterated
> > + * @zlist - The zonelist being iterated
> > + * @highidx - The zone index of the highest zone to return
> > + *
> > + * This iterator iterates though all zones at or below a given zone index.
> > + */
> > +#define for_each_zone_zonelist(zone, z, zlist, highidx) \
> > +	for (z = first_zones_zonelist(zlist, highidx), zone = *z++;	\
> > +		zone;							\
> > +		z = next_zones_zonelist(z, highidx), zone = *z++)
> > +
> 
> omygawd will that thing generate a lot of code!
> 
> It has four call sites in mm/oom_kill.c and the overall patchset increases
> mm/oom_kill.o's text section (x86_64 allmodconfig) from 3268 bytes to 3845.
> 
> vmscan.o and page_alloc.o also grew a lot.  otoh total vmlinux bloat from
> the patchset is only around 700 bytes, so I expect that with a little less
> insanity we could actually get an aggregate improvement here.
> 
> Some of the inlining in mmzone.h is just comical.  Some of it is obvious
> (first_zones_zonelist) and some of it is less obvious (pfn_present).

Yeah, Mel said he was really reaching to avoid performance regression in
this set.   

> 
> I applied these for testing but I really don't think we should be merging
> such easily-fixed regressions into mainline.  Could someone please take a
> look at de-porking core MM?

OK, Mel should be back real soon now, and I'll take a look as well.  At
this point, we just wanted to get some more testing in -mm.

> 
> 
> Also, I switched all your Tested-by:s to Signed-off-by:s.  You were on the
> delivery path, so s-o-b is the appropriate tag.  I would like to believe
> that Signed-off-by: implies Tested-by: anyway (rofl).


Well, this is not the first time I've been tagged with 's-o-b' and
probably not the last :-).

Lee
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] Remember what the preferred zone is for zone_statistics
  2008-02-27 21:47 ` [PATCH 3/6] Remember what the preferred zone is for zone_statistics Lee Schermerhorn, Mel Gorman
  2008-02-27 22:00   ` Christoph Lameter
@ 2008-02-29  2:30   ` KAMEZAWA Hiroyuki
  2008-02-29 14:32     ` Mel Gorman
  1 sibling, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-29  2:30 UTC (permalink / raw
  To: Lee Schermerhorn
  Cc: akpm, mel, ak, clameter, linux-mm, rientjes, eric.whitney

On Wed, 27 Feb 2008 16:47:28 -0500
Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:

> From: Mel Gorman <mel@csn.ul.ie>
> [PATCH 3/6] Remember what the preferred zone is for zone_statistics
> 
> V11r3 against 2.6.25-rc2-mm1
> 
> On NUMA, zone_statistics() is used to record events like numa hit, miss
> and foreign. It assumes that the first zone in a zonelist is the preferred
> zone. When multiple zonelists are replaced by one that is filtered, this
> is no longer the case.
> 
> This patch records what the preferred zone is rather than assuming the
> first zone in the zonelist is it. This simplifies the reading of later
> patches in this set.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 

I have no objection to the direction but


> +static struct page *buffered_rmqueue(struct zone *preferred_zone,
>  			struct zone *zone, int order, gfp_t gfp_flags)
>  {

Can't this be written like this ?

struct page *
buffered_rmqueue(struct zone *zone, int order, gfp_t gfp_flags, bool numa_hit)

Can't caller itself  set this bool value ?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/6] Use two zonelist that are filtered by GFP mask
  2008-02-28 21:53     ` Lee Schermerhorn
@ 2008-02-29  2:37       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-29  2:37 UTC (permalink / raw
  To: Lee Schermerhorn
  Cc: Andrew Morton, mel, ak, clameter, linux-mm, rientjes,
	eric.whitney

On Thu, 28 Feb 2008 16:53:58 -0500
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> > omygawd will that thing generate a lot of code!
> > 
> > It has four call sites in mm/oom_kill.c and the overall patchset increases
> > mm/oom_kill.o's text section (x86_64 allmodconfig) from 3268 bytes to 3845.
> > 
> > vmscan.o and page_alloc.o also grew a lot.  otoh total vmlinux bloat from
> > the patchset is only around 700 bytes, so I expect that with a little less
> > insanity we could actually get an aggregate improvement here.
> > 
> > Some of the inlining in mmzone.h is just comical.  Some of it is obvious
> > (first_zones_zonelist) and some of it is less obvious (pfn_present).
> 
> Yeah, Mel said he was really reaching to avoid performance regression in
> this set.   
> 
> > 
> > I applied these for testing but I really don't think we should be merging
> > such easily-fixed regressions into mainline.  Could someone please take a
> > look at de-porking core MM?
> 
> OK, Mel should be back real soon now, and I'll take a look as well.  At
> this point, we just wanted to get some more testing in -mm.
> 
maybe mm/mmzone.c can be candidate to put these into.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/6] Filter based on a nodemask as well as a gfp_mask
  2008-02-27 21:47 ` [PATCH 6/6] Filter based on a nodemask as well as a gfp_mask Lee Schermerhorn, Mel Gorman
@ 2008-02-29  2:59   ` KAMEZAWA Hiroyuki
  2008-03-07 11:56     ` Mel Gorman
  2008-02-29  8:48   ` KOSAKI Motohiro
  1 sibling, 1 reply; 37+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-02-29  2:59 UTC (permalink / raw
  To: Lee Schermerhorn, Mel Gorman
  Cc: akpm, ak, clameter, linux-mm, rientjes, eric.whitney

On Wed, 27 Feb 2008 16:47:47 -0500
Lee Schermerhorn <lee.schermerhorn@hp.com>, Mel Gorman <mel@csn.ul.ie> wrote:

> [PATCH 6/6] Filter based on a nodemask as well as a gfp_mask
> 
> V11r3 against 2.6.25-rc2-mm1
> 
> The MPOL_BIND policy creates a zonelist that is used for allocations
> controlled by that mempolicy. As the per-node zonelist is already being
> filtered based on a zone id, this patch adds a version of __alloc_pages()
> that takes a nodemask for further filtering. This eliminates the need
> for MPOL_BIND to create a custom zonelist.
> 
> A positive benefit of this is that allocations using MPOL_BIND now use the
> local node's distance-ordered zonelist instead of a custom node-id-ordered
> zonelist.  I.e., pages will be allocated from the closest allowed node with
> available memory.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Christoph Lameter <clameter@sgi.com>
> Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 

Thank you! I like this very much.
Next step is maybe to pass nodemask to try_to_free_pages().

BTW, cpuset memory limitation by nodemask has the same kind of feature.
But it seems cpuset_zone_allowed_soft/hardwall() has extra checks for system
sanity. 

Could you import them ? maybe like this
==
void __alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask))
{
	if (nodemask) {
		if (unlikely(test_thread_flag(TIF_MEMDIE)))
	                nodemask = NULL;
		if ((gfp_mask & __GFP_HARDWALL)
                     && (unlikely(test_thread_flag(TIF_MEMDIE)))
			nodemask = NULL;
	}
}
==
(I don't think above is clean.)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/6] Have zonelist contains structs with both a zone pointer and zone_idx
  2008-02-27 21:47 ` [PATCH 5/6] Have zonelist contains structs with both a zone pointer and zone_idx Lee Schermerhorn, Mel Gorman
@ 2008-02-29  7:49   ` KOSAKI Motohiro
  0 siblings, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2008-02-29  7:49 UTC (permalink / raw
  To: Lee Schermerhorn, Mel Gorman
  Cc: kosaki.motohiro, akpm, ak, clameter, kamezawa.hiroyu, linux-mm,
	rientjes, eric.whitney

Hi

this is nitpick.


> +static inline struct zone *zonelist_zone(struct zoneref *zoneref)
> +{
> +	return zoneref->zone;
> +}

this is zoneref operated function, not zonelist operation.
I like zoneref_zone() :)

> +static inline int zonelist_zone_idx(struct zoneref *zoneref)
> +{
> +	return zoneref->zone_idx;
> +}

ditto


> +static inline int zonelist_node_idx(struct zoneref *zoneref)
> +{
> +#ifdef CONFIG_NUMA
> +	/* zone_to_nid not available in this context */
> +	return zoneref->zone->node;
> +#else
> +	return 0;
> +#endif /* CONFIG_NUMA */
> +}

tritto


> -int try_set_zone_oom(struct zonelist *zonelist)
> +int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_mask)
>  {
(snip)
> +	for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) {

this function is no relation memory allocation and free.
and gfp_mask argument is only used for calculate highest zone.

I think following argument is more descriptive, may be.

	int try_set_zone_oom(struct zonelist *zonelist, 
	                     enum zone_type highest_zone_idx)


> @@ -491,16 +491,15 @@ out:
>   * allocation attempts with zonelists containing them may now recall the OOM
>   * killer, if necessary.
>   */
> -void clear_zonelist_oom(struct zonelist *zonelist)
> +void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)

ditto


Thanks.

- kosaki

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/6] Filter based on a nodemask as well as a gfp_mask
  2008-02-27 21:47 ` [PATCH 6/6] Filter based on a nodemask as well as a gfp_mask Lee Schermerhorn, Mel Gorman
  2008-02-29  2:59   ` KAMEZAWA Hiroyuki
@ 2008-02-29  8:48   ` KOSAKI Motohiro
  1 sibling, 0 replies; 37+ messages in thread
From: KOSAKI Motohiro @ 2008-02-29  8:48 UTC (permalink / raw
  To: Lee Schermerhorn, Mel Gorman
  Cc: kosaki.motohiro, akpm, ak, clameter, kamezawa.hiroyu, linux-mm,
	rientjes, eric.whitney

Hi 

> The MPOL_BIND policy creates a zonelist that is used for allocations
> controlled by that mempolicy. As the per-node zonelist is already being
> filtered based on a zone id, this patch adds a version of __alloc_pages()
> that takes a nodemask for further filtering. This eliminates the need
> for MPOL_BIND to create a custom zonelist.
> 
> A positive benefit of this is that allocations using MPOL_BIND now use the
> local node's distance-ordered zonelist instead of a custom node-id-ordered
> zonelist.  I.e., pages will be allocated from the closest allowed node with
> available memory.

Great.
this is not only clean up, but also great mempolicy improvement.


> -/* Generate a custom zonelist for the BIND policy. */
> -static struct zonelist *bind_zonelist(nodemask_t *nodes)
> +/* Check that the nodemask contains at least one populated zone */
> +static int is_valid_nodemask(nodemask_t *nodemask)
>  {
(snip)
> +	for_each_node_mask(nd, *nodemask) {
> +		struct zone *z;
> +
> +		for (k = 0; k <= policy_zone; k++) {
> +			z = &NODE_DATA(nd)->node_zones[k];
> +			if (z->present_pages > 0)
> +				return 1;

could we use populated_zone()?


-kosaki

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3
  2008-02-27 21:47 [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3 Lee Schermerhorn, Mel Gorman
                   ` (6 preceding siblings ...)
  2008-02-27 21:53 ` [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3 Lee Schermerhorn
@ 2008-02-29 14:12 ` Mel Gorman
  7 siblings, 0 replies; 37+ messages in thread
From: Mel Gorman @ 2008-02-29 14:12 UTC (permalink / raw
  To: Lee Schermerhorn
  Cc: akpm, ak, clameter, kamezawa.hiroyu, linux-mm, rientjes,
	eric.whitney

On (27/02/08 16:47), Lee Schermerhorn didst pronounce:
> From: Mel Gorman <mel@csn.ul.ie>
> [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3
> 
> This is a rebase of the two-zonelist patchset to 2.6.25-rc2-mm1.
> 
> Mel, still on vacation last I checked,  asked me to repost these
> as I'd already rebased them and I've been testing them continually
> on each -mm tree for months, hoping to see them in -mm for wider
> testing.
> 

Thanks a lot, Lee. I tested this patchset against a slightly patched
2.6.25-rc2-mm1 (compile-failure fix and a memoryless-related bug that is
fixed in git-x86#testing).

> These are the range of performance losses/gains when running against
> 2.6.24-rc4-mm1. The set and these machines are a mix of i386, x86_64 and
> ppc64 both NUMA and non-NUMA.
> 

Against 2.6.25-rc5-mm1, the results are
				loss	to	gain
Total CPU time on Kernbench:	-0.23%		 1.04%
Elapsed time on Kernbench:	-0.69%		 3.86%
page_test from aim9:		-3.74%		 5.72%
brk_test from aim9:		-7.37%		10.98%
fork_test from aim9:		-3.52%		 3.17%
exec_test from aim9:		-2.78%		 2.34%
TBench:				-1.93%		 2.96%

Hackbench was similarly variable but most machines showed little or no
difference. As before, whether the changes are a performance win/loss
depends on the machine but the majority of results showed little
difference.

> 			     loss   to  gain
> Total CPU time on Kernbench: -0.86% to  1.13%
> Elapsed   time on Kernbench: -0.79% to  0.76%
> page_test from aim9:         -4.37% to  0.79%
> brk_test  from aim9:         -0.71% to  4.07%
> fork_test from aim9:         -1.84% to  4.60%
> exec_test from aim9:         -0.71% to  1.08%
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] Remember what the preferred zone is for zone_statistics
  2008-02-27 22:00   ` Christoph Lameter
  2008-02-28 17:45     ` Lee Schermerhorn
@ 2008-02-29 14:19     ` Mel Gorman
  1 sibling, 0 replies; 37+ messages in thread
From: Mel Gorman @ 2008-02-29 14:19 UTC (permalink / raw
  To: Christoph Lameter
  Cc: Lee Schermerhorn, akpm, ak, kamezawa.hiroyu, linux-mm, rientjes,
	eric.whitney

On (27/02/08 14:00), Christoph Lameter didst pronounce:
> > This patch records what the preferred zone is rather than assuming the
> > first zone in the zonelist is it. This simplifies the reading of later
> > patches in this set.
> 
> And is needed for correctness if GFP_THISNODE is used?
> 

Yes, I should have noted that. Without the patch, statistics could be updated
in the wrong place.

> Reviewed-by: Christoph Lameter <clameter@sgi.com>
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] Remember what the preferred zone is for zone_statistics
  2008-02-29  2:30   ` KAMEZAWA Hiroyuki
@ 2008-02-29 14:32     ` Mel Gorman
  0 siblings, 0 replies; 37+ messages in thread
From: Mel Gorman @ 2008-02-29 14:32 UTC (permalink / raw
  To: KAMEZAWA Hiroyuki
  Cc: Lee Schermerhorn, akpm, ak, clameter, linux-mm, rientjes,
	eric.whitney

On (29/02/08 11:30), KAMEZAWA Hiroyuki didst pronounce:
> On Wed, 27 Feb 2008 16:47:28 -0500
> Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:
> 
> > From: Mel Gorman <mel@csn.ul.ie>
> > [PATCH 3/6] Remember what the preferred zone is for zone_statistics
> > 
> > V11r3 against 2.6.25-rc2-mm1
> > 
> > On NUMA, zone_statistics() is used to record events like numa hit, miss
> > and foreign. It assumes that the first zone in a zonelist is the preferred
> > zone. When multiple zonelists are replaced by one that is filtered, this
> > is no longer the case.
> > 
> > This patch records what the preferred zone is rather than assuming the
> > first zone in the zonelist is it. This simplifies the reading of later
> > patches in this set.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> 
> I have no objection to the direction but
> 
> 
> > +static struct page *buffered_rmqueue(struct zone *preferred_zone,
> >  			struct zone *zone, int order, gfp_t gfp_flags)
> >  {
> 
> Can't this be written like this ?
> 
> struct page *
> buffered_rmqueue(struct zone *zone, int order, gfp_t gfp_flags, bool numa_hit)
> 
> Can't caller itself  set this bool value ?
> 

Going that direction, the caller could call zone_statistics() instead of
buffered_rmqueue() and it would be one less parameter to pass. However,
as buffered_rmqueue() is probably inlined, I'm not sure a change would
be beneficial.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/6] Use two zonelist that are filtered by GFP mask
  2008-02-28 21:32   ` Andrew Morton
  2008-02-28 21:53     ` Lee Schermerhorn
@ 2008-02-29 14:50     ` Mel Gorman
  2008-02-29 15:48       ` Lee Schermerhorn
  2008-03-06 18:41     ` [PATCH 4/6] Use two zonelist that are filtered by GFP mask Mel Gorman
  2 siblings, 1 reply; 37+ messages in thread
From: Mel Gorman @ 2008-02-29 14:50 UTC (permalink / raw
  To: Andrew Morton
  Cc: Lee Schermerhorn, ak, clameter, kamezawa.hiroyu, linux-mm,
	rientjes, eric.whitney

On (28/02/08 13:32), Andrew Morton didst pronounce:
> On Wed, 27 Feb 2008 16:47:34 -0500
> Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:
> 
> > +/* Returns the first zone at or below highest_zoneidx in a zonelist */
> > +static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
> > +					enum zone_type highest_zoneidx)
> > +{
> > +	struct zone **z;
> > +
> > +	/* Find the first suitable zone to use for the allocation */
> > +	z = zonelist->zones;
> > +	while (*z && zone_idx(*z) > highest_zoneidx)
> > +		z++;
> > +
> > +	return z;
> > +}
> > +
> > +/* Returns the next zone at or below highest_zoneidx in a zonelist */
> > +static inline struct zone **next_zones_zonelist(struct zone **z,
> > +					enum zone_type highest_zoneidx)
> > +{
> > +	/* Find the next suitable zone to use for the allocation */
> > +	while (*z && zone_idx(*z) > highest_zoneidx)
> > +		z++;
> > +
> > +	return z;
> > +}
> > +
> > +/**
> > + * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
> > + * @zone - The current zone in the iterator
> > + * @z - The current pointer within zonelist->zones being iterated
> > + * @zlist - The zonelist being iterated
> > + * @highidx - The zone index of the highest zone to return
> > + *
> > + * This iterator iterates though all zones at or below a given zone index.
> > + */
> > +#define for_each_zone_zonelist(zone, z, zlist, highidx) \
> > +	for (z = first_zones_zonelist(zlist, highidx), zone = *z++;	\
> > +		zone;							\
> > +		z = next_zones_zonelist(z, highidx), zone = *z++)
> > +
> 
> omygawd will that thing generate a lot of code!
> 
> It has four call sites in mm/oom_kill.c and the overall patchset increases
> mm/oom_kill.o's text section (x86_64 allmodconfig) from 3268 bytes to 3845.
> 

Yeah... that's pretty bad. They were inlined to avoid function call overhead
when trying to avoid any additional performance overhead but the text overhead
is not helping either. I'll start looking at things to uninline and see what
can be gained text-reduction wise without mucking performance.

> vmscan.o and page_alloc.o also grew a lot.  otoh total vmlinux bloat from
> the patchset is only around 700 bytes, so I expect that with a little less
> insanity we could actually get an aggregate improvement here.
> 
> Some of the inlining in mmzone.h is just comical.  Some of it is obvious
> (first_zones_zonelist) and some of it is less obvious (pfn_present).
> 
> I applied these for testing but I really don't think we should be merging
> such easily-fixed regressions into mainline.  Could someone please take a
> look at de-porking core MM?
> 
> 
> Also, I switched all your Tested-by:s to Signed-off-by:s.  You were on the
> delivery path, so s-o-b is the appropriate tag.  I would like to believe
> that Signed-off-by: implies Tested-by: anyway (rofl).
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/6] Use two zonelist that are filtered by GFP mask
  2008-02-29 14:50     ` Mel Gorman
@ 2008-02-29 15:48       ` Lee Schermerhorn
  2008-02-29 21:07         ` Christoph Lameter
  2008-03-04 18:01         ` Mel Gorman
  0 siblings, 2 replies; 37+ messages in thread
From: Lee Schermerhorn @ 2008-02-29 15:48 UTC (permalink / raw
  To: Mel Gorman
  Cc: Andrew Morton, ak, clameter, kamezawa.hiroyu, linux-mm, rientjes,
	eric.whitney

On Fri, 2008-02-29 at 14:50 +0000, Mel Gorman wrote:
> On (28/02/08 13:32), Andrew Morton didst pronounce:
> > On Wed, 27 Feb 2008 16:47:34 -0500
> > Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:
> > 
> > > +/* Returns the first zone at or below highest_zoneidx in a zonelist */
> > > +static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
> > > +					enum zone_type highest_zoneidx)
> > > +{
> > > +	struct zone **z;
> > > +
> > > +	/* Find the first suitable zone to use for the allocation */
> > > +	z = zonelist->zones;
> > > +	while (*z && zone_idx(*z) > highest_zoneidx)
> > > +		z++;
> > > +
> > > +	return z;
> > > +}
> > > +
> > > +/* Returns the next zone at or below highest_zoneidx in a zonelist */
> > > +static inline struct zone **next_zones_zonelist(struct zone **z,
> > > +					enum zone_type highest_zoneidx)
> > > +{
> > > +	/* Find the next suitable zone to use for the allocation */
> > > +	while (*z && zone_idx(*z) > highest_zoneidx)
> > > +		z++;
> > > +
> > > +	return z;
> > > +}
> > > +
> > > +/**
> > > + * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
> > > + * @zone - The current zone in the iterator
> > > + * @z - The current pointer within zonelist->zones being iterated
> > > + * @zlist - The zonelist being iterated
> > > + * @highidx - The zone index of the highest zone to return
> > > + *
> > > + * This iterator iterates though all zones at or below a given zone index.
> > > + */
> > > +#define for_each_zone_zonelist(zone, z, zlist, highidx) \
> > > +	for (z = first_zones_zonelist(zlist, highidx), zone = *z++;	\
> > > +		zone;							\
> > > +		z = next_zones_zonelist(z, highidx), zone = *z++)
> > > +
> > 
> > omygawd will that thing generate a lot of code!
> > 
> > It has four call sites in mm/oom_kill.c and the overall patchset increases
> > mm/oom_kill.o's text section (x86_64 allmodconfig) from 3268 bytes to 3845.
> > 
> 
> Yeah... that's pretty bad. They were inlined to avoid function call overhead
> when trying to avoid any additional performance overhead but the text overhead
> is not helping either. I'll start looking at things to uninline and see what
> can be gained text-reduction wise without mucking performance.
> 
> > vmscan.o and page_alloc.o also grew a lot.  otoh total vmlinux bloat from
> > the patchset is only around 700 bytes, so I expect that with a little less
> > insanity we could actually get an aggregate improvement here.

Mel:

Thinking about this:

for_each_zone_zonelist():

Seems like the call sites to this macro are not hot paths, so maybe
these can call out to a zonelist iterator func in page_alloc.c or, as
Kame-san suggested, mmzone.c.

+ oom_kill and vmscan call sites:  if these are hot, we're already in,
uh..., slow mode. 

+ usage in slab.c and slub.c appears to be the fallback/slow path.
Christoph can chime in, here, if he disagrees.

+ in page_alloc.c:  waking up of kswapd and counting free zone pages
[mostly for init code] don't appear to be fast paths.  

+ The call site in hugetlb.c is in the huge-page allocation path, which
is under a global spinlock.  So, any slowdown here could result in
longer lock hold time and higher contention.  But, I have to believe
that in the grand scheme of things, huge-page allocation is not that
hot.  [Someone faulting in terabytes of hugepages might contest that.]

That leaves the call to for_each_zone_zonelist_nodemask() in
get_page_from_freelist().  This might be deserving of inlining?

If this works out, we could end up with these macros being inlined in
only 2 places:  get_page_from_freelist() and a to-be-designed zonelist
iterator function.  [In fact, I believe that such an iterator need not
expose the details of zonelists outside of page_alloc/mmzone, but that
would require more rework of the call sites, and additional helper
functions.  Maybe someday...]

Comments?

Right now, I've got to build/test the latest reclaim scalability patches
that Rik posted, and clean up the issues already pointed out.  If you
don't get to this, I can look at it further next week.

Lee

> > 
> > Some of the inlining in mmzone.h is just comical.  Some of it is obvious
> > (first_zones_zonelist) and some of it is less obvious (pfn_present).
> > 
> > I applied these for testing but I really don't think we should be merging
> > such easily-fixed regressions into mainline.  Could someone please take a
> > look at de-porking core MM?
> > 
> > 
> > Also, I switched all your Tested-by:s to Signed-off-by:s.  You were on the
> > delivery path, so s-o-b is the appropriate tag.  I would like to believe
> > that Signed-off-by: implies Tested-by: anyway (rofl).
> > 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/6] Use two zonelist that are filtered by GFP mask
  2008-02-29 15:48       ` Lee Schermerhorn
@ 2008-02-29 21:07         ` Christoph Lameter
  2008-03-04 18:01         ` Mel Gorman
  1 sibling, 0 replies; 37+ messages in thread
From: Christoph Lameter @ 2008-02-29 21:07 UTC (permalink / raw
  To: Lee Schermerhorn
  Cc: Mel Gorman, Andrew Morton, ak, kamezawa.hiroyu, linux-mm,
	rientjes, eric.whitney

On Fri, 29 Feb 2008, Lee Schermerhorn wrote:

> + usage in slab.c and slub.c appears to be the fallback/slow path.
> Christoph can chime in, here, if he disagrees.

Correct. And in 2.6.25 slub will start to buffer page allocator allocs in 
order to avoid that current issue with 4k allocs being slower than slab 
due to page allocator inefficiency.

I think we need a new fastpath for the page allocator! (No not me, I am 
already handing a gazillion patches).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/6] Use two zonelist that are filtered by GFP mask
  2008-02-29 15:48       ` Lee Schermerhorn
  2008-02-29 21:07         ` Christoph Lameter
@ 2008-03-04 18:01         ` Mel Gorman
  2008-03-05 16:06           ` [PATCH] 2.6.25-rc3-mm1 - Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask Lee Schermerhorn
  1 sibling, 1 reply; 37+ messages in thread
From: Mel Gorman @ 2008-03-04 18:01 UTC (permalink / raw
  To: Lee Schermerhorn
  Cc: Andrew Morton, ak, clameter, kamezawa.hiroyu, linux-mm, rientjes,
	eric.whitney

On (29/02/08 10:48), Lee Schermerhorn didst pronounce:
> On Fri, 2008-02-29 at 14:50 +0000, Mel Gorman wrote:
> > On (28/02/08 13:32), Andrew Morton didst pronounce:
> > > On Wed, 27 Feb 2008 16:47:34 -0500
> > > Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:
> > > 
> > > > +/* Returns the first zone at or below highest_zoneidx in a zonelist */
> > > > +static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
> > > > +					enum zone_type highest_zoneidx)
> > > > +{
> > > > +	struct zone **z;
> > > > +
> > > > +	/* Find the first suitable zone to use for the allocation */
> > > > +	z = zonelist->zones;
> > > > +	while (*z && zone_idx(*z) > highest_zoneidx)
> > > > +		z++;
> > > > +
> > > > +	return z;
> > > > +}
> > > > +
> > > > +/* Returns the next zone at or below highest_zoneidx in a zonelist */
> > > > +static inline struct zone **next_zones_zonelist(struct zone **z,
> > > > +					enum zone_type highest_zoneidx)
> > > > +{
> > > > +	/* Find the next suitable zone to use for the allocation */
> > > > +	while (*z && zone_idx(*z) > highest_zoneidx)
> > > > +		z++;
> > > > +
> > > > +	return z;
> > > > +}
> > > > +
> > > > +/**
> > > > + * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
> > > > + * @zone - The current zone in the iterator
> > > > + * @z - The current pointer within zonelist->zones being iterated
> > > > + * @zlist - The zonelist being iterated
> > > > + * @highidx - The zone index of the highest zone to return
> > > > + *
> > > > + * This iterator iterates though all zones at or below a given zone index.
> > > > + */
> > > > +#define for_each_zone_zonelist(zone, z, zlist, highidx) \
> > > > +	for (z = first_zones_zonelist(zlist, highidx), zone = *z++;	\
> > > > +		zone;							\
> > > > +		z = next_zones_zonelist(z, highidx), zone = *z++)
> > > > +
> > > 
> > > omygawd will that thing generate a lot of code!
> > > 
> > > It has four call sites in mm/oom_kill.c and the overall patchset increases
> > > mm/oom_kill.o's text section (x86_64 allmodconfig) from 3268 bytes to 3845.
> > > 
> > 
> > Yeah... that's pretty bad. They were inlined to avoid function call overhead
> > when trying to avoid any additional performance overhead but the text overhead
> > is not helping either. I'll start looking at things to uninline and see what
> > can be gained text-reduction wise without mucking performance.
> > 
> > > vmscan.o and page_alloc.o also grew a lot.  otoh total vmlinux bloat from
> > > the patchset is only around 700 bytes, so I expect that with a little less
> > > insanity we could actually get an aggregate improvement here.
> 
> Mel:
> 
> Thinking about this:
> 
> for_each_zone_zonelist():
> 
> Seems like the call sites to this macro are not hot paths, so maybe
> these can call out to a zonelist iterator func in page_alloc.c or, as
> Kame-san suggested, mmzone.c.
> 

I am trying an unlined version in mmzone.c to see what is looks like. As
expected there is less text bloat and I'll know in another day whether
it makes a performance difference or not. As you note below, the
majority of call sites are not in hot-paths as such.

> + oom_kill and vmscan call sites:  if these are hot, we're already in,
> uh..., slow mode. 
> 
> + usage in slab.c and slub.c appears to be the fallback/slow path.
> Christoph can chime in, here, if he disagrees.
> 
> + in page_alloc.c:  waking up of kswapd and counting free zone pages
> [mostly for init code] don't appear to be fast paths.  
> 
> + The call site in hugetlb.c is in the huge-page allocation path, which
> is under a global spinlock.  So, any slowdown here could result in
> longer lock hold time and higher contention.  But, I have to believe
> that in the grand scheme of things, huge-page allocation is not that
> hot.  [Someone faulting in terabytes of hugepages might contest that.]
> 
> That leaves the call to for_each_zone_zonelist_nodemask() in
> get_page_from_freelist().  This might be deserving of inlining?
> 
> If this works out, we could end up with these macros being inlined in
> only 2 places:  get_page_from_freelist() and a to-be-designed zonelist
> iterator function.  [In fact, I believe that such an iterator need not
> expose the details of zonelists outside of page_alloc/mmzone, but that
> would require more rework of the call sites, and additional helper
> functions.  Maybe someday...]
> 
> Comments?
> 
> Right now, I've got to build/test the latest reclaim scalability patches
> that Rik posted, and clean up the issues already pointed out.  If you
> don't get to this, I can look at it further next week.
> 
> Lee
> 
> > > 
> > > Some of the inlining in mmzone.h is just comical.  Some of it is obvious
> > > (first_zones_zonelist) and some of it is less obvious (pfn_present).
> > > 
> > > I applied these for testing but I really don't think we should be merging
> > > such easily-fixed regressions into mainline.  Could someone please take a
> > > look at de-porking core MM?
> > > 
> > > 
> > > Also, I switched all your Tested-by:s to Signed-off-by:s.  You were on the
> > > delivery path, so s-o-b is the appropriate tag.  I would like to believe
> > > that Signed-off-by: implies Tested-by: anyway (rofl).
> > > 
> > 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] 2.6.25-rc3-mm1 - Mempolicy:  make dequeue_huge_page_vma() obey MPOL_BIND nodemask
  2008-03-04 18:01         ` Mel Gorman
@ 2008-03-05 16:06           ` Lee Schermerhorn
  2008-03-05 18:03             ` Nishanth Aravamudan
  2008-03-06  0:39             ` [PATCH] 2.6.25-rc3-mm1 - Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask Andrew Morton
  0 siblings, 2 replies; 37+ messages in thread
From: Lee Schermerhorn @ 2008-03-05 16:06 UTC (permalink / raw
  To: Mel Gorman, Andrew Morton
  Cc: Nishanth Aravamudan, agl, wli, clameter, ak, kamezawa.hiroyu,
	rientjes, linux-mm, eric.whitney

PATCH Mempolicy - make dequeue_huge_page_vma() obey MPOL_BIND nodemask

dequeue_huge_page_vma() is not obeying the MPOL_BIND nodemask
with the zonelist rework.  It needs to search only zones in 
the mempolicy nodemask for hugepages.

Use for_each_zone_zonelist_nodemask() instead of
for_each_zone_zonelist().

Note:  this will bloat mm/hugetlb.o a bit until Mel reworks the
inlining of the for_each_zone... macros and helpers.

Added mempolicy helper function mpol_bind_nodemask() to hide
the details of mempolicy from hugetlb and to avoid
#ifdef CONFIG_NUMA in dequeue_huge_page_vma().

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mempolicy.h |   13 +++++++++++++
 mm/hugetlb.c              |    4 +++-
 2 files changed, 16 insertions(+), 1 deletion(-)

Index: linux-2.6.25-rc3-mm1/mm/hugetlb.c
===================================================================
--- linux-2.6.25-rc3-mm1.orig/mm/hugetlb.c	2008-03-05 10:35:12.000000000 -0500
+++ linux-2.6.25-rc3-mm1/mm/hugetlb.c	2008-03-05 10:37:09.000000000 -0500
@@ -99,8 +99,10 @@ static struct page *dequeue_huge_page_vm
 					htlb_alloc_mask, &mpol);
 	struct zone *zone;
 	struct zoneref *z;
+	nodemask_t *nodemask = mpol_bind_nodemask(mpol);
 
-	for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+						MAX_NR_ZONES - 1, nodemask) {
 		nid = zone_to_nid(zone);
 		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
 		    !list_empty(&hugepage_freelists[nid])) {
Index: linux-2.6.25-rc3-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.25-rc3-mm1.orig/include/linux/mempolicy.h	2008-03-05 10:35:12.000000000 -0500
+++ linux-2.6.25-rc3-mm1/include/linux/mempolicy.h	2008-03-05 10:59:11.000000000 -0500
@@ -163,6 +163,14 @@ static inline void check_highest_zone(en
 		policy_zone = k;
 }
 
+static inline nodemask_t *mpol_bind_nodemask(struct mempolicy *mpol)
+{
+	if (mpol->policy == MPOL_BIND)
+		return &mpol->v.nodes;
+	else
+		return NULL;
+}
+
 int do_migrate_pages(struct mm_struct *mm,
 	const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
 
@@ -255,6 +263,11 @@ static inline int do_migrate_pages(struc
 static inline void check_highest_zone(int k)
 {
 }
+
+static inline nodemask_t *mpol_bind_nodemask(struct mempolicy *mpol)
+{
+	return NULL;
+}
 #endif /* CONFIG_NUMA */
 #endif /* __KERNEL__ */
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] 2.6.25-rc3-mm1 - Mempolicy:  make dequeue_huge_page_vma() obey MPOL_BIND nodemask
  2008-03-05 16:06           ` [PATCH] 2.6.25-rc3-mm1 - Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask Lee Schermerhorn
@ 2008-03-05 18:03             ` Nishanth Aravamudan
  2008-03-05 19:02               ` Lee Schermerhorn
  2008-03-06  0:39             ` [PATCH] 2.6.25-rc3-mm1 - Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask Andrew Morton
  1 sibling, 1 reply; 37+ messages in thread
From: Nishanth Aravamudan @ 2008-03-05 18:03 UTC (permalink / raw
  To: Lee Schermerhorn
  Cc: Mel Gorman, Andrew Morton, agl, wli, clameter, ak,
	kamezawa.hiroyu, rientjes, linux-mm, eric.whitney

On 05.03.2008 [11:06:34 -0500], Lee Schermerhorn wrote:
> PATCH Mempolicy - make dequeue_huge_page_vma() obey MPOL_BIND nodemask
> 
> dequeue_huge_page_vma() is not obeying the MPOL_BIND nodemask
> with the zonelist rework.  It needs to search only zones in 
> the mempolicy nodemask for hugepages.
> 
> Use for_each_zone_zonelist_nodemask() instead of
> for_each_zone_zonelist().
> 
> Note:  this will bloat mm/hugetlb.o a bit until Mel reworks the
> inlining of the for_each_zone... macros and helpers.
> 
> Added mempolicy helper function mpol_bind_nodemask() to hide
> the details of mempolicy from hugetlb and to avoid
> #ifdef CONFIG_NUMA in dequeue_huge_page_vma().
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/mempolicy.h |   13 +++++++++++++
>  mm/hugetlb.c              |    4 +++-
>  2 files changed, 16 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6.25-rc3-mm1/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.25-rc3-mm1.orig/mm/hugetlb.c	2008-03-05 10:35:12.000000000 -0500
> +++ linux-2.6.25-rc3-mm1/mm/hugetlb.c	2008-03-05 10:37:09.000000000 -0500
> @@ -99,8 +99,10 @@ static struct page *dequeue_huge_page_vm
>  					htlb_alloc_mask, &mpol);
>  	struct zone *zone;
>  	struct zoneref *z;
> +	nodemask_t *nodemask = mpol_bind_nodemask(mpol);

We get this mpol from huge_zonelist(). Would it perhaps make sense to
pass the nodemask as a parameter, too, to huge_zonelist(), rather than
adding mpol_bind_nodemask()? This is the only user of it in-tree.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] 2.6.25-rc3-mm1 - Mempolicy:  make dequeue_huge_page_vma() obey MPOL_BIND nodemask
  2008-03-05 18:03             ` Nishanth Aravamudan
@ 2008-03-05 19:02               ` Lee Schermerhorn
  2008-03-06  1:04                 ` Nishanth Aravamudan
  0 siblings, 1 reply; 37+ messages in thread
From: Lee Schermerhorn @ 2008-03-05 19:02 UTC (permalink / raw
  To: Nishanth Aravamudan
  Cc: Mel Gorman, Andrew Morton, agl, wli, clameter, ak,
	kamezawa.hiroyu, rientjes, linux-mm, eric.whitney

On Wed, 2008-03-05 at 10:03 -0800, Nishanth Aravamudan wrote:
> On 05.03.2008 [11:06:34 -0500], Lee Schermerhorn wrote:
> > PATCH Mempolicy - make dequeue_huge_page_vma() obey MPOL_BIND nodemask
> > 
> > dequeue_huge_page_vma() is not obeying the MPOL_BIND nodemask
> > with the zonelist rework.  It needs to search only zones in 
> > the mempolicy nodemask for hugepages.
> > 
> > Use for_each_zone_zonelist_nodemask() instead of
> > for_each_zone_zonelist().
> > 
> > Note:  this will bloat mm/hugetlb.o a bit until Mel reworks the
> > inlining of the for_each_zone... macros and helpers.
> > 
> > Added mempolicy helper function mpol_bind_nodemask() to hide
> > the details of mempolicy from hugetlb and to avoid
> > #ifdef CONFIG_NUMA in dequeue_huge_page_vma().
> > 
> > Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  include/linux/mempolicy.h |   13 +++++++++++++
> >  mm/hugetlb.c              |    4 +++-
> >  2 files changed, 16 insertions(+), 1 deletion(-)
> > 
> > Index: linux-2.6.25-rc3-mm1/mm/hugetlb.c
> > ===================================================================
> > --- linux-2.6.25-rc3-mm1.orig/mm/hugetlb.c	2008-03-05 10:35:12.000000000 -0500
> > +++ linux-2.6.25-rc3-mm1/mm/hugetlb.c	2008-03-05 10:37:09.000000000 -0500
> > @@ -99,8 +99,10 @@ static struct page *dequeue_huge_page_vm
> >  					htlb_alloc_mask, &mpol);
> >  	struct zone *zone;
> >  	struct zoneref *z;
> > +	nodemask_t *nodemask = mpol_bind_nodemask(mpol);
> 
> We get this mpol from huge_zonelist(). Would it perhaps make sense to
> pass the nodemask as a parameter, too, to huge_zonelist(), rather than
> adding mpol_bind_nodemask()? This is the only user of it in-tree.

Nish:

I thought of that.  I didn't go that way because I'd either need to pass
a [pointer to a pointer to] a nodemask in addition to the [pointer to a
pointer to] the mpol, so that I can release the reference on the mpol
after the allocation is finished; or I'd need to copy the nodemask
[which can get pretty big] in the allocation path.  I wanted to avoid
both of those.  I suppose I could be convinced that one or the other of
those options is better than the single use helper function.  What do
you think?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] 2.6.25-rc3-mm1 - Mempolicy:  make dequeue_huge_page_vma() obey MPOL_BIND nodemask
  2008-03-05 16:06           ` [PATCH] 2.6.25-rc3-mm1 - Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask Lee Schermerhorn
  2008-03-05 18:03             ` Nishanth Aravamudan
@ 2008-03-06  0:39             ` Andrew Morton
  2008-03-06 15:17               ` Lee Schermerhorn
  1 sibling, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2008-03-06  0:39 UTC (permalink / raw
  To: Lee Schermerhorn
  Cc: mel, nacc, agl, wli, clameter, ak, kamezawa.hiroyu, rientjes,
	linux-mm, eric.whitney

On Wed, 05 Mar 2008 11:06:34 -0500
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> Subject: [PATCH] 2.6.25-rc3-mm1 - Mempolicy:  make dequeue_huge_page_vma() obey MPOL_BIND nodemask

whine.

a) Please put the "2.6.25-rc3-mm1" stuff inside [].  This tells the
   receiver that this is not-to-be-included metadata pertaining only to
   this email.

b) Sometimes (quite often) people send me patches advertised as being
   for "2.6.x-mm-y" whereas they actually fix problems which are in
   mainline (or even -stable).

   So I tend to not believe it when people say that.

> Date: Wed, 05 Mar 2008 11:06:34 -0500
> Organization: HP/OSLO
> X-Mailer: Evolution 2.6.1 
> 
> PATCH Mempolicy - make dequeue_huge_page_vma() obey MPOL_BIND nodemask
> 
> dequeue_huge_page_vma() is not obeying the MPOL_BIND nodemask
> with the zonelist rework.  It needs to search only zones in 
> the mempolicy nodemask for hugepages.
> 
> Use for_each_zone_zonelist_nodemask() instead of
> for_each_zone_zonelist().
> 
> Note:  this will bloat mm/hugetlb.o a bit until Mel reworks the
> inlining of the for_each_zone... macros and helpers.
> 
> Added mempolicy helper function mpol_bind_nodemask() to hide
> the details of mempolicy from hugetlb and to avoid
> #ifdef CONFIG_NUMA in dequeue_huge_page_vma().
> 

But this patch does indeed fix an only-in-mm problem.  afacit it is a
bugfix against mm-filter-based-on-a-nodemask-as-well-as-a-gfp_mask.patch. 
At least, that's how I've staged it.

If you have already worked that information out, please let me know.

> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

You're "Lee", not " Lee" ;)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] 2.6.25-rc3-mm1 - Mempolicy:  make dequeue_huge_page_vma() obey MPOL_BIND nodemask
  2008-03-05 19:02               ` Lee Schermerhorn
@ 2008-03-06  1:04                 ` Nishanth Aravamudan
  2008-03-06 15:38                   ` Lee Schermerhorn
  2008-03-06 21:24                   ` [PATCH] Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework Lee Schermerhorn
  0 siblings, 2 replies; 37+ messages in thread
From: Nishanth Aravamudan @ 2008-03-06  1:04 UTC (permalink / raw
  To: Lee Schermerhorn
  Cc: Mel Gorman, Andrew Morton, agl, wli, clameter, ak,
	kamezawa.hiroyu, rientjes, linux-mm, eric.whitney

On 05.03.2008 [14:02:53 -0500], Lee Schermerhorn wrote:
> On Wed, 2008-03-05 at 10:03 -0800, Nishanth Aravamudan wrote:
> > On 05.03.2008 [11:06:34 -0500], Lee Schermerhorn wrote:
> > > PATCH Mempolicy - make dequeue_huge_page_vma() obey MPOL_BIND nodemask
> > > 
> > > dequeue_huge_page_vma() is not obeying the MPOL_BIND nodemask
> > > with the zonelist rework.  It needs to search only zones in 
> > > the mempolicy nodemask for hugepages.
> > > 
> > > Use for_each_zone_zonelist_nodemask() instead of
> > > for_each_zone_zonelist().
> > > 
> > > Note:  this will bloat mm/hugetlb.o a bit until Mel reworks the
> > > inlining of the for_each_zone... macros and helpers.
> > > 
> > > Added mempolicy helper function mpol_bind_nodemask() to hide
> > > the details of mempolicy from hugetlb and to avoid
> > > #ifdef CONFIG_NUMA in dequeue_huge_page_vma().
> > > 
> > > Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> > > 
> > >  include/linux/mempolicy.h |   13 +++++++++++++
> > >  mm/hugetlb.c              |    4 +++-
> > >  2 files changed, 16 insertions(+), 1 deletion(-)
> > > 
> > > Index: linux-2.6.25-rc3-mm1/mm/hugetlb.c
> > > ===================================================================
> > > --- linux-2.6.25-rc3-mm1.orig/mm/hugetlb.c	2008-03-05 10:35:12.000000000 -0500
> > > +++ linux-2.6.25-rc3-mm1/mm/hugetlb.c	2008-03-05 10:37:09.000000000 -0500
> > > @@ -99,8 +99,10 @@ static struct page *dequeue_huge_page_vm
> > >  					htlb_alloc_mask, &mpol);
> > >  	struct zone *zone;
> > >  	struct zoneref *z;
> > > +	nodemask_t *nodemask = mpol_bind_nodemask(mpol);
> > 
> > We get this mpol from huge_zonelist(). Would it perhaps make sense to
> > pass the nodemask as a parameter, too, to huge_zonelist(), rather than
> > adding mpol_bind_nodemask()? This is the only user of it in-tree.
> 
> Nish:
> 
> I thought of that.  I didn't go that way because I'd either need to
> pass a [pointer to a pointer to] a nodemask in addition to the
> [pointer to a pointer to] the mpol, so that I can release the
> reference on the mpol after the allocation is finished;

See I looked at that and thought: "We're already passing a pointer to a
pointer to mpol, so a pointer to a pointer to a nodemask shouldn't be
that big of deal. This is the one call-site, as well. The idea being,
we've pushed as much of the zonelist/nodemask knowledge into
huge_zonelist(), keeping hugetlb.c relatively clear of it. Maybe it
doesn't matter, was really just a question. Not sure what other folks
think.

> or I'd need to copy the nodemask [which can get pretty big] in the
> allocation path.  I wanted to avoid both of those.  I suppose I could
> be convinced that one or the other of those options is better than the
> single use helper function.  What do you think?

What you have is fine, I guess -- and has been picked up by Andrew.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] 2.6.25-rc3-mm1 - Mempolicy:  make dequeue_huge_page_vma() obey MPOL_BIND nodemask
  2008-03-06  0:39             ` [PATCH] 2.6.25-rc3-mm1 - Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask Andrew Morton
@ 2008-03-06 15:17               ` Lee Schermerhorn
  0 siblings, 0 replies; 37+ messages in thread
From: Lee Schermerhorn @ 2008-03-06 15:17 UTC (permalink / raw
  To: Andrew Morton
  Cc: mel, nacc, agl, wli, clameter, ak, kamezawa.hiroyu, rientjes,
	linux-mm, eric.whitney

On Wed, 2008-03-05 at 16:39 -0800, Andrew Morton wrote:
> On Wed, 05 Mar 2008 11:06:34 -0500
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > Subject: [PATCH] 2.6.25-rc3-mm1 - Mempolicy:  make dequeue_huge_page_vma() obey MPOL_BIND nodemask
> 
> whine.
> 
> a) Please put the "2.6.25-rc3-mm1" stuff inside [].  This tells the
>    receiver that this is not-to-be-included metadata pertaining only to
>    this email.
> 
> b) Sometimes (quite often) people send me patches advertised as being
>    for "2.6.x-mm-y" whereas they actually fix problems which are in
>    mainline (or even -stable).
> 
>    So I tend to not believe it when people say that.
> 
> > Date: Wed, 05 Mar 2008 11:06:34 -0500
> > Organization: HP/OSLO
> > X-Mailer: Evolution 2.6.1 
> > 
> > PATCH Mempolicy - make dequeue_huge_page_vma() obey MPOL_BIND nodemask
> > 
> > dequeue_huge_page_vma() is not obeying the MPOL_BIND nodemask
> > with the zonelist rework.  It needs to search only zones in 
> > the mempolicy nodemask for hugepages.
> > 
> > Use for_each_zone_zonelist_nodemask() instead of
> > for_each_zone_zonelist().
> > 
> > Note:  this will bloat mm/hugetlb.o a bit until Mel reworks the
> > inlining of the for_each_zone... macros and helpers.
> > 
> > Added mempolicy helper function mpol_bind_nodemask() to hide
> > the details of mempolicy from hugetlb and to avoid
> > #ifdef CONFIG_NUMA in dequeue_huge_page_vma().
> > 
> 
> But this patch does indeed fix an only-in-mm problem.  afacit it is a
> bugfix against mm-filter-based-on-a-nodemask-as-well-as-a-gfp_mask.patch. 
> At least, that's how I've staged it.
> 
> If you have already worked that information out, please let me know.

ACK.  Sorry, didn't mean to make more work for you.  

And, I think you've told me this before--to mention when problem was
introduced by another patch :-(.

> 
> > Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
> You're "Lee", not " Lee" ;)

Ah... I didn't know the tools were that sensitive to whitespace.  

Lee  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] 2.6.25-rc3-mm1 - Mempolicy:  make dequeue_huge_page_vma() obey MPOL_BIND nodemask
  2008-03-06  1:04                 ` Nishanth Aravamudan
@ 2008-03-06 15:38                   ` Lee Schermerhorn
  2008-03-06 21:24                   ` [PATCH] Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework Lee Schermerhorn
  1 sibling, 0 replies; 37+ messages in thread
From: Lee Schermerhorn @ 2008-03-06 15:38 UTC (permalink / raw
  To: Nishanth Aravamudan
  Cc: Mel Gorman, Andrew Morton, agl, wli, clameter, ak,
	kamezawa.hiroyu, rientjes, linux-mm, eric.whitney

On Wed, 2008-03-05 at 17:04 -0800, Nishanth Aravamudan wrote:
> On 05.03.2008 [14:02:53 -0500], Lee Schermerhorn wrote:
> > On Wed, 2008-03-05 at 10:03 -0800, Nishanth Aravamudan wrote:
> > > On 05.03.2008 [11:06:34 -0500], Lee Schermerhorn wrote:
> > > > PATCH Mempolicy - make dequeue_huge_page_vma() obey MPOL_BIND nodemask
> > > > 
> > > > dequeue_huge_page_vma() is not obeying the MPOL_BIND nodemask
> > > > with the zonelist rework.  It needs to search only zones in 
> > > > the mempolicy nodemask for hugepages.
> > > > 
> > > > Use for_each_zone_zonelist_nodemask() instead of
> > > > for_each_zone_zonelist().
> > > > 
> > > > Note:  this will bloat mm/hugetlb.o a bit until Mel reworks the
> > > > inlining of the for_each_zone... macros and helpers.
> > > > 
> > > > Added mempolicy helper function mpol_bind_nodemask() to hide
> > > > the details of mempolicy from hugetlb and to avoid
> > > > #ifdef CONFIG_NUMA in dequeue_huge_page_vma().
> > > > 
> > > > Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> > > > 
> > > >  include/linux/mempolicy.h |   13 +++++++++++++
> > > >  mm/hugetlb.c              |    4 +++-
> > > >  2 files changed, 16 insertions(+), 1 deletion(-)
> > > > 
> > > > Index: linux-2.6.25-rc3-mm1/mm/hugetlb.c
> > > > ===================================================================
> > > > --- linux-2.6.25-rc3-mm1.orig/mm/hugetlb.c	2008-03-05 10:35:12.000000000 -0500
> > > > +++ linux-2.6.25-rc3-mm1/mm/hugetlb.c	2008-03-05 10:37:09.000000000 -0500
> > > > @@ -99,8 +99,10 @@ static struct page *dequeue_huge_page_vm
> > > >  					htlb_alloc_mask, &mpol);
> > > >  	struct zone *zone;
> > > >  	struct zoneref *z;
> > > > +	nodemask_t *nodemask = mpol_bind_nodemask(mpol);
> > > 
> > > We get this mpol from huge_zonelist(). Would it perhaps make sense to
> > > pass the nodemask as a parameter, too, to huge_zonelist(), rather than
> > > adding mpol_bind_nodemask()? This is the only user of it in-tree.
> > 
> > Nish:
> > 
> > I thought of that.  I didn't go that way because I'd either need to
> > pass a [pointer to a pointer to] a nodemask in addition to the
> > [pointer to a pointer to] the mpol, so that I can release the
> > reference on the mpol after the allocation is finished;
> 
> See I looked at that and thought: "We're already passing a pointer to a
> pointer to mpol, so a pointer to a pointer to a nodemask shouldn't be
> that big of deal. 

:-)  I looked at that and thought:  "Yuck!  we're already passing a
pointer to a pointer, ...  I don't want to add another one."  Don't know
why I have such an aversion to passing results back like that.  


> This is the one call-site, as well. The idea being,
> we've pushed as much of the zonelist/nodemask knowledge into
> huge_zonelist(), keeping hugetlb.c relatively clear of it. Maybe it
> doesn't matter, was really just a question. 

No, I do see your point.  As I say, I had the same thoughts.

> Not sure what other folks
> think.

> 
> > or I'd need to copy the nodemask [which can get pretty big] in the
> > allocation path.  I wanted to avoid both of those.  I suppose I could
> > be convinced that one or the other of those options is better than the
> > single use helper function.  What do you think?
> 
> What you have is fine, I guess -- and has been picked up by Andrew.

We'll need more work in this area, I think.  I can fix it up then.  I'll
try a patch and see how it "feels"...

Thanks for your attention,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/6] Use two zonelist that are filtered by GFP mask
  2008-02-28 21:32   ` Andrew Morton
  2008-02-28 21:53     ` Lee Schermerhorn
  2008-02-29 14:50     ` Mel Gorman
@ 2008-03-06 18:41     ` Mel Gorman
  2 siblings, 0 replies; 37+ messages in thread
From: Mel Gorman @ 2008-03-06 18:41 UTC (permalink / raw
  To: Andrew Morton
  Cc: Lee Schermerhorn, ak, clameter, kamezawa.hiroyu, linux-mm,
	rientjes, eric.whitney

On (28/02/08 13:32), Andrew Morton didst pronounce:
> On Wed, 27 Feb 2008 16:47:34 -0500
> Lee Schermerhorn <lee.schermerhorn@hp.com> wrote:
> 
> > +/* Returns the first zone at or below highest_zoneidx in a zonelist */
> > +static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
> > +					enum zone_type highest_zoneidx)
> > +{
> > +	struct zone **z;
> > +
> > +	/* Find the first suitable zone to use for the allocation */
> > +	z = zonelist->zones;
> > +	while (*z && zone_idx(*z) > highest_zoneidx)
> > +		z++;
> > +
> > +	return z;
> > +}
> > +
> > +/* Returns the next zone at or below highest_zoneidx in a zonelist */
> > +static inline struct zone **next_zones_zonelist(struct zone **z,
> > +					enum zone_type highest_zoneidx)
> > +{
> > +	/* Find the next suitable zone to use for the allocation */
> > +	while (*z && zone_idx(*z) > highest_zoneidx)
> > +		z++;
> > +
> > +	return z;
> > +}
> > +
> > +/**
> > + * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
> > + * @zone - The current zone in the iterator
> > + * @z - The current pointer within zonelist->zones being iterated
> > + * @zlist - The zonelist being iterated
> > + * @highidx - The zone index of the highest zone to return
> > + *
> > + * This iterator iterates though all zones at or below a given zone index.
> > + */
> > +#define for_each_zone_zonelist(zone, z, zlist, highidx) \
> > +	for (z = first_zones_zonelist(zlist, highidx), zone = *z++;	\
> > +		zone;							\
> > +		z = next_zones_zonelist(z, highidx), zone = *z++)
> > +
> 
> omygawd will that thing generate a lot of code!
> 
> It has four call sites in mm/oom_kill.c and the overall patchset increases
> mm/oom_kill.o's text section (x86_64 allmodconfig) from 3268 bytes to 3845.
> 
> vmscan.o and page_alloc.o also grew a lot.  otoh total vmlinux bloat from
> the patchset is only around 700 bytes, so I expect that with a little less
> insanity we could actually get an aggregate improvement here.
> 
> Some of the inlining in mmzone.h is just comical.  Some of it is obvious
> (first_zones_zonelist) and some of it is less obvious (pfn_present).
> 
> I applied these for testing but I really don't think we should be merging
> such easily-fixed regressions into mainline.  Could someone please take a
> look at de-porking core MM?
> 

Here is the first attempt at uninlining the the zonelist walking. It doesn't
use a callback iterator as that appeared at first glance as it would cause
more bloat. With the version of patches posted, the increase in text size
of vmlinux was 946 (x86_64 allmodconfig) and with this patch it's 613 so
it's a bit of an improvement.

Performance is still good. It's not a definitive gain or loss in comparison to
the two-zonelist as different machines show different results. In comparison
to the vanilla, it's still generally a win so for the reduction in text size,
it's likely worth it overall.

In comparison to two-zonelist the performance differences looks like

Kernbench Elapsed    time      -0.66      to 2.03
Kernbench Total  CPU           -0.73      to 0.94
Hackbench pipes-1              -8.23      to 29.65
Hackbench pipes-4              -8.10      to 10.98
Hackbench pipes-8              -17.55     to 8.66
Hackbench pipes-16             -12.15     to 3.16
Hackbench sockets-1            -6.90      to 11.78
Hackbench sockets-4            -1.75      to 3.89
Hackbench sockets-8            -7.53      to 5.78
Hackbench sockets-16           -3.78      to 5.89
TBench    clients-1            -1.98      to 12.01
TBench    clients-2            -3.83      to 11.02
TBench    clients-4            -4.49      to 6.32
TBench    clients-8            -5.65      to 5.76
DBench    clients-1-ext2       -5.13      to 0.98
DBench    clients-2-ext2       -2.42      to 8.07
DBench    clients-4-ext2       -4.64      to 1.92
DBench    clients-8-ext2       -5.57      to 9.90

DBench is the one I am most frowning at as another file-based microbench
showed huge variances but it's not clear why it would be particularly
sensitive.

Suggestions on how to improve the patch or alternatives?

==CUT HERE==
Subject: Uninline zonelist iterator helper functions

The order zones are accessed in is determined by a zonelist and the
iterator for it is for_each_zone_zonelist_nodemask(). That uses a number
of large inline functions that bloats the vmlinux file unnecessarily.
This patch uninlines one helper function and reduces the size of
another. Some iterator logic is then pushed to the uninlined function to
reduce text duplication

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 fs/buffer.c            |    4 ++-
 include/linux/mmzone.h |   62 +++++++------------------------------------------
 mm/mempolicy.c         |    4 ++-
 mm/mmzone.c            |   31 ++++++++++++++++++++++++
 mm/page_alloc.c        |   10 ++++++-
 5 files changed, 55 insertions(+), 56 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.25-rc2-mm1-1010_update_documentation/fs/buffer.c linux-2.6.25-rc2-mm1-1020_uninline/fs/buffer.c
--- linux-2.6.25-rc2-mm1-1010_update_documentation/fs/buffer.c	2008-03-03 14:39:40.000000000 +0000
+++ linux-2.6.25-rc2-mm1-1020_uninline/fs/buffer.c	2008-03-06 11:26:07.000000000 +0000
@@ -369,6 +369,7 @@ void invalidate_bdev(struct block_device
 static void free_more_memory(void)
 {
 	struct zoneref *zrefs;
+	struct zone *dummy;
 	int nid;
 
 	wakeup_pdflush(1024);
@@ -376,7 +377,8 @@ static void free_more_memory(void)
 
 	for_each_online_node(nid) {
 		zrefs = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
-						gfp_zone(GFP_NOFS), NULL);
+						gfp_zone(GFP_NOFS), NULL,
+						&dummy);
 		if (zrefs->zone)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
 						GFP_NOFS);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.25-rc2-mm1-1010_update_documentation/include/linux/mmzone.h linux-2.6.25-rc2-mm1-1020_uninline/include/linux/mmzone.h
--- linux-2.6.25-rc2-mm1-1010_update_documentation/include/linux/mmzone.h	2008-03-03 14:39:40.000000000 +0000
+++ linux-2.6.25-rc2-mm1-1020_uninline/include/linux/mmzone.h	2008-03-06 11:14:59.000000000 +0000
@@ -750,59 +750,19 @@ static inline int zonelist_node_idx(stru
 #endif /* CONFIG_NUMA */
 }
 
-static inline void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
-{
-	zoneref->zone = zone;
-	zoneref->zone_idx = zone_idx(zone);
-}
-
-static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
-{
-#ifdef CONFIG_NUMA
-	return node_isset(zonelist_node_idx(zref), *nodes);
-#else
-	return 1;
-#endif /* CONFIG_NUMA */
-}
+struct zoneref *next_zones_zonelist(struct zoneref *z,
+					enum zone_type highest_zoneidx,
+					nodemask_t *nodes,
+					struct zone **zone);
 
 /* Returns the first zone at or below highest_zoneidx in a zonelist */
 static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
 					enum zone_type highest_zoneidx,
-					nodemask_t *nodes)
+					nodemask_t *nodes,
+					struct zone **zone)
 {
-	struct zoneref *z;
-
-	/* Find the first suitable zone to use for the allocation */
-	z = zonelist->_zonerefs;
-	if (likely(nodes == NULL))
-		while (zonelist_zone_idx(z) > highest_zoneidx)
-			z++;
-	else
-		while (zonelist_zone_idx(z) > highest_zoneidx ||
-				(z->zone && !zref_in_nodemask(z, nodes)))
-			z++;
-
-	return z;
-}
-
-/* Returns the next zone at or below highest_zoneidx in a zonelist */
-static inline struct zoneref *next_zones_zonelist(struct zoneref *z,
-					enum zone_type highest_zoneidx,
-					nodemask_t *nodes)
-{
-	/*
-	 * Find the next suitable zone to use for the allocation.
-	 * Only filter based on nodemask if it's set
-	 */
-	if (likely(nodes == NULL))
-		while (zonelist_zone_idx(z) > highest_zoneidx)
-			z++;
-	else
-		while (zonelist_zone_idx(z) > highest_zoneidx ||
-				(z->zone && !zref_in_nodemask(z, nodes)))
-			z++;
-
-	return z;
+	return next_zones_zonelist(zonelist->_zonerefs, highest_zoneidx, nodes,
+								zone);
 }
 
 /**
@@ -817,11 +777,9 @@ static inline struct zoneref *next_zones
  * within a given nodemask
  */
 #define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
-	for (z = first_zones_zonelist(zlist, highidx, nodemask),	\
-					zone = zonelist_zone(z++);	\
+	for (z = first_zones_zonelist(zlist, highidx, nodemask, &zone);	\
 		zone;							\
-		z = next_zones_zonelist(z, highidx, nodemask),		\
-					zone = zonelist_zone(z++))
+		z = next_zones_zonelist(z, highidx, nodemask, &zone))	\
 
 /**
  * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.25-rc2-mm1-1010_update_documentation/mm/mempolicy.c linux-2.6.25-rc2-mm1-1020_uninline/mm/mempolicy.c
--- linux-2.6.25-rc2-mm1-1010_update_documentation/mm/mempolicy.c	2008-03-03 14:39:40.000000000 +0000
+++ linux-2.6.25-rc2-mm1-1020_uninline/mm/mempolicy.c	2008-03-06 12:17:30.000000000 +0000
@@ -1214,10 +1214,12 @@ unsigned slab_node(struct mempolicy *pol
 		 */
 		struct zonelist *zonelist;
 		struct zoneref *z;
+		struct zone *dummy;
 		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
 		zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
 		z = first_zones_zonelist(zonelist, highest_zoneidx,
-							&policy->v.nodes);
+							&policy->v.nodes,
+							&dummy);
 		return zonelist_node_idx(z);
 	}
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.25-rc2-mm1-1010_update_documentation/mm/mmzone.c linux-2.6.25-rc2-mm1-1020_uninline/mm/mmzone.c
--- linux-2.6.25-rc2-mm1-1010_update_documentation/mm/mmzone.c	2008-02-15 20:57:20.000000000 +0000
+++ linux-2.6.25-rc2-mm1-1020_uninline/mm/mmzone.c	2008-03-06 11:14:57.000000000 +0000
@@ -42,3 +42,34 @@ struct zone *next_zone(struct zone *zone
 	return zone;
 }
 
+static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
+{
+#ifdef CONFIG_NUMA
+	return node_isset(zonelist_node_idx(zref), *nodes);
+#else
+	return 1;
+#endif /* CONFIG_NUMA */
+}
+
+
+/* Returns the next zone at or below highest_zoneidx in a zonelist */
+struct zoneref *next_zones_zonelist(struct zoneref *z,
+					enum zone_type highest_zoneidx,
+					nodemask_t *nodes,
+					struct zone **zone)
+{
+	/*
+	 * Find the next suitable zone to use for the allocation.
+	 * Only filter based on nodemask if it's set
+	 */
+	if (likely(nodes == NULL))
+		while (zonelist_zone_idx(z) > highest_zoneidx)
+			z++;
+	else
+		while (zonelist_zone_idx(z) > highest_zoneidx ||
+				(z->zone && !zref_in_nodemask(z, nodes)))
+			z++;
+
+	*zone = zonelist_zone(z++);
+	return z;
+}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.25-rc2-mm1-1010_update_documentation/mm/page_alloc.c linux-2.6.25-rc2-mm1-1020_uninline/mm/page_alloc.c
--- linux-2.6.25-rc2-mm1-1010_update_documentation/mm/page_alloc.c	2008-03-03 14:39:40.000000000 +0000
+++ linux-2.6.25-rc2-mm1-1020_uninline/mm/page_alloc.c	2008-03-06 11:11:00.000000000 +0000
@@ -1398,9 +1398,9 @@ get_page_from_freelist(gfp_t gfp_mask, n
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	z = first_zones_zonelist(zonelist, high_zoneidx, nodemask);
+	z = first_zones_zonelist(zonelist, high_zoneidx, nodemask,
+							&preferred_zone);
 	classzone_idx = zonelist_zone_idx(z);
-	preferred_zone = zonelist_zone(z);
 
 zonelist_scan:
 	/*
@@ -1966,6 +1966,12 @@ void show_free_areas(void)
 	show_swap_cache_info();
 }
 
+static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
+{
+	zoneref->zone = zone;
+	zoneref->zone_idx = zone_idx(zone);
+}
+
 /*
  * Builds allocation fallback zone lists.
  *

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH] Mempolicy:  make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework
  2008-03-06  1:04                 ` Nishanth Aravamudan
  2008-03-06 15:38                   ` Lee Schermerhorn
@ 2008-03-06 21:24                   ` Lee Schermerhorn
  2008-03-07 17:35                     ` Nishanth Aravamudan
  1 sibling, 1 reply; 37+ messages in thread
From: Lee Schermerhorn @ 2008-03-06 21:24 UTC (permalink / raw
  To: Nishanth Aravamudan, Andrew Morton
  Cc: Mel Gorman, agl, wli, clameter, ak, kamezawa.hiroyu, rientjes,
	linux-mm, eric.whitney

Fix for earlier patch:
"mempolicy-make-dequeue_huge_page_vma-obey-bind-policy"

Against: 2.6.25-rc3-mm1 atop the above patch.

As suggested by Nish Aravamudan, remove the mpol_bind_nodemask()
helper and return a pointer to the policy node mask from
huge_zonelist for MPOL_BIND.  This hides more of the mempolicy
quirks from hugetlb.

In making this change, I noticed that the huge_zonelist() stub
for !NUMA wasn't nulling out the mpol.  Added that as well.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mempolicy.h |   21 ++++++---------------
 mm/hugetlb.c              |    4 ++--
 mm/mempolicy.c            |   18 ++++++++++++------
 3 files changed, 20 insertions(+), 23 deletions(-)

Index: linux-2.6.25-rc3-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.25-rc3-mm1.orig/include/linux/mempolicy.h	2008-03-06 12:01:59.000000000 -0500
+++ linux-2.6.25-rc3-mm1/include/linux/mempolicy.h	2008-03-06 12:14:24.000000000 -0500
@@ -152,7 +152,8 @@ extern void mpol_fix_fork_child_flag(str
 
 extern struct mempolicy default_policy;
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
-		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol);
+				unsigned long addr, gfp_t gfp_flags,
+				struct mempolicy **mpol, nodemask_t **nodemask);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -163,14 +164,6 @@ static inline void check_highest_zone(en
 		policy_zone = k;
 }
 
-static inline nodemask_t *mpol_bind_nodemask(struct mempolicy *mpol)
-{
-	if (mpol->policy == MPOL_BIND)
-		return &mpol->v.nodes;
-	else
-		return NULL;
-}
-
 int do_migrate_pages(struct mm_struct *mm,
 	const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
 
@@ -248,8 +241,11 @@ static inline void mpol_fix_fork_child_f
 }
 
 static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
- 		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol)
+				unsigned long addr, gfp_t gfp_flags,
+				struct mempolicy **mpol, nodemask_t **nodemask)
 {
+	*mpol = NULL;
+	*nodemask = NULL;
 	return node_zonelist(0, gfp_flags);
 }
 
@@ -263,11 +259,6 @@ static inline int do_migrate_pages(struc
 static inline void check_highest_zone(int k)
 {
 }
-
-static inline nodemask_t *mpol_bind_nodemask(struct mempolicy *mpol)
-{
-	return NULL;
-}
 #endif /* CONFIG_NUMA */
 #endif /* __KERNEL__ */
 
Index: linux-2.6.25-rc3-mm1/mm/hugetlb.c
===================================================================
--- linux-2.6.25-rc3-mm1.orig/mm/hugetlb.c	2008-03-06 12:01:59.000000000 -0500
+++ linux-2.6.25-rc3-mm1/mm/hugetlb.c	2008-03-06 12:03:06.000000000 -0500
@@ -95,11 +95,11 @@ static struct page *dequeue_huge_page_vm
 	int nid;
 	struct page *page = NULL;
 	struct mempolicy *mpol;
+	nodemask_t *nodemask;
 	struct zonelist *zonelist = huge_zonelist(vma, address,
-					htlb_alloc_mask, &mpol);
+					htlb_alloc_mask, &mpol, &nodemask);
 	struct zone *zone;
 	struct zoneref *z;
-	nodemask_t *nodemask = mpol_bind_nodemask(mpol);
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						MAX_NR_ZONES - 1, nodemask) {
Index: linux-2.6.25-rc3-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.25-rc3-mm1.orig/mm/mempolicy.c	2008-03-06 12:01:59.000000000 -0500
+++ linux-2.6.25-rc3-mm1/mm/mempolicy.c	2008-03-06 12:33:58.000000000 -0500
@@ -1276,25 +1276,31 @@ static inline unsigned interleave_nid(st
  * @vma = virtual memory area whose policy is sought
  * @addr = address in @vma for shared policy lookup and interleave policy
  * @gfp_flags = for requested zone
- * @mpol = pointer to mempolicy pointer for reference counted 'BIND policy
+ * @mpol = pointer to mempolicy pointer for reference counted mempolicy
+ * @nodemask = pointer to nodemask pointer for MPOL_BIND nodemask
  *
  * Returns a zonelist suitable for a huge page allocation.
- * If the effective policy is 'BIND, returns pointer to policy's zonelist.
+ * If the effective policy is 'BIND, returns pointer to local node's zonelist,
+ * and a pointer to the mempolicy's @nodemask for filtering the zonelist.
  * If it is also a policy for which get_vma_policy() returns an extra
- * reference, we must hold that reference until after allocation.
+ * reference, we must hold that reference until after the allocation.
  * In that case, return policy via @mpol so hugetlb allocation can drop
- * the reference.  For non-'BIND referenced policies, we can/do drop the
+ * the reference. For non-'BIND referenced policies, we can/do drop the
  * reference here, so the caller doesn't need to know about the special case
  * for default and current task policy.
  */
 struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
-				gfp_t gfp_flags, struct mempolicy **mpol)
+				gfp_t gfp_flags, struct mempolicy **mpol,
+				nodemask_t **nodemask)
 {
 	struct mempolicy *pol = get_vma_policy(current, vma, addr);
 	struct zonelist *zl;
 
 	*mpol = NULL;		/* probably no unref needed */
-	if (pol->policy == MPOL_INTERLEAVE) {
+	*nodemask = NULL;	/* assume !MPOL_BIND */
+	if (pol->policy == MPOL_BIND) {
+			*nodemask = &pol->v.nodes;
+	} else if (pol->policy == MPOL_INTERLEAVE) {
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/6] Filter based on a nodemask as well as a gfp_mask
  2008-02-29  2:59   ` KAMEZAWA Hiroyuki
@ 2008-03-07 11:56     ` Mel Gorman
  0 siblings, 0 replies; 37+ messages in thread
From: Mel Gorman @ 2008-03-07 11:56 UTC (permalink / raw
  To: KAMEZAWA Hiroyuki
  Cc: Lee Schermerhorn, akpm, ak, clameter, linux-mm, rientjes,
	eric.whitney

On (29/02/08 11:59), KAMEZAWA Hiroyuki didst pronounce:
> On Wed, 27 Feb 2008 16:47:47 -0500
> Lee Schermerhorn <lee.schermerhorn@hp.com>, Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > [PATCH 6/6] Filter based on a nodemask as well as a gfp_mask
> > 
> > V11r3 against 2.6.25-rc2-mm1
> > 
> > The MPOL_BIND policy creates a zonelist that is used for allocations
> > controlled by that mempolicy. As the per-node zonelist is already being
> > filtered based on a zone id, this patch adds a version of __alloc_pages()
> > that takes a nodemask for further filtering. This eliminates the need
> > for MPOL_BIND to create a custom zonelist.
> > 
> > A positive benefit of this is that allocations using MPOL_BIND now use the
> > local node's distance-ordered zonelist instead of a custom node-id-ordered
> > zonelist.  I.e., pages will be allocated from the closest allowed node with
> > available memory.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Christoph Lameter <clameter@sgi.com>
> > Tested-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> 
> Thank you! I like this very much.
> Next step is maybe to pass nodemask to try_to_free_pages().
> 

Not a bad plan. I will visit it after the text bloat is reduced a bit.
Currently each usage of for_each_zone_zonelist_nodemask() adds a bit too
much.

> BTW, cpuset memory limitation by nodemask has the same kind of feature.
> But it seems cpuset_zone_allowed_soft/hardwall() has extra checks for system
> sanity. 
> 
> Could you import them ? maybe like this
> ==
> void __alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
> +			struct zonelist *zonelist, nodemask_t *nodemask))
> {
> 	if (nodemask) {
> 		if (unlikely(test_thread_flag(TIF_MEMDIE)))
> 	                nodemask = NULL;
> 		if ((gfp_mask & __GFP_HARDWALL)
>                      && (unlikely(test_thread_flag(TIF_MEMDIE)))
> 			nodemask = NULL;
> 	}
> }
> ==
> (I don't think above is clean.)

I get the idea, I'll check it out and see.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Mempolicy:  make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework
  2008-03-06 21:24                   ` [PATCH] Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework Lee Schermerhorn
@ 2008-03-07 17:35                     ` Nishanth Aravamudan
  2008-03-07 18:31                       ` Lee Schermerhorn
  0 siblings, 1 reply; 37+ messages in thread
From: Nishanth Aravamudan @ 2008-03-07 17:35 UTC (permalink / raw
  To: Lee Schermerhorn
  Cc: Andrew Morton, Mel Gorman, agl, wli, clameter, ak,
	kamezawa.hiroyu, rientjes, linux-mm, eric.whitney

On 06.03.2008 [16:24:53 -0500], Lee Schermerhorn wrote:
> 
> Fix for earlier patch:
> "mempolicy-make-dequeue_huge_page_vma-obey-bind-policy"
> 
> Against: 2.6.25-rc3-mm1 atop the above patch.
> 
> As suggested by Nish Aravamudan, remove the mpol_bind_nodemask()
> helper and return a pointer to the policy node mask from
> huge_zonelist for MPOL_BIND.  This hides more of the mempolicy
> quirks from hugetlb.
> 
> In making this change, I noticed that the huge_zonelist() stub
> for !NUMA wasn't nulling out the mpol.  Added that as well.

Hrm, I was thinking more of the following (on top of this patch):

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4c5d41d..3790f5a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1298,9 +1298,7 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 
 	*mpol = NULL;		/* probably no unref needed */
 	*nodemask = NULL;	/* assume !MPOL_BIND */
-	if (pol->policy == MPOL_BIND) {
-			*nodemask = &pol->v.nodes;
-	} else if (pol->policy == MPOL_INTERLEAVE) {
+	if (pol->policy == MPOL_INTERLEAVE) {
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
@@ -1310,10 +1308,12 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 
 	zl = zonelist_policy(GFP_HIGHUSER, pol);
 	if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
-		if (pol->policy != MPOL_BIND)
+		if (pol->policy != MPOL_BIND) {
 			__mpol_free(pol);	/* finished with pol */
-		else
+		} else {
 			*mpol = pol;	/* unref needed after allocation */
+			*nodemask = &pol->v.nodes;
+		}
 	}
 	return zl;
 }

but perhaps that won't do the right thing if pol == current->mempolicy
and pol->policy == MPOL_BIND. So something like:


diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4c5d41d..7eb77e0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1298,9 +1298,7 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 
 	*mpol = NULL;		/* probably no unref needed */
 	*nodemask = NULL;	/* assume !MPOL_BIND */
-	if (pol->policy == MPOL_BIND) {
-			*nodemask = &pol->v.nodes;
-	} else if (pol->policy == MPOL_INTERLEAVE) {
+	if (pol->policy == MPOL_INTERLEAVE) {
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
@@ -1309,11 +1307,12 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
 	}
 
 	zl = zonelist_policy(GFP_HIGHUSER, pol);
-	if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
-		if (pol->policy != MPOL_BIND)
-			__mpol_free(pol);	/* finished with pol */
-		else
+	if (unlikely(pol != &default_policy && pol != current->mempolicy
+						&& pol->policy != MPOL_BIND))
+		__mpol_free(pol);	/* finished with pol */
+	if (pol->policy == MPOL_BIND) {
 			*mpol = pol;	/* unref needed after allocation */
+			*nodemask = &pol->v.nodes;
 	}
 	return zl;
 }

Still not quite as clean, but I think it's best to keep the *mpol and
*nodemask assignments together, as if *mpol is being assigned, that's
the only time we should need to set *nodemask, right?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] Mempolicy:  make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework
  2008-03-07 17:35                     ` Nishanth Aravamudan
@ 2008-03-07 18:31                       ` Lee Schermerhorn
  2008-03-08  0:27                         ` Nishanth Aravamudan
  0 siblings, 1 reply; 37+ messages in thread
From: Lee Schermerhorn @ 2008-03-07 18:31 UTC (permalink / raw
  To: Nishanth Aravamudan
  Cc: Andrew Morton, Mel Gorman, agl, wli, clameter, ak,
	kamezawa.hiroyu, rientjes, linux-mm, eric.whitney

On Fri, 2008-03-07 at 09:35 -0800, Nishanth Aravamudan wrote:
> On 06.03.2008 [16:24:53 -0500], Lee Schermerhorn wrote:
> > 
> > Fix for earlier patch:
> > "mempolicy-make-dequeue_huge_page_vma-obey-bind-policy"
> > 
> > Against: 2.6.25-rc3-mm1 atop the above patch.
> > 
> > As suggested by Nish Aravamudan, remove the mpol_bind_nodemask()
> > helper and return a pointer to the policy node mask from
> > huge_zonelist for MPOL_BIND.  This hides more of the mempolicy
> > quirks from hugetlb.
> > 
> > In making this change, I noticed that the huge_zonelist() stub
> > for !NUMA wasn't nulling out the mpol.  Added that as well.
> 
> Hrm, I was thinking more of the following (on top of this patch):
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4c5d41d..3790f5a 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1298,9 +1298,7 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
>  
>  	*mpol = NULL;		/* probably no unref needed */
>  	*nodemask = NULL;	/* assume !MPOL_BIND */
> -	if (pol->policy == MPOL_BIND) {
> -			*nodemask = &pol->v.nodes;
> -	} else if (pol->policy == MPOL_INTERLEAVE) {
> +	if (pol->policy == MPOL_INTERLEAVE) {
>  		unsigned nid;
>  
>  		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
> @@ -1310,10 +1308,12 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
>  
>  	zl = zonelist_policy(GFP_HIGHUSER, pol);
>  	if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
> -		if (pol->policy != MPOL_BIND)
> +		if (pol->policy != MPOL_BIND) {
>  			__mpol_free(pol);	/* finished with pol */
> -		else
> +		} else {
>  			*mpol = pol;	/* unref needed after allocation */
> +			*nodemask = &pol->v.nodes;
> +		}
>  	}
>  	return zl;
>  }
> 
> but perhaps that won't do the right thing if pol == current->mempolicy
> and pol->policy == MPOL_BIND. 

Right, you won't return the nodemask for current task policy == MBIND.

> So something like:
> 
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4c5d41d..7eb77e0 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1298,9 +1298,7 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
>  
>  	*mpol = NULL;		/* probably no unref needed */
>  	*nodemask = NULL;	/* assume !MPOL_BIND */
> -	if (pol->policy == MPOL_BIND) {
> -			*nodemask = &pol->v.nodes;
> -	} else if (pol->policy == MPOL_INTERLEAVE) {
> +	if (pol->policy == MPOL_INTERLEAVE) {
>  		unsigned nid;
>  
>  		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
> @@ -1309,11 +1307,12 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
>  	}
>  
>  	zl = zonelist_policy(GFP_HIGHUSER, pol);
> -	if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
> -		if (pol->policy != MPOL_BIND)
> -			__mpol_free(pol);	/* finished with pol */
> -		else
> +	if (unlikely(pol != &default_policy && pol != current->mempolicy
> +						&& pol->policy != MPOL_BIND))
> +		__mpol_free(pol);	/* finished with pol */
> +	if (pol->policy == MPOL_BIND) {
>  			*mpol = pol;	/* unref needed after allocation */
> +			*nodemask = &pol->v.nodes;
>  	}
>  	return zl;
>  }
> 
> Still not quite as clean, but I think it's best to keep the *mpol and
> *nodemask assignments together, as if *mpol is being assigned, that's
> the only time we should need to set *nodemask, right?

Well, as you've noted, we do have to test MPOL_BIND twice:  once to
return the nodemask for any 'BIND policy and once to return a non-NULL
mpol ONLY if it's MPOL_BIND and we need an unref.  However, I wanted to
avoid checking the policies twice as well, or storing *nodemask 3rd
time.

I think that your second change above is not quite right, either.
You're unconditionally returning the policy when the 'mode' == MBIND,
even if it does not need a deref.  This could result in prematurely
freeing the task policy, causing a "use after free" error on next
allocation; or even decrementing the reference on the system_default
policy, which is probably benign, but not "nice".  [Also, check your
parentheses...]

Anyway you slice it, it's pretty ugly.

So, for now, I'd like to keep it the way I have it.  I'll be sending out
a set of patches to rework the reference counting after mempolicy
settles down--i.e., Mel's and David's patches, which I'm testing now.
That will clean this area up quite a bit, IMO.  

Lee


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] Mempolicy:  make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework
  2008-03-07 18:31                       ` Lee Schermerhorn
@ 2008-03-08  0:27                         ` Nishanth Aravamudan
  0 siblings, 0 replies; 37+ messages in thread
From: Nishanth Aravamudan @ 2008-03-08  0:27 UTC (permalink / raw
  To: Lee Schermerhorn
  Cc: Andrew Morton, Mel Gorman, agl, wli, clameter, ak,
	kamezawa.hiroyu, rientjes, linux-mm, eric.whitney

On 07.03.2008 [13:31:44 -0500], Lee Schermerhorn wrote:
> On Fri, 2008-03-07 at 09:35 -0800, Nishanth Aravamudan wrote:
> > On 06.03.2008 [16:24:53 -0500], Lee Schermerhorn wrote:
> > > 
> > > Fix for earlier patch:
> > > "mempolicy-make-dequeue_huge_page_vma-obey-bind-policy"
> > > 
> > > Against: 2.6.25-rc3-mm1 atop the above patch.
> > > 
> > > As suggested by Nish Aravamudan, remove the mpol_bind_nodemask()
> > > helper and return a pointer to the policy node mask from
> > > huge_zonelist for MPOL_BIND.  This hides more of the mempolicy
> > > quirks from hugetlb.
> > > 
> > > In making this change, I noticed that the huge_zonelist() stub
> > > for !NUMA wasn't nulling out the mpol.  Added that as well.
> > 
> > Hrm, I was thinking more of the following (on top of this patch):
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 4c5d41d..3790f5a 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1298,9 +1298,7 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
> >  
> >  	*mpol = NULL;		/* probably no unref needed */
> >  	*nodemask = NULL;	/* assume !MPOL_BIND */
> > -	if (pol->policy == MPOL_BIND) {
> > -			*nodemask = &pol->v.nodes;
> > -	} else if (pol->policy == MPOL_INTERLEAVE) {
> > +	if (pol->policy == MPOL_INTERLEAVE) {
> >  		unsigned nid;
> >  
> >  		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
> > @@ -1310,10 +1308,12 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
> >  
> >  	zl = zonelist_policy(GFP_HIGHUSER, pol);
> >  	if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
> > -		if (pol->policy != MPOL_BIND)
> > +		if (pol->policy != MPOL_BIND) {
> >  			__mpol_free(pol);	/* finished with pol */
> > -		else
> > +		} else {
> >  			*mpol = pol;	/* unref needed after allocation */
> > +			*nodemask = &pol->v.nodes;
> > +		}
> >  	}
> >  	return zl;
> >  }
> > 
> > but perhaps that won't do the right thing if pol == current->mempolicy
> > and pol->policy == MPOL_BIND. 
> 
> Right, you won't return the nodemask for current task policy == MBIND.
> 
> > So something like:
> > 
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 4c5d41d..7eb77e0 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1298,9 +1298,7 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
> >  
> >  	*mpol = NULL;		/* probably no unref needed */
> >  	*nodemask = NULL;	/* assume !MPOL_BIND */
> > -	if (pol->policy == MPOL_BIND) {
> > -			*nodemask = &pol->v.nodes;
> > -	} else if (pol->policy == MPOL_INTERLEAVE) {
> > +	if (pol->policy == MPOL_INTERLEAVE) {
> >  		unsigned nid;
> >  
> >  		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
> > @@ -1309,11 +1307,12 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
> >  	}
> >  
> >  	zl = zonelist_policy(GFP_HIGHUSER, pol);
> > -	if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
> > -		if (pol->policy != MPOL_BIND)
> > -			__mpol_free(pol);	/* finished with pol */
> > -		else
> > +	if (unlikely(pol != &default_policy && pol != current->mempolicy
> > +						&& pol->policy != MPOL_BIND))
> > +		__mpol_free(pol);	/* finished with pol */
> > +	if (pol->policy == MPOL_BIND) {
> >  			*mpol = pol;	/* unref needed after allocation */
> > +			*nodemask = &pol->v.nodes;
> >  	}
> >  	return zl;
> >  }
> > 
> > Still not quite as clean, but I think it's best to keep the *mpol and
> > *nodemask assignments together, as if *mpol is being assigned, that's
> > the only time we should need to set *nodemask, right?
> 
> Well, as you've noted, we do have to test MPOL_BIND twice:  once to
> return the nodemask for any 'BIND policy and once to return a non-NULL
> mpol ONLY if it's MPOL_BIND and we need an unref.  However, I wanted to
> avoid checking the policies twice as well, or storing *nodemask 3rd
> time.
> 
> I think that your second change above is not quite right, either.
> You're unconditionally returning the policy when the 'mode' == MBIND,
> even if it does not need a deref.  This could result in prematurely
> freeing the task policy, causing a "use after free" error on next
> allocation; or even decrementing the reference on the system_default
> policy, which is probably benign, but not "nice".  [Also, check your
> parentheses...]
> 
> Anyway you slice it, it's pretty ugly.

Yep. You understand all this mempolicy code much better than I :) I was
just trying to explain my aesthetic goal :)

> So, for now, I'd like to keep it the way I have it.  I'll be sending
> out a set of patches to rework the reference counting after mempolicy
> settles down--i.e., Mel's and David's patches, which I'm testing now.
> That will clean this area up quite a bit, IMO.  

Sounds good. Looking forward to it.

-Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2008-03-08  0:27 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-27 21:47 [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3 Lee Schermerhorn, Mel Gorman
2008-02-27 21:47 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Lee Schermerhorn, Mel Gorman
2008-02-27 21:47 ` [PATCH 2/6] Introduce node_zonelist() for accessing the zonelist for a GFP mask Lee Schermerhorn, Mel Gorman
2008-02-27 21:47 ` [PATCH 3/6] Remember what the preferred zone is for zone_statistics Lee Schermerhorn, Mel Gorman
2008-02-27 22:00   ` Christoph Lameter
2008-02-28 17:45     ` Lee Schermerhorn
2008-02-29 14:19     ` Mel Gorman
2008-02-29  2:30   ` KAMEZAWA Hiroyuki
2008-02-29 14:32     ` Mel Gorman
2008-02-27 21:47 ` [PATCH 4/6] Use two zonelist that are filtered by GFP mask Lee Schermerhorn, Mel Gorman
2008-02-28 21:32   ` Andrew Morton
2008-02-28 21:53     ` Lee Schermerhorn
2008-02-29  2:37       ` KAMEZAWA Hiroyuki
2008-02-29 14:50     ` Mel Gorman
2008-02-29 15:48       ` Lee Schermerhorn
2008-02-29 21:07         ` Christoph Lameter
2008-03-04 18:01         ` Mel Gorman
2008-03-05 16:06           ` [PATCH] 2.6.25-rc3-mm1 - Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask Lee Schermerhorn
2008-03-05 18:03             ` Nishanth Aravamudan
2008-03-05 19:02               ` Lee Schermerhorn
2008-03-06  1:04                 ` Nishanth Aravamudan
2008-03-06 15:38                   ` Lee Schermerhorn
2008-03-06 21:24                   ` [PATCH] Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework Lee Schermerhorn
2008-03-07 17:35                     ` Nishanth Aravamudan
2008-03-07 18:31                       ` Lee Schermerhorn
2008-03-08  0:27                         ` Nishanth Aravamudan
2008-03-06  0:39             ` [PATCH] 2.6.25-rc3-mm1 - Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask Andrew Morton
2008-03-06 15:17               ` Lee Schermerhorn
2008-03-06 18:41     ` [PATCH 4/6] Use two zonelist that are filtered by GFP mask Mel Gorman
2008-02-27 21:47 ` [PATCH 5/6] Have zonelist contains structs with both a zone pointer and zone_idx Lee Schermerhorn, Mel Gorman
2008-02-29  7:49   ` KOSAKI Motohiro
2008-02-27 21:47 ` [PATCH 6/6] Filter based on a nodemask as well as a gfp_mask Lee Schermerhorn, Mel Gorman
2008-02-29  2:59   ` KAMEZAWA Hiroyuki
2008-03-07 11:56     ` Mel Gorman
2008-02-29  8:48   ` KOSAKI Motohiro
2008-02-27 21:53 ` [PATCH 0/6] Use two zonelists per node instead of multiple zonelists v11r3 Lee Schermerhorn
2008-02-29 14:12 ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.