linux-numa.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH/RFC 0/14] Shared Policy Overview
@ 2010-11-11 19:11 Lee Schermerhorn
  2010-11-11 19:11 ` [PATCH/RFC 1/14] Shared Policy: Miscellaneous Cleanup Lee Schermerhorn
                   ` (14 more replies)
  0 siblings, 15 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:11 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

[RFC] Shared Policy Cleanup, Fixes and Mapped File Policy

At the Linux Plumber's conference, Andi Kleen encouraged me again
to resubmit my automatic page migration patches because he thinks
they will be useful for virtualization.  Later, in the Virtualization
mini-conf, the subject came up during a presentation about adding
NUMA awareness to qemu/kvm.  After the presentation, I discussed
these series with Andrea Arcangeli and he also encouraged me to
post them.  My position within HP has changed such that I'm not
sure how much time I'll have to spend on this area nor whether I'll
have access to the larger NUMA platforms on which to test the
patches thoroughly.  However, here is the first of 4 series that
comprise my shared policy enhancements and lazy/auto-migration
enhancement.

I have rebased the patches against a recent mmotm tree.  This
rebase built cleanly, booted and passed a few ad hoc tests on
x86_64.  I've made a pass over the patch descriptions to update
them.  If there is sufficient interest in merging this, I'll
do what I can to assist in the completion and testing of the
series.

To follow:

2)  Migrate-on-fault a.k.a Lazy Migration facility
3)  Auto [as in "self"] migration facility
4)  a Migration Cache -- originally written by Marcello Tosatti

I'll announce this series and the automatic/lazy migration series
to follow on lkml, linux-mm, ...  However, I'll limit the actual
posting to linux-numa to avoid spamming the other lists.

I'm posting this shared policy series before the lazy and automatic
migration series because the latter series are based atop this one
in my tree.  I want to send out a buildable, runnable set of patches
and don't have the time to rebase the later series right now.  And,
although the later series do not functionally depend on this series,
I think it provides some reasonable cleanup and finishes some, IMO,
missing shared policy capabilities.  But, ultimately it's not required
for the auto/lazy-migration feature.

---

This series:

1) cleans up some shared policy variable naming,

2) reworks the shared policy internal interfaces to moved it from
   a "vma,address" orientation to an "policy, offset" one, where the
   offset is used for selected the node for an interleaved policy.
   This rework allows us to remove the last usage of a "pseudo-vma"
   to lookup or set mempolicy on a shared memory object.

3) reworks vma policy handling so that vmas are not split for
   shared memory policy because the shared policy infrastructure
   handles subrange mempolicies for the object.  Thus there is
   no need to split vmas for different mempolicy on different
   ranges of the shared object.

4) Fixes a long standing issue with /proc/<pid>/numa_maps where
   one can see different mempolicies on shared memory objects
   depending on which task applied the policies.  This occurs
   because Linux only split the vmas for the task applying the
   shared policy.  Essentially, now numa_maps now looks into a
   shared object mapped by a single vma and reports any mempolicy
   on subranges of the object.

5) Hooks up the hugetlbfs back shmem regions [SHM_HUGE] to the
   shared mempolicy mechanism.  The hugetlbfs inode has contained
   the shared policy structure for quite some time, but the
   vm ops were never hooked up.  Optional behavior; default is
   current behavior -- no shared policy on SHM_HUGE regions.

6) Enhances shared, mmap()'ed regular files to support shared
   policy on the resulting memory object as long as it remains
   mmap()'ed.  Optional behavior;  default is current behavior
   -- no shared policy on shared, mmap()ed files [except tmpfs].

7) Builds cleanly [my .config, anyway] and passes ad hoc memtoy
   tests on x86_64.  Has been built/tested on ia64 in the past
   but not recently.

---

Lee Schermerhorn

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH/RFC 1/14] Shared Policy: Miscellaneous Cleanup
  2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
@ 2010-11-11 19:11 ` Lee Schermerhorn
  2010-11-11 19:12 ` [PATCH/RFC 2/14] Shared Policy: move shared policy to inode/mapping Lee Schermerhorn
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:11 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Shared Policy - Miscellaneous shared policy cleanup

Some miscellaneous cleanup to use "sp" for shared policy in routines
that take one as an arg.  Prior use of "info" seemed misleading, as
it also refers to the shm_inode_info.  And use of "p" seemed just
plain inconsistent.

Additional cleanup/reorg of the numa_memory_policy.txt doc.

This patch is in preparation for additional shared policy rework.
I wanted to break the minor "cleanup" changes out into a separate
patch.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/numa_memory_policy.txt |  244 +++++++++++++++++---------------
 include/linux/mempolicy.h               |   10 -
 mm/mempolicy.c                          |   24 +--
 3 files changed, 156 insertions(+), 122 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -2185,7 +2185,7 @@ put_mpol:
 	}
 }
 
-int mpol_set_shared_policy(struct shared_policy *info,
+int mpol_set_shared_policy(struct shared_policy *sp,
 			struct vm_area_struct *vma, struct mempolicy *npol)
 {
 	int err;
@@ -2203,30 +2203,36 @@ int mpol_set_shared_policy(struct shared
 		if (!new)
 			return -ENOMEM;
 	}
-	err = shared_policy_replace(info, vma->vm_pgoff, vma->vm_pgoff+sz, new);
+	err = shared_policy_replace(sp, vma->vm_pgoff, vma->vm_pgoff+sz, new);
 	if (err && new)
 		kmem_cache_free(sn_cache, new);
 	return err;
 }
 
-/* Free a backing policy store on inode delete. */
-void mpol_free_shared_policy(struct shared_policy *p)
+/**
+ * mpol_free_shared_policy() - Free a backing policy store on inode delete.
+ * @sp - shared policy structure to free
+ *
+ * Frees the shared policy red-black tree, if any, before freeing the
+ * shared policy struct itself.
+ */
+void mpol_free_shared_policy(struct shared_policy *sp)
 {
 	struct sp_node *n;
 	struct rb_node *next;
 
-	if (!p->root.rb_node)
+	if (!sp->root.rb_node)
 		return;
-	spin_lock(&p->lock);
-	next = rb_first(&p->root);
+	spin_lock(&sp->lock);
+	next = rb_first(&sp->root);
 	while (next) {
 		n = rb_entry(next, struct sp_node, nd);
 		next = rb_next(&n->nd);
-		rb_erase(&n->nd, &p->root);
+		rb_erase(&n->nd, &sp->root);
 		mpol_put(n->policy);
 		kmem_cache_free(sn_cache, n);
 	}
-	spin_unlock(&p->lock);
+	spin_unlock(&sp->lock);
 }
 
 /* assumes fs == KERNEL_DS */
Index: linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mempolicy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
@@ -81,7 +81,7 @@ struct mm_struct;
  * the process policy is used. Interrupts ignore the memory policy
  * of the current process.
  *
- * Locking policy for interlave:
+ * Locking policy for interleave:
  * In process context there is no locking because only the process accesses
  * its own state. All vma manipulation is somewhat protected by a down_read on
  * mmap_sem.
@@ -192,10 +192,10 @@ struct shared_policy {
 };
 
 void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol);
-int mpol_set_shared_policy(struct shared_policy *info,
+int mpol_set_shared_policy(struct shared_policy *sp,
 				struct vm_area_struct *vma,
 				struct mempolicy *new);
-void mpol_free_shared_policy(struct shared_policy *p);
+void mpol_free_shared_policy(struct shared_policy *sp);
 struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 					    unsigned long idx);
 
@@ -284,7 +284,7 @@ static inline struct mempolicy *mpol_dup
 
 struct shared_policy {};
 
-static inline int mpol_set_shared_policy(struct shared_policy *info,
+static inline int mpol_set_shared_policy(struct shared_policy *sp,
 					struct vm_area_struct *vma,
 					struct mempolicy *new)
 {
@@ -296,7 +296,7 @@ static inline void mpol_shared_policy_in
 {
 }
 
-static inline void mpol_free_shared_policy(struct shared_policy *p)
+static inline void mpol_free_shared_policy(struct shared_policy *sp)
 {
 }
 
Index: linux-2.6.36-mmotm-101103-1217/Documentation/vm/numa_memory_policy.txt
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/Documentation/vm/numa_memory_policy.txt
+++ linux-2.6.36-mmotm-101103-1217/Documentation/vm/numa_memory_policy.txt
@@ -12,9 +12,10 @@ Memory policies should not be confused w
 (Documentation/cgroups/cpusets.txt)
 which is an administrative mechanism for restricting the nodes from which
 memory may be allocated by a set of processes. Memory policies are a
-programming interface that a NUMA-aware application can take advantage of.  When
-both cpusets and policies are applied to a task, the restrictions of the cpuset
-takes priority.  See "MEMORY POLICIES AND CPUSETS" below for more details.
+programming interface that a NUMA-aware application can take advantage of.
+When both cpusets and policies are applied to a task, the restrictions of the
+cpuset takes priority.  See "MEMORY POLICIES AND CPUSETS" below for more
+details.
 
 MEMORY POLICY CONCEPTS
 
@@ -56,7 +57,10 @@ most general to most specific:
 	A task policy applies only to pages allocated after the policy is
 	installed.  Any pages already faulted in by the task when the task
 	changes its task policy remain where they were allocated based on
-	the policy at the time they were allocated.
+	the policy at the time they were allocated.  The Memory Policy API
+	defines a flag to request that new pages be allocated to obey a newly
+	installed memory policy, and that the contents and state of the
+	original pages be migrated to the new pages.
 
     VMA Policy:  A "VMA" or "Virtual Memory Area" refers to a range of a task's
     virtual address space.  A task may define a specific policy for a range
@@ -109,7 +113,7 @@ most general to most specific:
     object share the policy, and all pages allocated for the shared object,
     by any task, will obey the shared policy.
 
-	As of 2.6.22, only shared memory segments, created by shmget() or
+	As of 2.6.28, only shared memory segments, created by shmget() or
 	mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy.  When shared
 	policy support was added to Linux, the associated data structures were
 	added to hugetlbfs shmem segments.  At the time, hugetlbfs did not
@@ -128,25 +132,35 @@ most general to most specific:
 	The shared policy infrastructure supports different policies on subset
 	ranges of the shared object.  However, Linux still splits the VMA of
 	the task that installs the policy for each range of distinct policy.
-	Thus, different tasks that attach to a shared memory segment can have
+	Thus, different tasks that attach to a shared memory object can have
 	different VMA configurations mapping that one shared object.  This
 	can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
-	a shared memory region, when one task has installed shared policy on
-	one or more ranges of the region.
+	a shared memory region.  When one task has installed shared policy on
+	one or more ranges of the region, the numa_maps of that task will
+	show different policies than the numa_maps of other tasks mapping the
+	shared object.  However, the installed shared policy with be used for
+	all pages allocated for the shared object by any of the attached tasks.
+
+	When installing shared policy on a shared object, the virtual address
+	range specified can be viewed as a "direct mapped", linear window onto
+	the underlying object.  As a result, attempting to install a shared
+	memory policy on a non-linear, shared mapping WILL [probably] install
+	the policy for some range of the object, but this range will not
+	necessarily correspond to the actual pages mapped non-linearly into the 	virtual address range.  Thus, applying a shared policy to a non-linear
+	mapping can be considered an undefined operation.
 
 Components of Memory Policies
 
     A Linux memory policy consists of a "mode", optional mode flags, and an
     optional set of nodes.  The mode determines the behavior of the policy,
-    the optional mode flags determine the behavior of the mode, and the
-    optional set of nodes can be viewed as the arguments to the policy
-    behavior.
-
-   Internally, memory policies are implemented by a reference counted
-   structure, struct mempolicy.  Details of this structure will be discussed
-   in context, below, as required to explain the behavior.
+    the optional mode flags determine the behavior of the mode, and the optional
+    set of nodes can be viewed as the arguments to the policy behavior.
 
-   Linux memory policy supports the following 4 behavioral modes:
+    Internally, memory policies are implemented by a reference counted
+    structure, struct mempolicy.  Details of this structure will be discussed
+    in context, below, as required to explain the behavior.
+
+    Linux memory policy supports the following 4 behavioral modes:
 
 	Default Mode--MPOL_DEFAULT:  This mode is only used in the memory
 	policy APIs.  Internally, MPOL_DEFAULT is converted to the NULL
@@ -174,7 +188,6 @@ Components of Memory Policies
 	allocation fails, the kernel will search other nodes, in order of
 	increasing distance from the preferred node based on information
 	provided by the platform firmware.
-	containing the cpu where the allocation takes place.
 
 	    Internally, the Preferred policy uses a single node--the
 	    preferred_node member of struct mempolicy.  When the internal
@@ -185,9 +198,11 @@ Components of Memory Policies
 
 	    It is possible for the user to specify that local allocation is
 	    always preferred by passing an empty nodemask with this mode.
+	    Note that this is the only way to specify local allocation for
+	    a VMA or Shared policy when the task policy is non-default.
 	    If an empty nodemask is passed, the policy cannot use the
 	    MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described
-	    below.
+	    in the MEMORY POLICIES AND CPUSETS section below.
 
 	MPOL_INTERLEAVED:  This mode specifies that page allocations be
 	interleaved, on a page granularity, across the nodes specified in
@@ -211,87 +226,13 @@ Components of Memory Policies
 	    on the order in which they are allocated, rather than based on any
 	    page offset into an address range or file.  During system boot up,
 	    the temporary interleaved system default policy works in this
-	    mode.
+	    mode to distribute boot-time allocations around the nodes with
+	    memory.
 
-   Linux memory policy supports the following optional mode flags:
+    Linux memory policy supports optional "mode flags" for controlling the
+    interaction of memory policies with cpuset resource contraints.  The flags
+    are described in the MEMORY POLICIES AND CPUSETS section below.
 
-	MPOL_F_STATIC_NODES:  This flag specifies that the nodemask passed by
-	the user should not be remapped if the task or VMA's set of allowed
-	nodes changes after the memory policy has been defined.
-
-	    Without this flag, anytime a mempolicy is rebound because of a
-	    change in the set of allowed nodes, the node (Preferred) or
-	    nodemask (Bind, Interleave) is remapped to the new set of
-	    allowed nodes.  This may result in nodes being used that were
-	    previously undesired.
-
-	    With this flag, if the user-specified nodes overlap with the
-	    nodes allowed by the task's cpuset, then the memory policy is
-	    applied to their intersection.  If the two sets of nodes do not
-	    overlap, the Default policy is used.
-
-	    For example, consider a task that is attached to a cpuset with
-	    mems 1-3 that sets an Interleave policy over the same set.  If
-	    the cpuset's mems change to 3-5, the Interleave will now occur
-	    over nodes 3, 4, and 5.  With this flag, however, since only node
-	    3 is allowed from the user's nodemask, the "interleave" only
-	    occurs over that node.  If no nodes from the user's nodemask are
-	    now allowed, the Default behavior is used.
-
-	    MPOL_F_STATIC_NODES cannot be combined with the
-	    MPOL_F_RELATIVE_NODES flag.  It also cannot be used for
-	    MPOL_PREFERRED policies that were created with an empty nodemask
-	    (local allocation).
-
-	MPOL_F_RELATIVE_NODES:  This flag specifies that the nodemask passed
-	by the user will be mapped relative to the set of the task or VMA's
-	set of allowed nodes.  The kernel stores the user-passed nodemask,
-	and if the allowed nodes changes, then that original nodemask will
-	be remapped relative to the new set of allowed nodes.
-
-	    Without this flag (and without MPOL_F_STATIC_NODES), anytime a
-	    mempolicy is rebound because of a change in the set of allowed
-	    nodes, the node (Preferred) or nodemask (Bind, Interleave) is
-	    remapped to the new set of allowed nodes.  That remap may not
-	    preserve the relative nature of the user's passed nodemask to its
-	    set of allowed nodes upon successive rebinds: a nodemask of
-	    1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
-	    allowed nodes is restored to its original state.
-
-	    With this flag, the remap is done so that the node numbers from
-	    the user's passed nodemask are relative to the set of allowed
-	    nodes.  In other words, if nodes 0, 2, and 4 are set in the user's
-	    nodemask, the policy will be effected over the first (and in the
-	    Bind or Interleave case, the third and fifth) nodes in the set of
-	    allowed nodes.  The nodemask passed by the user represents nodes
-	    relative to task or VMA's set of allowed nodes.
-
-	    If the user's nodemask includes nodes that are outside the range
-	    of the new set of allowed nodes (for example, node 5 is set in
-	    the user's nodemask when the set of allowed nodes is only 0-3),
-	    then the remap wraps around to the beginning of the nodemask and,
-	    if not already set, sets the node in the mempolicy nodemask.
-
-	    For example, consider a task that is attached to a cpuset with
-	    mems 2-5 that sets an Interleave policy over the same set with
-	    MPOL_F_RELATIVE_NODES.  If the cpuset's mems change to 3-7, the
-	    interleave now occurs over nodes 3,5-6.  If the cpuset's mems
-	    then change to 0,2-3,5, then the interleave occurs over nodes
-	    0,3,5.
-
-	    Thanks to the consistent remapping, applications preparing
-	    nodemasks to specify memory policies using this flag should
-	    disregard their current, actual cpuset imposed memory placement
-	    and prepare the nodemask as if they were always located on
-	    memory nodes 0 to N-1, where N is the number of memory nodes the
-	    policy is intended to manage.  Let the kernel then remap to the
-	    set of memory nodes allowed by the task's cpuset, as that may
-	    change over time.
-
-	    MPOL_F_RELATIVE_NODES cannot be combined with the
-	    MPOL_F_STATIC_NODES flag.  It also cannot be used for
-	    MPOL_PREFERRED policies that were created with an empty nodemask
-	    (local allocation).
 
 MEMORY POLICY REFERENCE COUNTING
 
@@ -435,19 +376,106 @@ MEMORY POLICIES AND CPUSETS
 Memory policies work within cpusets as described above.  For memory policies
 that require a node or set of nodes, the nodes are restricted to the set of
 nodes whose memories are allowed by the cpuset constraints.  If the nodemask
-specified for the policy contains nodes that are not allowed by the cpuset and
-MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
-specified for the policy and the set of nodes with memory is used.  If the
-result is the empty set, the policy is considered invalid and cannot be
-installed.  If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
-onto and folded into the task's set of allowed nodes as previously described.
+specified for the policy contains nodes that are not allowed by the cpuset [and
+MPOL_F_RELATIVE_NODES is not used--see below], the intersection of the set of
+nodes specified for the policy and the set of nodes with memory is used.  If
+the result is the empty set [and MPOL_F_STATIC_NODES is not used--see below],
+the policy is considered invalid and cannot be installed.
 
 The interaction of memory policies and cpusets can be problematic when tasks
 in two cpusets share access to a memory region, such as shared memory segments
 created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
 any of the tasks install shared policy on the region, only nodes whose
-memories are allowed in both cpusets may be used in the policies.  Obtaining
-this information requires "stepping outside" the memory policy APIs to use the
-cpuset information and requires that one know in what cpusets other task might
-be attaching to the shared region.  Furthermore, if the cpusets' allowed
-memory sets are disjoint, "local" allocation is the only valid policy.
+memories are allowed in both cpusets may be used in the policies.  Since
+2.6.26, applications can determine the allowed memories using the
+get_mempolicy() API with the MPOL_F_MEMS_ALLOWED flag.  However, one still
+can't easily determine in what cpusets other task might be attaching to the
+shared region.  Furthermore, if the cpusets' allowed memory sets are disjoint,
+"local" allocation is the only valid policy.
+
+To address some of the issues with the interaction of memory policies with
+cpusets, Linux supports two optional "mode flags".  These flags modify the
+interpretation of the set of nodes associated with a memory policy when:
+
+    1) the cpuset does not allow all of the nodes specified in the policy,
+    2) the cpuset allowed nodes changes,
+    3) the task is moved to a cpuset with a different set of allowed nodes.
+
+
+    MPOL_F_STATIC_NODES:  This flag specifies that the nodemask passed by
+    the user should not be remapped if the set of nodes allowed by the
+    task's cpuset changes after the memory policy has been defined.
+
+	Without this flag, anytime a mempolicy is rebound because of a
+	change in the set of allowed nodes, the node (Preferred) or
+	nodemask (Bind, Interleave) is remapped to the new set of
+	allowed nodes.  This may result in nodes being used that were
+	previously undesired.
+
+	With this flag, if the user-specified nodes overlap with the
+	nodes allowed by the task's cpuset, then the memory policy is
+	applied to their intersection.  If the two sets of nodes do not
+	overlap, the Default policy is used.
+
+	For example, consider a task that is attached to a cpuset with
+	mems 1-3 that sets an Interleave policy over the same set.  If
+	the cpuset's mems change to 3-5, without this flag, the allocations
+	will now be interleaved over nodes 3, 4, and 5.  With this flag,
+	however, since only node 3 is allowed from the user's nodemask, the
+	pages will only be allocated from node 3.  If no nodes from the
+	user's nodemask are now allowed, the Default behavior is used.
+
+	MPOL_F_STATIC_NODES cannot be combined with the
+	MPOL_F_RELATIVE_NODES flag.  It also cannot be used for
+	MPOL_PREFERRED policies that were created with an empty nodemask
+	(local allocation).
+
+    MPOL_F_RELATIVE_NODES:  This flag specifies that the nodemask passed
+    by the user should be mapped relative to the  set of nodes allowed by
+    the task's cpuset.  The kernel stores the user-passed nodemask, and
+    if the allowed nodes changes, then that original nodemask will be
+    remapped relative to the new set of allowed nodes.
+
+	Without this flag (and without MPOL_F_STATIC_NODES), anytime a
+	mempolicy is rebound because of a change in the set of allowed
+	nodes, the node (Preferred) or nodemask (Bind, Interleave) is
+	remapped to the new set of allowed nodes.  That remap may not
+	preserve the relative nature of the user's passed nodemask to its
+	set of allowed nodes upon successive rebinds: a nodemask of
+	1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
+	allowed nodes is restored to its original state.
+
+	With this flag, the remap is done so that the node numbers from
+	the user's passed nodemask are relative to the set of allowed
+	nodes.  In other words, if nodes 0, 2, and 4 are set in the user's
+	nodemask, the policy will be effected over the first (and in the
+	Bind or Interleave case, the third and fifth) nodes in the set of
+	allowed nodes.  The nodemask passed by the user represents nodes
+	relative to task or VMA's set of allowed nodes.
+
+	If the user's nodemask includes nodes that are outside the range
+	of the new set of allowed nodes (for example, node 5 is set in
+	the user's nodemask when the set of allowed nodes is only 0-3),
+	then the remap wraps around to the beginning of the nodemask and,
+	if not already set, sets the node in the mempolicy nodemask.
+
+	For example, consider a task that is attached to a cpuset with
+	mems 2-5 that sets an Interleave policy over the same set with
+	MPOL_F_RELATIVE_NODES.  If the cpuset's mems change to 3-7, the
+	interleave now occurs over nodes 3,5-6.  If the cpuset's mems
+	then change to 0,2-3,5, then the interleave occurs over nodes
+	0,3,5.
+
+	Thanks to the consistent remapping, applications preparing
+	nodemasks to specify memory policies using this flag should
+	disregard their current, actual cpuset imposed memory placement
+	and prepare the nodemask as if they were always located on
+	memory nodes 0 to N-1, where N is the number of memory nodes the
+	policy is intended to manage.  Let the kernel then remap to the
+	set of memory nodes allowed by the task's cpuset, as that may
+	change over time.
+
+	MPOL_F_RELATIVE_NODES cannot be combined with the
+	MPOL_F_STATIC_NODES flag.  It also cannot be used for
+	MPOL_PREFERRED policies that were created with an empty nodemask
+	(local allocation).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH/RFC 2/14] Shared Policy: move shared policy to inode/mapping
  2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
  2010-11-11 19:11 ` [PATCH/RFC 1/14] Shared Policy: Miscellaneous Cleanup Lee Schermerhorn
@ 2010-11-11 19:12 ` Lee Schermerhorn
  2010-11-11 19:12 ` [PATCH/RFC 3/14] Shared Policy: allocate shared policies as needed Lee Schermerhorn
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:12 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Shared Policy Infrastructure - move shared policy to inode/mapping

This patch starts the process of cleaning the shmem shared
[mem]policy infrastructure:
+ to eliminate use of on-stack pseudo-vmas for shmem allocations
  which will simplify several shmem internal APIs and functions.
+ for use with hugetlb shmem segments and
+ eventually, I hope, for use with generic mmap()ed files.

In this patch, the shared policy struct in the shmem and hugetlbfs
extended inodes are moved to the generic address space struct
where it will be available to any file type,  and fixes up the
existing code to accomodate this change.

Details:

1) create a shared_policy.h header and move the shared policy
   support from mempolicy.h to shared_policy.h.

2) add a struct shared_policy pointer to struct address_space
   This effectively adds it to each inode in i_data.  get_policy
   vma ops will locate this via vma->vm_file->f_mapping->spolicy.
   Modify [temporarily] mpol_shared_policy_init() to initialize
   via a shared policy pointer arg.

	A subsequent patch will replace this with a pointer
	to a dynamically allocated mempolicy and will make
	the pointer dependent on CONFIG_NUMA.  Then, all
	accesses to spolicy will also be made dependent on
	CONFIG_NUMA via wrappers.

3) modify mpol_shared_policy_lookup() to return NULL if
   spolicy pointer contains NULL.  get_vma_policy() will
   substitute the process policy, if any, else the default
   policy.

4) modify shmem, the only existing user of shared policy
   infrastructure, to work with changes above.  At this
   point, just use the shared_policy embedded in the shmem
   inode info struct.  A later patch will dynamically
   allocate the struct when needed.

   Actually, hugetlbfs inodes also contain a shared policy, but
   the vma's get|set_policy ops are not hooked up.  This patch
   modifies hugetlbfs_get_inode() to initialize the shared
   policy struct embedded in its info struct via the i_mapping's
   spolicy pointer.  A later patch will "hook up" hugetlb
   mappings to the get|set_policy ops.


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/hugetlbfs/inode.c          |    3 +
 include/linux/fs.h            |    2 +
 include/linux/mempolicy.h     |   53 ---------------------------------
 include/linux/shared_policy.h |   66 ++++++++++++++++++++++++++++++++++++++++++
 mm/mempolicy.c                |    2 -
 mm/shmem.c                    |   37 ++++++++++++-----------
 6 files changed, 92 insertions(+), 71 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/fs.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/fs.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/fs.h
@@ -646,6 +646,8 @@ struct address_space {
 	spinlock_t		private_lock;	/* for use by the address_space */
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
+
+	struct shared_policy	*spolicy;
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
Index: linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mempolicy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
@@ -64,10 +64,9 @@ enum mpol_rebind_step {
 
 #include <linux/mmzone.h>
 #include <linux/slab.h>
-#include <linux/rbtree.h>
-#include <linux/spinlock.h>
 #include <linux/nodemask.h>
 #include <linux/pagemap.h>
+#include <linux/shared_policy.h>
 
 struct mm_struct;
 
@@ -172,32 +171,6 @@ static inline int mpol_equal(struct memp
 	return __mpol_equal(a, b);
 }
 
-/*
- * Tree of shared policies for a shared memory region.
- * Maintain the policies in a pseudo mm that contains vmas. The vmas
- * carry the policy. As a special twist the pseudo mm is indexed in pages, not
- * bytes, so that we can work with shared memory segments bigger than
- * unsigned long.
- */
-
-struct sp_node {
-	struct rb_node nd;
-	unsigned long start, end;
-	struct mempolicy *policy;
-};
-
-struct shared_policy {
-	struct rb_root root;
-	spinlock_t lock;
-};
-
-void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol);
-int mpol_set_shared_policy(struct shared_policy *sp,
-				struct vm_area_struct *vma,
-				struct mempolicy *new);
-void mpol_free_shared_policy(struct shared_policy *sp);
-struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
-					    unsigned long idx);
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
@@ -281,30 +254,6 @@ static inline struct mempolicy *mpol_dup
 {
 	return NULL;
 }
-
-struct shared_policy {};
-
-static inline int mpol_set_shared_policy(struct shared_policy *sp,
-					struct vm_area_struct *vma,
-					struct mempolicy *new)
-{
-	return -EINVAL;
-}
-
-static inline void mpol_shared_policy_init(struct shared_policy *sp,
-						struct mempolicy *mpol)
-{
-}
-
-static inline void mpol_free_shared_policy(struct shared_policy *sp)
-{
-}
-
-static inline struct mempolicy *
-mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
-{
-	return NULL;
-}
 
 #define vma_policy(vma) NULL
 #define vma_set_policy(vma, pol) do {} while(0)
Index: linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
===================================================================
--- /dev/null
+++ linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
@@ -0,0 +1,66 @@
+#ifndef _LINUX_SHARED_POLICY_H
+#define _LINUX_SHARED_POLICY_H 1
+
+#include <linux/spinlock.h>
+#include <linux/rbtree.h>
+
+/*
+ * Tree of shared policies for a shared memory regions and memory
+ * mapped files.
+TODO:  wean the low level shared policies from the notion of vmas.
+       just use inode, offset, length
+ * Maintain the policies in a pseudo mm that contains vmas. The vmas
+ * carry the policy. As a special twist the pseudo mm is indexed in pages, not
+ * bytes, so that we can work with shared memory segments bigger than
+ * unsigned long.
+ */
+
+#ifdef CONFIG_NUMA
+
+struct sp_node {
+	struct rb_node nd;
+	unsigned long start, end;
+	struct mempolicy *policy;
+};
+
+struct shared_policy {
+	struct rb_root root;
+	spinlock_t lock;	/* protects rb tree */
+};
+
+void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol);
+int mpol_set_shared_policy(struct shared_policy *,
+				struct vm_area_struct *,
+				struct mempolicy *);
+void mpol_free_shared_policy(struct shared_policy *);
+struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *,
+					    unsigned long);
+
+#else /* !NUMA */
+
+struct shared_policy {};
+
+static inline int mpol_set_shared_policy(struct shared_policy *info,
+					struct vm_area_struct *vma,
+					struct mempolicy *new)
+{
+	return -EINVAL;
+}
+
+static inline void mpol_shared_policy_init(struct shared_policy *sp,
+						struct mempolicy *mpol)
+{
+}
+
+static inline void mpol_free_shared_policy(struct shared_policy *p)
+{
+}
+
+static inline struct mempolicy *
+mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+{
+	return NULL;
+}
+#endif
+
+#endif /* _LINUX_SHARED_POLICY_H */
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -2053,7 +2053,7 @@ mpol_shared_policy_lookup(struct shared_
 	struct mempolicy *pol = NULL;
 	struct sp_node *sn;
 
-	if (!sp->root.rb_node)
+	if (!sp || !sp->root.rb_node)
 		return NULL;
 	spin_lock(&sp->lock);
 	sn = sp_lookup(sp, idx, idx+1);
Index: linux-2.6.36-mmotm-101103-1217/mm/shmem.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/shmem.c
+++ linux-2.6.36-mmotm-101103-1217/mm/shmem.c
@@ -1146,15 +1146,14 @@ static struct mempolicy *shmem_get_sbmpo
 }
 #endif /* CONFIG_TMPFS */
 
-static struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
+struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp,
+				struct shared_policy *sp, unsigned long idx)
 {
 	struct mempolicy mpol, *spol;
 	struct vm_area_struct pvma;
 	struct page *page;
 
-	spol = mpol_cond_copy(&mpol,
-				mpol_shared_policy_lookup(&info->policy, idx));
+	spol = mpol_cond_copy(&mpol, mpol_shared_policy_lookup(sp, idx));
 
 	/* Create a pseudo vma that just contains the policy */
 	pvma.vm_start = 0;
@@ -1165,8 +1164,8 @@ static struct page *shmem_swapin(swp_ent
 	return page;
 }
 
-static struct page *shmem_alloc_page(gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
+static struct page *shmem_alloc_page(gfp_t gfp, struct shared_policy *sp,
+				unsigned long idx)
 {
 	struct vm_area_struct pvma;
 
@@ -1174,7 +1173,7 @@ static struct page *shmem_alloc_page(gfp
 	pvma.vm_start = 0;
 	pvma.vm_pgoff = idx;
 	pvma.vm_ops = NULL;
-	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
+	pvma.vm_policy = mpol_shared_policy_lookup(sp, idx);
 
 	/*
 	 * alloc_page_vma() will drop the shared policy reference
@@ -1188,14 +1187,14 @@ static inline void shmem_show_mpol(struc
 }
 #endif /* CONFIG_TMPFS */
 
-static inline struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
+static inline struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp, void *sp,
+						unsigned long idx)
 {
 	return swapin_readahead(entry, gfp, NULL, 0);
 }
 
-static inline struct page *shmem_alloc_page(gfp_t gfp,
-			struct shmem_inode_info *info, unsigned long idx)
+static inline struct page *shmem_alloc_page(gfp_t gfp, void *sp,
+						unsigned long idx)
 {
 	return alloc_page(gfp);
 }
@@ -1260,7 +1259,7 @@ repeat:
 		radix_tree_preload_end();
 		if (sgp != SGP_READ && !prealloc_page) {
 			/* We don't care if this fails */
-			prealloc_page = shmem_alloc_page(gfp, info, idx);
+			prealloc_page = shmem_alloc_page(gfp, mapping->spolicy, idx);
 			if (prealloc_page) {
 				if (mem_cgroup_cache_charge(prealloc_page,
 						current->mm, GFP_KERNEL)) {
@@ -1293,7 +1292,8 @@ repeat:
 				*type |= VM_FAULT_MAJOR;
 			}
 			spin_unlock(&info->lock);
-			swappage = shmem_swapin(swap, gfp, info, idx);
+			swappage = shmem_swapin(swap, gfp, mapping->spolicy,
+									idx);
 			if (!swappage) {
 				spin_lock(&info->lock);
 				entry = shmem_swp_alloc(info, idx, sgp);
@@ -1420,7 +1420,7 @@ repeat:
 
 			if (!prealloc_page) {
 				spin_unlock(&info->lock);
-				filepage = shmem_alloc_page(gfp, info, idx);
+				filepage = shmem_alloc_page(gfp, mapping->spolicy, idx);
 				if (!filepage) {
 					shmem_unacct_blocks(info->flags, 1);
 					shmem_free_blocks(inode, 1);
@@ -1608,7 +1608,8 @@ static struct inode *shmem_get_inode(str
 			inode->i_mapping->a_ops = &shmem_aops;
 			inode->i_op = &shmem_inode_operations;
 			inode->i_fop = &shmem_file_operations;
-			mpol_shared_policy_init(&info->policy,
+			inode->i_mapping->spolicy = &info->policy;
+			mpol_shared_policy_init(inode->i_mapping->spolicy,
 						 shmem_get_sbmpol(sbinfo));
 			break;
 		case S_IFDIR:
@@ -1623,7 +1624,9 @@ static struct inode *shmem_get_inode(str
 			 * Must not load anything in the rbtree,
 			 * mpol_free_shared_policy will not be called.
 			 */
-			mpol_shared_policy_init(&info->policy, NULL);
+			inode->i_mapping->spolicy = &info->policy;
+			mpol_shared_policy_init(inode->i_mapping->spolicy,
+						NULL);
 			break;
 		}
 	} else
@@ -2419,7 +2422,7 @@ static void shmem_destroy_inode(struct i
 {
 	if ((inode->i_mode & S_IFMT) == S_IFREG) {
 		/* only struct inode is valid if it's an inline symlink */
-		mpol_free_shared_policy(&SHMEM_I(inode)->policy);
+		mpol_free_shared_policy(inode->i_mapping->spolicy);
 	}
 	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
 }
Index: linux-2.6.36-mmotm-101103-1217/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/hugetlbfs/inode.c
+++ linux-2.6.36-mmotm-101103-1217/fs/hugetlbfs/inode.c
@@ -472,7 +472,8 @@ static struct inode *hugetlbfs_get_inode
 		 * call mpol_free_shared_policy() it will just return because
 		 * the rb tree will still be empty.
 		 */
-		mpol_shared_policy_init(&info->policy, NULL);
+		inode->i_mapping->spolicy = &info->policy;
+		mpol_shared_policy_init(inode->i_mapping->spolicy, NULL);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH/RFC 3/14] Shared Policy: allocate shared policies as needed
  2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
  2010-11-11 19:11 ` [PATCH/RFC 1/14] Shared Policy: Miscellaneous Cleanup Lee Schermerhorn
  2010-11-11 19:12 ` [PATCH/RFC 2/14] Shared Policy: move shared policy to inode/mapping Lee Schermerhorn
@ 2010-11-11 19:12 ` Lee Schermerhorn
  2010-11-11 19:12 ` [PATCH/RFC 4/14] Shared Policy: let vma policy ops handle sub-vma policies Lee Schermerhorn
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:12 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Shared Policy Infrastructure - dynamically alloc shared policies

Remove shared policy structs from shmem and hugetlbfs inode
info structs and dynamically allocate them as needed.

Make shared policy pointer in address_space dependent on
CONFIG_NUMA to avoid burdening configs that don't need/want
NUMA support.  Access [get/set] the shared_policy via wrappers
that also depend on 'NUMA [to avoid excessive #ifdef in .c files].

Initialize shmem and hugetlbfs inode/address_space spolicy
pointer to null, unless superblock [mount] specifies a
non-default policy.  Null shared policy pointer will cause
'get policy'--e.g., for page allocations--to fall back to task
policy, if any, else to system default policy.  Just like
NULL vma policies.

set_policy() ops must create shared_policy struct from a new
kmem cache when a new policy is installed and no spolicy exists.
mpol_shared_policy_init() replaced with mpol_shared_policy_new()
to accomplish this.

shmem must create/initialize a shared_policy when it allocates
an inode if the tmpfs super-block/mount point specifies a
non-default policy.

mpol_free_shared_policy() must free the spolicy, if any, when
inode is destroyed.

NOTE:  along with the previous patch in the series, this patch
adds a single pointer to the generic address_space struct and
thus to all inodes.  Last I looked, this did not decrease the
inodes/slab for x86_64 [and ia64 FWIW].  Other arcs need to be
checked.  If this extra pointer is problematic, I believe we
could over load the non-linear pointer and use as_flags to detect.


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/hugetlbfs/inode.c          |   19 ++----
 fs/inode.c                    |    1 
 include/linux/fs.h            |   27 ++++++++
 include/linux/hugetlb.h       |    1 
 include/linux/shared_policy.h |   24 ++++---
 include/linux/shmem_fs.h      |    1 
 mm/mempolicy.c                |  128 ++++++++++++++++++++++++++++--------------
 mm/shmem.c                    |   49 ++++++++++------
 8 files changed, 168 insertions(+), 82 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/shared_policy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
@@ -1,6 +1,7 @@
 #ifndef _LINUX_SHARED_POLICY_H
 #define _LINUX_SHARED_POLICY_H 1
 
+#include <linux/fs.h>
 #include <linux/spinlock.h>
 #include <linux/rbtree.h>
 
@@ -28,13 +29,15 @@ struct shared_policy {
 	spinlock_t lock;	/* protects rb tree */
 };
 
-void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol);
-int mpol_set_shared_policy(struct shared_policy *,
-				struct vm_area_struct *,
-				struct mempolicy *);
-void mpol_free_shared_policy(struct shared_policy *);
-struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *,
-					    unsigned long);
+extern struct shared_policy *mpol_shared_policy_new(
+					struct address_space *mapping,
+					struct mempolicy *mpol);
+extern int mpol_set_shared_policy(struct shared_policy *,
+					struct vm_area_struct *,
+					struct mempolicy *);
+extern void mpol_free_shared_policy(struct address_space *);
+extern struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *,
+					unsigned long);
 
 #else /* !NUMA */
 
@@ -47,12 +50,12 @@ static inline int mpol_set_shared_policy
 	return -EINVAL;
 }
 
-static inline void mpol_shared_policy_init(struct shared_policy *sp,
-						struct mempolicy *mpol)
+static inline struct shared_policy *
+mpol_shared_policy_new(struct address_space *mapping, struct mempolicy *mpol)
 {
 }
 
-static inline void mpol_free_shared_policy(struct shared_policy *p)
+static inline void mpol_free_shared_policy(struct shared_policy *sp)
 {
 }
 
@@ -61,6 +64,7 @@ mpol_shared_policy_lookup(struct shared_
 {
 	return NULL;
 }
+
 #endif
 
 #endif /* _LINUX_SHARED_POLICY_H */
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -102,6 +102,7 @@
 #define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2)		/* Gather statistics */
 
 static struct kmem_cache *policy_cache;
+static struct kmem_cache *sp_cache;
 static struct kmem_cache *sn_cache;
 
 /* Highest zone. An specific allocation for a zone below that is not
@@ -2137,52 +2138,86 @@ restart:
 }
 
 /**
- * mpol_shared_policy_init - initialize shared policy for inode
- * @sp: pointer to inode shared policy
- * @mpol:  struct mempolicy to install
+ * mpol_shared_policy_new - allocate and initialize a shared policy struct
+ * @mpol:  struct mempolicy to install, if non-NULL == tmpfs mount point
+ * mempolicy.
  *
- * Install non-NULL @mpol in inode's shared policy rb-tree.
+ * Allocate a new shared policy structure and install non-NULL @mpol.
  * On entry, the current task has a reference on a non-NULL @mpol.
  * This must be released on exit.
  * This is called at get_inode() calls and we can use GFP_KERNEL.
  */
-void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
+struct shared_policy *mpol_shared_policy_new(struct address_space *mapping,
+						struct mempolicy *mpol)
 {
-	int ret;
-
-	sp->root = RB_ROOT;		/* empty tree == default mempolicy */
-	spin_lock_init(&sp->lock);
+	struct shared_policy *sp, *spx;
+	struct mempolicy *new = NULL;
+	int err = 0;
 
 	if (mpol) {
-		struct vm_area_struct pvma;
-		struct mempolicy *new;
 		NODEMASK_SCRATCH(scratch);
 
-		if (!scratch)
-			goto put_mpol;
-		/* contextualize the tmpfs mount point mempolicy */
+		if (!scratch) {
+			sp = ERR_PTR(-ENOMEM);
+			err = !0;
+			goto put_free;
+		}
+		sp = mapping->spolicy;
+		/*
+		 * Contextualize the tmpfs mount point mempolicy.  Ensure that
+		 * we have a good mempolicy before allocating new shared policy.
+		 */
 		new = mpol_new(mpol->mode, mpol->flags, &mpol->w.user_nodemask);
-		if (IS_ERR(new))
-			goto free_scratch; /* no valid nodemask intersection */
+		err = IS_ERR(new);
+		if (err)
+			goto put_free;
 
 		task_lock(current);
-		ret = mpol_set_nodemask(new, &mpol->w.user_nodemask, scratch);
+		err = mpol_set_nodemask(new, &mpol->w.user_nodemask, scratch);
 		task_unlock(current);
-		if (ret)
-			goto put_new;
+put_free:
+		mpol_put(mpol);	/* drop our ref on sb mpol */
+		NODEMASK_SCRATCH_FREE(scratch);	/* scratch may be NULL */
+		if (err) {
+			mpol_put(new);	/* free bogus new mpol */
+			return sp;
+		}
+	}
 
-		/* Create pseudo-vma that contains just the policy */
+	sp = kmem_cache_alloc(sp_cache, GFP_KERNEL);
+	if (!sp) {
+		mpol_put(new);
+		return ERR_PTR(-ENOMEM);
+	}
+	sp->root = RB_ROOT;		/* empty tree == default mempolicy */
+	spin_lock_init(&sp->lock);
+
+	if (new) {
+		/*
+		 * Create pseudo-vma to specify policy range and
+		 * install new mempolicy
+		 */
+		struct vm_area_struct pvma;
 		memset(&pvma, 0, sizeof(struct vm_area_struct));
 		pvma.vm_end = TASK_SIZE;	/* policy covers entire file */
-		mpol_set_shared_policy(sp, &pvma, new); /* adds ref */
-
-put_new:
+		err = mpol_set_shared_policy(sp, &pvma, new); /* adds ref */
 		mpol_put(new);			/* drop initial ref */
-free_scratch:
-		NODEMASK_SCRATCH_FREE(scratch);
-put_mpol:
-		mpol_put(mpol);	/* drop our incoming ref on sb mpol */
 	}
+
+	/*
+	 * resolve potential set/set race; handle 'set' error
+	 */
+	spin_lock(&mapping->i_mmap_lock);
+	spx = mapping->spolicy;
+	if (!spx && !err)
+		mapping->spolicy = spx = sp;
+	else
+		err = !0;
+	spin_unlock(&mapping->i_mmap_lock);
+	if (err)
+		kmem_cache_free(sp_cache, sp);
+
+	return spx;
 }
 
 int mpol_set_shared_policy(struct shared_policy *sp,
@@ -2211,28 +2246,35 @@ int mpol_set_shared_policy(struct shared
 
 /**
  * mpol_free_shared_policy() - Free a backing policy store on inode delete.
- * @sp - shared policy structure to free
+ * @mapping - address_space struct containing pointer to shared policy to be freed.
  *
  * Frees the shared policy red-black tree, if any, before freeing the
- * shared policy struct itself.
+ * shared policy struct itself, if any.
  */
-void mpol_free_shared_policy(struct shared_policy *sp)
+void mpol_free_shared_policy(struct address_space *mapping)
 {
+	struct shared_policy *sp = mapping->spolicy;
 	struct sp_node *n;
 	struct rb_node *next;
 
-	if (!sp->root.rb_node)
-		return;
-	spin_lock(&sp->lock);
-	next = rb_first(&sp->root);
-	while (next) {
-		n = rb_entry(next, struct sp_node, nd);
-		next = rb_next(&n->nd);
-		rb_erase(&n->nd, &sp->root);
-		mpol_put(n->policy);
-		kmem_cache_free(sn_cache, n);
+	if (!sp)
+  		return;
+
+	mapping->spolicy = NULL;
+
+	if (sp->root.rb_node) {
+		spin_lock(&sp->lock);
+		next = rb_first(&sp->root);
+		while (next) {
+			n = rb_entry(next, struct sp_node, nd);
+			next = rb_next(&n->nd);
+			rb_erase(&n->nd, &sp->root);
+			mpol_put(n->policy);
+			kmem_cache_free(sn_cache, n);
+		}
+		spin_unlock(&sp->lock);
 	}
-	spin_unlock(&sp->lock);
+	kmem_cache_free(sp_cache, sp);
 }
 
 /* assumes fs == KERNEL_DS */
@@ -2246,6 +2288,10 @@ void __init numa_policy_init(void)
 					 sizeof(struct mempolicy),
 					 0, SLAB_PANIC, NULL);
 
+	sp_cache = kmem_cache_create("shared_policy",
+				     sizeof(struct shared_policy),
+				     0, SLAB_PANIC, NULL);
+
 	sn_cache = kmem_cache_create("shared_policy_node",
 				     sizeof(struct sp_node),
 				     0, SLAB_PANIC, NULL);
Index: linux-2.6.36-mmotm-101103-1217/mm/shmem.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/shmem.c
+++ linux-2.6.36-mmotm-101103-1217/mm/shmem.c
@@ -1259,7 +1259,8 @@ repeat:
 		radix_tree_preload_end();
 		if (sgp != SGP_READ && !prealloc_page) {
 			/* We don't care if this fails */
-			prealloc_page = shmem_alloc_page(gfp, mapping->spolicy, idx);
+			prealloc_page = shmem_alloc_page(gfp,
+					mapping_shared_policy(mapping), idx);
 			if (prealloc_page) {
 				if (mem_cgroup_cache_charge(prealloc_page,
 						current->mm, GFP_KERNEL)) {
@@ -1292,8 +1293,8 @@ repeat:
 				*type |= VM_FAULT_MAJOR;
 			}
 			spin_unlock(&info->lock);
-			swappage = shmem_swapin(swap, gfp, mapping->spolicy,
-									idx);
+			swappage = shmem_swapin(swap, gfp,
+					mapping_shared_policy(mapping), idx);
 			if (!swappage) {
 				spin_lock(&info->lock);
 				entry = shmem_swp_alloc(info, idx, sgp);
@@ -1420,7 +1421,8 @@ repeat:
 
 			if (!prealloc_page) {
 				spin_unlock(&info->lock);
-				filepage = shmem_alloc_page(gfp, mapping->spolicy, idx);
+				filepage = shmem_alloc_page(gfp,
+						mapping_shared_policy(mapping), idx);
 				if (!filepage) {
 					shmem_unacct_blocks(info->flags, 1);
 					shmem_free_blocks(inode, 1);
@@ -1525,18 +1527,28 @@ static int shmem_fault(struct vm_area_st
 #ifdef CONFIG_NUMA
 static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
 {
-	struct inode *i = vma->vm_file->f_path.dentry->d_inode;
-	return mpol_set_shared_policy(&SHMEM_I(i)->policy, vma, new);
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct shared_policy *sp = mapping_shared_policy(mapping);
+
+	if (!sp) {
+		sp = mpol_shared_policy_new(mapping, NULL);
+		if (IS_ERR(sp))
+			return PTR_ERR(sp);
+	}
+	return mpol_set_shared_policy(sp, vma, new);
 }
 
 static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
 					  unsigned long addr)
 {
-	struct inode *i = vma->vm_file->f_path.dentry->d_inode;
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	struct shared_policy *sp = mapping_shared_policy(mapping);
 	unsigned long idx;
 
+	if (!sp)
+		return NULL;	/* == default policy */
 	idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-	return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx);
+	return mpol_shared_policy_lookup(sp, idx);
 }
 #endif
 
@@ -1608,9 +1620,15 @@ static struct inode *shmem_get_inode(str
 			inode->i_mapping->a_ops = &shmem_aops;
 			inode->i_op = &shmem_inode_operations;
 			inode->i_fop = &shmem_file_operations;
-			inode->i_mapping->spolicy = &info->policy;
-			mpol_shared_policy_init(inode->i_mapping->spolicy,
-						 shmem_get_sbmpol(sbinfo));
+			if (sbinfo->mpol) {
+				struct address_space *mapping =
+							 inode->i_mapping;
+				struct shared_policy *sp =
+						mpol_shared_policy_new(mapping,
+						     shmem_get_sbmpol(sbinfo));
+				if (!IS_ERR(sp))
+					set_mapping_shared_policy(mapping, sp);
+			}
 			break;
 		case S_IFDIR:
 			inc_nlink(inode);
@@ -1621,12 +1639,9 @@ static struct inode *shmem_get_inode(str
 			break;
 		case S_IFLNK:
 			/*
-			 * Must not load anything in the rbtree,
-			 * mpol_free_shared_policy will not be called.
+			 * This case only exists so that we don't attempt
+			 * to call init_special_inode() for sym links.
 			 */
-			inode->i_mapping->spolicy = &info->policy;
-			mpol_shared_policy_init(inode->i_mapping->spolicy,
-						NULL);
 			break;
 		}
 	} else
@@ -2422,7 +2437,7 @@ static void shmem_destroy_inode(struct i
 {
 	if ((inode->i_mode & S_IFMT) == S_IFREG) {
 		/* only struct inode is valid if it's an inline symlink */
-		mpol_free_shared_policy(inode->i_mapping->spolicy);
+		mpol_free_shared_policy(inode->i_mapping);
 	}
 	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
 }
Index: linux-2.6.36-mmotm-101103-1217/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/hugetlbfs/inode.c
+++ linux-2.6.36-mmotm-101103-1217/fs/hugetlbfs/inode.c
@@ -448,14 +448,13 @@ static int hugetlbfs_setattr(struct dent
 	return 0;
 }
 
-static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid, 
+static struct inode *hugetlbfs_get_inode(struct super_block *sb, uid_t uid,
 					gid_t gid, int mode, dev_t dev)
 {
 	struct inode *inode;
 
 	inode = new_inode(sb);
 	if (inode) {
-		struct hugetlbfs_inode_info *info;
 		inode->i_ino = get_next_ino();
 		inode->i_mode = mode;
 		inode->i_uid = uid;
@@ -464,16 +463,9 @@ static struct inode *hugetlbfs_get_inode
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		INIT_LIST_HEAD(&inode->i_mapping->private_list);
-		info = HUGETLBFS_I(inode);
 		/*
-		 * The policy is initialized here even if we are creating a
-		 * private inode because initialization simply creates an
-		 * an empty rb tree and calls spin_lock_init(), later when we
-		 * call mpol_free_shared_policy() it will just return because
-		 * the rb tree will still be empty.
+		 * leave i_mapping->spolicy NULL [default policy]
 		 */
-		inode->i_mapping->spolicy = &info->policy;
-		mpol_shared_policy_init(inode->i_mapping->spolicy, NULL);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
@@ -486,7 +478,10 @@ static struct inode *hugetlbfs_get_inode
 			inode->i_op = &hugetlbfs_dir_inode_operations;
 			inode->i_fop = &simple_dir_operations;
 
-			/* directory inodes start off with i_nlink == 2 (for "." entry) */
+			/*
+			 * directory inodes start off with i_nlink == 2
+			 * (for "." entry)
+			 */
 			inc_nlink(inode);
 			break;
 		case S_IFLNK:
@@ -667,7 +662,7 @@ static struct inode *hugetlbfs_alloc_ino
 static void hugetlbfs_destroy_inode(struct inode *inode)
 {
 	hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode->i_sb));
-	mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);
+	mpol_free_shared_policy(inode->i_mapping);
 	kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
 }
 
Index: linux-2.6.36-mmotm-101103-1217/include/linux/hugetlb.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/hugetlb.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/hugetlb.h
@@ -152,7 +152,6 @@ struct hugetlbfs_sb_info {
 
 
 struct hugetlbfs_inode_info {
-	struct shared_policy policy;
 	struct inode vfs_inode;
 };
 
Index: linux-2.6.36-mmotm-101103-1217/include/linux/shmem_fs.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/shmem_fs.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/shmem_fs.h
@@ -15,7 +15,6 @@ struct shmem_inode_info {
 	unsigned long		alloced;	/* data pages alloced to file */
 	unsigned long		swapped;	/* subtotal assigned to swap */
 	unsigned long		next_index;	/* highest alloced index + 1 */
-	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct page		*i_indirect;	/* top indirect blocks page */
 	swp_entry_t		i_direct[SHMEM_NR_DIRECT]; /* first blocks */
 	struct list_head	swaplist;	/* chain of maybes on swap */
Index: linux-2.6.36-mmotm-101103-1217/fs/inode.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/inode.c
+++ linux-2.6.36-mmotm-101103-1217/fs/inode.c
@@ -215,6 +215,7 @@ int inode_init_always(struct super_block
 		mapping->backing_dev_info = bdi;
 	}
 	inode->i_private = NULL;
+	set_mapping_shared_policy(mapping, NULL);
 	inode->i_mapping = mapping;
 #ifdef CONFIG_FS_POSIX_ACL
 	inode->i_acl = inode->i_default_acl = ACL_NOT_CACHED;
Index: linux-2.6.36-mmotm-101103-1217/include/linux/fs.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/fs.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/fs.h
@@ -647,7 +647,9 @@ struct address_space {
 	struct list_head	private_list;	/* ditto */
 	struct address_space	*assoc_mapping;	/* ditto */
 
+#ifdef CONFIG_NUMA
 	struct shared_policy	*spolicy;
+#endif
 } __attribute__((aligned(sizeof(long))));
 	/*
 	 * On most architectures that alignment is already the case; but
@@ -655,6 +657,31 @@ struct address_space {
 	 * of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON.
 	 */
 
+#ifdef CONFIG_NUMA
+static inline struct shared_policy *
+mapping_shared_policy(struct address_space *mapping)
+{
+	return mapping->spolicy;
+}
+
+static inline void set_mapping_shared_policy(struct address_space *mapping,
+						struct shared_policy *sp)
+{
+	mapping->spolicy = sp;
+}
+
+#else
+static inline struct shared_policy *
+mapping_shared_policy(struct address_space *mapping)
+{
+	return NULL;
+}
+
+static inline void set_mapping_shared_policy(struct address_space *mapping,
+						struct shared_policy *sp)
+{ }
+#endif
+
 struct block_device {
 	dev_t			bd_dev;  /* not a kdev_t - it's a search key */
 	struct inode *		bd_inode;	/* will die */

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH/RFC 4/14] Shared Policy: let vma policy ops handle sub-vma policies
  2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
                   ` (2 preceding siblings ...)
  2010-11-11 19:12 ` [PATCH/RFC 3/14] Shared Policy: allocate shared policies as needed Lee Schermerhorn
@ 2010-11-11 19:12 ` Lee Schermerhorn
  2010-11-11 19:12 ` [PATCH/RFC 5/14] Shared Policy: fix show_numa_maps() Lee Schermerhorn
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:12 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Shared Policy Infrastructure  - let vma policy ops handle sub-vma policies

Shared policies can handle subranges of an object.  No need to
split the vma for these mappings.  So, ...

Add a vm_mpol_flag member to vm_area_struct with flags to control the
vma splitting behavior.

	Note:  Perhaps there is another field where these flags
	could go?

Modify mbind_range() and policy_vma() to call the set_policy vma op,
if one exists, for vma's with VMPOL_F_NOSPLIT set, instead of splitting
the vma for the mempolicy range.  However, if vma is VM_SHARED and we
would be splitting the vma--don't!  This policy will just be ignored
and numa_maps would be [currently are] misleading.

Now, we don't want private mappings mucking with the shared policy of
the mapped file, if any, so use vma policy for private mappings.
We'll still split vmas for private mappings.

As a result, this patch enforces a defined semantic for set|get_policy()
ops:  they only get called for linear, shared mappings, and in that
case we don't split the vma.  Only shmem currently has set|get_policy()
ops, and this seems an appropriate semantic for shared objects, in
general.  It also matches current behavior.

Now, since the vma start and end addresses no longer specify the
range to which a new policy applies, we need to add start,end address
args to the vma policy ops.  The set_policy op/handler just calls into
mpol_set_shared_policy() to do the real work, so we could just pass
the start and end address, along with the vma, down to that function.
However, to eliminate the need for the pseudo-vma on the stack when
initializing the shared policy for an inode with non-default "superblock
policy", we change the interface to mpol_set_shared_policy() to take a
page offset and size in pages.  We compute the page offset and size in
the shmem set_policy handler from the vma and the address range.

NOTE:  Added helper function "vma_mpol_pgoff()" for computing page
offset for interleaving.  This is similar to [linear_]page_index()
but does not offset by the PAGE_CACHE_SHIFT so that it can be used
for calculating page indices for interleaving for both base pages
and huge pages [subsequent patch].  Perhaps this can be merged with
other similar functions?

A word about non-linear mappings:

Shared policy ops do and always have installed and looked up shared
policies at a given (vma, address) by computing a page offset and size
into the backing file from the (vma, address) assuming a linear mapping
from virtual addresses to file page offsets.  Therefore this patch series
restricts shared policies to linearly mapped shared file mappings.  This
is nominally a change in behavior.

I don't know whether anyone is attempting to use memory policies with
non-linearly mapped shared memory areas or hugetlbfs mappings.  If so,
I don't understand what behavior they expect.  Since different tasks
could establish different mappings to the shared files, pages and mempolicies
that show up at one offset in one task, can show up at a different offset
in another task.  However, if this is a required feature, and we can come up
with reasonable semantics for supporting shared policies with nonlinear
mappings, I can try to support it.

Note:  although this patch removed the splitting of the vma when a shared
policy is installed in a sub-range of the shared area, I will defer updating
the documentation to the following patch, which fixes the numa_maps display.


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/sysfs/bin.c                |    5 +
 include/linux/mempolicy.h     |   16 +++++
 include/linux/mm.h            |   13 ++++
 include/linux/mm_types.h      |    1 
 include/linux/shared_policy.h |    7 +-
 ipc/shm.c                     |    5 +
 mm/mempolicy.c                |  119 ++++++++++++++++++++++++++++++++----------
 mm/shmem.c                    |   16 +++--
 8 files changed, 140 insertions(+), 42 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/mm_types.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mm_types.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mm_types.h
@@ -182,6 +182,7 @@ struct vm_area_struct {
 #endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
+	int vm_mpol_flags;		/* NOSPLIT, ... */
 #endif
 };
 
Index: linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mempolicy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
@@ -110,6 +110,13 @@ struct mempolicy {
 };
 
 /*
+ * vma memory policy flags
+ */
+enum vm_mpol_flags {
+	VMPOL_F_NOSPLIT  = 0x00000001,	/* don't split vma for mempolicy */
+};
+
+/*
  * Support for managing mempolicy data objects (clone, copy, destroy)
  * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
  */
@@ -157,6 +164,11 @@ static inline struct mempolicy *mpol_dup
 #define vma_policy(vma) ((vma)->vm_policy)
 #define vma_set_policy(vma, pol) ((vma)->vm_policy = (pol))
 
+static inline void mpol_set_vma_nosplit(struct vm_area_struct *vma)
+{
+	(vma)->vm_mpol_flags |= VMPOL_F_NOSPLIT;
+}
+
 static inline void mpol_get(struct mempolicy *pol)
 {
 	if (pol)
@@ -258,6 +270,10 @@ static inline struct mempolicy *mpol_dup
 #define vma_policy(vma) NULL
 #define vma_set_policy(vma, pol) do {} while(0)
 
+static inline void mpol_set_vma_nosplit(struct vm_area_struct *vma)
+{
+}
+
 static inline void numa_policy_init(void)
 {
 }
Index: linux-2.6.36-mmotm-101103-1217/include/linux/mm.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mm.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mm.h
@@ -212,7 +212,8 @@ struct vm_operations_struct {
 	 * install a MPOL_DEFAULT policy, nor the task or system default
 	 * mempolicy.
 	 */
-	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
+	int (*set_policy)(struct vm_area_struct *vma, unsigned long start,
+				unsigned long end, struct mempolicy *new);
 
 	/*
 	 * get_policy() op must add reference [mpol_get()] to any policy at
@@ -1232,6 +1233,16 @@ extern int after_bootmem;
 
 extern void setup_per_cpu_pageset(void);
 
+/*
+ * Address to offset for policy lookup and interleave calculation.
+ * Placed here because it needs struct vma definition.
+ */
+static inline pgoff_t vma_mpol_pgoff(struct vm_area_struct *vma,
+						unsigned long addr)
+{
+	return ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+}
+
 extern void zone_pcp_update(struct zone *zone);
 
 /* nommu.c */
Index: linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/shared_policy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
@@ -32,9 +32,8 @@ struct shared_policy {
 extern struct shared_policy *mpol_shared_policy_new(
 					struct address_space *mapping,
 					struct mempolicy *mpol);
-extern int mpol_set_shared_policy(struct shared_policy *,
-					struct vm_area_struct *,
-					struct mempolicy *);
+extern int mpol_set_shared_policy(struct shared_policy *, pgoff_t,
+				unsigned long, struct mempolicy *);
 extern void mpol_free_shared_policy(struct address_space *);
 extern struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *,
 					unsigned long);
@@ -44,7 +43,7 @@ extern struct mempolicy *mpol_shared_pol
 struct shared_policy {};
 
 static inline int mpol_set_shared_policy(struct shared_policy *info,
-					struct vm_area_struct *vma,
+					pgoff_t pgoff, unsigned long sz,
 					struct mempolicy *new)
 {
 	return -EINVAL;
Index: linux-2.6.36-mmotm-101103-1217/mm/shmem.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/shmem.c
+++ linux-2.6.36-mmotm-101103-1217/mm/shmem.c
@@ -1158,7 +1158,7 @@ struct page *shmem_swapin(swp_entry_t en
 	/* Create a pseudo vma that just contains the policy */
 	pvma.vm_start = 0;
 	pvma.vm_pgoff = idx;
-	pvma.vm_ops = NULL;
+	pvma.vm_file = NULL;
 	pvma.vm_policy = spol;
 	page = swapin_readahead(entry, gfp, &pvma, 0);
 	return page;
@@ -1172,7 +1172,7 @@ static struct page *shmem_alloc_page(gfp
 	/* Create a pseudo vma that just contains the policy */
 	pvma.vm_start = 0;
 	pvma.vm_pgoff = idx;
-	pvma.vm_ops = NULL;
+	pvma.vm_file = NULL;
 	pvma.vm_policy = mpol_shared_policy_lookup(sp, idx);
 
 	/*
@@ -1525,7 +1525,8 @@ static int shmem_fault(struct vm_area_st
 }
 
 #ifdef CONFIG_NUMA
-static int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+static int shmem_set_policy(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, struct mempolicy *new)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct shared_policy *sp = mapping_shared_policy(mapping);
@@ -1535,7 +1536,8 @@ static int shmem_set_policy(struct vm_ar
 		if (IS_ERR(sp))
 			return PTR_ERR(sp);
 	}
-	return mpol_set_shared_policy(sp, vma, new);
+	return mpol_set_shared_policy(sp, vma_mpol_pgoff(vma, start),
+					(end - start) >> PAGE_SHIFT, new);
 }
 
 static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
@@ -1543,12 +1545,10 @@ static struct mempolicy *shmem_get_polic
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
 	struct shared_policy *sp = mapping_shared_policy(mapping);
-	unsigned long idx;
 
 	if (!sp)
 		return NULL;	/* == default policy */
-	idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-	return mpol_shared_policy_lookup(sp, idx);
+	return mpol_shared_policy_lookup(sp, vma_mpol_pgoff(vma, addr));
 }
 #endif
 
@@ -1583,6 +1583,7 @@ static int shmem_mmap(struct file *file,
 	file_accessed(file);
 	vma->vm_ops = &shmem_vm_ops;
 	vma->vm_flags |= VM_CAN_NONLINEAR;
+	mpol_set_vma_nosplit(vma);
 	return 0;
 }
 
@@ -2802,5 +2803,6 @@ int shmem_zero_setup(struct vm_area_stru
 		fput(vma->vm_file);
 	vma->vm_file = file;
 	vma->vm_ops = &shmem_vm_ops;
+	mpol_set_vma_nosplit(vma);
 	return 0;
 }
Index: linux-2.6.36-mmotm-101103-1217/ipc/shm.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/ipc/shm.c
+++ linux-2.6.36-mmotm-101103-1217/ipc/shm.c
@@ -223,13 +223,14 @@ static int shm_fault(struct vm_area_stru
 }
 
 #ifdef CONFIG_NUMA
-static int shm_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+int shm_set_policy(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, struct mempolicy *new)
 {
 	struct file *file = vma->vm_file;
 	struct shm_file_data *sfd = shm_file_data(file);
 	int err = 0;
 	if (sfd->vm_ops->set_policy)
-		err = sfd->vm_ops->set_policy(vma, new);
+		err = sfd->vm_ops->set_policy(vma, start, end, new);
 	return err;
 }
 
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -610,20 +610,60 @@ check_range(struct mm_struct *mm, unsign
 	return first;
 }
 
-/* Apply policy to a single VMA */
-static int policy_vma(struct vm_area_struct *vma, struct mempolicy *new)
+static bool vma_is_shared_linear(struct vm_area_struct *vma)
+{
+	return ((vma->vm_flags & (VM_SHARED|VM_NONLINEAR)) == VM_SHARED);
+}
+
+static bool mpol_nosplit_vma(struct vm_area_struct *vma)
+{
+	if (vma->vm_mpol_flags & VMPOL_F_NOSPLIT)
+		return true;
+
+	if (vma_is_shared_linear(vma)  &&
+	    vma->vm_ops && vma->vm_ops->set_policy) {
+		vma->vm_mpol_flags |= VMPOL_F_NOSPLIT;
+		return true;
+	}
+	return false;
+}
+
+static bool mpol_use_get_op(struct vm_area_struct *vma)
+{
+	/*
+	 * Not for anon/private mappings.
+	 * And no need to invoke get_policy op if file doesn't
+	 * already have a shared policy.
+	 */
+	if (!vma_is_shared_linear(vma) ||
+	    !vma->vm_file || !vma->vm_file->f_mapping->spolicy)
+		return false;
+
+	VM_BUG_ON(!(vma->vm_ops && vma->vm_ops->get_policy));
+	return true;
+}
+
+/*
+ * Apply policy to a single VMA, or a subrange thereof
+ */
+static int policy_vma(struct vm_area_struct *vma, unsigned long start,
+			unsigned long end, struct mempolicy *new,
+			bool use_set_op)
 {
 	int err = 0;
-	struct mempolicy *old = vma->vm_policy;
 
 	pr_debug("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n",
-		 vma->vm_start, vma->vm_end, vma->vm_pgoff,
+		 start, end, vma_mpol_pgoff(vma, start),
 		 vma->vm_ops, vma->vm_file,
 		 vma->vm_ops ? vma->vm_ops->set_policy : NULL);
 
-	if (vma->vm_ops && vma->vm_ops->set_policy)
-		err = vma->vm_ops->set_policy(vma, new);
-	if (!err) {
+	/*
+	 * set_policy op, if exists, is responsible for policy ref counts.
+	 */
+	if (use_set_op)
+		err = vma->vm_ops->set_policy(vma, start, end, new);
+	else {
+		struct mempolicy *old = vma->vm_policy;
 		mpol_get(new);
 		vma->vm_policy = new;
 		mpol_put(old);
@@ -652,6 +692,28 @@ static int mbind_range(struct mm_struct
 		vmstart = max(start, vma->vm_start);
 		vmend   = min(end, vma->vm_end);
 
+		if (mpol_nosplit_vma(vma)) {
+			/*
+			 * set_policy op handles policies on sub-range
+			 * of vma for linear, shared mappings
+			 */
+			err = policy_vma(vma, vmstart, vmend, new_pol, true);
+			if (err)
+				break;
+			continue;
+		} else if (vma->vm_flags & VM_SHARED) {
+			/*
+			 * mempolicy will be ignored, so don't bother to
+			 * modify vma.  numa_maps would be misleading.
+			 */
+			continue;
+		}
+
+		/*
+		 * for private mappings and shared mappings of objects whose
+		 * mempolicy vm_ops don't support sub-range policies,
+		 * merge/split the vma, as needed, and use vma policy
+		 */
 		pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT);
 		prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags,
 				  vma->anon_vma, vma->vm_file, pgoff, new_pol);
@@ -670,7 +732,7 @@ static int mbind_range(struct mm_struct
 			if (err)
 				goto out;
 		}
-		err = policy_vma(vma, new_pol);
+		err = policy_vma(vma, vmstart, vmend, new_pol, false);
 		if (err)
 			goto out;
 	}
@@ -1493,7 +1555,10 @@ static struct mempolicy *get_vma_policy(
 	struct mempolicy *pol = task->mempolicy;
 
 	if (vma) {
-		if (vma->vm_ops && vma->vm_ops->get_policy) {
+		/*
+		 * use get_policy op, if any, for shared mappings
+		 */
+		if (mpol_use_get_op(vma)) {
 			struct mempolicy *vpol = vma->vm_ops->get_policy(vma,
 									addr);
 			if (vpol)
@@ -2193,14 +2258,8 @@ put_free:
 	spin_lock_init(&sp->lock);
 
 	if (new) {
-		/*
-		 * Create pseudo-vma to specify policy range and
-		 * install new mempolicy
-		 */
-		struct vm_area_struct pvma;
-		memset(&pvma, 0, sizeof(struct vm_area_struct));
-		pvma.vm_end = TASK_SIZE;	/* policy covers entire file */
-		err = mpol_set_shared_policy(sp, &pvma, new); /* adds ref */
+		err = mpol_set_shared_policy(sp, 0UL, TASK_SIZE >> PAGE_SHIFT,
+						new);
 		mpol_put(new);			/* drop initial ref */
 	}
 
@@ -2220,25 +2279,33 @@ put_free:
 	return spx;
 }
 
+/**
+ * mpol_set_shared_policy - install mempolicy in shared policy tree
+ * @sp:	 pointer to shared policy struct
+ * @pgoff:  offset in address_space where mempolicy applies
+ * @sz:  size of range [bytes] to which mempolicy applies
+ * @mpol:  the mempolicy to install
+ *
+ */
 int mpol_set_shared_policy(struct shared_policy *sp,
-			struct vm_area_struct *vma, struct mempolicy *npol)
+				pgoff_t pgoff, unsigned long sz,
+				struct mempolicy *mpol)
 {
 	int err;
 	struct sp_node *new = NULL;
-	unsigned long sz = vma_pages(vma);
 
 	pr_debug("set_shared_policy %lx sz %lu %d %d %lx\n",
-		 vma->vm_pgoff,
-		 sz, npol ? npol->mode : -1,
-		 npol ? npol->flags : -1,
-		 npol ? nodes_addr(npol->v.nodes)[0] : -1);
+		 pgoff,
+		 sz, mpol ? mpol->mode : -1,
+		 mpol ? mpol->flags : -1,
+		 mpol ? nodes_addr(mpol->v.nodes)[0] : -1);
 
-	if (npol) {
-		new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol);
+	if (mpol) {
+		new = sp_alloc(pgoff, pgoff + sz, mpol);
 		if (!new)
 			return -ENOMEM;
 	}
-	err = shared_policy_replace(sp, vma->vm_pgoff, vma->vm_pgoff+sz, new);
+	err = shared_policy_replace(sp, pgoff, pgoff+sz, new);
 	if (err && new)
 		kmem_cache_free(sn_cache, new);
 	return err;
Index: linux-2.6.36-mmotm-101103-1217/fs/sysfs/bin.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/sysfs/bin.c
+++ linux-2.6.36-mmotm-101103-1217/fs/sysfs/bin.c
@@ -256,7 +256,8 @@ static int bin_access(struct vm_area_str
 }
 
 #ifdef CONFIG_NUMA
-static int bin_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+static int bin_set_policy(struct vm_area_struct *vma, unsigned long start,
+				unsigned long end, struct mempolicy *new)
 {
 	struct file *file = vma->vm_file;
 	struct bin_buffer *bb = file->private_data;
@@ -271,7 +272,7 @@ static int bin_set_policy(struct vm_area
 
 	ret = 0;
 	if (bb->vm_ops->set_policy)
-		ret = bb->vm_ops->set_policy(vma, new);
+		ret = bb->vm_ops->set_policy(vma, start, end, new);
 
 	sysfs_put_active(attr_sd);
 	return ret;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH/RFC 5/14] Shared Policy: fix show_numa_maps()
  2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
                   ` (3 preceding siblings ...)
  2010-11-11 19:12 ` [PATCH/RFC 4/14] Shared Policy: let vma policy ops handle sub-vma policies Lee Schermerhorn
@ 2010-11-11 19:12 ` Lee Schermerhorn
  2010-11-11 19:12 ` [PATCH/RFC 6/14] Shared Policy: Factor alloc_page_pol routine Lee Schermerhorn
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:12 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Shared Policy Infrastructure - fix show_numa_maps()

This patch updates the procfs numa_maps display to handle multiple
shared policy ranges on a single vma.  numa_maps() still uses the
procfs task maps infrastructure, but provides wrappers around the
maps seq_file ops to handle shared policy "submaps", if any.

This fixes a problem with numa_maps for shared mappings:

Before this [shared policy] patch series, numa_maps could show you
different results for shared mappings depending on which task you
examined.  A task which has installed shared policies on sub-ranges
of the shared region will show the policies on the sub-ranges, as the
vmas for that task were split when the policies were installed.
Another task that shares the region, but didn't install any policies,
or installed policies on a different region or set of regions will
show a different policy/range or set thereof, based on the VMAs
of that task.  By displaying the policies directly from the shared
policy structure, we now see the same info--the correct effective
mempolicy for each address range--from each task that maps the segment.

The patch expands the proc_maps_private struct [#ifdef CONFIG_NUMA]
to track the existence of and progress through a submap for the
"current" vma.  For vmas with shared policy submaps, a new
function--get_numa_submap()--in mm/mempolicy.c allocates and
populates an array of the policy ranges in the shared policy.
To facilitate this, the shared policy struct tracks the number
of ranges [sp_nodes] in the tree.

The nm_* numa_map seq_file wrappers pass the range to be displayed
to show_numa_map() via the saddr and eaddr members added to the
proc_maps_private struct.  The patch modifies show_numa_map() to
use these members, where appropriate, instead of vm_start, vm_end.

As before, once the internal page size buffer is full, seq_read()
suspends the display, drops the mmap_sem and exits the read.
During this time the vma list can change.  However, even within a
single seq_read(), the shared_policy "submap" can be changed by
other mappers.  We could prevent this by holding the shared policy
spin_lock or otherwise holding off other mappers.  That would also
hold off other tasks faulting in pages, attempting to look up the
policy for that offset, unless we convert the lock to reader/writer.

It doesn't seem worth the effort, as the numa_map is only a snap_shot
in any case.  So, this patch makes a best effort [at least as good as
unpatched task map code, I think] to perform a single scan over the
address space, displaying the policies and page state/location
for policy ranges "snapped" under spin lock into the "submap"
array mentioned above.

NOTE:  this patch adds a fair bit of code to the numa maps display.
If necessary, I can make numa_maps a separately configurable option.
E.g., in support of, say, embedded systems that use numa [sh?] and
want/need /proc, but don't need numa_maps nor want the overhead.


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/numa_memory_policy.txt |   20 +--
 fs/proc/task_mmu.c                      |  186 +++++++++++++++++++++++++++++++-
 include/linux/mempolicy.h               |    5 
 include/linux/mm.h                      |    6 +
 include/linux/proc_fs.h                 |   12 ++
 include/linux/shared_policy.h           |    3 
 mm/mempolicy.c                          |   58 +++++++++
 7 files changed, 271 insertions(+), 19 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/proc_fs.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/proc_fs.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/proc_fs.h
@@ -286,12 +286,24 @@ static inline struct net *PDE_NET(struct
 	return pde->parent->data;
 }
 
+struct mpol_range {
+	unsigned long saddr;
+	unsigned long eaddr;
+};
+
 struct proc_maps_private {
 	struct pid *pid;
 	struct task_struct *task;
 #ifdef CONFIG_MMU
 	struct vm_area_struct *tail_vma;
 #endif
+
+#ifdef CONFIG_NUMA
+	struct vm_area_struct *vma;	/* preserved over seq_reads */
+	unsigned long saddr;
+	unsigned long eaddr;		/* preserved over seq_reads */
+	struct mpol_range *range, *ranges; /* preserved ... */
+#endif
 };
 
 #endif /* _LINUX_PROC_FS_H */
Index: linux-2.6.36-mmotm-101103-1217/include/linux/mm.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mm.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mm.h
@@ -1243,6 +1243,12 @@ static inline pgoff_t vma_mpol_pgoff(str
 	return ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 }
 
+static inline pgoff_t vma_mpol_addr(struct vm_area_struct *vma,
+						pgoff_t pgoff)
+{
+	return ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) + vma->vm_start;
+}
+
 extern void zone_pcp_update(struct zone *zone);
 
 /* nommu.c */
Index: linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mempolicy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
@@ -235,6 +235,11 @@ static inline int vma_migratable(struct
 	return 1;
 }
 
+struct seq_file;
+extern int show_numa_map(struct seq_file *, void *);
+struct mpol_range;
+extern struct mpol_range *get_numa_submap(struct vm_area_struct *);
+
 #else
 
 struct mempolicy {};
Index: linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/shared_policy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
@@ -26,7 +26,8 @@ struct sp_node {
 
 struct shared_policy {
 	struct rb_root root;
-	spinlock_t lock;	/* protects rb tree */
+	spinlock_t     lock;		/* protects rb tree */
+	int            nr_sp_nodes;	/* for numa_maps */
 };
 
 extern struct shared_policy *mpol_shared_policy_new(
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -2108,6 +2108,8 @@ static void sp_insert(struct shared_poli
 	}
 	rb_link_node(&new->nd, parent, p);
 	rb_insert_color(&new->nd, &sp->root);
+ 	++sp->nr_sp_nodes;
+
 	pr_debug("inserting %lx-%lx: %d\n", new->start, new->end,
 		 new->policy ? new->policy->mode : 0);
 }
@@ -2137,6 +2139,7 @@ static void sp_delete(struct shared_poli
 	rb_erase(&n->nd, &sp->root);
 	mpol_put(n->policy);
 	kmem_cache_free(sn_cache, n);
+	--sp->nr_sp_nodes;
 }
 
 static struct sp_node *sp_alloc(unsigned long start, unsigned long end,
@@ -2256,6 +2259,7 @@ put_free:
 	}
 	sp->root = RB_ROOT;		/* empty tree == default mempolicy */
 	spin_lock_init(&sp->lock);
+	sp->nr_sp_nodes = 0;
 
 	if (new) {
 		err = mpol_set_shared_policy(sp, 0UL, TASK_SIZE >> PAGE_SHIFT,
@@ -2740,11 +2744,11 @@ int show_numa_map(struct seq_file *m, vo
 	if (!md)
 		return 0;
 
-	pol = get_vma_policy(priv->task, vma, vma->vm_start);
+	pol = get_vma_policy(priv->task, vma, priv->saddr);
 	mpol_to_str(buffer, sizeof(buffer), pol, 0);
 	mpol_cond_put(pol);
 
-	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
+	seq_printf(m, "%08lx %s", priv->saddr, buffer);
 
 	if (file) {
 		seq_printf(m, " file=");
@@ -2757,10 +2761,10 @@ int show_numa_map(struct seq_file *m, vo
 	}
 
 	if (is_vm_hugetlb_page(vma)) {
-		check_huge_range(vma, vma->vm_start, vma->vm_end, md);
+		check_huge_range(vma, priv->saddr, priv->eaddr, md);
 		seq_printf(m, " huge");
 	} else {
-		check_pgd_range(vma, vma->vm_start, vma->vm_end,
+		check_pgd_range(vma, priv->saddr, priv->eaddr,
 			&node_states[N_HIGH_MEMORY], MPOL_MF_STATS, md);
 	}
 
@@ -2799,3 +2803,49 @@ out:
 		m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
 	return 0;
 }
+
+/*
+ * alloc/populate array of shared policy ranges for show_numa_map()
+ */
+struct mpol_range *get_numa_submap(struct vm_area_struct *vma)
+{
+	struct shared_policy *sp;
+	struct mpol_range *ranges, *range;
+	struct rb_node *rbn;
+	int nranges;
+
+	BUG_ON(!vma->vm_file);
+	sp = mapping_shared_policy(vma->vm_file->f_mapping);
+	if (!sp)
+		return NULL;
+
+	nranges = sp->nr_sp_nodes;
+	if (!nranges)
+		return NULL;
+
+	ranges = kzalloc((nranges + 1) * sizeof(*ranges), GFP_KERNEL);
+	if (!ranges)
+		return NULL;	/* pretend there are none */
+
+	range = ranges;
+	spin_lock(&sp->lock);
+	/*
+	 * # of ranges could have changes since we checked, but that is
+	 * unlikely, so this is close enough [as long as it's safe].
+	 */
+	rbn = rb_first(&sp->root);
+	/*
+	 * count nodes to ensure we leave one empty range struct
+	 * in case node added between check and alloc
+	 */
+	while (rbn && nranges--) {
+		struct sp_node *spn = rb_entry(rbn, struct sp_node, nd);
+		range->saddr = vma_mpol_addr(vma, spn->start);
+		range->eaddr = vma_mpol_addr(vma, spn->end);
+		range++;
+		rbn = rb_next(rbn);
+	}
+
+	spin_unlock(&sp->lock);
+	return ranges;
+}
Index: linux-2.6.36-mmotm-101103-1217/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/proc/task_mmu.c
+++ linux-2.6.36-mmotm-101103-1217/fs/proc/task_mmu.c
@@ -818,12 +818,190 @@ const struct file_operations proc_pagema
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
 #ifdef CONFIG_NUMA
-extern int show_numa_map(struct seq_file *m, void *v);
+/*
+ * numa_maps uses procfs task maps file operations, with wrappers
+ * to handle mpol submaps--policy ranges within a vma
+ */
+
+/*
+ * start processing a new vma for show_numa_maps
+ */
+static void nm_vma_start(struct proc_maps_private *priv,
+			struct vm_area_struct *vma)
+{
+	if (!vma)
+		return;
+	priv->vma = vma;	/* saved across read()s */
+
+	priv->saddr = vma->vm_start;
+	if (!(vma->vm_flags & VM_SHARED) || !vma->vm_file ||
+		!vma->vm_file->f_mapping->spolicy) {
+		/*
+		 * usual case:  no submap
+		 */
+		priv->eaddr = vma->vm_end;
+		return;
+	}
+
+	priv->range = priv->ranges = get_numa_submap(vma);
+	if (!priv->range) {
+		priv->eaddr = vma->vm_end;	/* empty shared policy */
+		return;
+	}
+
+	/*
+	 * restart suspended submap where we left off
+	 */
+	while (priv->range->eaddr && priv->range->eaddr < priv->eaddr)
+		++priv->range;
+
+	if (!priv->range->eaddr)
+		priv->eaddr = vma->vm_end;
+	else if (priv->saddr < priv->range->saddr)
+		priv->eaddr = priv->range->saddr; /* show gap [default pol] */
+	else
+		priv->eaddr = priv->range->eaddr; /* show range */
+}
+
+/*
+ * done with numa_maps vma:  reset so we start a new
+ * vma on next seq_read.
+ */
+static void nm_vma_stop(struct proc_maps_private *priv)
+{
+	kfree(priv->ranges);
+	priv->ranges = priv->range = NULL;
+	priv->vma = NULL;
+}
+
+/*
+ * Advance to next vma in mm or next subrange in vma.
+ * mmap_sem held during a single seq_read(), but shared
+ * policy ranges can be modified at any time by other
+ * mappers.  We just continue to display the ranges we
+ * found when we started the vma.
+ */
+static void *nm_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct proc_maps_private *priv = m->private;
+	struct vm_area_struct *vma = v;
+
+	if (!priv->range || priv->eaddr >= vma->vm_end) {
+		/*
+		 * usual case:  no submap or end of vma
+		 * re: '>=' -- in case we got here from nm_start()
+		 * and vma @ pos truncated to < priv->eaddr
+		 */
+		nm_vma_stop(priv);
+		vma = m_next(m, v, pos);
+		nm_vma_start(priv, vma);
+		return vma;
+	}
+
+	/*
+	 * Advance to next range in submap
+	 */
+	priv->saddr = priv->eaddr;
+	if (priv->eaddr == priv->range->saddr) {
+		/*
+		 * just processed a gap in the submap
+		 */
+		priv->eaddr = min(priv->range->eaddr, vma->vm_end);
+		return vma;	/* show the range */
+	}
+
+	++priv->range;
+	if (!priv->range->eaddr)
+		priv->eaddr = vma->vm_end;	/* past end of ranges */
+	else if (priv->saddr < priv->range->saddr)
+		priv->eaddr = priv->range->saddr; /* gap in submap */
+	else
+		priv->eaddr = min(priv->range->eaddr, vma->vm_end);
+
+	return vma;
+}
+
+/*
+ * [Re]start scan for new seq_read().
+ * N.B., much could have changes in mm, as we dropped the mmap_sem
+ * between reads().  Need to call m_start() to find vma at pos.
+ */
+static void *nm_start(struct seq_file *m, loff_t *pos)
+{
+	struct proc_maps_private *priv = m->private;
+	struct vm_area_struct *vma;
+
+	if (!priv->range) {
+		/*
+		 * usual case:  1st after open, or finished prev vma
+		 */
+		vma = m_start(m, pos);
+		nm_vma_start(priv, vma);
+		return vma;
+	}
+
+	/*
+	 * Continue with submap of "current" vma.  However, vma could have
+	 * been unmapped, split, truncated, ... between read()s.
+	 * Reset "last_addr" to simulate seek;  find vma by 'pos'.
+	 */
+	m->version = 0;
+	--(*pos);		/* seq_read() incremented it */
+	vma = m_start(m, pos);
+	if (vma != priv->vma)
+		goto new_vma;
+	/*
+	 * Same vma address as where we left off, but could have different
+	 * ranges or could be entirely different vma.
+	 */
+	if (vma->vm_start > priv->eaddr)
+		goto new_vma;	/* starts past last range displayed */
+	if (priv->eaddr < vma->vm_end) {
+		/*
+		 * vma at pos still covers eaddr--where we left off.  Submap
+		 * could have changed, but we'll keep reporting ranges we found
+		 * earlier up to vm_end.
+		 * We hope it is very unlikely that submap changed.
+		 */
+		return nm_next(m, vma, pos);
+	}
+
+	/*
+	 * Already reported past end of vma; find next vma past eaddr
+	 */
+	while (vma && vma->vm_end < priv->eaddr)
+		vma = m_next(m, vma, pos);
+
+new_vma:
+	/*
+	 * new vma at pos;  continue from ~ last eaddr
+	 */
+	nm_vma_stop(priv);
+	nm_vma_start(priv, vma);
+	return vma;
+}
+
+/*
+ * Suspend display of numa_map--e.g., buffer full?
+ */
+static void nm_stop(struct seq_file *m, void *v)
+{
+	struct proc_maps_private *priv = m->private;
+	struct vm_area_struct *vma = v;
+
+	if (!vma || priv->eaddr >= vma->vm_end)
+		nm_vma_stop(priv);
+	/*
+	 * leave state in priv for nm_start(); but drop the
+	 * mmap_sem and unref the mm
+	 */
+	m_stop(m, v);
+}
 
 static const struct seq_operations proc_pid_numa_maps_op = {
-        .start  = m_start,
-        .next   = m_next,
-        .stop   = m_stop,
+        .start  = nm_start,
+        .next   = nm_next,
+        .stop   = nm_stop,
         .show   = show_numa_map,
 };
 
Index: linux-2.6.36-mmotm-101103-1217/Documentation/vm/numa_memory_policy.txt
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/Documentation/vm/numa_memory_policy.txt
+++ linux-2.6.36-mmotm-101103-1217/Documentation/vm/numa_memory_policy.txt
@@ -130,16 +130,16 @@ most general to most specific:
 	task policy, if any, else System Default Policy.
 
 	The shared policy infrastructure supports different policies on subset
-	ranges of the shared object.  However, Linux still splits the VMA of
-	the task that installs the policy for each range of distinct policy.
-	Thus, different tasks that attach to a shared memory object can have
-	different VMA configurations mapping that one shared object.  This
-	can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
-	a shared memory region.  When one task has installed shared policy on
-	one or more ranges of the region, the numa_maps of that task will
-	show different policies than the numa_maps of other tasks mapping the
-	shared object.  However, the installed shared policy with be used for
-	all pages allocated for the shared object by any of the attached tasks.
+	ranges of the shared object.  However, before Linux 2.6.XX, the kernel
+	still split the VMA of the task that installed the policy for each
+	range of distinct policy.  Thus, different tasks that attach to a
+	shared memory segment could have different VMA configurations mapping
+	that one shared object.  This was visible by examining the
+	/proc/<pid>/numa_maps of tasks sharing the shared memory region.
+	As of 2.6.XX, Linux no longer splits the VMA that maps a shared object
+	to install a memory policy on a sub-range of the object.  The
+	/proc/<pid>/numa_maps of all tasks sharing a shared memory region now
+	show the same set of memory policy ranges.
 
 	When installing shared policy on a shared object, the virtual address
 	range specified can be viewed as a "direct mapped", linear window onto

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH/RFC 6/14] Shared Policy: Factor alloc_page_pol routine
  2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
                   ` (4 preceding siblings ...)
  2010-11-11 19:12 ` [PATCH/RFC 5/14] Shared Policy: fix show_numa_maps() Lee Schermerhorn
@ 2010-11-11 19:12 ` Lee Schermerhorn
  2010-11-11 19:12 ` [PATCH/RFC 7/14] Shared Policy: use shared policy for page cache allocations Lee Schermerhorn
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:12 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Shared Policy Infrastructure - Factor alloc_page_pol routine

Implement alloc_page_pol() to allocate a page given a policy and
an offset [for interleaving].  No vma nor addr needed.  This
function will be used to allocate page_cache pages--e.g., for tmpfs
files--given the policy at a given page offset, simplifying the
shmem page allocation functions.

Revise alloc_page_vma() to simply call alloc_page_pol() after looking
up the vma policy, to eliminate duplicate code.  This change rippled
into the interleaving functions.  I was able to eliminate
interleave_nid() by computing the offset at the call sites where it
was not already available and calling [modified] offset_il_node()
directly.

	removed vma arg from offset_il_node(), as it wasn't
	used and is not available when called from
	alloc_page_pol().

Note:  re: alloc_page_vma() -- can be called w/ vma == NULL via
read_swap_cache_async() from try_to_unuse().  Can't compute a page
offset in this case.  This means that pages read by "swapoff"
don't/can't follow vma policy.  This is current behavior.  Similarly,
swapin readahead reads multiple swap pages, almost certainly associated
with different tasks, with the vma from the first swap page read--the
one that caused the fault.  Again, we can't compute the correct policy
for those pages, and this is current behavior.


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/gfp.h |    3 +
 include/linux/mm.h  |   12 ++++-
 mm/hugetlb.c        |   10 ++++
 mm/mempolicy.c      |  107 ++++++++++++++++++++++++++++------------------------
 4 files changed, 81 insertions(+), 51 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/gfp.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/gfp.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/gfp.h
@@ -327,10 +327,13 @@ alloc_pages(gfp_t gfp_mask, unsigned int
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr);
+struct mempolicy;
+extern struct page *alloc_page_pol(gfp_t, struct mempolicy *, pgoff_t);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
 #define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define alloc_page_pol(gfp_mask, pol, off)  alloc_pages(gfp_mask, 0)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 
Index: linux-2.6.36-mmotm-101103-1217/include/linux/mm.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mm.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mm.h
@@ -1235,15 +1235,23 @@ extern void setup_per_cpu_pageset(void);
 
 /*
  * Address to offset for policy lookup and interleave calculation.
- * Placed here because it needs struct vma definition.
+ * Placed here because it needs struct vma definition and we
+ * can't easily include mm.h in mempolicy.h, nor can we include
+ * hugetlb.h here.  Thus, the extern below.
  */
 static inline pgoff_t vma_mpol_pgoff(struct vm_area_struct *vma,
 						unsigned long addr)
 {
+	extern pgoff_t vma_huge_mpol_offset(struct vm_area_struct *,
+						unsigned long);
+
+	if (unlikely(vma->vm_flags & VM_HUGETLB))
+		return vma_huge_mpol_offset(vma, addr);
+
 	return ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 }
 
-static inline pgoff_t vma_mpol_addr(struct vm_area_struct *vma,
+static inline unsigned long vma_mpol_addr(struct vm_area_struct *vma,
 						pgoff_t pgoff)
 {
 	return ((pgoff - vma->vm_pgoff) << PAGE_SHIFT) + vma->vm_start;
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -35,6 +35,7 @@
  *                use the process policy. This is what Linux always did
  *		  in a NUMA aware kernel and still does by, ahem, default.
  *
+//TODO:  following needs paragraph rewording.  haven't figured out what to say.
  * The process policy is applied for most non interrupt memory allocations
  * in that process' context. Interrupts ignore the policies and always
  * try to allocate on the local CPU. The VMA policy is only applied for memory
@@ -50,15 +51,15 @@
  * Same with GFP_DMA allocations.
  *
  * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between
- * all users and remembered even when nobody has memory mapped.
+ * all users and remembered even when nobody has memory mapped. Shared
+ * policies handle mempolicies on sub-ranges of the object using a
+ * red/black tree.  These policies persist until explicitly removed or
+ * the backing file is destroyed.
  */
 
 /* Notebook:
-   fix mmap readahead to honour policy and enable policy for any page cache
-   object
    statistics for bigpages
-   global policy for page cache? currently it uses process policy. Requires
-   first item above.
+   global policy for page cache?
    handle mremap for shared memory (currently ignored for the policy)
    grows down?
    make bind policy root only? It can trigger oom much faster and the
@@ -1671,9 +1672,10 @@ unsigned slab_node(struct mempolicy *pol
 	}
 }
 
-/* Do static interleaving for a VMA with known offset. */
-static unsigned offset_il_node(struct mempolicy *pol,
-		struct vm_area_struct *vma, unsigned long off)
+/*
+ * Do static interleaving for a policy with known offset.
+ */
+static unsigned offset_il_node(struct mempolicy *pol, pgoff_t off)
 {
 	unsigned nnodes = nodes_weight(pol->v.nodes);
 	unsigned target;
@@ -1691,28 +1693,6 @@ static unsigned offset_il_node(struct me
 	return nid;
 }
 
-/* Determine a node number for interleave */
-static inline unsigned interleave_nid(struct mempolicy *pol,
-		 struct vm_area_struct *vma, unsigned long addr, int shift)
-{
-	if (vma) {
-		unsigned long off;
-
-		/*
-		 * for small pages, there is no difference between
-		 * shift and PAGE_SHIFT, so the bit-shift is safe.
-		 * for huge pages, since vm_pgoff is in units of small
-		 * pages, we need to shift off the always 0 bits to get
-		 * a useful offset.
-		 */
-		BUG_ON(shift < PAGE_SHIFT);
-		off = vma->vm_pgoff >> (shift - PAGE_SHIFT);
-		off += (addr - vma->vm_start) >> shift;
-		return offset_il_node(pol, vma, off);
-	} else
-		return interleave_nodes(pol);
-}
-
 #ifdef CONFIG_HUGETLBFS
 /*
  * huge_zonelist(@vma, @addr, @gfp_flags, @mpol)
@@ -1739,8 +1719,9 @@ struct zonelist *huge_zonelist(struct vm
 	*nodemask = NULL;	/* assume !MPOL_BIND */
 
 	if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
-		zl = node_zonelist(interleave_nid(*mpol, vma, addr,
-				huge_page_shift(hstate_vma(vma))), gfp_flags);
+		zl = node_zonelist(
+			offset_il_node(*mpol, vma_mpol_pgoff(vma, addr)),
+			gfp_flags);
 	} else {
 		zl = policy_zonelist(gfp_flags, *mpol);
 		if ((*mpol)->mode == MPOL_BIND)
@@ -1859,31 +1840,27 @@ static struct page *alloc_page_interleav
 }
 
 /**
- * 	alloc_page_vma	- Allocate a page for a VMA.
+ * alloc_page_pol() -- allocate a page based on policy,offset.
  *
- * 	@gfp:
+ * @gfp   - gfp mask [flags + zone] for allocation
  *      %GFP_USER    user allocation.
  *      %GFP_KERNEL  kernel allocations,
  *      %GFP_HIGHMEM highmem/user allocations,
  *      %GFP_FS      allocation should not call back into a file system.
  *      %GFP_ATOMIC  don't sleep.
  *
- * 	@vma:  Pointer to VMA or NULL if not available.
- *	@addr: Virtual Address of the allocation. Must be inside the VMA.
+ * @pol   - policy to use for allocation
+ * @pgoff - page offset for interleaving -- used only if interleave policy
  *
- * 	This function allocates a page from the kernel page pool and applies
- *	a NUMA policy associated with the VMA or the current process.
- *	When VMA is not NULL caller must hold down_read on the mmap_sem of the
- *	mm_struct of the VMA to prevent it from going away. Should be used for
- *	all allocations for pages that will be mapped into
- * 	user space. Returns NULL when no page can be allocated.
+ *	This function allocates a page from the kernel page pool and applies
+ *	the NUMA memory policy @pol, possibly indexed by @pgoff.  Should be
+ *	used for all allocations for anonymous pages that will be mapped into
+ *	user space.  Returns NULL when no page can be allocated.
  *
- *	Should be called with the mm_sem of the vma hold.
+ * Note:  extra ref on shared policies dropped on return.
  */
-struct page *
-alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
+struct page *alloc_page_pol(gfp_t gfp, struct mempolicy *pol, pgoff_t pgoff)
 {
-	struct mempolicy *pol = get_vma_policy(current, vma, addr);
 	struct zonelist *zl;
 	struct page *page;
 
@@ -1891,7 +1868,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 	if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
 		unsigned nid;
 
-		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
+		nid = offset_il_node(pol, pgoff);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, 0, nid);
 		put_mems_allowed();
@@ -1902,8 +1879,8 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		/*
 		 * slow path: ref counted shared policy
 		 */
-		struct page *page =  __alloc_pages_nodemask(gfp, 0,
-						zl, policy_nodemask(gfp, pol));
+		struct page *page =  __alloc_pages_nodemask(gfp, 0, zl,
+						policy_nodemask(gfp, pol));
 		__mpol_put(pol);
 		put_mems_allowed();
 		return page;
@@ -1915,6 +1892,38 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 	put_mems_allowed();
 	return page;
 }
+EXPORT_SYMBOL(alloc_page_pol);
+
+/**
+ *	alloc_page_vma	- Allocate a page for a VMA.
+ *
+ *	@gfp:
+ *      %GFP_USER    user allocation.
+ *      %GFP_KERNEL  kernel allocations,
+ *      %GFP_HIGHMEM highmem/user allocations,
+ *      %GFP_FS      allocation should not call back into a file system.
+ *      %GFP_ATOMIC  don't sleep.
+ *
+ *	@vma:  Pointer to VMA or NULL if not available.
+ *	@addr: Virtual Address of the allocation. Must be inside the VMA.
+ *
+ *	This function allocates a page from the kernel page pool and applies
+ *	a NUMA policy associated with the VMA or the current process.
+ *	When VMA is not NULL caller must hold down_read on the mmap_sem of the
+ *	mm_struct of the VMA to prevent it from going away. Should be used for
+ *	all allocations for anonymous pages that will be mapped into
+ *	user space. Returns NULL when no page can be allocated.
+ */
+struct page *
+alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
+{
+	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+	pgoff_t pgoff = 0;
+
+	if (likely(vma))
+		pgoff = vma_mpol_pgoff(vma, addr);
+	return alloc_page_pol(gfp, pol, pgoff);
+}
 
 /**
  * 	alloc_pages_current - Allocate pages.
Index: linux-2.6.36-mmotm-101103-1217/mm/hugetlb.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/hugetlb.c
+++ linux-2.6.36-mmotm-101103-1217/mm/hugetlb.c
@@ -230,6 +230,16 @@ pgoff_t linear_hugepage_index(struct vm_
 }
 
 /*
+ * As above, given just vma and address.
+ * For computing huge page offset for interleave mempolicy
+ */
+pgoff_t vma_huge_mpol_offset(struct vm_area_struct *vma,
+					unsigned long address)
+{
+	return vma_hugecache_offset(hstate_vma(vma), vma, address);
+}
+
+/*
  * Return the size of the pages allocated when backing a VMA. In the majority
  * cases this will be same size as used by the page table entries.
  */

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH/RFC 7/14] Shared Policy: use shared policy for page cache allocations
  2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
                   ` (5 preceding siblings ...)
  2010-11-11 19:12 ` [PATCH/RFC 6/14] Shared Policy: Factor alloc_page_pol routine Lee Schermerhorn
@ 2010-11-11 19:12 ` Lee Schermerhorn
  2010-11-11 19:12 ` [PATCH/RFC 8/14] Shared Policy: use alloc_page_pol for swap and shmempages Lee Schermerhorn
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:12 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Shared Policy Infrastructure - use shared policy for page cache allocations

This patch implements a "get_file_policy()" function, analogous
to get_vma_policy(), but for a given file[inode/mapping] at
at specified offset, using the shared_policy, if any, in the
file's address_space.  If no shared policy, returns the process
policy of the argument task [to match get_vma_policy() args] or
default policy, if no process policy.

	Note that for a file policy to exist [on other than shmem
	segments] the file must currently be mmap()ed into a task's
	address space with MAP_SHARED, with the policy installed
	via mbind().

	A later patch will hook up the generic file mempolicy
	vm_ops and define a per cpuset control file to enable
	this semantic.  Default will be same as current behavior--
	no policy on shared file mapping.

Details:

Revert [__]page_cache_alloc() to take mapping argument as it once
used to.  I need that to locate the shared policy.  Add pgoff_t
argument.  Fix up page_cache_alloc() and page_cache_alloc_cold()
in pagemap.h and all direct callers of __page_cache_alloc() accordingly.

Modify __page_cache_alloc() to use get_file_policy() and
alloc_page_pol().  Again, without generic file mempolicy, this
behaves the same as alloc_page_current()

page_cache_alloc*() now take an additional offset/index
argument, available at all call sites, to lookup the appropriate
policy and to compute interleave node for interleave policy.
The patches fixes all in-tree users of the modified interfaces.

Re: interaction with cpusets page spread:  if the file has a
shared policy structure attached, that policy takes precedence
over spreading.

Now that we have get_file_policy() and alloc_page_pol(), we can eliminate
another case of a pseudo-vma on the stack and use the new infrastructure
to allocate shmem pages.  This will be done in subsequent patches.

Re: ceph fs calls to __page_cache_alloc():  these are the only
calls where an inode/mapping and page offset/index are not
available.  As such, they don't seem to be bona fide page cache
allocations.  So, I've replaced them with direct calls to
alloc_page() as this is what page_cache_alloc() evaluated to
before this series.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/btrfs/compression.c    |    4 +--
 fs/cachefiles/rdwr.c      |    6 +++--
 fs/ntfs/file.c            |    2 -
 fs/splice.c               |    2 -
 include/linux/mempolicy.h |    8 +++++++
 include/linux/pagemap.h   |   18 ++++++++++------
 mm/filemap.c              |   50 +++++++++++++++++++++++++++++++++++-----------
 mm/mempolicy.c            |   25 +++++++++++++++++++++--
 mm/readahead.c            |    2 -
 net/ceph/messenger.c      |    2 -
 net/ceph/pagelist.c       |    4 +--
 net/ceph/pagevec.c        |    2 -
 12 files changed, 94 insertions(+), 31 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/mm/filemap.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/filemap.c
+++ linux-2.6.36-mmotm-101103-1217/mm/filemap.c
@@ -35,6 +35,8 @@
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h> /* for page_is_file_cache() */
 #include <linux/cleancache.h>
+#include <linux/mempolicy.h>
+
 #include "internal.h"
 
 /*
@@ -472,19 +474,43 @@ int add_to_page_cache_lru(struct page *p
 EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+/**
+ * __page_cache_alloc - allocate a page cache page
+ * @mapping - address_space for which page will be allocated
+ * @pgoff   - page index in mapping -- for mem policy
+ * @gfp     - gfp flags
+ *
+ * If the mapping does not contain a shared policy, and page cache spreading
+ * is enabled for the current context's cpuset, allocate a page from the node
+ * indicated by page cache spreading.
+ *
+ * Otherwise, fetch the memory policy at the indicated pgoff and allocate
+ * a page according to that policy.  Note that if the mapping does not
+ * have a shared policy, the allocation will use the task policy, if any,
+ * else the system default policy.
+ *
+ * All allocations will use the specified gfp mask.
+ */
+struct page *__page_cache_alloc(struct address_space *mapping, pgoff_t pgoff,
+					gfp_t gfp)
 {
-	int n;
+	struct mempolicy *pol;
 	struct page *page;
+	int n;
 
-	if (cpuset_do_page_mem_spread()) {
+	/*
+	 * Consider spreading only if no shared_policy
+	 */
+	if (!mapping->spolicy && cpuset_do_page_mem_spread()) {
 		get_mems_allowed();
 		n = cpuset_mem_spread_node();
 		page = alloc_pages_exact_node(n, gfp, 0);
 		put_mems_allowed();
 		return page;
-	}
-	return alloc_pages(gfp, 0);
+	} else
+		pol = get_file_policy(mapping, pgoff);
+
+	return alloc_page_pol(gfp, pol, pgoff);
 }
 EXPORT_SYMBOL(__page_cache_alloc);
 #endif
@@ -733,7 +759,7 @@ struct page *find_or_create_page(struct
 repeat:
 	page = find_lock_page(mapping, index);
 	if (!page) {
-		page = __page_cache_alloc(gfp_mask);
+		page = __page_cache_alloc(mapping, index, gfp_mask);
 		if (!page)
 			return NULL;
 		/*
@@ -951,7 +977,8 @@ grab_cache_page_nowait(struct address_sp
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
+	page = __page_cache_alloc(mapping, index,
+				  mapping_gfp_mask(mapping) & ~__GFP_FS);
 	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
 		page_cache_release(page);
 		page = NULL;
@@ -1182,7 +1209,7 @@ no_cached_page:
 		 * Ok, it wasn't cached, so we need to create a new
 		 * page..
 		 */
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, index);
 		if (!page) {
 			desc->error = -ENOMEM;
 			goto out;
@@ -1440,7 +1467,7 @@ static int page_cache_read(struct file *
 	int ret;
 
 	do {
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, offset);
 		if (!page)
 			return -ENOMEM;
 
@@ -1709,7 +1736,7 @@ static struct page *__read_cache_page(st
 repeat:
 	page = find_get_page(mapping, index);
 	if (!page) {
-		page = __page_cache_alloc(gfp | __GFP_COLD);
+		page = page_cache_alloc_cold(mapping, index);
 		if (!page)
 			return ERR_PTR(-ENOMEM);
 		err = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
@@ -2234,7 +2261,8 @@ repeat:
 	if (likely(page))
 		return page;
 
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
+	page = __page_cache_alloc(mapping, index,
+				mapping_gfp_mask(mapping) & ~gfp_notmask);
 	if (!page)
 		return NULL;
 	status = add_to_page_cache_lru(page, mapping, index,
Index: linux-2.6.36-mmotm-101103-1217/include/linux/pagemap.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/pagemap.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/pagemap.h
@@ -201,22 +201,26 @@ static inline void page_unfreeze_refs(st
 }
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(struct address_space *, pgoff_t,
+							gfp_t);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(struct address_space *mapping,
+						pgoff_t off, gfp_t gfp)
 {
-	return alloc_pages(gfp, 0);
+	return alloc_pages(mapping_gfp_mask(mapping));
 }
 #endif
 
-static inline struct page *page_cache_alloc(struct address_space *x)
+static inline struct page *page_cache_alloc(struct address_space *mapping,
+						pgoff_t off)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x));
+	return __page_cache_alloc(mapping, off, mapping_gfp_mask(mapping));
 }
 
-static inline struct page *page_cache_alloc_cold(struct address_space *x)
+static inline struct page *page_cache_alloc_cold(struct address_space *mapping,
+						pgoff_t off)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+	return __page_cache_alloc(mapping, off, mapping_gfp_mask(mapping) | __GFP_COLD);
 }
 
 typedef int filler_t(void *, struct page *);
Index: linux-2.6.36-mmotm-101103-1217/fs/splice.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/splice.c
+++ linux-2.6.36-mmotm-101103-1217/fs/splice.c
@@ -349,7 +349,7 @@ __generic_file_splice_read(struct file *
 			/*
 			 * page didn't exist, allocate one.
 			 */
-			page = page_cache_alloc_cold(mapping);
+			page = page_cache_alloc_cold(mapping, index);
 			if (!page)
 				break;
 
Index: linux-2.6.36-mmotm-101103-1217/mm/readahead.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/readahead.c
+++ linux-2.6.36-mmotm-101103-1217/mm/readahead.c
@@ -174,7 +174,7 @@ __do_page_cache_readahead(struct address
 		if (page)
 			continue;
 
-		page = page_cache_alloc_cold(mapping);
+		page = page_cache_alloc_cold(mapping, page_offset);
 		if (!page)
 			break;
 		page->index = page_offset;
Index: linux-2.6.36-mmotm-101103-1217/fs/ntfs/file.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/ntfs/file.c
+++ linux-2.6.36-mmotm-101103-1217/fs/ntfs/file.c
@@ -415,7 +415,7 @@ static inline int __ntfs_grab_cache_page
 		pages[nr] = find_lock_page(mapping, index);
 		if (!pages[nr]) {
 			if (!*cached_page) {
-				*cached_page = page_cache_alloc(mapping);
+				*cached_page = page_cache_alloc(mapping, index);
 				if (unlikely(!*cached_page)) {
 					err = -ENOMEM;
 					goto err_out;
Index: linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mempolicy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
@@ -109,6 +109,8 @@ struct mempolicy {
 	} w;
 };
 
+extern struct mempolicy default_policy;
+
 /*
  * vma memory policy flags
  */
@@ -191,6 +193,7 @@ extern void mpol_rebind_task(struct task
 extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
 extern void mpol_fix_fork_child_flag(struct task_struct *p);
 
+extern struct mempolicy *get_file_policy(struct address_space *, pgoff_t);
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
@@ -321,6 +324,11 @@ static inline bool mempolicy_nodemask_in
 	return false;
 }
 
+static inline struct mempolicy *get_file_policy(struct address_space *, pgoff_t)
+{
+	return NULL;
+}
+
 static inline int do_migrate_pages(struct mm_struct *mm,
 			const nodemask_t *from_nodes,
 			const nodemask_t *to_nodes, int flags)
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -1534,8 +1534,29 @@ asmlinkage long compat_sys_mbind(compat_
 
 #endif
 
-/*
- * get_vma_policy(@task, @vma, @addr)
+/**
+ * get_file_policy - Return effective policy for @mapping at @pgoff
+ * @mapping - file's address_space that might contain shared policy
+ * @pgoff - page offset into file/object
+ *
+ * Falls back to task or system default policy, as necessary.
+ */
+struct mempolicy *get_file_policy(struct address_space *mapping, pgoff_t pgoff)
+{
+	struct shared_policy *sp = mapping->spolicy;
+	struct mempolicy *pol = NULL;
+
+	if (unlikely(sp))
+		pol = mpol_shared_policy_lookup(sp, pgoff);
+	else if (likely(current))
+		pol = current->mempolicy;
+	if (likely(!pol))
+		pol = &default_policy;
+	return pol;
+}
+
+/**
+ * get_vma_policy
  * @task - task for fallback if vma policy == default
  * @vma   - virtual memory area whose policy is sought
  * @addr  - address in @vma for shared policy lookup
Index: linux-2.6.36-mmotm-101103-1217/fs/cachefiles/rdwr.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/cachefiles/rdwr.c
+++ linux-2.6.36-mmotm-101103-1217/fs/cachefiles/rdwr.c
@@ -258,7 +258,8 @@ static int cachefiles_read_backing_file_
 			goto backing_page_already_present;
 
 		if (!newpage) {
-			newpage = page_cache_alloc_cold(bmapping);
+			newpage = page_cache_alloc_cold(bmapping,
+							netpage->index);
 			if (!newpage)
 				goto nomem_monitor;
 		}
@@ -500,7 +501,8 @@ static int cachefiles_read_backing_file(
 				goto backing_page_already_present;
 
 			if (!newpage) {
-				newpage = page_cache_alloc_cold(bmapping);
+				newpage = page_cache_alloc_cold(bmapping,
+							netpage->index);
 				if (!newpage)
 					goto nomem;
 			}
Index: linux-2.6.36-mmotm-101103-1217/fs/btrfs/compression.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/btrfs/compression.c
+++ linux-2.6.36-mmotm-101103-1217/fs/btrfs/compression.c
@@ -474,8 +474,8 @@ static noinline int add_ra_bio_pages(str
 			goto next;
 		}
 
-		page = __page_cache_alloc(mapping_gfp_mask(mapping) &
-								~__GFP_FS);
+		page = __page_cache_alloc(mapping, page_index,
+					mapping_gfp_mask(mapping) & ~__GFP_FS);
 		if (!page)
 			break;
 
Index: linux-2.6.36-mmotm-101103-1217/net/ceph/messenger.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/net/ceph/messenger.c
+++ linux-2.6.36-mmotm-101103-1217/net/ceph/messenger.c
@@ -2111,7 +2111,7 @@ struct ceph_messenger *ceph_messenger_cr
 
 	/* the zero page is needed if a request is "canceled" while the message
 	 * is being written over the socket */
-	msgr->zero_page = __page_cache_alloc(GFP_KERNEL | __GFP_ZERO);
+	msgr->zero_page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 	if (!msgr->zero_page) {
 		kfree(msgr);
 		return ERR_PTR(-ENOMEM);
Index: linux-2.6.36-mmotm-101103-1217/net/ceph/pagelist.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/net/ceph/pagelist.c
+++ linux-2.6.36-mmotm-101103-1217/net/ceph/pagelist.c
@@ -33,7 +33,7 @@ static int ceph_pagelist_addpage(struct
 	struct page *page;
 
 	if (!pl->num_pages_free) {
-		page = __page_cache_alloc(GFP_NOFS);
+		page = alloc_page(GFP_NOFS);
 	} else {
 		page = list_first_entry(&pl->free_list, struct page, lru);
 		list_del(&page->lru);
@@ -85,7 +85,7 @@ int ceph_pagelist_reserve(struct ceph_pa
 	space = (space + PAGE_SIZE - 1) >> PAGE_SHIFT;   /* conv to num pages */
 
 	while (space > pl->num_pages_free) {
-		struct page *page = __page_cache_alloc(GFP_NOFS);
+		struct page *page = alloc_page(GFP_NOFS);
 		if (!page)
 			return -ENOMEM;
 		list_add_tail(&page->lru, &pl->free_list);
Index: linux-2.6.36-mmotm-101103-1217/net/ceph/pagevec.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/net/ceph/pagevec.c
+++ linux-2.6.36-mmotm-101103-1217/net/ceph/pagevec.c
@@ -69,7 +69,7 @@ struct page **ceph_alloc_page_vector(int
 	if (!pages)
 		return ERR_PTR(-ENOMEM);
 	for (i = 0; i < num_pages; i++) {
-		pages[i] = __page_cache_alloc(flags);
+		pages[i] = alloc_page(flags);
 		if (pages[i] == NULL) {
 			ceph_release_page_vector(pages, i);
 			return ERR_PTR(-ENOMEM);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH/RFC 8/14] Shared Policy: use alloc_page_pol for swap and shmempages
  2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
                   ` (6 preceding siblings ...)
  2010-11-11 19:12 ` [PATCH/RFC 7/14] Shared Policy: use shared policy for page cache allocations Lee Schermerhorn
@ 2010-11-11 19:12 ` Lee Schermerhorn
  2010-11-11 19:13 ` [PATCH/RFC 9/14] Shared Policy: per cpuset huge file policy control Lee Schermerhorn
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:12 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Shared Policy Infrastructure - use alloc_page_pol() for shmem
and swap cache allocations

Now that we have the alloc_page_pol() to allocate a page
given a policy, we can use it to "simplify" shmem and swap
cache page allocations.  This eliminates the need for
pseudo-vmas on stack for shmem page allocations and moves us
towards a "policy + offset" model for page cache allocations,
rather that a "vma + address" model.  The vma+address are not
[both] available everywhere we would like to do policy-based page
allocation, whereas the policy and pgoff usually are.  However,
this does mean, however, that we need to be aware of mempolicy
reference counting in swapin read-ahead.

read_swap_cache_async() and swapin_readahead() have been changed
to take a policy and page offset [for interleaving] instead of a
vma and address.  swapin_readahead() passes the policy and pgoff
to read_swap_cache_async() to do the read.  read_swap_cache_async()
now uses alloc_page_pol() with the policy and offset, instead of
alloc_page_vma().

  Note that the pgoff used by swapin_readahead() is essentially
  bogus for all but the first read.  This was already the case
  for the 'address' argument before this patch.  With this patch,
  swapin_readahead() holds pgoff constant to select the same
  node for each readahead page, in the case of interleave
  policy.  This preserves pre-existing behavior.

shmem_swapin() now uses get_file_policy() directly to look up
the shared policy on the shmem pseudo-file which it passes
to swapin_readahead().  swapin_readahead() can call
read_swap_cache_async() multiple times in a loop before the final
tail call.  read_swap_cache_async() itself may loop to retry [in
case of swapin races?].  To avoid multiple "frees" of the shared
policy, swapin_readahead() makes a "conditional unshared" copy
of the policy on stack via mpol_cond_assign().  This releases the
extra ref for a shared policy, and is effectively a no-op for
non-shared policy.  Because the copy is non-shared, alloc_page_pol()
will not attempt to decrement the reference count.

Note that get_vma_policy() becomes an in-kernel global for
use outside of mempolicy.c, like get_file_policy(), to
lookup vma based policy for other calls to swapin_readahead().
Again, use of get_vma_policy() balances reference counts with
mpol_cond_assign() in swapin_readahead().


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mempolicy.h |    8 ++++++
 include/linux/swap.h      |    6 ++--
 mm/memory.c               |    5 +++
 mm/mempolicy.c            |    2 -
 mm/shmem.c                |   58 +++++++++++++++-------------------------------
 mm/swap_state.c           |   31 +++++++++++++++++-------
 mm/swapfile.c             |   14 ++++++++---
 7 files changed, 68 insertions(+), 56 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/mm/swap_state.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/swap_state.c
+++ linux-2.6.36-mmotm-101103-1217/mm/swap_state.c
@@ -17,6 +17,7 @@
 #include <linux/buffer_head.h>
 #include <linux/backing-dev.h>
 #include <linux/pagevec.h>
+#include <linux/mempolicy.h>
 #include <linux/migrate.h>
 #include <linux/page_cgroup.h>
 
@@ -275,9 +276,11 @@ struct page * lookup_swap_cache(swp_entr
  * and reading the disk if it is not already cached.
  * A failure return means that either the page allocation failed or that
  * the swap entry is no longer in use.
+ *
+ * This function will drop any incoming conditional reference on @pol.
  */
 struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
-			struct vm_area_struct *vma, unsigned long addr)
+			struct mempolicy *pol, pgoff_t pgoff)
 {
 	struct page *found_page, *new_page = NULL;
 	int err;
@@ -296,7 +299,8 @@ struct page *read_swap_cache_async(swp_e
 		 * Get a new page to read into from swap.
 		 */
 		if (!new_page) {
-			new_page = alloc_page_vma(gfp_mask, vma, addr);
+			new_page = alloc_page_pol(GFP_HIGHUSER_MOVABLE,
+								pol, pgoff);
 			if (!new_page)
 				break;		/* Out of memory */
 		}
@@ -353,8 +357,8 @@ struct page *read_swap_cache_async(swp_e
  * swapin_readahead - swap in pages in hope we need them soon
  * @entry: swap entry of this memory
  * @gfp_mask: memory allocation flags
- * @vma: user vma this address belongs to
- * @addr: target address for mempolicy
+ * @pol: mempolicy that controls allocation.
+ * @pgoff: page offset for interleave policy
  *
  * Returns the struct page for entry and addr, after queueing swapin.
  *
@@ -369,29 +373,38 @@ struct page *read_swap_cache_async(swp_e
  * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
  */
 struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
-			struct vm_area_struct *vma, unsigned long addr)
+				struct mempolicy *pol, pgoff_t pgoff)
 {
-	int nr_pages;
+	struct mempolicy mpol;
 	struct page *page;
 	unsigned long offset;
 	unsigned long end_offset;
+	int nr_pages;
+
+	/*
+	 * make a non-shared copy of pol and release incoming ref, if
+	 * necessary, for read ahead loop and read_swap_cache_async()
+	 * retry loop.
+	 */
+	pol = mpol_cond_copy(&mpol, pol);
 
 	/*
 	 * Get starting offset for readaround, and number of pages to read.
 	 * Adjust starting address by readbehind (for NUMA interleave case)?
 	 * No, it's very unlikely that swap layout would follow vma layout,
 	 * more likely that neighbouring swap pages came from the same node:
-	 * so use the same "addr" to choose the same node for each swap read.
+	 * so use the same "pgoff" to choose the same node for each swap read.
 	 */
 	nr_pages = valid_swaphandles(entry, &offset);
 	for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
+
 		/* Ok, do the async read-ahead now */
 		page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
-						gfp_mask, vma, addr);
+						gfp_mask, pol, pgoff);
 		if (!page)
 			break;
 		page_cache_release(page);
 	}
 	lru_add_drain();	/* Push any new pages onto the LRU now */
-	return read_swap_cache_async(entry, gfp_mask, vma, addr);
+	return read_swap_cache_async(entry, gfp_mask, pol, pgoff);
 }
Index: linux-2.6.36-mmotm-101103-1217/include/linux/swap.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/swap.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/swap.h
@@ -317,9 +317,9 @@ extern void free_page_and_swap_cache(str
 extern void free_pages_and_swap_cache(struct page **, int);
 extern struct page *lookup_swap_cache(swp_entry_t);
 extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
-			struct vm_area_struct *vma, unsigned long addr);
+					struct mempolicy *, pgoff_t);
 extern struct page *swapin_readahead(swp_entry_t, gfp_t,
-			struct vm_area_struct *vma, unsigned long addr);
+					struct mempolicy *, pgoff_t);
 
 /* linux/mm/swapfile.c */
 extern long nr_swap_pages;
@@ -427,7 +427,7 @@ static inline void swapcache_free(swp_en
 }
 
 static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
-			struct vm_area_struct *vma, unsigned long addr)
+					struct mempolicy *pol, pgoff_t pgoff)
 {
 	return NULL;
 }
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -1571,7 +1571,7 @@ struct mempolicy *get_file_policy(struct
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct task_struct *task,
+struct mempolicy *get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol = task->mempolicy;
Index: linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mempolicy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
@@ -193,6 +193,8 @@ extern void mpol_rebind_task(struct task
 extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
 extern void mpol_fix_fork_child_flag(struct task_struct *p);
 
+extern struct mempolicy *get_vma_policy(struct task_struct *,
+				struct vm_area_struct *, unsigned long);
 extern struct mempolicy *get_file_policy(struct address_space *, pgoff_t);
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
@@ -324,6 +326,12 @@ static inline bool mempolicy_nodemask_in
 	return false;
 }
 
+static inline struct mempolicy *get_vma_policy(struct task_struct *task,
+		struct vm_area_struct *vma, unsigned long addr)
+{
+	return NULL;
+}
+
 static inline struct mempolicy *get_file_policy(struct address_space *, pgoff_t)
 {
 	return NULL;
Index: linux-2.6.36-mmotm-101103-1217/mm/swapfile.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/swapfile.c
+++ linux-2.6.36-mmotm-101103-1217/mm/swapfile.c
@@ -31,6 +31,7 @@
 #include <linux/syscalls.h>
 #include <linux/memcontrol.h>
 #include <linux/poll.h>
+#include <linux/mempolicy.h>
 
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
@@ -1096,6 +1097,7 @@ static int try_to_unuse(unsigned int typ
 	struct mm_struct *start_mm;
 	unsigned char *swap_map;
 	unsigned char swcount;
+	struct mempolicy *pol;
 	struct page *page;
 	swp_entry_t entry;
 	unsigned int i = 0;
@@ -1132,12 +1134,18 @@ static int try_to_unuse(unsigned int typ
 		/*
 		 * Get a page for the entry, using the existing swap
 		 * cache page if there is one.  Otherwise, get a clean
-		 * page and read the swap into it.
+		 * page and read the swap into it.  Use dummy policy
+		 * [current task's policy or system default] with swap
+		 * cache index for interleaving to allocate new page.
+		 * Note:  read_swap_cache_async() drops reference on policy.
+		 *        need to refetch policy for each call.  Not a
+		 *        performance concern in this loop.
 		 */
 		swap_map = &si->swap_map[i];
 		entry = swp_entry(type, i);
-		page = read_swap_cache_async(entry,
-					GFP_HIGHUSER_MOVABLE, NULL, 0);
+		pol = get_vma_policy(current, NULL, 0);
+		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
+							 pol, i);
 		if (!page) {
 			/*
 			 * Either swap_duplicate() failed because entry
Index: linux-2.6.36-mmotm-101103-1217/mm/memory.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/memory.c
+++ linux-2.6.36-mmotm-101103-1217/mm/memory.c
@@ -2651,9 +2651,12 @@ static int do_swap_page(struct mm_struct
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
 	page = lookup_swap_cache(entry);
 	if (!page) {
+		struct mempolicy *pol = get_vma_policy(current, vma, address);
+		pgoff_t pgoff = vma_mpol_pgoff(vma, address);
+
 		grab_swap_token(mm); /* Contend for token _before_ read-in */
 		page = swapin_readahead(entry,
-					GFP_HIGHUSER_MOVABLE, vma, address);
+					GFP_HIGHUSER_MOVABLE, pol, pgoff);
 		if (!page) {
 			/*
 			 * Back out if somebody else faulted in this pte
Index: linux-2.6.36-mmotm-101103-1217/mm/shmem.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/shmem.c
+++ linux-2.6.36-mmotm-101103-1217/mm/shmem.c
@@ -1146,39 +1146,21 @@ static struct mempolicy *shmem_get_sbmpo
 }
 #endif /* CONFIG_TMPFS */
 
-struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp,
-				struct shared_policy *sp, unsigned long idx)
+struct page *shmem_swapin(swp_entry_t entry,
+			struct address_space *mapping, unsigned long idx)
 {
-	struct mempolicy mpol, *spol;
-	struct vm_area_struct pvma;
-	struct page *page;
-
-	spol = mpol_cond_copy(&mpol, mpol_shared_policy_lookup(sp, idx));
-
-	/* Create a pseudo vma that just contains the policy */
-	pvma.vm_start = 0;
-	pvma.vm_pgoff = idx;
-	pvma.vm_file = NULL;
-	pvma.vm_policy = spol;
-	page = swapin_readahead(entry, gfp, &pvma, 0);
-	return page;
+	return swapin_readahead(entry, mapping_gfp_mask(mapping),
+	                                   get_file_policy(mapping, idx), idx);
 }
 
-static struct page *shmem_alloc_page(gfp_t gfp, struct shared_policy *sp,
-				unsigned long idx)
+static inline struct page *shmem_alloc_page(struct address_space *mapping,
+					    unsigned long idx)
 {
-	struct vm_area_struct pvma;
-
-	/* Create a pseudo vma that just contains the policy */
-	pvma.vm_start = 0;
-	pvma.vm_pgoff = idx;
-	pvma.vm_file = NULL;
-	pvma.vm_policy = mpol_shared_policy_lookup(sp, idx);
-
 	/*
-	 * alloc_page_vma() will drop the shared policy reference
+	 * alloc_page_pol() will drop the shared policy reference
 	 */
-	return alloc_page_vma(gfp, &pvma, 0);
+	return alloc_page_pol(mapping_gfp_mask(mapping) | __GFP_ZERO,
+				 get_file_policy(mapping, idx), idx);
 }
 #else /* !CONFIG_NUMA */
 #ifdef CONFIG_TMPFS
@@ -1187,16 +1169,17 @@ static inline void shmem_show_mpol(struc
 }
 #endif /* CONFIG_TMPFS */
 
-static inline struct page *shmem_swapin(swp_entry_t entry, gfp_t gfp, void *sp,
-						unsigned long idx)
+static inline struct page *shmem_swapin(swp_entry_t entry,
+					struct address_space *mapping,
+					unsigned long idx)
 {
-	return swapin_readahead(entry, gfp, NULL, 0);
+ 	return swapin_readahead(entry, mapping_gfp_mask(mapping), NULL, 0);
 }
 
-static inline struct page *shmem_alloc_page(gfp_t gfp, void *sp,
-						unsigned long idx)
+static inline struct page *shmem_alloc_page(struct address_space *mapping,
+					    unsigned long idx)
 {
-	return alloc_page(gfp);
+	return alloc_page(mapping_gfp_mask(mapping) | __GFP_ZERO);
 }
 #endif /* CONFIG_NUMA */
 
@@ -1259,8 +1242,7 @@ repeat:
 		radix_tree_preload_end();
 		if (sgp != SGP_READ && !prealloc_page) {
 			/* We don't care if this fails */
-			prealloc_page = shmem_alloc_page(gfp,
-					mapping_shared_policy(mapping), idx);
+			prealloc_page = shmem_alloc_page(mapping, idx);
 			if (prealloc_page) {
 				if (mem_cgroup_cache_charge(prealloc_page,
 						current->mm, GFP_KERNEL)) {
@@ -1293,8 +1275,7 @@ repeat:
 				*type |= VM_FAULT_MAJOR;
 			}
 			spin_unlock(&info->lock);
-			swappage = shmem_swapin(swap, gfp,
-					mapping_shared_policy(mapping), idx);
+			swappage = shmem_swapin(swap, mapping, idx);
 			if (!swappage) {
 				spin_lock(&info->lock);
 				entry = shmem_swp_alloc(info, idx, sgp);
@@ -1421,8 +1402,7 @@ repeat:
 
 			if (!prealloc_page) {
 				spin_unlock(&info->lock);
-				filepage = shmem_alloc_page(gfp,
-						mapping_shared_policy(mapping), idx);
+				filepage = shmem_alloc_page(mapping, idx);
 				if (!filepage) {
 					shmem_unacct_blocks(info->flags, 1);
 					shmem_free_blocks(inode, 1);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH/RFC 9/14] Shared Policy: per cpuset huge file policy control
  2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
                   ` (7 preceding siblings ...)
  2010-11-11 19:12 ` [PATCH/RFC 8/14] Shared Policy: use alloc_page_pol for swap and shmempages Lee Schermerhorn
@ 2010-11-11 19:13 ` Lee Schermerhorn
  2010-11-11 19:13 ` [PATCH/RFC 10/14] Shared Policy: Add hugepage shmem policy vm_ops Lee Schermerhorn
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:13 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Shared Policy Infrastructure - add cpuset control for huge page shared policy

Add a per cpuset "shared_huge_policy" control file to enable
shared hugetlbfs file policy for tasks in the cpuset.  Default
is disabled, resulting in the old behavior--i.e., we continue
to use any task private vma policy, falling back to task or
system default policy if none, on address ranges backed by
shared hugetlbfs file mappings.  The "shared_huge_policy" file
depends on CONFIG_NUMA.

This patch adapts and renames the cpuset_update_task_spread_flag()
function and related mechanisms to update the cpusetcontrolled flags
of the tasks in the cpuset when "shared_huge_policy" is changes.

Why a "per cpuset" control?  cpusets are numa-aware task groupings
and memory policy is a numa concept.  Applications that need/want
shared hugetlbfs file policy can be grouped in a cpuset with this
feature enabled, while other applications in other cpusets need not
see this feature.  Alternatively, the behavior may be enabled for
the entire system by setting the control file in the top level cpuset.

	This use of cpusets to control NUMA-related behavior,
	vs. a separate controller, might be worth a side discussion?

The default may be overridden--e.g., to enabled--on the kernel
command line using the "shared_huge_policy_default" parameter.
When cpusets are configured, this policy sets the default value
of "shared_huge_policy" for the top cpuset, which is then inherited
by all subsequently created descendant cpusets.  When cpusets are
not configured, this parameter sets the "shared_huge_policy_enabled"
flag for the init process, which is then inherited by all descendant
processes.

A subsequent patch "hooks up" the shared file .{set|get}_policy
vm_ops to install or lookup a shared policy on a memory mapped
hugetlbfs file if the capability has been enabled for the caller's cpuset,
or for the system in the case of no cpusets.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/cpuset.h        |   27 +++++++++++++++++
 include/linux/sched.h         |   20 ++++++++++++
 include/linux/shared_policy.h |    4 ++
 kernel/cpuset.c               |   66 +++++++++++++++++++++++++++++++++++-------
 mm/mempolicy.c                |   26 +++++++++++++++-
 5 files changed, 131 insertions(+), 12 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/sched.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/sched.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/sched.h
@@ -1455,6 +1455,7 @@ struct task_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
 	short il_next;
+	short shared_huge_policy_enabled;
 #endif
 	atomic_t fs_excl;	/* holding fs exclusive resources */
 	struct rcu_head rcu;
@@ -1870,6 +1871,25 @@ extern void sched_exec(void);
 extern void sched_clock_idle_sleep_event(void);
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 
+#ifdef CONFIG_NUMA
+static inline void set_shared_huge_policy_enabled(struct task_struct *tsk,
+							int val)
+{
+	tsk->shared_huge_policy_enabled = !!val;
+}
+static inline int shared_huge_policy_enabled(struct task_struct *tsk)
+{
+	return tsk->shared_huge_policy_enabled;
+}
+
+#else
+static void set_shared_huge_policy_enabled(struct task_struct *tsk, int val) { }
+static int shared_huge_policy_enabled(struct task_struct *tsk)
+{
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_HOTPLUG_CPU
 extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
 extern void idle_task_exit(void);
Index: linux-2.6.36-mmotm-101103-1217/kernel/cpuset.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/kernel/cpuset.c
+++ linux-2.6.36-mmotm-101103-1217/kernel/cpuset.c
@@ -132,6 +132,7 @@ typedef enum {
 	CS_SCHED_LOAD_BALANCE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_SHARED_HUGE_POLICY,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -170,6 +171,11 @@ static inline int is_spread_slab(const s
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+static inline int is_shared_huge_policy(const struct cpuset *cs)
+{
+	return test_bit(CS_SHARED_HUGE_POLICY, &cs->flags);
+}
+
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
 };
@@ -306,11 +312,11 @@ static void guarantee_online_mems(const
 }
 
 /*
- * update task's spread flag if cpuset's page/slab spread flag is set
+ * update task's cpuset-controlled flags to match it's cpuset.
  *
  * Called with callback_mutex/cgroup_mutex held
  */
-static void cpuset_update_task_spread_flag(struct cpuset *cs,
+static void cpuset_update_task_cpuset_flags(struct cpuset *cs,
 					struct task_struct *tsk)
 {
 	if (is_spread_page(cs))
@@ -321,6 +327,10 @@ static void cpuset_update_task_spread_fl
 		tsk->flags |= PF_SPREAD_SLAB;
 	else
 		tsk->flags &= ~PF_SPREAD_SLAB;
+	if (is_shared_huge_policy(cs))
+		set_shared_huge_policy_enabled(tsk, 1);
+	else
+		set_shared_huge_policy_enabled(tsk, 0);
 }
 
 /*
@@ -1180,7 +1190,8 @@ static int update_relax_domain_level(str
 }
 
 /*
- * cpuset_change_flag - make a task's spread flags the same as its cpuset's
+ * cpuset_change_flag - make a task's cpuset controled flags the same as
+ * its cpuset's
  * @tsk: task to be updated
  * @scan: struct cgroup_scanner containing the cgroup of the task
  *
@@ -1192,12 +1203,12 @@ static int update_relax_domain_level(str
 static void cpuset_change_flag(struct task_struct *tsk,
 				struct cgroup_scanner *scan)
 {
-	cpuset_update_task_spread_flag(cgroup_cs(scan->cg), tsk);
+	cpuset_update_task_cpuset_flags(cgroup_cs(scan->cg), tsk);
 }
 
 /*
- * update_tasks_flags - update the spread flags of tasks in the cpuset.
- * @cs: the cpuset in which each task's spread flags needs to be changed
+ * update_tasks_flags - update the cpuset-controlled flags of tasks in a cpuset.
+ * @cs: the cpuset in which each task's flags needs to be changed
  * @heap: if NULL, defer allocating heap memory to cgroup_scan_tasks()
  *
  * Called with cgroup_mutex held
@@ -1233,7 +1244,7 @@ static int update_flag(cpuset_flagbits_t
 {
 	struct cpuset *trialcs;
 	int balance_flag_changed;
-	int spread_flag_changed;
+	int cpuset_flags_changed;
 	struct ptr_heap heap;
 	int err;
 
@@ -1257,8 +1268,9 @@ static int update_flag(cpuset_flagbits_t
 	balance_flag_changed = (is_sched_load_balance(cs) !=
 				is_sched_load_balance(trialcs));
 
-	spread_flag_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
-			|| (is_spread_page(cs) != is_spread_page(trialcs)));
+	cpuset_flags_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
+			|| (is_spread_page(cs) != is_spread_page(trialcs))
+			|| (is_shared_huge_policy(cs) != is_shared_huge_policy(trialcs)));
 
 	mutex_lock(&callback_mutex);
 	cs->flags = trialcs->flags;
@@ -1267,7 +1279,7 @@ static int update_flag(cpuset_flagbits_t
 	if (!cpumask_empty(trialcs->cpus_allowed) && balance_flag_changed)
 		async_rebuild_sched_domains();
 
-	if (spread_flag_changed)
+	if (cpuset_flags_changed)
 		update_tasks_flags(cs, &heap);
 	heap_free(&heap);
 out:
@@ -1428,7 +1440,7 @@ static void cpuset_attach_task(struct ta
 	WARN_ON_ONCE(err);
 
 	cpuset_change_task_nodemask(tsk, to);
-	cpuset_update_task_spread_flag(cs, tsk);
+ 	cpuset_update_task_cpuset_flags(cs, tsk);
 
 }
 
@@ -1494,6 +1506,7 @@ typedef enum {
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
+	FILE_SHARED_HUGE_POLICY,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1533,6 +1546,9 @@ static int cpuset_write_u64(struct cgrou
 	case FILE_SPREAD_SLAB:
 		retval = update_flag(CS_SPREAD_SLAB, cs, val);
 		break;
+	case FILE_SHARED_HUGE_POLICY:
+		retval = update_flag(CS_SHARED_HUGE_POLICY, cs, val);
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -1697,6 +1713,8 @@ static u64 cpuset_read_u64(struct cgroup
 		return is_spread_page(cs);
 	case FILE_SPREAD_SLAB:
 		return is_spread_slab(cs);
+	case FILE_SHARED_HUGE_POLICY:
+		return is_shared_huge_policy(cs);
 	default:
 		BUG();
 	}
@@ -1814,6 +1832,13 @@ static struct cftype cft_memory_pressure
 	.private = FILE_MEMORY_PRESSURE_ENABLED,
 };
 
+static struct cftype cft_shared_huge_policy = {
+	.name = "shared_huge_policy",
+	.read_u64 = cpuset_read_u64,
+	.write_u64 = cpuset_write_u64,
+	.private = FILE_SHARED_HUGE_POLICY,
+};
+
 static int cpuset_populate(struct cgroup_subsys *ss, struct cgroup *cont)
 {
 	int err;
@@ -1821,6 +1846,12 @@ static int cpuset_populate(struct cgroup
 	err = cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
 	if (err)
 		return err;
+	/*
+	 * only if shared file policy configured
+	 */
+	err = add_shared_xxx_policy_file(cont, ss, &cft_shared_huge_policy);
+	if (err < 0)
+		return err;
 	/* memory_pressure_enabled is in root cpuset only */
 	if (!cont->parent)
 		err = cgroup_add_file(cont, ss,
@@ -1895,6 +1926,8 @@ static struct cgroup_subsys_state *cpuse
 		set_bit(CS_SPREAD_PAGE, &cs->flags);
 	if (is_spread_slab(parent))
 		set_bit(CS_SPREAD_SLAB, &cs->flags);
+	if (is_shared_huge_policy(parent))
+		set_bit(CS_SHARED_HUGE_POLICY, &cs->flags);
 	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
 	cpumask_clear(cs->cpus_allowed);
 	nodes_clear(cs->mems_allowed);
@@ -1968,6 +2001,17 @@ int __init cpuset_init(void)
 }
 
 /**
+ * cpuset_init_shared_file_policy - set default value for shared_file_policy
+ * enablement.
+ */
+
+void __init cpuset_init_shared_huge_policy(int dflt)
+{
+	if (dflt)
+		set_bit(CS_SHARED_HUGE_POLICY, &top_cpuset.flags);
+}
+
+/**
  * cpuset_do_move_task - move a given task to another cpuset
  * @tsk: pointer to task_struct the task to move
  * @scan: struct cgroup_scanner contained in its struct cpuset_hotplug_scanner
Index: linux-2.6.36-mmotm-101103-1217/include/linux/cpuset.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/cpuset.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/cpuset.h
@@ -128,6 +128,28 @@ static inline void set_mems_allowed(node
 	task_unlock(current);
 }
 
+#ifdef CONFIG_NUMA
+static inline int add_shared_xxx_policy_file(struct cgroup *cg,
+						struct cgroup_subsys *ss,
+						struct cftype *cft)
+{
+	return cgroup_add_file(cg, ss, cft);
+}
+
+#else
+/*
+ * don't expose "shared_huge_policy" file if !NUMA
+ */
+static inline int add_shared_xxx_policy_file(struct cgroup *cg,
+						struct cgroup_subsys *ss,
+						struct cftype *cft)
+{
+	return 0;
+}
+#endif
+
+extern void __init cpuset_init_shared_huge_policy(int dflt);
+
 #else /* !CONFIG_CPUSETS */
 
 static inline int cpuset_init(void) { return 0; }
@@ -242,6 +264,11 @@ static inline void put_mems_allowed(void
 {
 }
 
+static inline void cpuset_init_shared_huge_policy(int dflt)
+{
+	current->shared_file_policy_enabled = dflt;
+}
+
 #endif /* !CONFIG_CPUSETS */
 
 #endif /* _LINUX_CPUSET_H */
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -2079,7 +2079,25 @@ int __mpol_equal(struct mempolicy *a, st
 
 /*
  * Shared memory backing store policy support.
- *
+ */
+
+/*
+ * default state of per cpuset shared_huge_policy enablement
+ */
+int shared_huge_policy_default;	/* default:  disabled */
+
+static int __init setup_shared_huge_policy_default(char *str)
+{
+	int ret, val;
+	ret = get_option(&str, &val);
+	if (!ret)
+		return 0;
+	shared_huge_policy_default = !!val;
+	return 1;
+}
+__setup("shared_huge_policy_default=", setup_shared_huge_policy_default);
+
+/*
  * Remember policies even when nobody has shared memory mapped.
  * The policies are kept in Red-Black tree linked from the inode.
  * They are protected by the sp->lock spinlock, which should be held
@@ -2423,6 +2441,12 @@ void __init numa_policy_init(void)
 
 	if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes))
 		printk("numa_policy_init: interleaving failed\n");
+
+	/*
+	 * initialize per cpuset shared huge policy enablement
+	 * from default.
+	 */
+	cpuset_init_shared_huge_policy(shared_huge_policy_default);
 }
 
 /* Reset policy of current process to default */
Index: linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/shared_policy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
@@ -30,6 +30,8 @@ struct shared_policy {
 	int            nr_sp_nodes;	/* for numa_maps */
 };
 
+extern int shared_file_policy_default;
+
 extern struct shared_policy *mpol_shared_policy_new(
 					struct address_space *mapping,
 					struct mempolicy *mpol);
@@ -43,6 +45,8 @@ extern struct mempolicy *mpol_shared_pol
 
 struct shared_policy {};
 
+#define shared_file_policy_default 0
+
 static inline int mpol_set_shared_policy(struct shared_policy *info,
 					pgoff_t pgoff, unsigned long sz,
 					struct mempolicy *new)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH/RFC 10/14] Shared Policy: Add hugepage shmem policy vm_ops
  2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
                   ` (8 preceding siblings ...)
  2010-11-11 19:13 ` [PATCH/RFC 9/14] Shared Policy: per cpuset huge file policy control Lee Schermerhorn
@ 2010-11-11 19:13 ` Lee Schermerhorn
  2010-11-11 19:13 ` [PATCH/RFC 11/14] Shared Policy: fix migration of private mappings Lee Schermerhorn
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:13 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Shared Policy Infrastructure - Add hugepage shmem policy vm_ops

Hugetlb shmem segments have always had a shared policy structure
in their 'info' struct.  With this series, like all file mappings
on a CONFIG_NUMA kernel, the segments' address_space structures
now have a pointer to a dynamically allocated shared_policy
struct.  The shared policy struct will only be allocated when
a shared policy is installed.  However, currently, the hugetlbfs
vm_operations do not support mempolicy set/get ops to install
and lookup shared policies.

This patch hooks up the hugepage shmem segment's
{set|get}_policy vm_ops so that shmem segments created with
the SHM_HUGETLB flag will install policies specified via the
mbind() syscall into the shared policy of the shared segment.
This capability is possible now that hugetlb pages are faulted
in on demand.

Restore the shmem_{set|get}_policy prototypes to mm.h--removed
back ~23-rc1-mm2 :-(.

The shared policy infrastructure maintains memory policies on
"base page size" ranges.  To ensure that policies installed on
a hugetlb shmem segment cover entire huge pages, this patch
enhances do_mbind() to enforce huge page alignment if the policy
range starts within a hugetlb segment.  The enforcement is down
in check_range() because we need the vma to determine whether or
not the range starts in a hugetlb segment.

	We could just silently round the start address down to
	a hugepage alignment.  This would be safe and, some might
	think, convenient for the application programmer, but it
	is inconsistent with the treatement of base page ranges
	which MUST be page aligned.

Set VMPOL_F_NOSPLIT in hugetlbfs_file_mmap() to prevent splitting
of hugetlbfs vmas when applying mempolicy to subset range of the
segment.

This patch depends on the numa_maps fixes and related shared
policy infrastructure clean up earlier in the series to prevent hangs
when displaying [via cat] the numa_maps of a task that has attached a
huge page shmem segment.


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/numa_memory_policy.txt |   16 +++++++++-------
 fs/hugetlbfs/inode.c                    |    1 +
 include/linux/mm.h                      |    6 ++++++
 mm/hugetlb.c                            |    4 ++++
 mm/mempolicy.c                          |   19 +++++++++++++++++--
 mm/shmem.c                              |   20 ++++++++++++++++++--
 6 files changed, 55 insertions(+), 11 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/mm/hugetlb.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/hugetlb.c
+++ linux-2.6.36-mmotm-101103-1217/mm/hugetlb.c
@@ -2143,6 +2143,10 @@ const struct vm_operations_struct hugetl
 	.fault = hugetlb_vm_op_fault,
 	.open = hugetlb_vm_op_open,
 	.close = hugetlb_vm_op_close,
+#ifdef CONFIG_NUMA
+	.set_policy	= shmem_set_policy,
+	.get_policy	= shmem_get_policy,
+#endif
 };
 
 static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page,
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -581,6 +581,15 @@ check_range(struct mm_struct *mm, unsign
 	first = find_vma(mm, start);
 	if (!first)
 		return ERR_PTR(-EFAULT);
+
+	/*
+	 * need vma for hugetlb check
+	 */
+	if (is_vm_hugetlb_page(first)) {
+		if (start & ~HPAGE_MASK)
+			return ERR_PTR(-EINVAL);
+	}
+
 	prev = NULL;
 	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
 		if (!(flags & MPOL_MF_DISCONTIG_OK)) {
@@ -589,8 +598,14 @@ check_range(struct mm_struct *mm, unsign
 			if (prev && prev->vm_end < vma->vm_start)
 				return ERR_PTR(-EFAULT);
 		}
-		if (!is_vm_hugetlb_page(vma) &&
-		    ((flags & MPOL_MF_STRICT) ||
+		if (is_vm_hugetlb_page(vma)) {
+			/*
+			 * round end up to hugepage alignment if
+			 * it falls in a hugetlb vma.
+			 */
+			if (end < vma->vm_end)
+				end = (end + HPAGE_MASK) & HPAGE_MASK;
+		} else if (((flags & MPOL_MF_STRICT) ||
 		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
 				vma_migratable(vma)))) {
 			unsigned long endvma = vma->vm_end;
Index: linux-2.6.36-mmotm-101103-1217/include/linux/mm.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mm.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mm.h
@@ -748,6 +748,10 @@ extern void show_free_areas(void);
 int shmem_lock(struct file *file, int lock, struct user_struct *user);
 struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags);
 int shmem_zero_setup(struct vm_area_struct *);
+int shmem_set_policy(struct vm_area_struct *vma,
+	unsigned long start, unsigned long end, struct mempolicy *new);
+struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
+					unsigned long addr);
 
 #ifndef CONFIG_MMU
 extern unsigned long shmem_get_unmapped_area(struct file *file,
@@ -1251,6 +1255,8 @@ static inline pgoff_t vma_mpol_pgoff(str
 	return ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 }
 
+// TODO:  is this OK for huge pages?  Or do I need inverse of
+// vma_huge_mpol_offset?
 static inline unsigned long vma_mpol_addr(struct vm_area_struct *vma,
 						pgoff_t pgoff)
 {
Index: linux-2.6.36-mmotm-101103-1217/mm/shmem.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/shmem.c
+++ linux-2.6.36-mmotm-101103-1217/mm/shmem.c
@@ -1505,7 +1505,7 @@ static int shmem_fault(struct vm_area_st
 }
 
 #ifdef CONFIG_NUMA
-static int shmem_set_policy(struct vm_area_struct *vma, unsigned long start,
+int shmem_set_policy(struct vm_area_struct *vma, unsigned long start,
 			unsigned long end, struct mempolicy *new)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -1520,7 +1520,7 @@ static int shmem_set_policy(struct vm_ar
 					(end - start) >> PAGE_SHIFT, new);
 }
 
-static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
+struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
 					  unsigned long addr)
 {
 	struct address_space *mapping = vma->vm_file->f_mapping;
@@ -1530,6 +1530,7 @@ static struct mempolicy *shmem_get_polic
 		return NULL;	/* == default policy */
 	return mpol_shared_policy_lookup(sp, vma_mpol_pgoff(vma, addr));
 }
+#define HAVE_SHMEM_XET_POLICY
 #endif
 
 int shmem_lock(struct file *file, int lock, struct user_struct *user)
@@ -2700,6 +2701,21 @@ out:
 
 #endif /* CONFIG_SHMEM */
 
+#ifndef HAVE_SHMEM_XET_POLICY
+int shmem_set_policy(struct vm_area_struct *vma,
+				   struct mempolicy *new)
+{
+	return 0;
+}
+
+struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
+						 unsigned long addr)
+{
+	return NULL;
+}
+#endif
+
+
 /* common code */
 
 /**
Index: linux-2.6.36-mmotm-101103-1217/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/hugetlbfs/inode.c
+++ linux-2.6.36-mmotm-101103-1217/fs/hugetlbfs/inode.c
@@ -93,6 +93,7 @@ static int hugetlbfs_file_mmap(struct fi
 	 */
 	vma->vm_flags |= VM_HUGETLB | VM_RESERVED;
 	vma->vm_ops = &hugetlb_vm_ops;
+	mpol_set_vma_nosplit(vma);
 
 	if (vma->vm_pgoff & ~(huge_page_mask(h) >> PAGE_SHIFT))
 		return -EINVAL;
Index: linux-2.6.36-mmotm-101103-1217/Documentation/vm/numa_memory_policy.txt
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/Documentation/vm/numa_memory_policy.txt
+++ linux-2.6.36-mmotm-101103-1217/Documentation/vm/numa_memory_policy.txt
@@ -114,13 +114,15 @@ most general to most specific:
     by any task, will obey the shared policy.
 
 	As of 2.6.28, only shared memory segments, created by shmget() or
-	mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy.  When shared
-	policy support was added to Linux, the associated data structures were
-	added to hugetlbfs shmem segments.  At the time, hugetlbfs did not
-	support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
-	shmem segments were never "hooked up" to the shared policy support.
-	Although hugetlbfs segments now support lazy allocation, their support
-	for shared policy has not been completed.
+	mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy.  Prior to
+	2.6.XX, shared segments backed by huge pages did not support shared
+	policy.  In fact, different tasks could install different policies
+	for the same ranges of a shared huge page segment.  The policy of
+	any given page was determined by which task touched it first--always
+	the case for local allocation.  As of 2.6.XX, Linux supports shared
+	policies on huge pages shared segments, just as for regular sized
+	pages.  To preserve existing behavior for applications that might
+	care, this new behavior must be enabled on a per-cpuset basis.
 
 	As mentioned above [re: VMA policies], allocations of page cache
 	pages for regular files mmap()ed with MAP_SHARED ignore any VMA

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH/RFC 11/14] Shared Policy: fix migration of private mappings
  2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
                   ` (9 preceding siblings ...)
  2010-11-11 19:13 ` [PATCH/RFC 10/14] Shared Policy: Add hugepage shmem policy vm_ops Lee Schermerhorn
@ 2010-11-11 19:13 ` Lee Schermerhorn
  2010-11-11 19:13 ` [PATCH/RFC 12/14] Shared Policy: mapped file policy persistence model Lee Schermerhorn
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:13 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Shared Policy Infrastructure - fix migration of private mappings

This patch is in preparation for subsequent patch to add shared
policy {get|set}_policy ops to generic files.  At that point, we'll
have "memory objects" that can be mapped shared in some tasks and
have shared policy applied, but mapped private in other tasks.
Unlikely, perhaps, be we need to handle it in some fashion.

Now, if we installed a vma policy on the private mapping, it
will be ignored for cache pages.  If we specified MPOL_MF_MOVE_ALL
on the vma range, we don't want the private mapping's vma policy
to affect the cache pages--especially when the file has a shared policy.
Rather, we want only to migrate any private, anon copies that the
task has "COWed".  This will preserve existing behavior for private
mappings.

Define a new internal flag--MPOL_MF_MOVE_ANON_ONLY--that we
set in check_range() for private mappings of files with shared
policy.  Then, migrate_page_add() will skip cache [non-anon] pages
when this flag is set.

May also be able to use this flag to force unmapping of
anon pages that may be shared with relatives during automigrate
on internode task migration--e.g., by using:

	MPOL_MF_MOVE_ALL|MPOL_MF_MOVE_ANON_ONLY

But, that's the subject of a different patch series.


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/mempolicy.c |   20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -101,6 +101,7 @@
 #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)	/* Skip checks for continuous vmas */
 #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
 #define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2)		/* Gather statistics */
+#define MPOL_MF_MOVE_ANON_ONLY (MPOL_MF_INTERNAL << 3)
 
 static struct kmem_cache *policy_cache;
 static struct kmem_cache *sp_cache;
@@ -609,13 +610,24 @@ check_range(struct mm_struct *mm, unsign
 		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
 				vma_migratable(vma)))) {
 			unsigned long endvma = vma->vm_end;
+			unsigned long anononly = 0;
 
 			if (endvma > end)
 				endvma = end;
 			if (vma->vm_start > start)
 				start = vma->vm_start;
+
+			/*
+			 * Non-SHARED file mapping with shared policy installed:
+			 * migrate only COWed anon pages as shared pages follow
+			 * the shared policy.
+			 */
+			if (vma->vm_file && !(vma->vm_flags & VM_SHARED) &&
+					vma->vm_file->f_mapping->spolicy)
+				anononly = MPOL_MF_MOVE_ANON_ONLY;
+
 			err = check_pgd_range(vma, start, endvma, nodes,
-						flags, private);
+						flags|anononly, private);
 			if (err) {
 				first = ERR_PTR(err);
 				break;
@@ -977,9 +989,11 @@ static void migrate_page_add(struct page
 				unsigned long flags)
 {
 	/*
-	 * Avoid migrating a page that is shared with others.
+	 * Avoid migrating a file backed page in a private mapping or
+	 * a page that is shared with others.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
+	if ((!(flags & MPOL_MF_MOVE_ANON_ONLY) || PageAnon(page)) &&
+		((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1)) {
 		if (!isolate_lru_page(page)) {
 			list_add_tail(&page->lru, pagelist);
 			inc_zone_page_state(page, NR_ISOLATED_ANON +

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH/RFC 12/14] Shared Policy: mapped file policy persistence model
  2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
                   ` (10 preceding siblings ...)
  2010-11-11 19:13 ` [PATCH/RFC 11/14] Shared Policy: fix migration of private mappings Lee Schermerhorn
@ 2010-11-11 19:13 ` Lee Schermerhorn
  2010-11-11 19:13 ` [PATCH/RFC 13/14] Shared Policy: per cpuset mapped file policy control Lee Schermerhorn
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:13 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Shared Policy Infrastructure - define mapped file policy persistence model

This patch starts the process of supporting optional shared policy on
shared memory mapped files.

Mapped file policy applies to a range of a linearly memory mapped file
mmap()ed with the MAP_SHARED flag.  The mapping serves as a linear
window onto the mapped range.  Retain the shared policy until the last
shared mapping is removed, so that cached files do not retain policies
installed by defunct applications.

Use rcu deferred free to close possible race between last shared
mapper removing the shared policy and non-mmap page cache access.

Shmem segments [including SHM_HUGETLB segments] look like shared
mapped files to the shared policy infrastructure.  The policy
persistence model for shmem segments is that once a shared policy
is applied, it remains as long as the segment exists.  To retain this
behavior, define a shared policy persistence flag--SPOL_F_PERSIST--and
set this flag when allocating a shared policy for a shmem segment.

Now, we can push the freeing any shmem/hugetlbfs persistent shared
policy when the segment is deleted down into the fs-independent inode
cleanup path.


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/hugetlbfs/inode.c          |    1 
 fs/inode.c                    |    7 ++++
 include/linux/shared_policy.h |   11 ++++--
 mm/mempolicy.c                |   70 ++++++++++++++++++++++++++++++++----------
 mm/mmap.c                     |   11 ++++++
 mm/shmem.c                    |    5 ---
 6 files changed, 81 insertions(+), 24 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/hugetlbfs/inode.c
+++ linux-2.6.36-mmotm-101103-1217/fs/hugetlbfs/inode.c
@@ -663,7 +663,6 @@ static struct inode *hugetlbfs_alloc_ino
 static void hugetlbfs_destroy_inode(struct inode *inode)
 {
 	hugetlbfs_inc_free_inodes(HUGETLBFS_SB(inode->i_sb));
-	mpol_free_shared_policy(inode->i_mapping);
 	kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
 }
 
Index: linux-2.6.36-mmotm-101103-1217/fs/inode.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/inode.c
+++ linux-2.6.36-mmotm-101103-1217/fs/inode.c
@@ -25,6 +25,7 @@
 #include <linux/async.h>
 #include <linux/posix_acl.h>
 #include <linux/ima.h>
+#include <linux/shared_policy.h>
 
 /*
  * This is needed for the following functions:
@@ -305,6 +306,12 @@ void inode_init_once(struct inode *inode
 #ifdef CONFIG_FSNOTIFY
 	INIT_HLIST_HEAD(&inode->i_fsnotify_marks);
 #endif
+	/*
+	 * free any shared policy
+	 */
+	if ((inode->i_mode & S_IFMT) == S_IFREG)
+		mpol_free_shared_policy(inode->i_mapping);
+
 }
 EXPORT_SYMBOL(inode_init_once);
 
Index: linux-2.6.36-mmotm-101103-1217/mm/shmem.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/shmem.c
+++ linux-2.6.36-mmotm-101103-1217/mm/shmem.c
@@ -1516,6 +1516,7 @@ int shmem_set_policy(struct vm_area_stru
 		if (IS_ERR(sp))
 			return PTR_ERR(sp);
 	}
+	sp->sp_flags |= SPOL_F_PERSIST;
 	return mpol_set_shared_policy(sp, vma_mpol_pgoff(vma, start),
 					(end - start) >> PAGE_SHIFT, new);
 }
@@ -2417,10 +2418,6 @@ static struct inode *shmem_alloc_inode(s
 
 static void shmem_destroy_inode(struct inode *inode)
 {
-	if ((inode->i_mode & S_IFMT) == S_IFREG) {
-		/* only struct inode is valid if it's an inline symlink */
-		mpol_free_shared_policy(inode->i_mapping);
-	}
 	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
 }
 
Index: linux-2.6.36-mmotm-101103-1217/mm/mmap.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mmap.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mmap.c
@@ -198,6 +198,17 @@ static void __remove_shared_vm_struct(st
 	if (vma->vm_flags & VM_SHARED)
 		mapping->i_mmap_writable--;
 
+	if (!mapping->i_mmap_writable) {
+		/*
+		 * shared mmap()ed file policy persistence model:
+		 * remove policy when removing last shared mapping,
+		 * unless marked as persistent--e.g., shmem
+		 */
+		struct shared_policy *sp = mapping_shared_policy(mapping);
+		if (sp && !(sp->sp_flags & SPOL_F_PERSIST))
+			mpol_free_shared_policy(mapping);
+	}
+
 	flush_dcache_mmap_lock(mapping);
 	if (unlikely(vma->vm_flags & VM_NONLINEAR))
 		list_del_init(&vma->shared.vm_set.list);
Index: linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/shared_policy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
@@ -4,6 +4,7 @@
 #include <linux/fs.h>
 #include <linux/spinlock.h>
 #include <linux/rbtree.h>
+#include <linux/rcupdate.h>
 
 /*
  * Tree of shared policies for a shared memory regions and memory
@@ -25,11 +26,15 @@ struct sp_node {
 };
 
 struct shared_policy {
-	struct rb_root root;
-	spinlock_t     lock;		/* protects rb tree */
-	int            nr_sp_nodes;	/* for numa_maps */
+	struct rb_root  root;
+	spinlock_t      lock;		/* protects rb tree, nr_sp_nodes */
+	int             nr_sp_nodes;	/* for numa_maps */
+	int             sp_flags;	/* persistence, ... */
+	struct rcu_head sp_rcu;		/* deferred reclaim */
 };
 
+#define SPOL_F_PERSIST	0x01		/* for shmem use */
+
 extern int shared_file_policy_default;
 
 extern struct shared_policy *mpol_shared_policy_new(
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -1572,13 +1572,17 @@ asmlinkage long compat_sys_mbind(compat_
  */
 struct mempolicy *get_file_policy(struct address_space *mapping, pgoff_t pgoff)
 {
-	struct shared_policy *sp = mapping->spolicy;
+	struct shared_policy *sp;
 	struct mempolicy *pol = NULL;
 
+	rcu_read_lock();
+	sp = rcu_dereference(mapping->spolicy);
 	if (unlikely(sp))
 		pol = mpol_shared_policy_lookup(sp, pgoff);
 	else if (likely(current))
 		pol = current->mempolicy;
+	rcu_read_unlock();
+
 	if (likely(!pol))
 		pol = &default_policy;
 	return pol;
@@ -2291,6 +2295,10 @@ restart:
  * On entry, the current task has a reference on a non-NULL @mpol.
  * This must be released on exit.
  * This is called at get_inode() calls and we can use GFP_KERNEL.
+ *
+ * Locking:  mapping->spolicy stabilized by current->mm->mmap_sem.
+ * Can't remove last shared mapping while we hold the sem; can't
+ * remove inode/shared policy while inode is mmap()ed shared.
  */
 struct shared_policy *mpol_shared_policy_new(struct address_space *mapping,
 						struct mempolicy *mpol)
@@ -2349,9 +2357,10 @@ put_free:
 	 */
 	spin_lock(&mapping->i_mmap_lock);
 	spx = mapping->spolicy;
-	if (!spx && !err)
-		mapping->spolicy = spx = sp;
-	else
+	if (!spx && !err) {
+		spx = sp;
+		rcu_assign_pointer(mapping->spolicy, sp);
+	} else
 		err = !0;
 	spin_unlock(&mapping->i_mmap_lock);
 	if (err)
@@ -2367,6 +2376,9 @@ put_free:
  * @sz:  size of range [bytes] to which mempolicy applies
  * @mpol:  the mempolicy to install
  *
+ * Locking:  mapping->spolicy stabilized by current->mm->mmap_sem.
+ * Can't remove last shared mapping while we hold the sem; can't
+ * remove inode/shared policy while inode is mmap()ed shared.
  */
 int mpol_set_shared_policy(struct shared_policy *sp,
 				pgoff_t pgoff, unsigned long sz,
@@ -2394,37 +2406,63 @@ int mpol_set_shared_policy(struct shared
 
 /**
  * mpol_free_shared_policy() - Free a backing policy store on inode delete.
- * @mapping - address_space struct containing pointer to shared policy to be freed.
+ * @mapping - address_space struct containing pointer to shared policy to be
+ * freed.
  *
  * Frees the shared policy red-black tree, if any, before freeing the
  * shared policy struct itself, if any.
+
+ * Locking:  only free shared policy on inode deletion [shmem] or
+ * removal of last shared mmap()ing.  Can only delete inode when no
+ * more references.  Removal of last shared mmap()ing protected by
+ * mmap_sem [and mapping->i_mmap_lock].  Still a potential race with
+ * shared policy lookups from page cache on behalf of file descriptor
+ * access to pages.  Use deferred RCU to protect readers [in get_file_policy()]
+ * from shared policy free on removal of last shared mmap()ing.
  */
-void mpol_free_shared_policy(struct address_space *mapping)
+static void __mpol_free_shared_policy(struct rcu_head *rhp)
 {
-	struct shared_policy *sp = mapping->spolicy;
-	struct sp_node *n;
+	struct shared_policy *sp = container_of(rhp, struct shared_policy,
+						sp_rcu);
 	struct rb_node *next;
 
-	if (!sp)
-  		return;
-
-	mapping->spolicy = NULL;
-
+	/*
+	 * Now, we can safely tear down the shared policy tree, if any
+	 */
 	if (sp->root.rb_node) {
-		spin_lock(&sp->lock);
 		next = rb_first(&sp->root);
 		while (next) {
-			n = rb_entry(next, struct sp_node, nd);
+			struct sp_node *n = rb_entry(next, struct sp_node, nd);
 			next = rb_next(&n->nd);
 			rb_erase(&n->nd, &sp->root);
 			mpol_put(n->policy);
 			kmem_cache_free(sn_cache, n);
 		}
-		spin_unlock(&sp->lock);
 	}
 	kmem_cache_free(sp_cache, sp);
 }
 
+void mpol_free_shared_policy(struct address_space *mapping)
+{
+	struct shared_policy *sp = mapping->spolicy;
+
+	if (!sp)
+		return;
+
+	rcu_assign_pointer(mapping->spolicy, NULL);
+
+	/*
+	 * Presence of 'PERSIST flag means we're freeing the
+	 * shared policy in the inode destruction path.  No
+	 * need for RCU synchronization.
+	 */
+	if (sp->sp_flags & SPOL_F_PERSIST)
+		__mpol_free_shared_policy(&sp->sp_rcu);
+	else
+		call_rcu(&sp->sp_rcu, __mpol_free_shared_policy);
+
+}
+
 /* assumes fs == KERNEL_DS */
 void __init numa_policy_init(void)
 {

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH/RFC 13/14] Shared Policy: per cpuset mapped file policy control
  2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
                   ` (11 preceding siblings ...)
  2010-11-11 19:13 ` [PATCH/RFC 12/14] Shared Policy: mapped file policy persistence model Lee Schermerhorn
@ 2010-11-11 19:13 ` Lee Schermerhorn
  2010-11-11 19:13 ` [PATCH/RFC 14/14] Shared Policy: add generic file set/get policy vm ops Lee Schermerhorn
  2010-11-11 19:54 ` [PATCH/RFC 0/14] Shared Policy Overview Andi Kleen
  14 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:13 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Shared Policy Infrastracture - per cpuset mapped file policy control

As for shared hugetlbfs file mappings, add a per cpuset
"shared_file_policy" control file to enable shared file policy
for tasks in the cpuset.  Default is disabled, resulting in
the old behavior--i.e., we continue to ignore mbind() on
address ranges backed by shared file mappings.  The
"shared_file_policy" file depends on CONFIG_NUMA.

Why "per cpuset"?  cpusets are numa-aware task groupings and
memory policy is a numa concept.  Applications that need/want
shared file policy can be grouped in a cpuset with this feature
enabled, while other applications in other cpusets need not see
this feature.

The default may be overridden--e.g., to enabled--on the kernel
command line using the "shared_file_policy_default" parameter.
When cpusets are configured, this policy sets the default value
of "shared_file_policy" for the top cpuset, which is then inherited
by all subsequently created descendant cpusets.  When cpusets are
not configured, this parameter sets the "shared_file_policy_enabled"
flag for the init process, which is then inherited by all descendant
processes.

A subsequent patch will "hook up" generic file .{set|get}_policy
vm_ops to install a shared policy on a memory mapped file
if the capability has been enabled for the caller's cpuset, or
for the system in the case of no cpusets.


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
 
 include/linux/cpuset.h        |    6 ++++++
 include/linux/sched.h         |   19 ++++++++++++++++++-
 include/linux/shared_policy.h |    2 ++
 kernel/cpuset.c               |   42 +++++++++++++++++++++++++++++++++++++++++-
 mm/mempolicy.c                |   20 +++++++++++++++++++-
 5 files changed, 86 insertions(+), 3 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/sched.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/sched.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/sched.h
@@ -1455,7 +1455,8 @@ struct task_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *mempolicy;	/* Protected by alloc_lock */
 	short il_next;
-	short shared_huge_policy_enabled;
+	short shared_huge_policy_enabled:1;
+	short shared_file_policy_enabled:1;
 #endif
 	atomic_t fs_excl;	/* holding fs exclusive resources */
 	struct rcu_head rcu;
@@ -1882,12 +1883,28 @@ static inline int shared_huge_policy_ena
 	return tsk->shared_huge_policy_enabled;
 }
 
+static inline void set_shared_file_policy_enabled(struct task_struct *tsk,
+							int val)
+{
+	tsk->shared_file_policy_enabled = !!val;
+}
+static inline int shared_file_policy_enabled(struct task_struct *tsk)
+{
+	return tsk->shared_file_policy_enabled;
+}
+
 #else
 static void set_shared_huge_policy_enabled(struct task_struct *tsk, int val) { }
 static int shared_huge_policy_enabled(struct task_struct *tsk)
 {
 	return 0;
 }
+
+static void set_shared_file_policy_enabled(struct task_struct *tsk, int val) { }
+static int shared_file_policy_enabled(struct task_struct *tsk)
+{
+	return 0;
+}
 #endif
 
 #ifdef CONFIG_HOTPLUG_CPU
Index: linux-2.6.36-mmotm-101103-1217/kernel/cpuset.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/kernel/cpuset.c
+++ linux-2.6.36-mmotm-101103-1217/kernel/cpuset.c
@@ -133,6 +133,7 @@ typedef enum {
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
 	CS_SHARED_HUGE_POLICY,
+ 	CS_SHARED_FILE_POLICY,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -176,6 +177,11 @@ static inline int is_shared_huge_policy(
 	return test_bit(CS_SHARED_HUGE_POLICY, &cs->flags);
 }
 
+static inline int is_shared_file_policy(const struct cpuset *cs)
+{
+	return test_bit(CS_SHARED_FILE_POLICY, &cs->flags);
+}
+
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
 };
@@ -331,6 +337,10 @@ static void cpuset_update_task_cpuset_fl
 		set_shared_huge_policy_enabled(tsk, 1);
 	else
 		set_shared_huge_policy_enabled(tsk, 0);
+	if (is_shared_file_policy(cs))
+		set_shared_file_policy_enabled(tsk, 1);
+	else
+		set_shared_file_policy_enabled(tsk, 0);
 }
 
 /*
@@ -1270,7 +1280,8 @@ static int update_flag(cpuset_flagbits_t
 
 	cpuset_flags_changed = ((is_spread_slab(cs) != is_spread_slab(trialcs))
 			|| (is_spread_page(cs) != is_spread_page(trialcs))
-			|| (is_shared_huge_policy(cs) != is_shared_huge_policy(trialcs)));
+			|| (is_shared_huge_policy(cs) != is_shared_huge_policy(trialcs))
+			|| (is_shared_file_policy(cs) != is_shared_file_policy(trialcs)));
 
 	mutex_lock(&callback_mutex);
 	cs->flags = trialcs->flags;
@@ -1507,6 +1518,7 @@ typedef enum {
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
 	FILE_SHARED_HUGE_POLICY,
+	FILE_SHARED_FILE_POLICY,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1549,6 +1561,9 @@ static int cpuset_write_u64(struct cgrou
 	case FILE_SHARED_HUGE_POLICY:
 		retval = update_flag(CS_SHARED_HUGE_POLICY, cs, val);
 		break;
+	case FILE_SHARED_FILE_POLICY:
+		retval = update_flag(CS_SHARED_FILE_POLICY, cs, val);
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -1715,6 +1730,8 @@ static u64 cpuset_read_u64(struct cgroup
 		return is_spread_slab(cs);
 	case FILE_SHARED_HUGE_POLICY:
 		return is_shared_huge_policy(cs);
+	case FILE_SHARED_FILE_POLICY:
+		return is_shared_file_policy(cs);
 	default:
 		BUG();
 	}
@@ -1839,6 +1856,13 @@ static struct cftype cft_shared_huge_pol
 	.private = FILE_SHARED_HUGE_POLICY,
 };
 
+static struct cftype cft_shared_file_policy = {
+	.name = "shared_file_policy",
+	.read_u64 = cpuset_read_u64,
+	.write_u64 = cpuset_write_u64,
+	.private = FILE_SHARED_FILE_POLICY,
+};
+
 static int cpuset_populate(struct cgroup_subsys *ss, struct cgroup *cont)
 {
 	int err;
@@ -1852,6 +1876,9 @@ static int cpuset_populate(struct cgroup
 	err = add_shared_xxx_policy_file(cont, ss, &cft_shared_huge_policy);
 	if (err < 0)
 		return err;
+	err = add_shared_xxx_policy_file(cont, ss, &cft_shared_file_policy);
+	if (err < 0)
+		return err;
 	/* memory_pressure_enabled is in root cpuset only */
 	if (!cont->parent)
 		err = cgroup_add_file(cont, ss,
@@ -1928,6 +1955,8 @@ static struct cgroup_subsys_state *cpuse
 		set_bit(CS_SPREAD_SLAB, &cs->flags);
 	if (is_shared_huge_policy(parent))
 		set_bit(CS_SHARED_HUGE_POLICY, &cs->flags);
+	if (is_shared_file_policy(parent))
+		set_bit(CS_SHARED_FILE_POLICY, &cs->flags);
 	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
 	cpumask_clear(cs->cpus_allowed);
 	nodes_clear(cs->mems_allowed);
@@ -2012,6 +2041,17 @@ void __init cpuset_init_shared_huge_poli
 }
 
 /**
+ * cpuset_init_shared_file_policy - set default value for shared_file_policy
+ * enablement.
+ */
+
+void __init cpuset_init_shared_file_policy(int dflt)
+{
+	if (dflt)
+		set_bit(CS_SHARED_FILE_POLICY, &top_cpuset.flags);
+}
+
+/**
  * cpuset_do_move_task - move a given task to another cpuset
  * @tsk: pointer to task_struct the task to move
  * @scan: struct cgroup_scanner contained in its struct cpuset_hotplug_scanner
Index: linux-2.6.36-mmotm-101103-1217/include/linux/cpuset.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/cpuset.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/cpuset.h
@@ -149,6 +149,7 @@ static inline int add_shared_xxx_policy_
 #endif
 
 extern void __init cpuset_init_shared_huge_policy(int dflt);
+extern void __init cpuset_init_shared_file_policy(int dflt);
 
 #else /* !CONFIG_CPUSETS */
 
@@ -268,6 +269,11 @@ static inline void cpuset_init_shared_hu
 {
 	current->shared_file_policy_enabled = dflt;
 }
+
+static inline void cpuset_init_shared_file_policy(int dflt)
+{
+	current->shared_file_policy_enabled = dflt;
+}
 
 #endif /* !CONFIG_CPUSETS */
 
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -2131,6 +2131,23 @@ static int __init setup_shared_huge_poli
 __setup("shared_huge_policy_default=", setup_shared_huge_policy_default);
 
 /*
+ * default state of per cpuset shared_file_policy_enablement
+ */
+int shared_file_policy_default;
+
+static int __init setup_shared_file_policy_default(char *str)
+{
+	int ret, val;
+	ret = get_option(&str, &val);
+	if (!ret)
+		return 0;
+	shared_file_policy_default = !!val;
+	return 1;
+}
+__setup("shared_file_policy_default=", setup_shared_file_policy_default);
+
+
+/*
  * Remember policies even when nobody has shared memory mapped.
  * The policies are kept in Red-Black tree linked from the inode.
  * They are protected by the sp->lock spinlock, which should be held
@@ -2510,10 +2527,11 @@ void __init numa_policy_init(void)
 		printk("numa_policy_init: interleaving failed\n");
 
 	/*
-	 * initialize per cpuset shared huge policy enablement
+	 * initialize per cpuset shared [huge] file policy enablement
 	 * from default.
 	 */
 	cpuset_init_shared_huge_policy(shared_huge_policy_default);
+	cpuset_init_shared_file_policy(shared_file_policy_default);
 }
 
 /* Reset policy of current process to default */
Index: linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/shared_policy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/shared_policy.h
@@ -36,6 +36,7 @@ struct shared_policy {
 #define SPOL_F_PERSIST	0x01		/* for shmem use */
 
 extern int shared_file_policy_default;
+extern int shared_file_policy_default;
 
 extern struct shared_policy *mpol_shared_policy_new(
 					struct address_space *mapping,
@@ -51,6 +52,7 @@ extern struct mempolicy *mpol_shared_pol
 struct shared_policy {};
 
 #define shared_file_policy_default 0
+#define shared_file_policy_default 0
 
 static inline int mpol_set_shared_policy(struct shared_policy *info,
 					pgoff_t pgoff, unsigned long sz,

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH/RFC 14/14] Shared Policy: add generic file set/get policy vm ops
  2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
                   ` (12 preceding siblings ...)
  2010-11-11 19:13 ` [PATCH/RFC 13/14] Shared Policy: per cpuset mapped file policy control Lee Schermerhorn
@ 2010-11-11 19:13 ` Lee Schermerhorn
  2010-11-11 19:54 ` [PATCH/RFC 0/14] Shared Policy Overview Andi Kleen
  14 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:13 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

Shared Policy Infrastructure - add generic file set/get policy vm ops

Add set/get policy vm ops to generic_file_vm_ops in support of
mmap()ed file memory policies.  This patch effectively "hooks up"
shared file mappings to the NUMA shared policy infrastructure.
However, a task will only use a shared policy if it has been enabled
by the task's cpuset's "shared_file_policy" control file.   Default is
disabled--i.e., existing behaviors.

To ensure that applications do not get surprised by unrelated
applications applying shared policy to their files, allow only the
owner of a file to apply shared policy.  Note that we could make this
enforcement conditional on a per-cpuset "shared_policy_enforce_ownership"
file.

NOTE:  may be able to unify with shmem_{get|set}_policy.

Updated numa_memory_policy.txt to document this behavior.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/numa_memory_policy.txt |   47 ++++++++++++++++++-----------
 mm/filemap.c                            |   51 ++++++++++++++++++++++++++++++++
 mm/mempolicy.c                          |   35 +++++++++++++++++----
 mm/shmem.c                              |    4 +-
 4 files changed, 110 insertions(+), 27 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/mm/filemap.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/filemap.c
+++ linux-2.6.36-mmotm-101103-1217/mm/filemap.c
@@ -513,6 +513,47 @@ struct page *__page_cache_alloc(struct a
 	return alloc_page_pol(gfp, pol, pgoff);
 }
 EXPORT_SYMBOL(__page_cache_alloc);
+
+static int generic_file_set_policy(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end,
+			struct mempolicy *new)
+{
+	struct address_space *mapping;
+	struct shared_policy *sp;
+	int ret;
+
+	mapping = vma->vm_file->f_mapping;
+
+	/*
+	 * Only owner or privileged task can set shared policy on shared
+	 * regular file mappings.
+	 */
+	if (!is_owner_or_cap(mapping->host))
+		return -EPERM;
+
+	sp = mapping->spolicy;
+	if (!sp) {
+		sp = mpol_shared_policy_new(mapping, NULL);
+		if (IS_ERR(sp))
+			return PTR_ERR(sp);
+	}
+
+	ret = mpol_set_shared_policy(sp, vma_mpol_pgoff(vma, start),
+					 (end - start) >> PAGE_SHIFT, new);
+	if (!ret)
+		mpol_set_vma_nosplit(vma);
+	return ret;
+}
+
+static struct mempolicy *
+generic_file_get_policy(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct shared_policy *sp = vma->vm_file->f_mapping->spolicy;
+	if (!sp)
+		return NULL;
+
+	return mpol_shared_policy_lookup(sp, vma_mpol_pgoff(vma, addr));
+}
 #endif
 
 static int __sleep_on_page_lock(void *word)
@@ -1686,6 +1727,10 @@ EXPORT_SYMBOL(filemap_fault);
 
 const struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+#ifdef CONFIG_NUMA
+	.set_policy     = generic_file_set_policy,
+	.get_policy     = generic_file_get_policy,
+#endif
 };
 
 /* This is used for a general mmap of a disk file */
@@ -1699,6 +1744,12 @@ int generic_file_mmap(struct file * file
 	file_accessed(file);
 	vma->vm_ops = &generic_file_vm_ops;
 	vma->vm_flags |= VM_CAN_NONLINEAR;
+
+	/*
+	 * shared policies and non-linear mappings are mutually exclusive
+	 */
+	if ((vma->vm_flags & VM_SHARED) && mapping_shared_policy(mapping))
+		mpol_set_vma_nosplit(vma);
 	return 0;
 }
 
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -638,18 +638,38 @@ check_range(struct mm_struct *mm, unsign
 	return first;
 }
 
+/*
+ * helper functions for deciding whether to split vmas on set_policy
+ * or to use the vma policy op for set/get.  Note that we only get
+ * into these if the vma represents a shared, linear, file mapping,
+ * including shmem.
+ */
 static bool vma_is_shared_linear(struct vm_area_struct *vma)
 {
 	return ((vma->vm_flags & (VM_SHARED|VM_NONLINEAR)) == VM_SHARED);
 }
 
-static bool mpol_nosplit_vma(struct vm_area_struct *vma)
+static bool has_set_policy_op(struct vm_area_struct *vma)
+{
+	return (vma->vm_ops && vma->vm_ops->set_policy);
+}
+
+/*
+ * We don't split vmas on set_policy if VMPOL_F_NOSPLIT is set or we have
+ * a shared, linear mapping, AND a set_policy() vm_op.  VMPOL_F_NOSPLIT
+ * will be set for shmem segments and files mmap()ed SHARED if a shared
+ * policy has previously been applied to this file.
+ */
+static int mpol_nosplit_vma(struct vm_area_struct *vma)
 {
 	if (vma->vm_mpol_flags & VMPOL_F_NOSPLIT)
 		return true;
 
-	if (vma_is_shared_linear(vma)  &&
-	    vma->vm_ops && vma->vm_ops->set_policy) {
+	if (vma_is_shared_linear(vma) && has_set_policy_op(vma) &&
+			shared_file_policy_enabled(current)) {
+		/*
+		 * short circuit future queries.
+		 */
 		vma->vm_mpol_flags |= VMPOL_F_NOSPLIT;
 		return true;
 	}
@@ -701,7 +721,8 @@ static int policy_vma(struct vm_area_str
 
 /* Step 2: apply policy to a range and do splits. */
 static int mbind_range(struct mm_struct *mm, unsigned long start,
-		       unsigned long end, struct mempolicy *new_pol)
+		       unsigned long end, struct mempolicy *new_pol,
+		       unsigned long flags)
 {
 	struct vm_area_struct *next;
 	struct vm_area_struct *prev;
@@ -925,7 +946,7 @@ static long do_get_mempolicy(int *policy
 			up_read(&mm->mmap_sem);
 			return -EFAULT;
 		}
-		if (vma->vm_ops && vma->vm_ops->get_policy)
+		if (mpol_use_get_op(vma))
 			pol = vma->vm_ops->get_policy(vma, addr);
 		else
 			pol = vma->vm_policy;
@@ -1244,7 +1265,7 @@ static long do_mbind(unsigned long start
 	if (!IS_ERR(vma)) {
 		int nr_failed = 0;
 
-		err = mbind_range(mm, start, end, new);
+		err = mbind_range(mm, start, end, new, flags);
 
 		if (!list_empty(&pagelist)) {
 			nr_failed = migrate_pages(&pagelist, new_vma_page,
@@ -1611,7 +1632,7 @@ struct mempolicy *get_vma_policy(struct
 
 	if (vma) {
 		/*
-		 * use get_policy op, if any, for shared mappings
+		 * use get_policy op, if applicable, for shared mappings
 		 */
 		if (mpol_use_get_op(vma)) {
 			struct mempolicy *vpol = vma->vm_ops->get_policy(vma,
Index: linux-2.6.36-mmotm-101103-1217/mm/shmem.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/shmem.c
+++ linux-2.6.36-mmotm-101103-1217/mm/shmem.c
@@ -219,7 +219,7 @@ static const struct file_operations shme
 static const struct inode_operations shmem_inode_operations;
 static const struct inode_operations shmem_dir_inode_operations;
 static const struct inode_operations shmem_special_inode_operations;
-static const struct vm_operations_struct shmem_vm_ops;
+const struct vm_operations_struct shmem_vm_ops;
 
 static struct backing_dev_info shmem_backing_dev_info  __read_mostly = {
 	.ra_pages	= 0,	/* No readahead */
@@ -2526,7 +2526,7 @@ static const struct super_operations shm
 	.put_super	= shmem_put_super,
 };
 
-static const struct vm_operations_struct shmem_vm_ops = {
+const struct vm_operations_struct shmem_vm_ops = {
 	.fault		= shmem_fault,
 #ifdef CONFIG_NUMA
 	.set_policy     = shmem_set_policy,
Index: linux-2.6.36-mmotm-101103-1217/Documentation/vm/numa_memory_policy.txt
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/Documentation/vm/numa_memory_policy.txt
+++ linux-2.6.36-mmotm-101103-1217/Documentation/vm/numa_memory_policy.txt
@@ -78,11 +78,10 @@ most general to most specific:
 	VMA policy applies ONLY to anonymous pages.  These include pages
 	allocated for anonymous segments, such as the task stack and heap, and
 	any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
-	If a VMA policy is applied to a file mapping, it will be ignored if
-	the mapping used the MAP_SHARED flag.  If the file mapping used the
-	MAP_PRIVATE flag, the VMA policy will only be applied when an
-	anonymous page is allocated on an attempt to write to the mapping--
-	i.e., at Copy-On-Write.
+	If a VMA policy is applied to a file mapping mapped with the
+	MAP_PRIVATE flag, the VMA policy will only be applied when an anonymous
+	page is allocated on an attempt to write to the mapping--i.e., at
+	Copy-On-Write.
 
 	VMA policies are shared between all tasks that share a virtual address
 	space--a.k.a. threads--independent of when the policy is installed; and
@@ -107,11 +106,16 @@ most general to most specific:
     mapped shared into one or more tasks' distinct address spaces.  An
     application installs a shared policies the same way as VMA policies--using
     the mbind() system call specifying a range of virtual addresses that map
-    the shared object.  However, unlike VMA policies, which can be considered
-    to be an attribute of a range of a task's address space, shared policies
-    apply directly to the shared object.  Thus, all tasks that attach to the
-    object share the policy, and all pages allocated for the shared object,
-    by any task, will obey the shared policy.
+    some range of the shared object.  However, unlike VMA policies, which can
+    be considered to be an attribute of a range of a task's address space,
+    shared policies apply directly to [a range of] the shared object.  Thus,
+    all tasks that attach to the object share the policy, and all pages
+    allocated for the shared object, by any task, after the policy is installed,
+    will obey the shared policy.
+
+	If no shared policy exists for a given page offset in a shared object,
+	allocation will fall back to task policy, and then to system default
+	policy, like VMA policies.
 
 	As of 2.6.28, only shared memory segments, created by shmget() or
 	mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy.  Prior to
@@ -124,12 +128,19 @@ most general to most specific:
 	pages.  To preserve existing behavior for applications that might
 	care, this new behavior must be enabled on a per-cpuset basis.
 
-	As mentioned above [re: VMA policies], allocations of page cache
-	pages for regular files mmap()ed with MAP_SHARED ignore any VMA
-	policy installed on the virtual address range backed by the shared
-	file mapping.  Rather, shared page cache pages, including pages backing
-	private mappings that have not yet been written by the task, follow
-	task policy, if any, else System Default Policy.
+	Prior to 2.6.XX, shared memory policies were not supported on regular
+	files.  Allocations of page cache pages for regular files mmap()ed with
+	MAP_SHARED ignored any VMA policy installed on the virtual address
+	range backed by the shared file mapping.  Rather, shared page cache
+	pages, including pages backing private mappings that have not yet been
+	written by the task, followed task policy, if any, else System Default
+	Policy.  As of 2.6.XX, mbind() will install shared policies on [a range
+	of] a regular file mmap()ed with MAP_SHARED.   To minimize unpleasant
+	surprises for existing applications, only the owner or appropriately
+	privileged task may apply a shared policy to a regular file, and the
+	policy will persist only as long as the file remains mapped in one or
+	more task's virtual address space.  Further, this new behavior must be
+	enabled on a per-cpuset basis.
 
 	The shared policy infrastructure supports different policies on subset
 	ranges of the shared object.  However, before Linux 2.6.XX, the kernel
@@ -386,8 +397,8 @@ the policy is considered invalid and can
 
 The interaction of memory policies and cpusets can be problematic when tasks
 in two cpusets share access to a memory region, such as shared memory segments
-created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
-any of the tasks install shared policy on the region, only nodes whose
+created by shmget() of mmap() with the MAP_ANONYMOUS and/or MAP_SHARED flags,
+and any of the tasks install shared policy on the region, only nodes whose
 memories are allowed in both cpusets may be used in the policies.  Since
 2.6.26, applications can determine the allowed memories using the
 get_mempolicy() API with the MPOL_F_MEMS_ALLOWED flag.  However, one still

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC 0/14] Shared Policy Overview
  2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
                   ` (13 preceding siblings ...)
  2010-11-11 19:13 ` [PATCH/RFC 14/14] Shared Policy: add generic file set/get policy vm ops Lee Schermerhorn
@ 2010-11-11 19:54 ` Andi Kleen
  2010-11-11 19:59   ` Lee Schermerhorn
  14 siblings, 1 reply; 17+ messages in thread
From: Andi Kleen @ 2010-11-11 19:54 UTC (permalink / raw
  To: Lee Schermerhorn
  Cc: linux-numa, akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, andi, David Rientjes, Avi Kivity,
	Andrea Arcangeli

> I'll announce this series and the automatic/lazy migration series
> to follow on lkml, linux-mm, ...  However, I'll limit the actual
> posting to linux-numa to avoid spamming the other lists.

Thanks for posting Lee. Yes I think the patchkit is 
very interesting. 

Quick comment on the list: I think linux-mm is better as the main 
list where all the MM hackers are; linux-numa is really mostly about 
the user space code.

-Andi

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH/RFC 0/14] Shared Policy Overview
  2010-11-11 19:54 ` [PATCH/RFC 0/14] Shared Policy Overview Andi Kleen
@ 2010-11-11 19:59   ` Lee Schermerhorn
  0 siblings, 0 replies; 17+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 19:59 UTC (permalink / raw
  To: Andi Kleen
  Cc: linux-numa, akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins,
	KOSAKI Motohiro, David Rientjes, Avi Kivity, Andrea Arcangeli

On Thu, 2010-11-11 at 20:54 +0100, Andi Kleen wrote:
> > I'll announce this series and the automatic/lazy migration series
> > to follow on lkml, linux-mm, ...  However, I'll limit the actual
> > posting to linux-numa to avoid spamming the other lists.
> 
> Thanks for posting Lee. Yes I think the patchkit is 
> very interesting. 
> 
> Quick comment on the list: I think linux-mm is better as the main 
> list where all the MM hackers are; linux-numa is really mostly about 
> the user space code.

Andi:

I will announce the patches there with links to the threads.  If this
moves beyond RFC we can move it to -mm if you like.  I used linux-numa
because I thought that was most appropriate, thinking it more general
than just user space code.

Lee

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2010-11-11 19:59 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-11 19:11 [PATCH/RFC 0/14] Shared Policy Overview Lee Schermerhorn
2010-11-11 19:11 ` [PATCH/RFC 1/14] Shared Policy: Miscellaneous Cleanup Lee Schermerhorn
2010-11-11 19:12 ` [PATCH/RFC 2/14] Shared Policy: move shared policy to inode/mapping Lee Schermerhorn
2010-11-11 19:12 ` [PATCH/RFC 3/14] Shared Policy: allocate shared policies as needed Lee Schermerhorn
2010-11-11 19:12 ` [PATCH/RFC 4/14] Shared Policy: let vma policy ops handle sub-vma policies Lee Schermerhorn
2010-11-11 19:12 ` [PATCH/RFC 5/14] Shared Policy: fix show_numa_maps() Lee Schermerhorn
2010-11-11 19:12 ` [PATCH/RFC 6/14] Shared Policy: Factor alloc_page_pol routine Lee Schermerhorn
2010-11-11 19:12 ` [PATCH/RFC 7/14] Shared Policy: use shared policy for page cache allocations Lee Schermerhorn
2010-11-11 19:12 ` [PATCH/RFC 8/14] Shared Policy: use alloc_page_pol for swap and shmempages Lee Schermerhorn
2010-11-11 19:13 ` [PATCH/RFC 9/14] Shared Policy: per cpuset huge file policy control Lee Schermerhorn
2010-11-11 19:13 ` [PATCH/RFC 10/14] Shared Policy: Add hugepage shmem policy vm_ops Lee Schermerhorn
2010-11-11 19:13 ` [PATCH/RFC 11/14] Shared Policy: fix migration of private mappings Lee Schermerhorn
2010-11-11 19:13 ` [PATCH/RFC 12/14] Shared Policy: mapped file policy persistence model Lee Schermerhorn
2010-11-11 19:13 ` [PATCH/RFC 13/14] Shared Policy: per cpuset mapped file policy control Lee Schermerhorn
2010-11-11 19:13 ` [PATCH/RFC 14/14] Shared Policy: add generic file set/get policy vm ops Lee Schermerhorn
2010-11-11 19:54 ` [PATCH/RFC 0/14] Shared Policy Overview Andi Kleen
2010-11-11 19:59   ` Lee Schermerhorn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).