linux-numa.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH/RFC 0/11] numa - Automatic-migration
@ 2010-11-11 20:01 Lee Schermerhorn
  2010-11-11 20:01 ` [PATCH/RFC 1/11] numa - Automatic-migration - preparation, cleanup Lee Schermerhorn
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 20:01 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

RFC Automatic Page Migration

At the Linux Plumber's conference, Andi Kleen encouraged me again
to resubmit my automatic page migration patches because he thinks
they will be useful for virtualization.  Later, in the Virtualization
mini-conf, the subject came up during a presentation about adding
NUMA awareness to qemu/kvm.  After the presentation, I discussed
these series with Andrea Arcangeli and he also encouraged me to
post them.  My position within HP has changed such that I'm not
sure how much time I'll have to spend on this area nor whether I'll
have access to the larger NUMA platforms on which to test the
patches thoroughly.  However, here is the third of 4 series that
comprise my shared policy enhancements and lazy/auto-migration
enhancement.

I have rebased the patches against a recent mmotm tree.  This
rebase built cleanly, booted and passed a few ad hoc tests on
x86_64.  I've made a pass over the patch descriptions to update
them.  If there is sufficient interest in merging this, I'll
do what I can to assist in the completion and testing of the
series.

Based atop the previously posted:

1) Shared policy cleanup, fixes, mapped file policy
2) Migrate-on-fault a.k.a. Lazy Page Migration facility

To follow:

4)  a Migration Cache -- originally written by Marcello Tosatti

I'll announce this series and the automatic/lazy migration series
to follow on lkml, linux-mm, ...  However, I'll limit the actual
posting to linux-numa to avoid spamming the other lists.

---

This series of patches hooks up linux page migration to the task load
balancing mechanism.  The effect is such that, when load balancing moves
a task to a cpu on a different node from where the task last executed,
the task is notified of this change using a variant of the mechanism used
to notify a task of pending signals.  When the task returns to user state,
it attempts to migrate, to the new node, any pages not already on that
node in those of the task's vm areas under control of default policy.

By default, the task will use lazy migration to migrate "misplaced"
pages.  When notified of an inter-node migration, the task will
walk its address space, attempting to unmap [remove all ptes] any
anonymous pages in the tasks page table.  When the task subsequently
touchs any of these unmapped pages, it will include a swap page
fault.  The swap fault handler will either restore the pte if the
cached page's location matches it's mempolicy, otherwise the
"migrate-on-fault" mechanism will attempt to migrate the page to
the correct node.

Lazy migration may be disabled by writing zero to the per cpuset
auto_migrate_lazy file.  In that case, automigration will use
direct, synchronous migration to pull all anonymous pages mapped
by the task to new node.

	Why lazy migration by default?  Think of the effect
	of direct, synchronous migration, in this context,
	on large multi-threaded programs.

Automatic page migration is disabled by default, but can be enabled by
writing non-zero to the per cpuset auto_migrate_enable file.
Furthermore, to prevent thrashing, this series provides a second,
experimental per cpuset control, auto_migrate_interval.  The load
balancer will not move a task to a different node if it has move to a
new node in the last auto_migrate_interval seconds.  [User interface
is in seconds; internally it's in HZ.]  The idea is to give the task
time to ammortize the cost of the migration by giving it time to
benefit from local references to the page.  Some experimenting and
tuning will be necessary to determine the appropriate default value
for this parameter on various platforms.

An additional per cpuset control -- migrate_max_mapcount -- adjusts
the threshold page mapcount at which non-privileged users can migrate
shared pages.  This control allows experimentation with more aggressive
auto-migration.

Why "per cpuset controls"?  Originally, cpusets was the only convenient
"soft partitioning" or "task grouping" mechanism available.  Now that
"containers" or "control groups" are available, one might consider
a "NUMA behavior" control group, orthogonal to cpusets, to control this
sort behavior.  However, because cpusets are closely tied to NUMA resource
partitioning and locality management, it still seems like a good place to
contain the migration and mempolicy behavior controls.

Finally, the series adds a per process control file -- /proc/<pid>/migrate.
Writing to this file causes the task to simulate an internode migration
by walking its address space and unmapping anonymous pages so that they
will be checked for [mis]placement on next touch; or by directly migrating
them if lazy migration is disabled for the task's cpuset.  This can be
used to test the automigration facility or to force a task to reestablish
it's anonymous page NUMA footprint at any time.

---

Lee Schermerhorn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH/RFC 1/11] numa - Automatic-migration - preparation, cleanup
  2010-11-11 20:01 [PATCH/RFC 0/11] numa - Automatic-migration Lee Schermerhorn
@ 2010-11-11 20:01 ` Lee Schermerhorn
  2010-11-11 20:01 ` [PATCH/RFC 2/11] numa - Automatic-migration - per cpuset automigration control Lee Schermerhorn
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 20:01 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

AutoPage Migration -  Preparatory patch

Added AUTO_MIGRATION Kconfig option that depends on MIGRATION.
Conditionally compiled auto-migration features now controlled
by this option.

Define mempolicy.c internal flag for auto-migration.  This flag
will select auto-migration specific behavior in the existing
page migration functions.  Test this flag via helper function
is_auto_migration().  Can't be static inline in header because
flag is private to mempolicy.c.

Add auto_migrate_task_memory() to mempolicy.c.  This function sets up
to call migrate_to_node() with internal flags for auto-migration.

Modify vma_migratable() to skip VMAs that don't have local policy
when auto-migrating.  vma_migratable() now called from check_range()
in mempolicy.c and do_move_pages() in migrate.c.

Subsequent patches will arrange for auto_migrate_task_memory() to be
called when a task returns to user space after the scheduler migrates
it to a cpu on a node different from the node where it last executed.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/auto-migrate.h |   25 +++++++++++++++
 include/linux/mempolicy.h    |   13 +++++++
 mm/Kconfig                   |    7 ++++
 mm/mempolicy.c               |   71 ++++++++++++++++++++++++++++++++++++++-----
 mm/migrate.c                 |    3 +
 5 files changed, 110 insertions(+), 9 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/mm/Kconfig
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/Kconfig
+++ linux-2.6.36-mmotm-101103-1217/mm/Kconfig
@@ -211,6 +211,13 @@ config MIGRATE_ON_FAULT
 	  page is not currently mapped by any tasks.  This allows a task to
 	  pull unmapped pages closer to itself when enabled for that task.
 
+config AUTO_MIGRATION
+	bool "Auto-migrate task memory"
+	depends on MIGRATION
+	help
+	  Allows tasks' private memory to follow that task itself across
+	  inter-node migrations.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
Index: linux-2.6.36-mmotm-101103-1217/include/linux/auto-migrate.h
===================================================================
--- /dev/null
+++ linux-2.6.36-mmotm-101103-1217/include/linux/auto-migrate.h
@@ -0,0 +1,25 @@
+#ifndef _LINUX_AUTO_MIGRATE_H
+#define _LINUX_AUTO_MIGRATE_H
+
+/*
+ * minimal memory migration definitions need by scheduler,
+ * sysctl, ..., so that they don't need to drag in the entire
+ * migrate.h and all that it depends on.
+ */
+
+#ifdef CONFIG_AUTO_MIGRATION
+
+extern int is_auto_migration(int flags);
+
+extern void auto_migrate_task_memory(void);
+
+#else	/* !CONFIG_AUTO_MIGRATION */
+
+static inline int is_auto_migration(int flags)
+{
+	return 0;
+}
+
+#endif	/* CONFIG_AUTO_MIGRATION */
+
+#endif
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -91,6 +91,7 @@
 #include <linux/syscalls.h>
 #include <linux/ctype.h>
 #include <linux/mm_inline.h>
+#include <linux/auto-migrate.h>
 
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
@@ -98,10 +99,16 @@
 #include "internal.h"
 
 /* Internal flags */
-#define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)	/* Skip checks for continuous vmas */
-#define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
-#define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2)		/* Gather statistics */
-#define MPOL_MF_MOVE_ANON_ONLY (MPOL_MF_INTERNAL << 3)
+#define MPOL_MF_DISCONTIG_OK \
+	(MPOL_MF_INTERNAL << 0)		/* Skip checks for continuous vmas */
+#define MPOL_MF_INVERT \
+	(MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
+#define MPOL_MF_STATS \
+	(MPOL_MF_INTERNAL << 2)		/* Gather statistics */
+#define MPOL_MF_MOVE_ANON_ONLY \
+	(MPOL_MF_INTERNAL << 3)		/* migrate private, anon pages only */
+#define MPOL_MF_AUTOMIGRATE \
+	(MPOL_MF_INTERNAL << 4)		/* auto-migrating task memory */
 
 static struct kmem_cache *policy_cache;
 static struct kmem_cache *sp_cache;
@@ -467,8 +474,10 @@ static void migrate_page_add(struct page
 /*
  * Check whether a vma is migratable
  */
-int vma_migratable(struct vm_area_struct *vma)
+int vma_migratable(struct vm_area_struct *vma, int flags)
 {
+	int ret = 1;
+
 	if (vma->vm_flags & (VM_IO|VM_HUGETLB|VM_PFNMAP|VM_RESERVED))
 		return 0;
 	/*
@@ -480,7 +489,20 @@ int vma_migratable(struct vm_area_struct
 		gfp_zone(mapping_gfp_mask(vma->vm_file->f_mapping))
 								< policy_zone)
 			return 0;
-	return 1;
+
+	/*
+	 * Auto-migration:  only consider vmas with local allocation policy
+	 * NOTE:  we only query the start address of the vma.  For shared
+	 * segments with multiple policy ranges, this might lie, but we'll
+	 * live with that.
+	 */
+	if (is_auto_migration(flags)) {
+		struct mempolicy *pol =
+			get_vma_policy(current, vma, vma->vm_start);
+		ret = is_local_allocation(pol);
+		mpol_cond_put(pol);
+	}
+	return ret;
 }
 
 /* Scan through pages checking if pages follow certain conditions. */
@@ -627,7 +649,7 @@ check_range(struct mm_struct *mm, unsign
 				end = (end + HPAGE_MASK) & HPAGE_MASK;
 		} else if (((flags & MPOL_MF_STRICT) ||
 		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
-				vma_migratable(vma)))) {
+				vma_migratable(vma, flags)))) {
 			unsigned long endvma = vma->vm_end;
 			unsigned long anononly = 0;
 
@@ -1190,6 +1212,41 @@ static struct page *new_vma_page(struct
 	 */
 	return alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
 }
+
+#ifdef CONFIG_AUTO_MIGRATION
+
+int is_auto_migration(int flags)
+{
+	return !!(flags & MPOL_MF_AUTOMIGRATE);
+}
+
+/**
+ * auto_migrate_task_memory()
+ *
+ * Called just before returning to user state when a task has been
+ * migrated to a new node by the schedule and sched_migrate_memory
+ * is enabled.
+ */
+void auto_migrate_task_memory(void)
+{
+	struct mm_struct *mm = current->mm;
+	int dest = cpu_to_node(task_cpu(current));
+	int flags = MPOL_MF_MOVE | MPOL_MF_INVERT | MPOL_MF_AUTOMIGRATE;
+
+	/*
+	 * we're returning to user space, so mm must be non-NULL
+	 */
+	BUG_ON(!mm);
+
+	/*
+	 * Pass destination node as source node plus 'INVERT flag:
+	 *    Migrate all pages NOT on destination node.
+	 * 'AUTOMIGRATE flag selects only VMAs with default policy
+	 */
+	migrate_to_node(mm, dest, dest, flags);
+}
+#endif	/* _AUTO_MIGRATION */
+
 #else
 
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
Index: linux-2.6.36-mmotm-101103-1217/mm/migrate.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/migrate.c
+++ linux-2.6.36-mmotm-101103-1217/mm/migrate.c
@@ -35,6 +35,7 @@
 #include <linux/hugetlb.h>
 #include <linux/gfp.h>
 #include <linux/vmstat.h>
+#include <linux/auto-migrate.h>
 
 #include "internal.h"
 
@@ -1156,7 +1157,7 @@ static int do_move_page_to_node_array(st
 
 		err = -EFAULT;
 		vma = find_vma(mm, pp->addr);
-		if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
+		if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma, 0))
 			goto set_status;
 
 		page = follow_page(vma, pp->addr, FOLL_GET);
Index: linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/mempolicy.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/mempolicy.h
@@ -234,7 +234,7 @@ extern int mpol_to_str(char *buffer, int
 			int no_context);
 #endif
 
-extern int vma_migratable(struct vm_area_struct *);
+extern int vma_migratable(struct vm_area_struct *, int);
 
 struct seq_file;
 extern int show_numa_map(struct seq_file *, void *);
@@ -249,6 +249,14 @@ extern struct mpol_range *get_numa_subma
 extern int mpol_misplaced(struct page *, struct vm_area_struct *,
 		unsigned long, int *);
 
+/*
+ * Does the argument mempolicy specify local allocation?
+ */
+static inline int is_local_allocation(struct mempolicy *mpol)
+{
+	return mpol->flags & MPOL_F_LOCAL;
+}
+
 #endif /* CONFIG_MIGRATE_ON_FAULT */
 
 #else
@@ -368,6 +376,9 @@ static inline int mpol_to_str(char *buff
 }
 #endif
 
+static inline int vma_migratable(struct vm_area_struct *vma, int flags)
+					{ return 0 };
+
 #endif /* CONFIG_NUMA */
 #endif /* __KERNEL__ */
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH/RFC 2/11] numa - Automatic-migration - per cpuset automigration control
  2010-11-11 20:01 [PATCH/RFC 0/11] numa - Automatic-migration Lee Schermerhorn
  2010-11-11 20:01 ` [PATCH/RFC 1/11] numa - Automatic-migration - preparation, cleanup Lee Schermerhorn
@ 2010-11-11 20:01 ` Lee Schermerhorn
  2010-11-11 20:01 ` [PATCH/RFC 3/11] numa - Automatic-migration - check notify migrate pending Lee Schermerhorn
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 20:01 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

AutoPage Migration - add auto_migration enable per cpuset control

This patch implements a per cpuset "auto_migration" control.
Default is disabled.

Earlier versions of this patch used a task flag for this purpose
to avoid extra cache misses in the fault path and such.  There
are currently no task flags available, so this version uses
a flag in another variable in another cache line :(.

Because AUTO_MIGRATE requires CPUSETS, unconditionally select
same in mm/Kconfig.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/cpuset.h |   16 ++++++++++++++++
 include/linux/sched.h  |   18 ++++++++++++++++++
 kernel/cpuset.c        |   31 ++++++++++++++++++++++++++++++-
 mm/Kconfig             |    1 +
 4 files changed, 65 insertions(+), 1 deletion(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/sched.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/sched.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/sched.h
@@ -1458,6 +1458,7 @@ struct task_struct {
 	short shared_huge_policy_enabled:1;
 	short shared_file_policy_enabled:1;
 	short migrate_on_fault_enabled:1;
+	short auto_migrate_enabled:1;
 #endif
 	atomic_t fs_excl;	/* holding fs exclusive resources */
 	struct rcu_head rcu;
@@ -1926,6 +1927,23 @@ static inline int migrate_on_fault_enabl
 }
 #endif
 
+#ifdef CONFIG_AUTO_MIGRATION
+static inline void set_auto_migrate_enabled(struct task_struct *tsk,
+							int val)
+{
+	tsk->auto_migrate_enabled = !!val;
+}
+static inline int auto_migrate_enabled(struct task_struct *tsk)
+{
+	return tsk->auto_migrate_enabled;
+}
+#else
+static inline void set_auto_migrate_enabled(struct task_struct *tsk,
+							int val)
+{
+}
+#endif
+
 #ifdef CONFIG_HOTPLUG_CPU
 extern void move_task_off_dead_cpu(int dead_cpu, struct task_struct *p);
 extern void idle_task_exit(void);
Index: linux-2.6.36-mmotm-101103-1217/kernel/cpuset.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/kernel/cpuset.c
+++ linux-2.6.36-mmotm-101103-1217/kernel/cpuset.c
@@ -135,6 +135,7 @@ typedef enum {
 	CS_SHARED_HUGE_POLICY,
  	CS_SHARED_FILE_POLICY,
 	CS_MIGRATE_ON_FAULT,
+	CS_AUTO_MIGRATE,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -188,6 +189,11 @@ static inline int is_migrate_on_fault(co
 	return test_bit(CS_MIGRATE_ON_FAULT,  &cs->flags);
 }
 
+static inline int is_auto_migrate(const struct cpuset *cs)
+{
+	return test_bit(CS_AUTO_MIGRATE,  &cs->flags);
+}
+
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
 };
@@ -352,6 +358,10 @@ static void cpuset_update_task_cpuset_fl
 		set_migrate_on_fault_enabled(tsk, 1);
 	else
 		set_migrate_on_fault_enabled(tsk, 0);
+	if (is_auto_migrate(cs))
+		set_auto_migrate_enabled(tsk, 1);
+	else
+		set_auto_migrate_enabled(tsk, 0);
 
 }
 
@@ -1294,7 +1304,8 @@ static int update_flag(cpuset_flagbits_t
 			|| (is_spread_page(cs) != is_spread_page(trialcs))
 			|| (is_shared_huge_policy(cs) != is_shared_huge_policy(trialcs))
 			|| (is_shared_file_policy(cs) != is_shared_file_policy(trialcs))
-			|| (is_migrate_on_fault(cs) != is_migrate_on_fault(trialcs)));
+			|| (is_migrate_on_fault(cs) != is_migrate_on_fault(trialcs))
+			|| (is_auto_migrate(cs) != is_auto_migrate(trialcs)));
 
 	mutex_lock(&callback_mutex);
 	cs->flags = trialcs->flags;
@@ -1533,6 +1544,7 @@ typedef enum {
 	FILE_SHARED_HUGE_POLICY,
 	FILE_SHARED_FILE_POLICY,
 	FILE_MIGRATE_ON_FAULT,
+	FILE_AUTO_MIGRATE,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1580,6 +1592,9 @@ static int cpuset_write_u64(struct cgrou
 		break;
 	case FILE_MIGRATE_ON_FAULT:
 		retval = update_flag(CS_MIGRATE_ON_FAULT, cs, val);
+  		break;
+	case FILE_AUTO_MIGRATE:
+		retval = update_flag(CS_AUTO_MIGRATE, cs, val);
 		break;
 	default:
 		retval = -EINVAL;
@@ -1751,6 +1766,8 @@ static u64 cpuset_read_u64(struct cgroup
 		return is_shared_file_policy(cs);
 	case FILE_MIGRATE_ON_FAULT:
 		return is_migrate_on_fault(cs);
+	case FILE_AUTO_MIGRATE:
+		return is_auto_migrate(cs);
 	default:
 		BUG();
 	}
@@ -1889,6 +1906,13 @@ static struct cftype cft_migrate_on_faul
 	.private = FILE_MIGRATE_ON_FAULT,
 };
 
+static struct cftype cft_auto_migration = {
+	.name = "auto_migration",
+	.read_u64 = cpuset_read_u64,
+	.write_u64 = cpuset_write_u64,
+	.private = FILE_AUTO_MIGRATE,
+};
+
 static int cpuset_populate(struct cgroup_subsys *ss, struct cgroup *cont)
 {
 	int err;
@@ -1909,6 +1933,9 @@ static int cpuset_populate(struct cgroup
 						&cft_migrate_on_fault);
 	if (err < 0)
 		return err;
+	err = add_auto_migration_file(cont, ss, &cft_auto_migration);
+	if (err < 0)
+		return err;
 	/* memory_pressure_enabled is in root cpuset only */
 	if (!cont->parent)
 		err = cgroup_add_file(cont, ss,
@@ -1989,6 +2016,8 @@ static struct cgroup_subsys_state *cpuse
 		set_bit(CS_SHARED_FILE_POLICY, &cs->flags);
 	if (is_migrate_on_fault(parent))
 		set_bit(CS_MIGRATE_ON_FAULT, &cs->flags);
+	if (is_auto_migrate(parent))
+		set_bit(CS_AUTO_MIGRATE, &cs->flags);
 	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
 	cpumask_clear(cs->cpus_allowed);
 	nodes_clear(cs->mems_allowed);
Index: linux-2.6.36-mmotm-101103-1217/mm/Kconfig
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/Kconfig
+++ linux-2.6.36-mmotm-101103-1217/mm/Kconfig
@@ -214,6 +214,7 @@ config MIGRATE_ON_FAULT
 config AUTO_MIGRATION
 	bool "Auto-migrate task memory"
 	depends on MIGRATION
+	select CPUSETS
 	help
 	  Allows tasks' private memory to follow that task itself across
 	  inter-node migrations.
Index: linux-2.6.36-mmotm-101103-1217/include/linux/cpuset.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/cpuset.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/cpuset.h
@@ -164,6 +164,22 @@ static inline int add_migrate_on_fault_f
 }
 #endif
 
+#ifdef CONFIG_AUTO_MIGRATION
+static inline int add_auto_migration_file(struct cgroup *cg,
+						struct cgroup_subsys *ss,
+						struct cftype *cft)
+{
+	return cgroup_add_file(cg, ss, cft);
+}
+#else
+static inline int add_auto_migration_file(struct cgroup *cg,
+						struct cgroup_subsys *ss,
+						struct cftype *cft)
+{
+	return 0;
+}
+#endif
+
 extern void __init cpuset_init_shared_huge_policy(int dflt);
 extern void __init cpuset_init_shared_file_policy(int dflt);
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH/RFC 3/11] numa - Automatic-migration - check notify migrate pending
  2010-11-11 20:01 [PATCH/RFC 0/11] numa - Automatic-migration Lee Schermerhorn
  2010-11-11 20:01 ` [PATCH/RFC 1/11] numa - Automatic-migration - preparation, cleanup Lee Schermerhorn
  2010-11-11 20:01 ` [PATCH/RFC 2/11] numa - Automatic-migration - per cpuset automigration control Lee Schermerhorn
@ 2010-11-11 20:01 ` Lee Schermerhorn
  2010-11-11 20:01 ` [PATCH/RFC 4/11] numa - Automatic-migration - ia64 " Lee Schermerhorn
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 20:01 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

AutoPage Migration - generic check/notify internode migration

This patch adds the check for internode migration to be called
from scheduler load balancing functions, and the check for migration
pending to be called when a task returning to user space notices
NOTIFY_RESUME pending

Check for internode migration:  if automatic memory migration
is enabled [auto_migrate_enabled(task)] and this is a user task and
the destination cpu is on a different node from the task's current cpu,
the task will be marked for migration pending via member added to task
struct.  The TIF_NOTIFY_RESUME thread_info flag is set to cause the
task to enter do_notify_resume[_user]() to check for migration pending.

When a task is rescheduled to user space with TIF_NOTIFY_RESUME,
it will check for migration pending, unless SIGKILL is pending.
If the task notices migration pending, it will call
auto_migrate_task_memory() to migrate pages in vma's with default
policy.  Only default policy is affected by migration to a new node.

Note that we can't call auto_migrate_task_memory() with interrupts
disabled.  Temporarily enable interrupts around the call.
The check is last in line before returning to user space so this
should be safe here.

These checks become empty macros when 'AUTO_MIGRATION' is not
configured.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/auto-migrate.h |   53 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/sched.h        |    4 +++
 mm/mempolicy.c               |   20 ++++++++++++++++
 3 files changed, 76 insertions(+), 1 deletion(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/sched.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/sched.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/sched.h
@@ -1459,6 +1459,9 @@ struct task_struct {
 	short shared_file_policy_enabled:1;
 	short migrate_on_fault_enabled:1;
 	short auto_migrate_enabled:1;
+#ifdef CONFIG_AUTO_MIGRATION
+	short migrate_pending:1;	/* internode mem migration pending */
+#endif
 #endif
 	atomic_t fs_excl;	/* holding fs exclusive resources */
 	struct rcu_head rcu;
@@ -1928,6 +1931,7 @@ static inline int migrate_on_fault_enabl
 #endif
 
 #ifdef CONFIG_AUTO_MIGRATION
+#define SCHED_AUTO_MIGRATION 1
 static inline void set_auto_migrate_enabled(struct task_struct *tsk,
 							int val)
 {
Index: linux-2.6.36-mmotm-101103-1217/include/linux/auto-migrate.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/auto-migrate.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/auto-migrate.h
@@ -10,9 +10,49 @@
 #ifdef CONFIG_AUTO_MIGRATION
 
 extern int is_auto_migration(int flags);
-
 extern void auto_migrate_task_memory(void);
 
+#ifdef SCHED_AUTO_MIGRATION
+/* these need sched.h definition.  They're only where sched.h is
+ * already included.  Note we depend on sched.h being included
+ * first to see these functions.
+ */
+extern void __check_internode_migration(struct task_struct *, int);
+
+static inline void check_internode_migration(struct task_struct *task,
+			int dest_cpu)
+{
+	if (auto_migrate_enabled(task))
+		__check_internode_migration(task, dest_cpu);
+}
+
+/*
+ * called only by arch dependent code for architectures that
+ * support "migration work"
+ */
+static inline void check_migrate_pending(void)
+{
+	if (unlikely(current->migrate_pending)) {
+		int disable_irqs = 0;
+
+		if (irqs_disabled()) {
+			disable_irqs = 1;
+			local_irq_enable();
+		}
+
+		/*
+		 * can't be called in atomic context.
+		 */
+		auto_migrate_task_memory();
+
+		if (disable_irqs)
+			local_irq_disable();
+	}
+	current->migrate_pending = 0;
+	return;
+}
+#endif /* SCHED_AUTO_MIGRATION */
+
 #else	/* !CONFIG_AUTO_MIGRATION */
 
 static inline int is_auto_migration(int flags)
@@ -20,6 +60,17 @@ static inline int is_auto_migration(int
 	return 0;
 }
 
+static int is_auto_migration(int flags) { return 0; }
+
+static inline void check_internode_migration(struct task_struct *tsk, int cpu)
+{
+}
+
+static inline void check_migrate_pending(void)
+{
+	clear_thread_flag(TIF_NOTIFY_RESUME);
+}
+
 #endif	/* CONFIG_AUTO_MIGRATION */
 
 #endif
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -1215,6 +1215,26 @@ static struct page *new_vma_page(struct
 
 #ifdef CONFIG_AUTO_MIGRATION
 
+/*
+ * Check for task migration to a new node and flag to auto-migrate task memory.
+ * Only called if auto-migration is enabled for this task.
+ */
+void __check_internode_migration(struct task_struct *task,
+			int dest_cpu)
+{
+	if (task->mm) {
+		int node = cpu_to_node(task_cpu(task));
+		if ((node != cpu_to_node(dest_cpu))) {
+			/*
+			 * migrating a user task to a new node.
+			 * mark for memory migration on return to user space.
+			 */
+			task->migrate_pending = 1;
+			set_tsk_thread_flag(task, TIF_NOTIFY_RESUME);
+		}
+	}
+}
+
 int is_auto_migration(int flags)
 {
 	return !!(flags & MPOL_MF_AUTOMIGRATE);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH/RFC 4/11] numa - Automatic-migration - ia64 check notify migrate pending
  2010-11-11 20:01 [PATCH/RFC 0/11] numa - Automatic-migration Lee Schermerhorn
                   ` (2 preceding siblings ...)
  2010-11-11 20:01 ` [PATCH/RFC 3/11] numa - Automatic-migration - check notify migrate pending Lee Schermerhorn
@ 2010-11-11 20:01 ` Lee Schermerhorn
  2010-11-11 20:01 ` [PATCH/RFC 5/11] numa - Automatic-migration - x86_64 " Lee Schermerhorn
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 20:01 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

AutoPage Migration - ia64 check/notify internode migration

NOTE:  not recently tested on ia64

This patch hooks the check for task memory migration pending
into the ia64 do_notify_resume() function.  Call to
check_migrate_pending() is a no-op if automigration is not
configured.

TODO:  reconcile automigration and trace use of 'NOTIFY_RESUME

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 arch/ia64/kernel/process.c |    9 +++++++++
 1 file changed, 9 insertions(+)

Index: linux-2.6.36-mmotm-101103-1217/arch/ia64/kernel/process.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/arch/ia64/kernel/process.c
+++ linux-2.6.36-mmotm-101103-1217/arch/ia64/kernel/process.c
@@ -29,6 +29,7 @@
 #include <linux/kdebug.h>
 #include <linux/utsname.h>
 #include <linux/tracehook.h>
+#include <linux/auto-migrate.h>
 
 #include <asm/cpu.h>
 #include <asm/delay.h>
@@ -193,12 +194,20 @@ do_notify_resume_user(sigset_t *unused,
 		pfm_handle_work();
 #endif
 
+	/*
+	 * check for task memory migration before delivering
+	 * signals so that hander[s] use memory in new node.
+	 */
+	check_migrate_pending();
+
 	/* deal with pending signal delivery */
 	if (test_thread_flag(TIF_SIGPENDING)) {
 		local_irq_enable();	/* force interrupt enable */
 		ia64_do_signal(scr, in_syscall);
 	}
 
+//TODO:  make sure tracehook_notify_resume() is NO-OP if not enabled
+//       by something other than TIF_NOTIFY_RESUME which is general flag.
 	if (test_thread_flag(TIF_NOTIFY_RESUME)) {
 		clear_thread_flag(TIF_NOTIFY_RESUME);
 		tracehook_notify_resume(&scr->pt);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH/RFC 5/11] numa - Automatic-migration - x86_64 check notify migrate pending
  2010-11-11 20:01 [PATCH/RFC 0/11] numa - Automatic-migration Lee Schermerhorn
                   ` (3 preceding siblings ...)
  2010-11-11 20:01 ` [PATCH/RFC 4/11] numa - Automatic-migration - ia64 " Lee Schermerhorn
@ 2010-11-11 20:01 ` Lee Schermerhorn
  2010-11-11 20:01 ` [PATCH/RFC 6/11] numa - Automatic-migration - hook to scheduler inter-node migration Lee Schermerhorn
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 20:01 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

AutoPage Migration - x86_64 check/notify internode migration

Hook check for task memory migration for x86_64.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 arch/x86/kernel/signal.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6.36-mmotm-101103-1217/arch/x86/kernel/signal.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/arch/x86/kernel/signal.c
+++ linux-2.6.36-mmotm-101103-1217/arch/x86/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/personality.h>
 #include <linux/uaccess.h>
 #include <linux/user-return-notifier.h>
+#include <linux/auto-migrate.h>
 
 #include <asm/processor.h>
 #include <asm/ucontext.h>
@@ -857,6 +858,8 @@ do_notify_resume(struct pt_regs *regs, v
 	if (thread_info_flags & _TIF_USER_RETURN_NOTIFY)
 		fire_user_return_notifiers();
 
+	check_migrate_pending();	/* auto-migration hook */
+
 #ifdef CONFIG_X86_32
 	clear_thread_flag(TIF_IRET);
 #endif /* CONFIG_X86_32 */

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH/RFC 6/11] numa - Automatic-migration - hook to scheduler inter-node migration
  2010-11-11 20:01 [PATCH/RFC 0/11] numa - Automatic-migration Lee Schermerhorn
                   ` (4 preceding siblings ...)
  2010-11-11 20:01 ` [PATCH/RFC 5/11] numa - Automatic-migration - x86_64 " Lee Schermerhorn
@ 2010-11-11 20:01 ` Lee Schermerhorn
  2010-11-11 20:01 ` [PATCH/RFC 7/11] numa - Automatic-migration - add internode migration delay Lee Schermerhorn
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 20:01 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

AutoPage Migration - hook sched migrate to memory migration

Add check for internode migration to scheduler -- in most places
where a new cpu is assigned via set_task_cpu().  If MIGRATION is
configured, and auto-migration is enabled [and this is a
user space task], the check will set "migration pending" for the
task IFF the destination cpu is on a different node from the last
cpu to which the task was assigned.  Migration of affected pages
[those with default policy] will occur when the task returns to
user space.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 kernel/sched.c      |    7 ++++++-
 kernel/sched_fair.c |    2 ++
 kernel/sched_rt.c   |    2 ++
 3 files changed, 10 insertions(+), 1 deletion(-)

Index: linux-2.6.36-mmotm-101103-1217/kernel/sched.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/kernel/sched.c
+++ linux-2.6.36-mmotm-101103-1217/kernel/sched.c
@@ -72,6 +72,7 @@
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
 #include <linux/slab.h>
+#include <linux/auto-migrate.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -2519,8 +2520,10 @@ static int try_to_wake_up(struct task_st
 	}
 
 	cpu = select_task_rq(rq, p, SD_BALANCE_WAKE, wake_flags);
-	if (cpu != orig_cpu)
+	if (cpu != orig_cpu) {
+		check_internode_migration(p, cpu);
 		set_task_cpu(p, cpu);
+	}
 	__task_rq_unlock(rq);
 
 	rq = cpu_rq(cpu);
@@ -2699,6 +2702,7 @@ void sched_fork(struct task_struct *p, i
 	 * Silence PROVE_RCU.
 	 */
 	rcu_read_lock();
+	check_internode_migration(p, cpu);	/* TODO:  here?  */
 	set_task_cpu(p, cpu);
 	rcu_read_unlock();
 
@@ -5681,6 +5685,7 @@ static int __migrate_task(struct task_st
 	 */
 	if (p->se.on_rq) {
 		deactivate_task(rq_src, p, 0);
+		check_internode_migration(p, dest_cpu);
 		set_task_cpu(p, dest_cpu);
 		activate_task(rq_dest, p, 0);
 		check_preempt_curr(rq_dest, p, 0);
Index: linux-2.6.36-mmotm-101103-1217/kernel/sched_rt.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/kernel/sched_rt.c
+++ linux-2.6.36-mmotm-101103-1217/kernel/sched_rt.c
@@ -1372,6 +1372,7 @@ retry:
 	}
 
 	deactivate_task(rq, next_task, 0);
+	check_internode_migration(next_task, lowest_rq->cpu);
 	set_task_cpu(next_task, lowest_rq->cpu);
 	activate_task(lowest_rq, next_task, 0);
 
@@ -1455,6 +1456,7 @@ static int pull_rt_task(struct rq *this_
 			ret = 1;
 
 			deactivate_task(src_rq, p, 0);
+			check_internode_migration(p, this_cpu);
 			set_task_cpu(p, this_cpu);
 			activate_task(this_rq, p, 0);
 			/*
Index: linux-2.6.36-mmotm-101103-1217/kernel/sched_fair.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/kernel/sched_fair.c
+++ linux-2.6.36-mmotm-101103-1217/kernel/sched_fair.c
@@ -1761,6 +1761,7 @@ static void pull_task(struct rq *src_rq,
 		      struct rq *this_rq, int this_cpu)
 {
 	deactivate_task(src_rq, p, 0);
+	check_internode_migration(p, this_cpu);
 	set_task_cpu(p, this_cpu);
 	activate_task(this_rq, p, 0);
 	check_preempt_curr(this_rq, p, 0);
@@ -3794,6 +3795,7 @@ static void task_fork_fair(struct task_s
 	update_rq_clock(rq);
 
 	if (unlikely(task_cpu(p) != this_cpu)) {
+		check_internode_migration(p, this_cpu);	/* TODO:  here?  */
 		rcu_read_lock();
 		__set_task_cpu(p, this_cpu);
 		rcu_read_unlock();

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH/RFC 7/11] numa - Automatic-migration - add internode migration delay
  2010-11-11 20:01 [PATCH/RFC 0/11] numa - Automatic-migration Lee Schermerhorn
                   ` (5 preceding siblings ...)
  2010-11-11 20:01 ` [PATCH/RFC 6/11] numa - Automatic-migration - hook to scheduler inter-node migration Lee Schermerhorn
@ 2010-11-11 20:01 ` Lee Schermerhorn
  2010-11-11 20:02 ` [PATCH/RFC 8/11] numa - Automatic-migration - per cpuset max mapcount control Lee Schermerhorn
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 20:01 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

AutoPage Migration - add hysteresis to internode migration

This patch adds hysteresis to the internode migration to prevent
page migration trashing when automatic scheduler driven page migration
is enabled.

Add static in-line function "too_soon_for_internode_migration"
[macro => 0 if !CONFIG_AUTO_MIGRATION] to check for attempts to move
task to a new node sooner than auto_migrate_interval jiffies
after previous migration.  Note:  fetches interval from task struct
to avoid callout to cpuset func with rcu_lock/unlock round trip on
each migration check.  The task's auto_migrate_interval is updated
from cpuset_update_task_memory_state().

Modify try_to_wakeup() to leave task on its current cpu if too
soon to move it to a different node.

Modify can_migrate_task() to "just say no!" if the load balancer
proposes an internode migration too soon after previous internode
migration.

?	Fix comment block on can_migrate_task() to reflect
	order of tests in current code.

Added a control file--auto_migrate_interval--to cpusets to
query/set the per cpuset interval.  Provide some fairly arbitrary
min, max and default values.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/auto-migrate.h |   30 ++++++++++++++++++++++++++
 include/linux/sched.h        |    2 +
 kernel/cpuset.c              |   49 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sched_fair.c          |   18 +++++++++++++--
 4 files changed, 94 insertions(+), 5 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/sched.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/sched.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/sched.h
@@ -1462,6 +1462,8 @@ struct task_struct {
 #ifdef CONFIG_AUTO_MIGRATION
 	short migrate_pending:1;	/* internode mem migration pending */
 #endif
+	unsigned long next_migrate;	/* internode migration hysteresis */
+	unsigned long auto_migrate_interval;	/* from cpuset */
 #endif
 	atomic_t fs_excl;	/* holding fs exclusive resources */
 	struct rcu_head rcu;
Index: linux-2.6.36-mmotm-101103-1217/include/linux/auto-migrate.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/auto-migrate.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/auto-migrate.h
@@ -12,6 +12,10 @@
 extern int is_auto_migration(int flags);
 extern void auto_migrate_task_memory(void);
 
+#define AUTO_MIGRATE_INTERVAL_DFLT (30*HZ)
+#define AUTO_MIGRATE_INTERVAL_MIN (5*HZ)
+#define AUTO_MIGRATE_INTERVAL_MAX (300*HZ)
+
 #ifdef SCHED_AUTO_MIGRATION
 /* these need sched.h definition.  They're only where sched.h is
  * already included.  Note we depend on sched.h being included
@@ -27,6 +31,24 @@ static inline void check_internode_migra
 }
 
 /*
+ * To avoids page migration thrashing when auto memory migration is enabled,
+ * check user task for too recent internode migration.
+ */
+static inline int too_soon_for_internode_migration(struct task_struct *task,
+								int this_cpu)
+{
+	if (auto_migrate_enabled(task) && task->mm &&
+		cpu_to_node(task_cpu(task)) != cpu_to_node(this_cpu)) {
+
+		if (task->migrate_pending ||
+			time_before(jiffies, task->next_migrate))
+			return 1;
+	}
+
+	return 0;
+}
+
+/*
  * called only by arch dependent code for architectures that
  * support "migration work"
  */
@@ -40,6 +62,8 @@ static inline void check_migrate_pending
 			local_irq_enable();
 		}
 
+		current->next_migrate = jiffies
+			 + current->auto_migrate_interval;
 		/*
 		 * can't be called in atomic context.
 		 */
@@ -71,6 +95,12 @@ static inline void check_migrate_pending
 	clear_thread_flag(TIF_NOTIFY_RESUME);
 }
 
+static inline int too_soon_for_internode_migration(struct task_struct *tsk,
+								int cpu)
+{
+	return 0;
+}
+
 #endif	/* CONFIG_AUTO_MIGRATION */
 
 #endif
Index: linux-2.6.36-mmotm-101103-1217/kernel/cpuset.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/kernel/cpuset.c
+++ linux-2.6.36-mmotm-101103-1217/kernel/cpuset.c
@@ -53,6 +53,7 @@
 #include <linux/time.h>
 #include <linux/backing-dev.h>
 #include <linux/sort.h>
+#include <linux/auto-migrate.h>
 
 #include <asm/uaccess.h>
 #include <asm/atomic.h>
@@ -99,6 +100,8 @@ struct cpuset {
 
 	struct fmeter fmeter;		/* memory_pressure filter */
 
+	unsigned long auto_migrate_interval;
+
 	/* partition number for rebuild_sched_domains() */
 	int pn;
 
@@ -196,6 +199,7 @@ static inline int is_auto_migrate(const
 
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
+	.auto_migrate_interval = AUTO_MIGRATE_INTERVAL_DFLT,
 };
 
 /*
@@ -358,9 +362,10 @@ static void cpuset_update_task_cpuset_fl
 		set_migrate_on_fault_enabled(tsk, 1);
 	else
 		set_migrate_on_fault_enabled(tsk, 0);
-	if (is_auto_migrate(cs))
+	if (is_auto_migrate(cs)) {
 		set_auto_migrate_enabled(tsk, 1);
-	else
+		tsk->auto_migrate_interval = cs->auto_migrate_interval;
+	} else
 		set_auto_migrate_enabled(tsk, 0);
 
 }
@@ -1526,6 +1531,28 @@ alloc_fail:
 	NODEMASK_FREE(to);
 }
 
+/*
+ * Call with manage_mutex held.
+ */
+static int update_auto_migrate_interval(struct cpuset *cs, u64 val)
+{
+	unsigned long n = val * HZ;	/* scale seconds to ticks */
+
+	if (n == cs->auto_migrate_interval)
+		return 0;
+
+	/*
+	 * silently clip to min/max
+	 */
+	if (n < AUTO_MIGRATE_INTERVAL_MIN)
+		cs->auto_migrate_interval = AUTO_MIGRATE_INTERVAL_MIN;
+	else if (n > AUTO_MIGRATE_INTERVAL_MAX)
+		cs->auto_migrate_interval = AUTO_MIGRATE_INTERVAL_MAX;
+	else
+		cs->auto_migrate_interval = n;
+	return 0;
+}
+
 /* The various types of files and directories in a cpuset file system */
 
 typedef enum {
@@ -1545,6 +1572,7 @@ typedef enum {
 	FILE_SHARED_FILE_POLICY,
 	FILE_MIGRATE_ON_FAULT,
 	FILE_AUTO_MIGRATE,
+	FILE_AUTO_MIGRATE_INTERVAL,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1596,6 +1624,9 @@ static int cpuset_write_u64(struct cgrou
 	case FILE_AUTO_MIGRATE:
 		retval = update_flag(CS_AUTO_MIGRATE, cs, val);
 		break;
+	case FILE_AUTO_MIGRATE_INTERVAL:
+		retval = update_auto_migrate_interval(cs, val);
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -1725,6 +1756,9 @@ static ssize_t cpuset_common_file_read(s
 	case FILE_MEMLIST:
 		s += cpuset_sprintf_memlist(s, cs);
 		break;
+	case FILE_AUTO_MIGRATE_INTERVAL:
+		s += sprintf(s, "%ld", cs->auto_migrate_interval / HZ);
+		break;
 	default:
 		retval = -EINVAL;
 		goto out;
@@ -1913,6 +1947,13 @@ static struct cftype cft_auto_migration
 	.private = FILE_AUTO_MIGRATE,
 };
 
+static struct cftype cft_auto_migrate_interval = {
+	.name = "auto_migrate_interval",
+	.read = cpuset_common_file_read,
+	.write_u64 = cpuset_write_u64,
+	.private = FILE_AUTO_MIGRATE_INTERVAL,
+};
+
 static int cpuset_populate(struct cgroup_subsys *ss, struct cgroup *cont)
 {
 	int err;
@@ -1936,6 +1977,9 @@ static int cpuset_populate(struct cgroup
 	err = add_auto_migration_file(cont, ss, &cft_auto_migration);
 	if (err < 0)
 		return err;
+	err = add_auto_migration_file(cont, ss, &cft_auto_migrate_interval);
+	if (err < 0)
+		return err;
 	/* memory_pressure_enabled is in root cpuset only */
 	if (!cont->parent)
 		err = cgroup_add_file(cont, ss,
@@ -2019,6 +2063,7 @@ static struct cgroup_subsys_state *cpuse
 	if (is_auto_migrate(parent))
 		set_bit(CS_AUTO_MIGRATE, &cs->flags);
 	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+	cs->auto_migrate_interval = parent->auto_migrate_interval;
 	cpumask_clear(cs->cpus_allowed);
 	nodes_clear(cs->mems_allowed);
 	fmeter_init(&cs->fmeter);
Index: linux-2.6.36-mmotm-101103-1217/kernel/sched_fair.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/kernel/sched_fair.c
+++ linux-2.6.36-mmotm-101103-1217/kernel/sched_fair.c
@@ -1454,6 +1454,14 @@ select_task_rq_fair(struct rq *rq, struc
 	int want_sd = 1;
 	int sync = wake_flags & WF_SYNC;
 
+
+	/*
+	 * short circuit balancing if this task was recently
+	 * migrated to this cpu's node.
+	 */
+	if (too_soon_for_internode_migration(p, prev_cpu))
+		return prev_cpu;
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		if (cpumask_test_cpu(cpu, &p->cpus_allowed))
 			want_affine = 1;
@@ -1782,9 +1790,10 @@ int can_migrate_task(struct task_struct
 	int tsk_cache_hot = 0;
 	/*
 	 * We do not migrate tasks that are:
-	 * 1) running (obviously), or
-	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
-	 * 3) are cache-hot on their current CPU.
+	 * 1) cannot be migrated to this CPU due to cpus_allowed, or
+	 * 2) running (obviously), or
+	 * 3) too soon since last internode migration
+	 * 4) are cache-hot on their current CPU.
 	 */
 	if (!cpumask_test_cpu(this_cpu, &p->cpus_allowed)) {
 		schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
@@ -1797,6 +1806,9 @@ int can_migrate_task(struct task_struct
 		return 0;
 	}
 
+	if (too_soon_for_internode_migration(p, this_cpu))
+		return 0;
+
 	/*
 	 * Aggressive migration if:
 	 * 1) task is cache cold, or

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH/RFC 8/11] numa - Automatic-migration - per cpuset max mapcount control
  2010-11-11 20:01 [PATCH/RFC 0/11] numa - Automatic-migration Lee Schermerhorn
                   ` (6 preceding siblings ...)
  2010-11-11 20:01 ` [PATCH/RFC 7/11] numa - Automatic-migration - add internode migration delay Lee Schermerhorn
@ 2010-11-11 20:02 ` Lee Schermerhorn
  2010-11-11 20:02 ` [PATCH/RFC 9/11] numa - Automatic-migration - hook to migrate on fault Lee Schermerhorn
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 20:02 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

AutoPage Migration - add max mapcount migration threshold

This patch adds an additional per cpuset migration control that
allows one to vary the page mapcount threshold above which pages
will not be migrated by MPOL_MF_MOVE.  The default value is 1,
which yields the same behavior as before this patch.

This is useful because anon pages can be shared between ancestors
and descendants until sharing is broken by a write.  We want to
be able to unmap these pages for lazy, automigration so that the
next touch will migrate the page local to the task that touches
it.  However, we still want a threshold above which we don't
attempt to migrate the page because unmap is very expensive when
a page has a large mapcount.

We add the threshold to the task structure so that we can fetch
it using a static inline function that is redefined as to return
the default value of 1 when AUTO_MIGRATION is not configured.
The max mapcount is accessed for each page proposed for migration
and we don't want to call a cpuset function and take an
rcu_lock/unlock round trip for each page.

Note:  This threshold could be configured under MIGRATE_ON_FAULT
instead of AUTO_MIGRATION or independently of either, as it is
useful for mbind() with MPOL_MF_MOVE as well.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/auto-migrate.h |    4 ++++
 include/linux/sched.h        |    1 +
 kernel/cpuset.c              |   42 +++++++++++++++++++++++++++++++++++++++++-
 mm/mempolicy.c               |    8 +++++---
 4 files changed, 51 insertions(+), 4 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/auto-migrate.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/auto-migrate.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/auto-migrate.h
@@ -77,6 +77,10 @@ static inline void check_migrate_pending
 }
 #endif /* SCHED_AUTO_MIGRATION */
 
+static inline unsigned int migrate_max_mapcount(struct task_struct *task)
+{
+	return task->migrate_max_mapcount;
+}
 #else	/* !CONFIG_AUTO_MIGRATION */
 
 static inline int is_auto_migration(int flags)
Index: linux-2.6.36-mmotm-101103-1217/kernel/cpuset.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/kernel/cpuset.c
+++ linux-2.6.36-mmotm-101103-1217/kernel/cpuset.c
@@ -101,6 +101,7 @@ struct cpuset {
 	struct fmeter fmeter;		/* memory_pressure filter */
 
 	unsigned long auto_migrate_interval;
+	unsigned int migrate_max_mapcount;
 
 	/* partition number for rebuild_sched_domains() */
 	int pn;
@@ -200,6 +201,7 @@ static inline int is_auto_migrate(const
 static struct cpuset top_cpuset = {
 	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
 	.auto_migrate_interval = AUTO_MIGRATE_INTERVAL_DFLT,
+	.migrate_max_mapcount = 1,
 };
 
 /*
@@ -365,8 +367,11 @@ static void cpuset_update_task_cpuset_fl
 	if (is_auto_migrate(cs)) {
 		set_auto_migrate_enabled(tsk, 1);
 		tsk->auto_migrate_interval = cs->auto_migrate_interval;
-	} else
+		tsk->migrate_max_mapcount  = cs->migrate_max_mapcount;
+	} else {
 		set_auto_migrate_enabled(tsk, 0);
+		tsk->migrate_max_mapcount  = 1;
+	}
 
 }
 
@@ -1553,6 +1558,23 @@ static int update_auto_migrate_interval(
 	return 0;
 }
 
+/*
+ * Call with manage_mutex held.
+ */
+static int update_migrate_max_mapcount(struct cpuset *cs, u64 val)
+{
+	unsigned int n = val;
+
+	if (n == cs->migrate_max_mapcount)
+		return 0;
+
+	if (n < 1)
+		cs->migrate_max_mapcount = 1;
+	else
+		cs->migrate_max_mapcount = n;
+	return 0;
+}
+
 /* The various types of files and directories in a cpuset file system */
 
 typedef enum {
@@ -1573,6 +1595,7 @@ typedef enum {
 	FILE_MIGRATE_ON_FAULT,
 	FILE_AUTO_MIGRATE,
 	FILE_AUTO_MIGRATE_INTERVAL,
+	FILE_MIGRATE_MAX_MAPCOUNT,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1627,6 +1650,9 @@ static int cpuset_write_u64(struct cgrou
 	case FILE_AUTO_MIGRATE_INTERVAL:
 		retval = update_auto_migrate_interval(cs, val);
 		break;
+	case FILE_MIGRATE_MAX_MAPCOUNT:
+		retval = update_migrate_max_mapcount(cs, val);
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -1759,6 +1785,9 @@ static ssize_t cpuset_common_file_read(s
 	case FILE_AUTO_MIGRATE_INTERVAL:
 		s += sprintf(s, "%ld", cs->auto_migrate_interval / HZ);
 		break;
+	case FILE_MIGRATE_MAX_MAPCOUNT:
+		s += sprintf(s, "%d", cs->migrate_max_mapcount);
+		break;
 	default:
 		retval = -EINVAL;
 		goto out;
@@ -1954,6 +1983,13 @@ static struct cftype cft_auto_migrate_in
 	.private = FILE_AUTO_MIGRATE_INTERVAL,
 };
 
+static struct cftype cft_migrate_max_mapcount = {
+	.name = "migrate_max_mapcount",
+	.read = cpuset_common_file_read,
+	.write_u64 = cpuset_write_u64,
+	.private = FILE_MIGRATE_MAX_MAPCOUNT,
+};
+
 static int cpuset_populate(struct cgroup_subsys *ss, struct cgroup *cont)
 {
 	int err;
@@ -1980,6 +2016,9 @@ static int cpuset_populate(struct cgroup
 	err = add_auto_migration_file(cont, ss, &cft_auto_migrate_interval);
 	if (err < 0)
 		return err;
+	err = add_auto_migration_file(cont, ss, &cft_migrate_max_mapcount);
+	if (err < 0)
+		return err;
 	/* memory_pressure_enabled is in root cpuset only */
 	if (!cont->parent)
 		err = cgroup_add_file(cont, ss,
@@ -2064,6 +2103,7 @@ static struct cgroup_subsys_state *cpuse
 		set_bit(CS_AUTO_MIGRATE, &cs->flags);
 	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
 	cs->auto_migrate_interval = parent->auto_migrate_interval;
+	cs->migrate_max_mapcount  = parent->migrate_max_mapcount;
 	cpumask_clear(cs->cpus_allowed);
 	nodes_clear(cs->mems_allowed);
 	fmeter_init(&cs->fmeter);
Index: linux-2.6.36-mmotm-101103-1217/include/linux/sched.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/sched.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/sched.h
@@ -1464,6 +1464,7 @@ struct task_struct {
 #endif
 	unsigned long next_migrate;	/* internode migration hysteresis */
 	unsigned long auto_migrate_interval;	/* from cpuset */
+	unsigned int migrate_max_mapcount;	/* for !MPOL_MF_MOVE_ALL */
 #endif
 	atomic_t fs_excl;	/* holding fs exclusive resources */
 	struct rcu_head rcu;
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -1051,11 +1051,13 @@ static void migrate_page_add(struct page
 				unsigned long flags)
 {
 	/*
-	 * Avoid migrating a file backed page in a private mapping or
-	 * a page that is shared with others.
+	 * Avoid migrating a file backed page in a private mapping, or
+	 * a page that is shared with > 'migrate_max_mapcount' others
+	 * unless MPOL_MF_MOVE_ALL specified.
 	 */
 	if ((!(flags & MPOL_MF_MOVE_ANON_ONLY) || PageAnon(page)) &&
-		((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1)) {
+		((flags & MPOL_MF_MOVE_ALL) ||
+			page_mapcount(page) <= migrate_max_mapcount(current))) {
 		if (!isolate_lru_page(page)) {
 			list_add_tail(&page->lru, pagelist);
 			inc_zone_page_state(page, NR_ISOLATED_ANON +

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH/RFC 9/11] numa - Automatic-migration - hook to migrate on fault
  2010-11-11 20:01 [PATCH/RFC 0/11] numa - Automatic-migration Lee Schermerhorn
                   ` (7 preceding siblings ...)
  2010-11-11 20:02 ` [PATCH/RFC 8/11] numa - Automatic-migration - per cpuset max mapcount control Lee Schermerhorn
@ 2010-11-11 20:02 ` Lee Schermerhorn
  2010-11-11 20:02 ` [PATCH/RFC 10/11] numa - Automatic-migration - per proc automigrate kick control Lee Schermerhorn
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 20:02 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

AutoPage Migration - hook automigration to migrate-on-fault

Add a per cpuset control--auto_migrate_lazy--to use migrate-on-fault
for auto-migration, if configured.

Modify migrate_to_node() to just unmap the eligible pages
via migrate_pages_unmap_only() when MPOL_MF_LAZY flag is set.

Set auto_migrate_lazy by default in the top cpuset.  Lazy
automigration is preferred.  Why?  Think of the effect of
direct, auto-migration on a multithreaded process.  [Perhaps
I should change this flag to 'auto_migrate_direct' and default
that to disabled?]

This patch depends on the "migrate-on-fault" patch series that
defines the MPOL_MF_LAZY flag and the migrate_pages_unmap_only()
function.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/auto-migrate.h |   19 ++++++++++++++++++
 kernel/cpuset.c              |   44 ++++++++++++++++++++++++++++++++++++++++++-
 mm/mempolicy.c               |    8 ++++++-
 3 files changed, 69 insertions(+), 2 deletions(-)

Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -1092,7 +1092,10 @@ static int migrate_to_node(struct mm_str
 		return PTR_ERR(vma);
 
 	if (!list_empty(&pagelist)) {
-		err = migrate_pages(&pagelist, new_node_page, dest, 0);
+		if (is_lazy_migration(flags))
+			err = migrate_pages_unmap_only(&pagelist);
+		else
+			err = migrate_pages(&pagelist, new_node_page, dest, 0);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
@@ -1260,6 +1263,9 @@ void auto_migrate_task_memory(void)
 	 */
 	BUG_ON(!mm);
 
+	if (auto_migrate_lazy(current))
+		set_lazy_migration(flags);
+
 	/*
 	 * Pass destination node as source node plus 'INVERT flag:
 	 *    Migrate all pages NOT on destination node.
Index: linux-2.6.36-mmotm-101103-1217/include/linux/auto-migrate.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/auto-migrate.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/auto-migrate.h
@@ -81,6 +81,17 @@ static inline unsigned int migrate_max_m
 {
 	return task->migrate_max_mapcount;
 }
+
+extern unsigned int auto_migrate_lazy(struct task_struct *);
+
+#ifdef MPOL_MF_LAZY
+#define is_lazy_migration(F) ((F) & MPOL_MF_LAZY)
+#define set_lazy_migration(F) (F) |= MPOL_MF_LAZY
+#else
+#define is_lazy_migration(F) (0)
+#define set_lazy_migration(F)
+#endif
+
 #else	/* !CONFIG_AUTO_MIGRATION */
 
 static inline int is_auto_migration(int flags)
@@ -105,6 +116,14 @@ static inline int too_soon_for_internode
 	return 0;
 }
 
+static inline unsigned int auto_migrate_lazy(struct task_struct *)
+{
+	return 0;
+}
+
+#define is_lazy_migration(F) (0)
+#define set_lazy_migration(F)
+
 #endif	/* CONFIG_AUTO_MIGRATION */
 
 #endif
Index: linux-2.6.36-mmotm-101103-1217/kernel/cpuset.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/kernel/cpuset.c
+++ linux-2.6.36-mmotm-101103-1217/kernel/cpuset.c
@@ -140,6 +140,7 @@ typedef enum {
  	CS_SHARED_FILE_POLICY,
 	CS_MIGRATE_ON_FAULT,
 	CS_AUTO_MIGRATE,
+	CS_LAZY_MIGRATE,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -198,8 +199,14 @@ static inline int is_auto_migrate(const
 	return test_bit(CS_AUTO_MIGRATE,  &cs->flags);
 }
 
+static inline int is_auto_migrate_lazy(const struct cpuset *cs)
+{
+	return test_bit(CS_LAZY_MIGRATE,  &cs->flags);
+}
+
 static struct cpuset top_cpuset = {
-	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
+	.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE) |
+		  (1 << CS_LAZY_MIGRATE)),
 	.auto_migrate_interval = AUTO_MIGRATE_INTERVAL_DFLT,
 	.migrate_max_mapcount = 1,
 };
@@ -1596,6 +1603,7 @@ typedef enum {
 	FILE_AUTO_MIGRATE,
 	FILE_AUTO_MIGRATE_INTERVAL,
 	FILE_MIGRATE_MAX_MAPCOUNT,
+	FILE_AUTO_MIGRATE_LAZY,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1653,6 +1661,9 @@ static int cpuset_write_u64(struct cgrou
 	case FILE_MIGRATE_MAX_MAPCOUNT:
 		retval = update_migrate_max_mapcount(cs, val);
 		break;
+	case FILE_AUTO_MIGRATE_LAZY:
+		retval = update_flag(CS_LAZY_MIGRATE, cs, val);
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -1831,6 +1842,8 @@ static u64 cpuset_read_u64(struct cgroup
 		return is_migrate_on_fault(cs);
 	case FILE_AUTO_MIGRATE:
 		return is_auto_migrate(cs);
+	case FILE_AUTO_MIGRATE_LAZY:
+		return is_auto_migrate_lazy(cs);
 	default:
 		BUG();
 	}
@@ -1990,6 +2003,13 @@ static struct cftype cft_migrate_max_map
 	.private = FILE_MIGRATE_MAX_MAPCOUNT,
 };
 
+static struct cftype cft_auto_migrate_lazy = {
+	.name = "auto_migrate_lazy",
+	.read_u64 = cpuset_read_u64,
+	.write_u64 = cpuset_write_u64,
+	.private = FILE_AUTO_MIGRATE_LAZY,
+};
+
 static int cpuset_populate(struct cgroup_subsys *ss, struct cgroup *cont)
 {
 	int err;
@@ -2019,6 +2039,9 @@ static int cpuset_populate(struct cgroup
 	err = add_auto_migration_file(cont, ss, &cft_migrate_max_mapcount);
 	if (err < 0)
 		return err;
+	err = add_auto_migration_file(cont, ss, &cft_auto_migrate_lazy);
+	if (err < 0)
+		return err;
 	/* memory_pressure_enabled is in root cpuset only */
 	if (!cont->parent)
 		err = cgroup_add_file(cont, ss,
@@ -2101,6 +2124,8 @@ static struct cgroup_subsys_state *cpuse
 		set_bit(CS_MIGRATE_ON_FAULT, &cs->flags);
 	if (is_auto_migrate(parent))
 		set_bit(CS_AUTO_MIGRATE, &cs->flags);
+	if (is_auto_migrate_lazy(parent))
+		set_bit(CS_LAZY_MIGRATE, &cs->flags);
 	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
 	cs->auto_migrate_interval = parent->auto_migrate_interval;
 	cs->migrate_max_mapcount  = parent->migrate_max_mapcount;
@@ -2874,3 +2899,20 @@ void cpuset_task_status_allowed(struct s
 	seq_nodemask_list(m, &task->mems_allowed);
 	seq_printf(m, "\n");
 }
+
+#ifdef CONFIG_AUTO_MIGRATION
+unsigned int auto_migrate_lazy(struct task_struct *task)
+{
+	unsigned int lazy;
+
+	if (task_cs(current) == &top_cpuset) {
+		/* Don't need rcu for top_cpuset.  It's never freed. */
+		lazy = is_auto_migrate_lazy(&top_cpuset);
+	} else {
+		rcu_read_lock();
+		lazy = is_auto_migrate_lazy(task_cs(current));
+		rcu_read_unlock();
+	}
+	return lazy;
+}
+#endif

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH/RFC 10/11] numa - Automatic-migration - per proc automigrate kick control
  2010-11-11 20:01 [PATCH/RFC 0/11] numa - Automatic-migration Lee Schermerhorn
                   ` (8 preceding siblings ...)
  2010-11-11 20:02 ` [PATCH/RFC 9/11] numa - Automatic-migration - hook to migrate on fault Lee Schermerhorn
@ 2010-11-11 20:02 ` Lee Schermerhorn
  2010-11-11 20:02 ` [PATCH/RFC 11/11] numa - Automatic-migration - add statistics Lee Schermerhorn
  2010-11-14  6:42 ` [PATCH/RFC 0/11] numa - Automatic-migration KOSAKI Motohiro
  11 siblings, 0 replies; 13+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 20:02 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

AutoPage Migration - add /proc/<tgid>/migrate

Add a "migrate" control file to per process /proc subdir to allow
external trigger of task auto [==self] migration.  When lazy
auto-migration is enabled, this effectively resets all unmappable
pages so that the next touch will cause a migrate on fault, if
the page is remote from the faulting task.  This allows one to
"poke" a task externally to force "re-affinitization on next
touch" independent of inter-node migration.

On read, show current value of task's "migrate_pending".  Nearly
useless, but I wanted to avoid a "write-only" file.

On write, if value is non-zero, set migrate_pending to 1 and set
'NOTIFY_RESUME thread info flag to cause task to handle the pending
migration.  If value is zero, clear the migrate_pending--also not very
useful, but falls out of the code.  Don't bother to reset the thread
info flag when clearing migrate_pending -- just being lazy.  It's a
no-op in this case as far as auto-migration, but might have been set
by something else.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 fs/proc/base.c     |    6 ++++
 fs/proc/internal.h |    2 +
 fs/proc/task_mmu.c |   70 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 78 insertions(+)

Index: linux-2.6.36-mmotm-101103-1217/fs/proc/base.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/proc/base.c
+++ linux-2.6.36-mmotm-101103-1217/fs/proc/base.c
@@ -2743,6 +2743,9 @@ static const struct pid_entry tgid_base_
 	REG("maps",       S_IRUGO, proc_maps_operations),
 #ifdef CONFIG_NUMA
 	REG("numa_maps",  S_IRUGO, proc_numa_maps_operations),
+#ifdef CONFIG_AUTO_MIGRATION
+	REG("migrate",    S_IRUGO|S_IWUSR, proc_migrate_operations),
+#endif
 #endif
 	REG("mem",        S_IRUSR|S_IWUSR, proc_mem_operations),
 	LNK("cwd",        proc_cwd_link),
@@ -3080,6 +3083,9 @@ static const struct pid_entry tid_base_s
 	REG("maps",      S_IRUGO, proc_maps_operations),
 #ifdef CONFIG_NUMA
 	REG("numa_maps", S_IRUGO, proc_numa_maps_operations),
+#ifdef CONFIG_AUTO_MIGRATION
+	REG("migrate",   S_IRUGO|S_IWUSR, proc_migrate_operations),
+#endif
 #endif
 	REG("mem",       S_IRUSR|S_IWUSR, proc_mem_operations),
 	LNK("cwd",       proc_cwd_link),
Index: linux-2.6.36-mmotm-101103-1217/fs/proc/internal.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/proc/internal.h
+++ linux-2.6.36-mmotm-101103-1217/fs/proc/internal.h
@@ -59,6 +59,8 @@ extern const struct file_operations proc
 extern const struct file_operations proc_clear_refs_operations;
 extern const struct file_operations proc_pagemap_operations;
 extern const struct file_operations proc_net_operations;
+extern const struct file_operations proc_migrate_operations;
+
 extern const struct inode_operations proc_net_inode_operations;
 
 void proc_init_inodecache(void);
Index: linux-2.6.36-mmotm-101103-1217/fs/proc/task_mmu.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/fs/proc/task_mmu.c
+++ linux-2.6.36-mmotm-101103-1217/fs/proc/task_mmu.c
@@ -1016,4 +1016,74 @@ const struct file_operations proc_numa_m
 	.llseek		= seq_lseek,
 	.release	= seq_release_private,
 };
+
+#ifdef CONFIG_AUTO_MIGRATION
+/*
+ * read/write task's "migrate_pending" flag.
+ * on write, set TIF_NOTIFY_RESUME thread info flag so that task
+ * will handle "migrate_pending" on next return to user space --
+ * no later than next clock tick.
+ */
+static ssize_t proc_migrate_read(struct file *file, char __user *buf,
+				  size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+	char buffer[PROC_NUMBUF];
+	size_t len;
+	int migpend;
+	loff_t __ppos = *ppos;
+
+	task = get_proc_task(file->f_dentry->d_inode);
+	if (!task)
+		return -ESRCH;
+	migpend = task->migrate_pending;
+	put_task_struct(task);
+
+	len = snprintf(buffer, sizeof(buffer), "%i\n", migpend);
+	if (__ppos >= len)
+		return 0;
+	if (count > len-__ppos)
+		count = len-__ppos;
+	if (copy_to_user(buf, buffer + __ppos, count))
+		return -EFAULT;
+	*ppos = __ppos + count;
+	return count;
+}
+
+static ssize_t proc_migrate_write(struct file *file, const char __user *buf,
+				   size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+	char buffer[PROC_NUMBUF], *end;
+	int migpend;
+
+	if (!capable(CAP_SYS_RESOURCE))
+		return -EPERM;
+	memset(buffer, 0, sizeof(buffer));
+	if (count > sizeof(buffer) - 1)
+		count = sizeof(buffer) - 1;
+	if (copy_from_user(buffer, buf, count))
+		return -EFAULT;
+	migpend = simple_strtol(buffer, &end, 0);
+	if (*end == '\n')
+		end++;
+	if (end - buffer == 0)
+		return -EIO;
+
+	task = get_proc_task(file->f_dentry->d_inode);
+	if (!task)
+		return -ESRCH;
+	task->migrate_pending = !!migpend;
+	if (migpend)
+		set_tsk_thread_flag(task, TIF_NOTIFY_RESUME);
+	put_task_struct(task);
+
+	return end - buffer;
+}
+
+const struct file_operations proc_migrate_operations = {
+	.read 		= proc_migrate_read,
+	.write		= proc_migrate_write,
+};
+#endif
 #endif

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH/RFC 11/11] numa - Automatic-migration - add statistics
  2010-11-11 20:01 [PATCH/RFC 0/11] numa - Automatic-migration Lee Schermerhorn
                   ` (9 preceding siblings ...)
  2010-11-11 20:02 ` [PATCH/RFC 10/11] numa - Automatic-migration - per proc automigrate kick control Lee Schermerhorn
@ 2010-11-11 20:02 ` Lee Schermerhorn
  2010-11-14  6:42 ` [PATCH/RFC 0/11] numa - Automatic-migration KOSAKI Motohiro
  11 siblings, 0 replies; 13+ messages in thread
From: Lee Schermerhorn @ 2010-11-11 20:02 UTC (permalink / raw
  To: linux-numa
  Cc: akpm, Mel Gorman, cl, Nick Piggin, Hugh Dickins, KOSAKI Motohiro,
	andi, David Rientjes, Avi Kivity, Andrea Arcangeli

AutoPage Migration -  add  vmstats

Add vmstats for auto-migration:

+ automig_tasks_migrated - number of times we scan a task's
  address space to unmap pages controlled by local allocation
  policy.
+ automig_pgs_scanned - the total number of pages scanned
  [by check_range()] for automigration.
+ automig_pgs_selected - the total number of pages selected
  [by migrate_page_add()] for auto-migration.
+ automig_pgs_failed - the number of [selected] pages that
  we were not able to unmap/migrate.

We can compute the number of pages successfully unmapped or
migrated from the last two stats above.

Also, see the migrate-on-fault vmstats for information on
subsequent lazy page migrations.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/vmstat.h |    4 ++++
 mm/mempolicy.c         |   12 +++++++++++-
 mm/vmstat.c            |    6 ++++++
 3 files changed, 21 insertions(+), 1 deletion(-)

Index: linux-2.6.36-mmotm-101103-1217/include/linux/vmstat.h
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/include/linux/vmstat.h
+++ linux-2.6.36-mmotm-101103-1217/include/linux/vmstat.h
@@ -61,6 +61,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 #ifdef CONFIG_MIGRATE_ON_FAULT
 		PGLOCCHECK, PGMISPLACED, PGMIGRATED,
 #endif
+#ifdef CONFIG_AUTO_MIGRATION
+		AUTOMIG_TASKS_MIGRATED,
+		AUTOMIG_PGSCANNED, AUTOMIG_PGSELECTED, AUTOMIG_PGFAILED,
+#endif
 		NR_VM_EVENT_ITEMS
 };
 
Index: linux-2.6.36-mmotm-101103-1217/mm/vmstat.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/vmstat.c
+++ linux-2.6.36-mmotm-101103-1217/mm/vmstat.c
@@ -869,6 +869,12 @@ static const char * const vmstat_text[]
 	"pgmisplaced",
 	"pgmigrated",
 #endif
+#ifdef CONFIG_AUTO_MIGRATION
+	"automig_tasks_migrated",
+	"automig_pgs_scanned",
+	"automig_pgs_selected",
+	"automig_pgs_failed",
+#endif
 #endif
 };
 
Index: linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
===================================================================
--- linux-2.6.36-mmotm-101103-1217.orig/mm/mempolicy.c
+++ linux-2.6.36-mmotm-101103-1217/mm/mempolicy.c
@@ -1050,6 +1050,9 @@ static long do_get_mempolicy(int *policy
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
 				unsigned long flags)
 {
+	if (is_auto_migration(flags))
+		count_vm_event(AUTOMIG_PGSCANNED);
+
 	/*
 	 * Avoid migrating a file backed page in a private mapping, or
 	 * a page that is shared with > 'migrate_max_mapcount' others
@@ -1062,6 +1065,8 @@ static void migrate_page_add(struct page
 			list_add_tail(&page->lru, pagelist);
 			inc_zone_page_state(page, NR_ISOLATED_ANON +
 					    page_is_file_cache(page));
+			if (is_auto_migration(flags))
+				count_vm_event(AUTOMIG_PGSELECTED);
 		}
 	}
 }
@@ -1096,8 +1101,12 @@ static int migrate_to_node(struct mm_str
 			err = migrate_pages_unmap_only(&pagelist);
 		else
 			err = migrate_pages(&pagelist, new_node_page, dest, 0);
-		if (err)
+		if (err) {
 			putback_lru_pages(&pagelist);
+
+			if (is_auto_migration(flags))
+				count_vm_events(AUTOMIG_PGFAILED, err);
+		}
 	}
 
 	return err;
@@ -1262,6 +1271,7 @@ void auto_migrate_task_memory(void)
 	 * we're returning to user space, so mm must be non-NULL
 	 */
 	BUG_ON(!mm);
+	count_vm_event(AUTOMIG_TASKS_MIGRATED);
 
 	if (auto_migrate_lazy(current))
 		set_lazy_migration(flags);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH/RFC 0/11] numa - Automatic-migration
  2010-11-11 20:01 [PATCH/RFC 0/11] numa - Automatic-migration Lee Schermerhorn
                   ` (10 preceding siblings ...)
  2010-11-11 20:02 ` [PATCH/RFC 11/11] numa - Automatic-migration - add statistics Lee Schermerhorn
@ 2010-11-14  6:42 ` KOSAKI Motohiro
  11 siblings, 0 replies; 13+ messages in thread
From: KOSAKI Motohiro @ 2010-11-14  6:42 UTC (permalink / raw
  To: Lee Schermerhorn
  Cc: kosaki.motohiro, linux-numa, akpm, Mel Gorman, cl, Nick Piggin,
	Hugh Dickins, andi, David Rientjes, Avi Kivity, Andrea Arcangeli

> This series of patches hooks up linux page migration to the task load
> balancing mechanism.  The effect is such that, when load balancing moves
> a task to a cpu on a different node from where the task last executed,
> the task is notified of this change using a variant of the mechanism used
> to notify a task of pending signals.  When the task returns to user state,
> it attempts to migrate, to the new node, any pages not already on that
> node in those of the task's vm areas under control of default policy.
> 
> By default, the task will use lazy migration to migrate "misplaced"
> pages.  When notified of an inter-node migration, the task will
> walk its address space, attempting to unmap [remove all ptes] any
> anonymous pages in the tasks page table.  When the task subsequently
> touchs any of these unmapped pages, it will include a swap page
> fault.  The swap fault handler will either restore the pte if the
> cached page's location matches it's mempolicy, otherwise the
> "migrate-on-fault" mechanism will attempt to migrate the page to
> the correct node.
> 
> Lazy migration may be disabled by writing zero to the per cpuset
> auto_migrate_lazy file.  In that case, automigration will use
> direct, synchronous migration to pull all anonymous pages mapped
> by the task to new node.
> 
> 	Why lazy migration by default?  Think of the effect
> 	of direct, synchronous migration, in this context,
> 	on large multi-threaded programs.
> 
> Automatic page migration is disabled by default, but can be enabled by
> writing non-zero to the per cpuset auto_migrate_enable file.
> Furthermore, to prevent thrashing, this series provides a second,
> experimental per cpuset control, auto_migrate_interval.  The load
> balancer will not move a task to a different node if it has move to a
> new node in the last auto_migrate_interval seconds.  [User interface
> is in seconds; internally it's in HZ.]  The idea is to give the task
> time to ammortize the cost of the migration by giving it time to
> benefit from local references to the page.  Some experimenting and
> tuning will be necessary to determine the appropriate default value
> for this parameter on various platforms.
> 
> An additional per cpuset control -- migrate_max_mapcount -- adjusts
> the threshold page mapcount at which non-privileged users can migrate
> shared pages.  This control allows experimentation with more aggressive
> auto-migration.
> 
> Why "per cpuset controls"?  Originally, cpusets was the only convenient
> "soft partitioning" or "task grouping" mechanism available.  Now that
> "containers" or "control groups" are available, one might consider
> a "NUMA behavior" control group, orthogonal to cpusets, to control this
> sort behavior.  However, because cpusets are closely tied to NUMA resource
> partitioning and locality management, it still seems like a good place to
> contain the migration and mempolicy behavior controls.
> 
> Finally, the series adds a per process control file -- /proc/<pid>/migrate.
> Writing to this file causes the task to simulate an internode migration
> by walking its address space and unmapping anonymous pages so that they
> will be checked for [mis]placement on next touch; or by directly migrating
> them if lazy migration is disabled for the task's cpuset.  This can be
> used to test the automigration facility or to force a task to reestablish
> it's anonymous page NUMA footprint at any time.

If my remember is correct, you presented this feature anywhere conference and
you have presentation slide, right?
If so, can you please tell me the URL of the presentation. I'd like to 
understand the background of the patch.

Thanks.



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-11-14  6:42 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-11 20:01 [PATCH/RFC 0/11] numa - Automatic-migration Lee Schermerhorn
2010-11-11 20:01 ` [PATCH/RFC 1/11] numa - Automatic-migration - preparation, cleanup Lee Schermerhorn
2010-11-11 20:01 ` [PATCH/RFC 2/11] numa - Automatic-migration - per cpuset automigration control Lee Schermerhorn
2010-11-11 20:01 ` [PATCH/RFC 3/11] numa - Automatic-migration - check notify migrate pending Lee Schermerhorn
2010-11-11 20:01 ` [PATCH/RFC 4/11] numa - Automatic-migration - ia64 " Lee Schermerhorn
2010-11-11 20:01 ` [PATCH/RFC 5/11] numa - Automatic-migration - x86_64 " Lee Schermerhorn
2010-11-11 20:01 ` [PATCH/RFC 6/11] numa - Automatic-migration - hook to scheduler inter-node migration Lee Schermerhorn
2010-11-11 20:01 ` [PATCH/RFC 7/11] numa - Automatic-migration - add internode migration delay Lee Schermerhorn
2010-11-11 20:02 ` [PATCH/RFC 8/11] numa - Automatic-migration - per cpuset max mapcount control Lee Schermerhorn
2010-11-11 20:02 ` [PATCH/RFC 9/11] numa - Automatic-migration - hook to migrate on fault Lee Schermerhorn
2010-11-11 20:02 ` [PATCH/RFC 10/11] numa - Automatic-migration - per proc automigrate kick control Lee Schermerhorn
2010-11-11 20:02 ` [PATCH/RFC 11/11] numa - Automatic-migration - add statistics Lee Schermerhorn
2010-11-14  6:42 ` [PATCH/RFC 0/11] numa - Automatic-migration KOSAKI Motohiro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).