All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* NUMA scheduler BK tree
@ 2002-11-06 16:34 Erich Focht
  2002-11-06 18:10 ` Michael Hohnbaum
                   ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Erich Focht @ 2002-11-06 16:34 UTC (permalink / raw
  To: Michael Hohnbaum, Martin J. Bligh
  Cc: Robert Love, Anton Blanchard, Ingo Molnar, Stephen Hemminger,
	linux-kernel

Michael, Martin,

in order to make it easier to keep up with the main Linux tree I've
set up a bitkeeper repository with our NUMA scheduler at
       bk://numa-ef.bkbits.net/numa-sched
(Web view:  http://numa-ef.bkbits.net/)
This used to contain my node affine NUMA scheduler, I'll add extra
trees when the additional patches for that are tested on top of our
NUMA scheduler.

Is it ok for you to have it this way or would you prefer having the
core and the initial load balancer separate?

The tree is currently in sync with bk://linux.bkbits.net/linux-2.5 and
I'll try to keep so.

Regards,
Erich


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NUMA scheduler BK tree
  2002-11-06 16:34 NUMA scheduler BK tree Erich Focht
@ 2002-11-06 18:10 ` Michael Hohnbaum
  2002-11-07 23:05   ` Erich Focht
  2002-11-07 23:46 ` Michael Hohnbaum
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 29+ messages in thread
From: Michael Hohnbaum @ 2002-11-06 18:10 UTC (permalink / raw
  To: Erich Focht
  Cc: Martin J. Bligh, Robert Love, Anton Blanchard, Ingo Molnar,
	Stephen Hemminger, linux-kernel

On Wed, 2002-11-06 at 08:34, Erich Focht wrote:
> Michael, Martin,
> 
> in order to make it easier to keep up with the main Linux tree I've
> set up a bitkeeper repository with our NUMA scheduler at
>        bk://numa-ef.bkbits.net/numa-sched
> (Web view:  http://numa-ef.bkbits.net/)
> This used to contain my node affine NUMA scheduler, I'll add extra
> trees when the additional patches for that are tested on top of our
> NUMA scheduler.
> 
> Is it ok for you to have it this way or would you prefer having the
> core and the initial load balancer separate?
> 
> The tree is currently in sync with bk://linux.bkbits.net/linux-2.5 and
> I'll try to keep so.

Erich,

This is fine with me.  Can't the core changes and and load
balancer be maintained as separate changesets within the bk
tree?  

            Michael

> Regards,
> Erich
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NUMA scheduler BK tree
  2002-11-06 18:10 ` Michael Hohnbaum
@ 2002-11-07 23:05   ` Erich Focht
  0 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-11-07 23:05 UTC (permalink / raw
  To: Michael Hohnbaum
  Cc: Martin J. Bligh, Robert Love, Anton Blanchard, Ingo Molnar,
	Stephen Hemminger, linux-kernel

On Wednesday 06 November 2002 19:10, Michael Hohnbaum wrote:
> > Is it ok for you to have it this way or would you prefer having the
> > core and the initial load balancer separate?
>
> This is fine with me.  Can't the core changes and and load
> balancer be maintained as separate changesets within the bk
> tree?

OK, I'll do that. Any idea how I can apply a changeset which has
another name attached to it than mine?

Regards,
Erich


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NUMA scheduler BK tree
  2002-11-06 16:34 NUMA scheduler BK tree Erich Focht
  2002-11-06 18:10 ` Michael Hohnbaum
@ 2002-11-07 23:46 ` Michael Hohnbaum
  2002-11-08 16:57   ` Erich Focht
  2002-11-11 15:13 ` [PATCH 2.5.47] NUMA scheduler (1/2) Erich Focht
  2002-11-18 19:40 ` NUMA scheduler BK tree Martin J. Bligh
  3 siblings, 1 reply; 29+ messages in thread
From: Michael Hohnbaum @ 2002-11-07 23:46 UTC (permalink / raw
  To: Erich Focht
  Cc: Martin J. Bligh, Robert Love, Anton Blanchard, Ingo Molnar,
	Stephen Hemminger, linux-kernel

On Wed, 2002-11-06 at 08:34, Erich Focht wrote:
> Michael, Martin,
> 
> in order to make it easier to keep up with the main Linux tree I've
> set up a bitkeeper repository with our NUMA scheduler at
>        bk://numa-ef.bkbits.net/numa-sched
> (Web view:  http://numa-ef.bkbits.net/)
> 
> The tree is currently in sync with bk://linux.bkbits.net/linux-2.5 and
> I'll try to keep so.
> 
Tested this on a 4 node NUMAQ.  Worked fine.  Results:

$reportbench stock46 sched46 
Kernbench:
                             Elapsed        User      System         CPU
                 stock46      20.66s    194.062s      53.39s     1197.4%
                 sched46     19.988s    191.302s     50.692s     1210.4%

Schedbench 4:
                             Elapsed   TotalUser    TotalSys     AvgUser
                 stock46       27.27       40.64      109.13        0.85
                 sched46       23.10       41.32       92.42        0.76

Schedbench 8:
                             Elapsed   TotalUser    TotalSys     AvgUser
                 stock46       39.18       55.12      313.56        1.68
                 sched46       34.45       51.24      275.63        2.28

Schedbench 16:
                             Elapsed   TotalUser    TotalSys     AvgUser
                 stock46       56.39       72.44      902.45        5.12
                 sched46       56.73       71.31      907.88        4.19

Schedbench 32:
                             Elapsed   TotalUser    TotalSys     AvgUser
                 stock46       90.47      203.28     2895.41       10.39
                 sched46       60.95      143.21     1950.72       10.31

Schedbench 64:
                             Elapsed   TotalUser    TotalSys     AvgUser
                 stock46      105.00      439.04     6720.90       25.02
                 sched46       59.07      262.98     3781.06       19.59

The schedbench runs were ran once each.  Kernbench is the average of
5 runs.

          Michael

> Regards,
> Erich
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NUMA scheduler BK tree
  2002-11-07 23:46 ` Michael Hohnbaum
@ 2002-11-08 16:57   ` Erich Focht
  0 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-11-08 16:57 UTC (permalink / raw
  To: Michael Hohnbaum
  Cc: Martin J. Bligh, Robert Love, Anton Blanchard, Ingo Molnar,
	Stephen Hemminger, linux-kernel

Michael,

thanks for the results, they look really good!

There's a problem with your script for Schedbench: the column labels
are permuted! Instead of
>                              Elapsed   TotalUser    TotalSys     AvgUser
you should have
>                              AvgUser   Elapsed      TotalUser    TotalSys

The numbers make more sense like that ;-)
Could you please send me your script for schedbench?

BTW: the bk repository has now two distinct changesets, one for the
"core" the other one for the initial balancing. Larry McVoy showed me
how to do things right with the names assigned to the changesets.

Thanks!

Regards,
Erich


On Friday 08 November 2002 00:46, Michael Hohnbaum wrote:
> On Wed, 2002-11-06 at 08:34, Erich Focht wrote:
> > in order to make it easier to keep up with the main Linux tree I've
> > set up a bitkeeper repository with our NUMA scheduler at
> >        bk://numa-ef.bkbits.net/numa-sched
> > (Web view:  http://numa-ef.bkbits.net/)
...
> Tested this on a 4 node NUMAQ.  Worked fine.  Results:
>
> $reportbench stock46 sched46
> Kernbench:
>                              Elapsed        User      System         CPU
>                  stock46      20.66s    194.062s      53.39s     1197.4%
>                  sched46     19.988s    191.302s     50.692s     1210.4%
>
> Schedbench 4:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>                  stock46       27.27       40.64      109.13        0.85
>                  sched46       23.10       41.32       92.42        0.76
>
> Schedbench 8:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>                  stock46       39.18       55.12      313.56        1.68
>                  sched46       34.45       51.24      275.63        2.28
>
> Schedbench 16:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>                  stock46       56.39       72.44      902.45        5.12
>                  sched46       56.73       71.31      907.88        4.19
>
> Schedbench 32:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>                  stock46       90.47      203.28     2895.41       10.39
>                  sched46       60.95      143.21     1950.72       10.31
>
> Schedbench 64:
>                              Elapsed   TotalUser    TotalSys     AvgUser
>                  stock46      105.00      439.04     6720.90       25.02
>                  sched46       59.07      262.98     3781.06       19.59
>
> The schedbench runs were ran once each.  Kernbench is the average of
> 5 runs.
>
>           Michael


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2.5.47] NUMA scheduler (1/2)
  2002-11-06 16:34 NUMA scheduler BK tree Erich Focht
  2002-11-06 18:10 ` Michael Hohnbaum
  2002-11-07 23:46 ` Michael Hohnbaum
@ 2002-11-11 15:13 ` Erich Focht
  2002-11-11 15:14   ` [PATCH 2.5.47] NUMA scheduler (2/2) Erich Focht
  2002-11-12  0:24   ` [PATCH 2.5.47] NUMA scheduler (1/2) Michael Hohnbaum
  2002-11-18 19:40 ` NUMA scheduler BK tree Martin J. Bligh
  3 siblings, 2 replies; 29+ messages in thread
From: Erich Focht @ 2002-11-11 15:13 UTC (permalink / raw
  To: Michael Hohnbaum, Martin J. Bligh
  Cc: Robert Love, Anton Blanchard, Ingo Molnar, Stephen Hemminger,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 496 bytes --]

Hi,

here come the NUMA scheduler patches for 2.5.47 which emerged from
Michael's and my work on NUMA schedulers. As usual in two parts:

01-numa-sched-core-2.5.47-21.patch: core NUMA scheduler infrastructure
  providing a node aware load_balancer

02-numa-sched-ilb-2.5.47-21.patch: initial load balancing, selects
  least loaded node & CPU at exec()

An up-to-date BK tree containing the whole NUMA scheduler can be found
at bk://numa-ef.bkbits.net/numa-sched .

Regards,
Erich

[-- Attachment #2: 01-numa-sched-core-2.5.47-21.patch --]
[-- Type: text/x-diff, Size: 12567 bytes --]

diff -Nru a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c	Mon Nov 11 13:33:22 2002
+++ b/arch/i386/kernel/smpboot.c	Mon Nov 11 13:33:22 2002
@@ -1196,6 +1196,9 @@
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	zap_low_mappings();
+#ifdef CONFIG_NUMA
+	build_node_data();
+#endif
 }
 
 void __init smp_intr_init()
diff -Nru a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
--- a/arch/ia64/kernel/smpboot.c	Mon Nov 11 13:33:22 2002
+++ b/arch/ia64/kernel/smpboot.c	Mon Nov 11 13:33:22 2002
@@ -397,7 +397,7 @@
 static void
 smp_tune_scheduling (void)
 {
-	cache_decay_ticks = 10;	/* XXX base this on PAL info and cache-bandwidth estimate */
+	cache_decay_ticks = 8;	/* XXX base this on PAL info and cache-bandwidth estimate */
 
 	printk("task migration cache decay timeout: %ld msecs.\n",
 	       (cache_decay_ticks + 1) * 1000 / HZ);
@@ -522,6 +522,9 @@
 
 	printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n",
 	       num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
+#ifdef CONFIG_NUMA
+	build_node_data();
+#endif
 }
 
 int __devinit
diff -Nru a/arch/ppc64/kernel/smp.c b/arch/ppc64/kernel/smp.c
--- a/arch/ppc64/kernel/smp.c	Mon Nov 11 13:33:22 2002
+++ b/arch/ppc64/kernel/smp.c	Mon Nov 11 13:33:22 2002
@@ -679,4 +679,7 @@
 
 	/* XXX fix this, xics currently relies on it - Anton */
 	smp_threads_ready = 1;
+#ifdef CONFIG_NUMA
+	build_node_data();
+#endif
 }
diff -Nru a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	Mon Nov 11 13:33:22 2002
+++ b/include/linux/sched.h	Mon Nov 11 13:33:22 2002
@@ -20,6 +20,7 @@
 #include <asm/mmu.h>
 
 #include <linux/smp.h>
+#include <asm/topology.h>
 #include <linux/sem.h>
 #include <linux/signal.h>
 #include <linux/securebits.h>
@@ -449,6 +462,9 @@
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void build_pools(void);
+#endif
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	Mon Nov 11 13:33:22 2002
+++ b/kernel/sched.c	Mon Nov 11 13:33:22 2002
@@ -153,6 +153,10 @@
 	struct list_head migration_queue;
 
 	atomic_t nr_iowait;
+
+	unsigned long wait_time;
+	int wait_node;
+
 } ____cacheline_aligned;
 
 static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -177,6 +182,54 @@
 # define task_running(rq, p)		((rq)->curr == (p))
 #endif
 
+#define cpu_to_node(cpu) __cpu_to_node(cpu)
+#define LOADSCALE 128
+
+#ifdef CONFIG_NUMA
+/* Number of CPUs per node: sane values until all CPUs are up */
+int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
+int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are sorted!)*/
+#define node_ncpus(node)  _node_nr_cpus[node]
+
+#define NODE_DELAY_IDLE  (1*HZ/1000)
+#define NODE_DELAY_BUSY  (20*HZ/1000)
+
+/* the following macro implies that logical CPUs are sorted by node number */
+#define loop_over_node(cpu,node) \
+	for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)
+
+/*
+ * Build node_ptr and node_ncpus data after all CPUs have come up. This
+ * expects that the logical CPUs are sorted by their node numbers! Check
+ * out the NUMA API for details on why this should be that way.     [EF]
+ */
+void build_node_data(void)
+{
+	int n, cpu, ptr;
+	unsigned long mask;
+
+	ptr=0;
+	for (n=0; n<numnodes; n++) {
+		mask = __node_to_cpu_mask(n) & cpu_online_map;
+		node_ptr[n] = ptr;
+		for (cpu=0; cpu<NR_CPUS; cpu++)
+			if (mask  & (1UL << cpu))
+				ptr++;
+		node_ncpus(n) = ptr - node_ptr[n];;
+	}
+	printk("CPU nodes : %d\n",numnodes);
+	for (n=0; n<numnodes; n++)
+		printk("node %d : %d .. %d\n",n,node_ptr[n],node_ptr[n+1]-1);
+}
+
+#else
+#define node_ncpus(node)  num_online_cpus()
+#define NODE_DELAY_IDLE 0
+#define NODE_DELAY_BUSY 0
+#define loop_over_node(cpu,n) for(cpu=0; cpu<NR_CPUS; cpu++)
+#endif
+
+
 /*
  * task_rq_lock - lock the runqueue a given task resides on and disable
  * interrupts.  Note the ordering: we can safely lookup the task_rq without
@@ -652,81 +705,134 @@
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
+ * Find a runqueue from which to steal a task. We try to do this as locally as
+ * possible because we don't want to let tasks get far from their node.
+ * 
+ * 1. First try to find a runqueue within the own node with
+ * imbalance larger than 25% (relative to the current runqueue).
+ * 2. If the local node is well balanced, locate the most loaded node and its
+ * most loaded CPU.
+ *
+ * This routine implements node balancing by delaying steals from remote
+ * nodes more if the own node is (within margins) averagely loaded. The
+ * most loaded node is remembered as well as the time (jiffies). In the
+ * following calls to the load_balancer the time is compared with
+ * NODE_DELAY_BUSY (if load is around the average) or NODE_DELAY_IDLE (if own
+ * node is unloaded) if the most loaded node didn't change. This gives less 
+ * loaded nodes the chance to approach the average load but doesn't exclude
+ * busy nodes from stealing (just in case the cpus_allowed mask isn't good
+ * for the idle nodes).
+ * This concept can be extended easilly to more than two levels (multi-level
+ * scheduler), e.g.: CPU -> node -> supernode... by implementing node-distance
+ * dependent steal delays.
+ *
+ *                                                         <efocht@ess.nec.de>
+ */
+
+struct node_queue_data {
+	int total_load;
+	int average_load;
+	int busiest_cpu;
+	int busiest_cpu_load;
+};
+
+/*
+ * Check if CPUs are balanced. The check is more involved than the O(1) original
+ * because that one is simply wrong in certain cases (integer arithmetic !!!) EF
+ */
+#define CPUS_BALANCED(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 + 3)/4))
+/*
+ * Check if nodes are imbalanced. "this" node is balanced (compared to the "comp"
+ * node) when it's load is not smaller than "comp"'s load - LOADSCALE/2.
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+#define NODES_BALANCED(comp,this) (((comp)-(this)) < LOADSCALE/2)
+
+static inline runqueue_t *find_busiest_queue(int this_cpu, int idle, int *nr_running)
 {
-	int nr_running, load, max_load, i;
-	runqueue_t *busiest, *rq_src;
+	runqueue_t *busiest_rq = NULL, *this_rq = cpu_rq(this_cpu), *src_rq;
+	int busiest_cpu, busiest_node=0, cpu, load, max_avg_load, avg_load;
+	int n, steal_delay, system_load = 0, this_node=cpu_to_node(this_cpu); 
+	struct node_queue_data nd[MAX_NUMNODES], *node;
 
-	/*
-	 * We search all runqueues to find the most busy one.
-	 * We do this lockless to reduce cache-bouncing overhead,
-	 * we re-check the 'best' source CPU later on again, with
-	 * the lock held.
-	 *
-	 * We fend off statistical fluctuations in runqueue lengths by
-	 * saving the runqueue length during the previous load-balancing
-	 * operation and using the smaller one the current and saved lengths.
-	 * If a runqueue is long enough for a longer amount of time then
-	 * we recognize it and pull tasks from it.
-	 *
-	 * The 'current runqueue length' is a statistical maximum variable,
-	 * for that one we take the longer one - to avoid fluctuations in
-	 * the other direction. So for a load-balance to happen it needs
-	 * stable long runqueue on the target CPU and stable short runqueue
-	 * on the local runqueue.
-	 *
-	 * We make an exception if this CPU is about to become idle - in
-	 * that case we are less picky about moving a task across CPUs and
-	 * take what can be taken.
-	 */
 	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
-		nr_running = this_rq->nr_running;
+		*nr_running = this_rq->nr_running;
 	else
-		nr_running = this_rq->prev_nr_running[this_cpu];
+		*nr_running = this_rq->prev_nr_running[this_cpu];
 
-	busiest = NULL;
-	max_load = 1;
-	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
-			continue;
+	for (n = 0; n < numnodes; n++) {
+		nd[n].busiest_cpu_load = -1;
+		nd[n].total_load = 0;
+	}
 
-		rq_src = cpu_rq(i);
-		if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
-			load = rq_src->nr_running;
+	/* compute all node loads and save their max cpu loads */
+	for (cpu = 0; cpu < NR_CPUS; cpu++) {
+		if (!cpu_online(cpu)) continue;
+		node = &nd[cpu_to_node(cpu)];
+		src_rq = cpu_rq(cpu);
+		if (idle || (src_rq->nr_running < this_rq->prev_nr_running[cpu]))
+			load = src_rq->nr_running;
 		else
-			load = this_rq->prev_nr_running[i];
-		this_rq->prev_nr_running[i] = rq_src->nr_running;
-
-		if ((load > max_load) && (rq_src != this_rq)) {
-			busiest = rq_src;
-			max_load = load;
+			load = this_rq->prev_nr_running[cpu];
+		this_rq->prev_nr_running[cpu] = src_rq->nr_running;
+		node->total_load += load;
+		if (load > node->busiest_cpu_load) {
+			node->busiest_cpu_load = load;
+			node->busiest_cpu = cpu;
 		}
 	}
 
-	if (likely(!busiest))
-		goto out;
+	busiest_cpu = nd[this_node].busiest_cpu;
+	if (busiest_cpu != this_cpu) {
+		if (!CPUS_BALANCED(nd[this_node].busiest_cpu_load,*nr_running)) {
+			busiest_rq = cpu_rq(busiest_cpu);
+			this_rq->wait_node = -1;
+			goto out;
+		}
+	}
 
-	*imbalance = (max_load - nr_running) / 2;
+#ifdef CONFIG_NUMA
+	max_avg_load = -1;
+	for (n = 0; n < numnodes; n++) {
+		node = &nd[n];
+		system_load += node->total_load;
+		node->average_load = node->total_load*LOADSCALE/node_ncpus(n);
+		if (node->average_load > max_avg_load) {
+			max_avg_load = node->average_load;
+			busiest_node = n;
+		}
+	}
 
-	/* It needs an at least ~25% imbalance to trigger balancing. */
-	if (!idle && (*imbalance < (max_load + 3)/4)) {
-		busiest = NULL;
+	/* Exit if not enough imbalance on any remote node. */
+	if ((busiest_node == this_node) || (max_avg_load <= LOADSCALE) ||
+	    NODES_BALANCED(max_avg_load,nd[this_node].average_load)) {
+		this_rq->wait_node = -1;
 		goto out;
 	}
 
-	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
-	/*
-	 * Make sure nothing changed since we checked the
-	 * runqueue length.
-	 */
-	if (busiest->nr_running <= nr_running + 1) {
-		spin_unlock(&busiest->lock);
-		busiest = NULL;
+	busiest_cpu = nd[busiest_node].busiest_cpu;
+	avg_load = system_load*LOADSCALE/num_online_cpus();
+	/* Wait longer before stealing if own node's load is average. */
+	if (NODES_BALANCED(avg_load,nd[this_node].average_load))
+		steal_delay = NODE_DELAY_BUSY;
+	else
+		steal_delay = NODE_DELAY_IDLE;
+	/* if we have a new most loaded node: just mark it */
+	if (this_rq->wait_node != busiest_node) {
+		this_rq->wait_node = busiest_node;
+		this_rq->wait_time = jiffies;
+		goto out;
+	} else
+	/* old most loaded node: check if waited enough */
+		if (jiffies - this_rq->wait_time < steal_delay)
+			goto out;
+
+	if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {
+		busiest_rq = cpu_rq(busiest_cpu);
+		this_rq->wait_node = -1;
 	}
-out:
-	return busiest;
+#endif
+ out:
+	return busiest_rq;
 }
 
 /*
@@ -758,16 +864,21 @@
  */
 static void load_balance(runqueue_t *this_rq, int idle)
 {
-	int imbalance, idx, this_cpu = smp_processor_id();
+	int imbalance, nr_running, idx, this_cpu = smp_processor_id();
 	runqueue_t *busiest;
 	prio_array_t *array;
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+	busiest = find_busiest_queue(this_cpu, idle, &nr_running);
 	if (!busiest)
 		goto out;
 
+	imbalance = (busiest->nr_running - nr_running)/2;
+
+	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+	if (busiest->nr_running <= nr_running + 1)
+		goto out_unlock;
 	/*
 	 * We first consider expired tasks. Those will likely not be
 	 * executed in the near future, and they are most likely to
@@ -839,10 +950,10 @@
  * frequency and balancing agressivity depends on whether the CPU is
  * idle or not.
  *
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
  * systems with HZ=100, every 10 msecs.)
  */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
 
 static inline void idle_tick(runqueue_t *rq)
@@ -2080,7 +2191,8 @@
 		spin_unlock_irqrestore(&rq->lock, flags);
 
 		p = req->task;
-		cpu_dest = __ffs(p->cpus_allowed);
+		cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+
 		rq_dest = cpu_rq(cpu_dest);
 repeat:
 		cpu_src = task_cpu(p);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2.5.47] NUMA scheduler (2/2)
  2002-11-11 15:13 ` [PATCH 2.5.47] NUMA scheduler (1/2) Erich Focht
@ 2002-11-11 15:14   ` Erich Focht
  2002-11-12  0:24   ` [PATCH 2.5.47] NUMA scheduler (1/2) Michael Hohnbaum
  1 sibling, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-11-11 15:14 UTC (permalink / raw
  To: Michael Hohnbaum, Martin J. Bligh
  Cc: Robert Love, Anton Blanchard, Ingo Molnar, Stephen Hemminger,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 463 bytes --]

And here's the second part.

Erich

On Monday 11 November 2002 16:13, Erich Focht wrote:
> here come the NUMA scheduler patches for 2.5.47 which emerged from
> Michael's and my work on NUMA schedulers. As usual in two parts:
>
> 01-numa-sched-core-2.5.47-21.patch: core NUMA scheduler infrastructure
>   providing a node aware load_balancer
>
> 02-numa-sched-ilb-2.5.47-21.patch: initial load balancing, selects
>   least loaded node & CPU at exec()

[-- Attachment #2: 02-numa-sched-ilb-2.5.47-21.patch --]
[-- Type: text/x-diff, Size: 4660 bytes --]

diff -urN a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c	Mon Nov 11 04:28:10 2002
+++ b/fs/exec.c	Mon Nov 11 14:49:13 2002
@@ -1022,6 +1022,8 @@
 	int retval;
 	int i;
 
+	sched_balance_exec();
+
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	Mon Nov 11 14:44:04 2002
+++ b/include/linux/sched.h	Mon Nov 11 14:49:13 2002
@@ -159,7 +159,19 @@
 			       unsigned long system, int cpu);
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
-
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+	rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+	rq->nr_running--
+#else
+#define sched_balance_exec() {}
+#define node_nr_running_init() {}
+#define nr_running_inc(rq) rq->nr_running++
+#define nr_running_dec(rq) rq->nr_running--
+#endif
 
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
diff -urN a/init/main.c b/init/main.c
--- a/init/main.c	Mon Nov 11 04:28:05 2002
+++ b/init/main.c	Mon Nov 11 14:49:13 2002
@@ -499,6 +499,7 @@
 
 	migration_init();
 #endif
+	node_nr_running_init();
 	spawn_ksoftirqd();
 }
 
diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	Mon Nov 11 14:44:04 2002
+++ b/kernel/sched.c	Mon Nov 11 14:49:13 2002
@@ -153,6 +153,7 @@
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
+	atomic_t * node_ptr;
 
 	task_t *migration_thread;
 	struct list_head migration_queue;
@@ -346,7 +347,7 @@
 		p->prio = effective_prio(p);
 	}
 	enqueue_task(p, array);
-	rq->nr_running++;
+	nr_running_inc(rq);
 }
 
 /*
@@ -354,7 +355,7 @@
  */
 static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
-	rq->nr_running--;
+	nr_running_dec(rq);
 	if (p->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
 	dequeue_task(p, p->array);
@@ -841,9 +842,9 @@
 static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
 {
 	dequeue_task(p, src_array);
-	src_rq->nr_running--;
+	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
+	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
@@ -2253,6 +2254,83 @@
 
 #endif
 
+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+__init void node_nr_running_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+	}
+	return;
+}
+
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	/* force the process onto the specified CPU */
+	set_cpus_allowed(p, 1UL << dest_cpu);
+
+	/* restore the cpus allowed mask */
+	set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU.  Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+	int i, minload, load, best_cpu, node = 0;
+
+	best_cpu = task_cpu(p);
+	if (cpu_rq(best_cpu)->nr_running <= 2)
+		return best_cpu;
+
+	minload = 10000000;
+	for (i = 0; i < numnodes; i++) {
+		load = atomic_read(&node_nr_running[i]);
+		if (load < minload) {
+			minload = load;
+			node = i;
+		}
+	}
+	minload = 10000000;
+	loop_over_node(i,node) {
+		if (!cpu_online(i))
+			continue;
+		if (cpu_rq(i)->nr_running < minload) {
+			best_cpu = i;
+			minload = cpu_rq(i)->nr_running;
+		}
+	}
+	return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu;
+
+	if (numnodes > 1) {
+		new_cpu = sched_best_cpu(current);
+		if (new_cpu != smp_processor_id())
+			sched_migrate_task(current, new_cpu);
+	}
+}
+#endif /* CONFIG_NUMA */
+
 #if CONFIG_SMP || CONFIG_PREEMPT
 /*
  * The 'big kernel lock'
@@ -2314,6 +2392,10 @@
 		spin_lock_init(&rq->lock);
 		INIT_LIST_HEAD(&rq->migration_queue);
 		atomic_set(&rq->nr_iowait, 0);
+#if CONFIG_NUMA
+		rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */
+
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2.5.47] NUMA scheduler (1/2)
  2002-11-11 15:13 ` [PATCH 2.5.47] NUMA scheduler (1/2) Erich Focht
  2002-11-11 15:14   ` [PATCH 2.5.47] NUMA scheduler (2/2) Erich Focht
@ 2002-11-12  0:24   ` Michael Hohnbaum
  1 sibling, 0 replies; 29+ messages in thread
From: Michael Hohnbaum @ 2002-11-12  0:24 UTC (permalink / raw
  To: Erich Focht
  Cc: Martin J. Bligh, Robert Love, Anton Blanchard, Ingo Molnar,
	Stephen Hemminger, linux-kernel

On Mon, 2002-11-11 at 07:13, Erich Focht wrote:
> Hi,
> 
> here come the NUMA scheduler patches for 2.5.47 which emerged from
> Michael's and my work on NUMA schedulers. As usual in two parts:
> 
> 01-numa-sched-core-2.5.47-21.patch: core NUMA scheduler infrastructure
>   providing a node aware load_balancer
> 
> 02-numa-sched-ilb-2.5.47-21.patch: initial load balancing, selects
>   least loaded node & CPU at exec()

Built, booted, tested on NUMAQ.  Results:

$ reportbench sched47 stock47
Kernbench:
                             Elapsed        User      System         CPU
                 sched47      20.39s    191.824s     50.358s     1187.6%
                 stock47     20.608s    194.536s         53s     1201.4%

Schedbench 4:
                             Elapsed   TotalUser    TotalSys     AvgUser
                 sched47       22.45       34.79       89.84        0.69
                 stock47       29.05       42.58      116.25        0.94

Schedbench 8:
                             Elapsed   TotalUser    TotalSys     AvgUser
                 sched47       39.58       60.98      316.72        1.78
                 stock47       43.30       56.84      346.49        2.44

Schedbench 16:
                             Elapsed   TotalUser    TotalSys     AvgUser
                 sched47       58.83       72.58      941.56        5.30
                 stock47       60.76       82.05      972.31        4.82

Schedbench 32:
                             Elapsed   TotalUser    TotalSys     AvgUser
                 sched47       55.87      122.75     1788.32       11.02
                 stock47       77.93      179.68     2494.06       11.35

Schedbench 64:
                             Elapsed   TotalUser    TotalSys     AvgUser
                 sched47       85.77      360.49     5490.28       21.13
                 stock47       94.31      398.85     6036.95       23.47


            Michael

> An up-to-date BK tree containing the whole NUMA scheduler can be found
> at bk://numa-ef.bkbits.net/numa-sched .
> 
> Regards,
> Erich

-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: NUMA scheduler BK tree
  2002-11-06 16:34 NUMA scheduler BK tree Erich Focht
                   ` (2 preceding siblings ...)
  2002-11-11 15:13 ` [PATCH 2.5.47] NUMA scheduler (1/2) Erich Focht
@ 2002-11-18 19:40 ` Martin J. Bligh
  2002-11-19 16:26   ` [PATCH 2.5.48] NUMA scheduler (1/2) Erich Focht
  3 siblings, 1 reply; 29+ messages in thread
From: Martin J. Bligh @ 2002-11-18 19:40 UTC (permalink / raw
  To: Erich Focht, Michael Hohnbaum
  Cc: Robert Love, Anton Blanchard, Ingo Molnar, Stephen Hemminger,
	linux-kernel

> in order to make it easier to keep up with the main Linux tree I've
> set up a bitkeeper repository with our NUMA scheduler at
>        bk://numa-ef.bkbits.net/numa-sched
> (Web view:  http://numa-ef.bkbits.net/)
> This used to contain my node affine NUMA scheduler, I'll add extra
> trees when the additional patches for that are tested on top of our
> NUMA scheduler.
> 
> Is it ok for you to have it this way or would you prefer having the
> core and the initial load balancer separate?
> 
> The tree is currently in sync with bk://linux.bkbits.net/linux-2.5 and
> I'll try to keep so.

BTW, can you keep producing normal patches too, when you do an update? 
I don't use bitkeeper ...

Thanks,

Martin.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2.5.48] NUMA scheduler (1/2)
  2002-11-18 19:40 ` NUMA scheduler BK tree Martin J. Bligh
@ 2002-11-19 16:26   ` Erich Focht
  2002-11-19 16:27     ` [PATCH 2.5.48] NUMA scheduler (2/2) Erich Focht
  2002-12-02 15:29     ` [PATCH 2.5.50] NUMA scheduler (1/2) Erich Focht
  0 siblings, 2 replies; 29+ messages in thread
From: Erich Focht @ 2002-11-19 16:26 UTC (permalink / raw
  To: Martin J. Bligh, Michael Hohnbaum
  Cc: Robert Love, Anton Blanchard, Ingo Molnar, Stephen Hemminger,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 719 bytes --]

As requested in another email, I attach the 2.5.48 patches for the
NUMA scheduler which emerged from Michael's and my work:

01-numa-sched-core-2.5.48-22.patch: core NUMA scheduler infrastructure
  providing a node aware load_balancer. Rediffed and fixed one
  external declaration.

02-numa-sched-ilb-2.5.48-21.patch: initial load balancing, selects
  least loaded node & CPU at exec(). Rediffed.

We are curious about any benchmark results and, of course, about running
on other platforms than NUMAQ & Azusa/TX7, too.

Regards,
Erich

On Monday 18 November 2002 20:40, Martin J. Bligh wrote:
>
> BTW, can you keep producing normal patches too, when you do an update?
> I don't use bitkeeper ...


[-- Attachment #2: 01-numa-sched-core-2.5.48-22.patch --]
[-- Type: text/x-diff, Size: 12579 bytes --]

diff -Nru a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c	Tue Nov 19 17:08:40 2002
+++ b/arch/i386/kernel/smpboot.c	Tue Nov 19 17:08:40 2002
@@ -1196,6 +1196,9 @@
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	zap_low_mappings();
+#ifdef CONFIG_NUMA
+	build_node_data();
+#endif
 }
 
 void __init smp_intr_init()
diff -Nru a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
--- a/arch/ia64/kernel/smpboot.c	Tue Nov 19 17:08:40 2002
+++ b/arch/ia64/kernel/smpboot.c	Tue Nov 19 17:08:40 2002
@@ -397,7 +397,7 @@
 static void
 smp_tune_scheduling (void)
 {
-	cache_decay_ticks = 10;	/* XXX base this on PAL info and cache-bandwidth estimate */
+	cache_decay_ticks = 8;	/* XXX base this on PAL info and cache-bandwidth estimate */
 
 	printk("task migration cache decay timeout: %ld msecs.\n",
 	       (cache_decay_ticks + 1) * 1000 / HZ);
@@ -522,6 +522,9 @@
 
 	printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n",
 	       num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
+#ifdef CONFIG_NUMA
+	build_node_data();
+#endif
 }
 
 int __devinit
diff -Nru a/arch/ppc64/kernel/smp.c b/arch/ppc64/kernel/smp.c
--- a/arch/ppc64/kernel/smp.c	Tue Nov 19 17:08:40 2002
+++ b/arch/ppc64/kernel/smp.c	Tue Nov 19 17:08:40 2002
@@ -679,4 +679,7 @@
 
 	/* XXX fix this, xics currently relies on it - Anton */
 	smp_threads_ready = 1;
+#ifdef CONFIG_NUMA
+	build_node_data();
+#endif
 }
diff -Nru a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	Tue Nov 19 17:08:40 2002
+++ b/include/linux/sched.h	Tue Nov 19 17:08:40 2002
@@ -20,6 +20,7 @@
 #include <asm/mmu.h>
 
 #include <linux/smp.h>
+#include <asm/topology.h>
 #include <linux/sem.h>
 #include <linux/signal.h>
 #include <linux/securebits.h>
@@ -449,6 +450,9 @@
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void build_node_data(void);
+#endif
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	Tue Nov 19 17:08:40 2002
+++ b/kernel/sched.c	Tue Nov 19 17:08:40 2002
@@ -158,6 +158,10 @@
 	struct list_head migration_queue;
 
 	atomic_t nr_iowait;
+
+	unsigned long wait_time;
+	int wait_node;
+
 } ____cacheline_aligned;
 
 static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -177,6 +181,54 @@
 # define task_running(rq, p)		((rq)->curr == (p))
 #endif
 
+#define cpu_to_node(cpu) __cpu_to_node(cpu)
+#define LOADSCALE 128
+
+#ifdef CONFIG_NUMA
+/* Number of CPUs per node: sane values until all CPUs are up */
+int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
+int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are sorted!)*/
+#define node_ncpus(node)  _node_nr_cpus[node]
+
+#define NODE_DELAY_IDLE  (1*HZ/1000)
+#define NODE_DELAY_BUSY  (20*HZ/1000)
+
+/* the following macro implies that logical CPUs are sorted by node number */
+#define loop_over_node(cpu,node) \
+	for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)
+
+/*
+ * Build node_ptr and node_ncpus data after all CPUs have come up. This
+ * expects that the logical CPUs are sorted by their node numbers! Check
+ * out the NUMA API for details on why this should be that way.     [EF]
+ */
+void build_node_data(void)
+{
+	int n, cpu, ptr;
+	unsigned long mask;
+
+	ptr=0;
+	for (n=0; n<numnodes; n++) {
+		mask = __node_to_cpu_mask(n) & cpu_online_map;
+		node_ptr[n] = ptr;
+		for (cpu=0; cpu<NR_CPUS; cpu++)
+			if (mask  & (1UL << cpu))
+				ptr++;
+		node_ncpus(n) = ptr - node_ptr[n];;
+	}
+	printk("CPU nodes : %d\n",numnodes);
+	for (n=0; n<numnodes; n++)
+		printk("node %d : %d .. %d\n",n,node_ptr[n],node_ptr[n+1]-1);
+}
+
+#else
+#define node_ncpus(node)  num_online_cpus()
+#define NODE_DELAY_IDLE 0
+#define NODE_DELAY_BUSY 0
+#define loop_over_node(cpu,n) for(cpu=0; cpu<NR_CPUS; cpu++)
+#endif
+
+
 /*
  * task_rq_lock - lock the runqueue a given task resides on and disable
  * interrupts.  Note the ordering: we can safely lookup the task_rq without
@@ -652,81 +704,134 @@
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
- */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
-{
-	int nr_running, load, max_load, i;
-	runqueue_t *busiest, *rq_src;
+ * Find a runqueue from which to steal a task. We try to do this as locally as
+ * possible because we don't want to let tasks get far from their node.
+ * 
+ * 1. First try to find a runqueue within the own node with
+ * imbalance larger than 25% (relative to the current runqueue).
+ * 2. If the local node is well balanced, locate the most loaded node and its
+ * most loaded CPU.
+ *
+ * This routine implements node balancing by delaying steals from remote
+ * nodes more if the own node is (within margins) averagely loaded. The
+ * most loaded node is remembered as well as the time (jiffies). In the
+ * following calls to the load_balancer the time is compared with
+ * NODE_DELAY_BUSY (if load is around the average) or NODE_DELAY_IDLE (if own
+ * node is unloaded) if the most loaded node didn't change. This gives less 
+ * loaded nodes the chance to approach the average load but doesn't exclude
+ * busy nodes from stealing (just in case the cpus_allowed mask isn't good
+ * for the idle nodes).
+ * This concept can be extended easilly to more than two levels (multi-level
+ * scheduler), e.g.: CPU -> node -> supernode... by implementing node-distance
+ * dependent steal delays.
+ *
+ *                                                         <efocht@ess.nec.de>
+ */
+
+struct node_queue_data {
+	int total_load;
+	int average_load;
+	int busiest_cpu;
+	int busiest_cpu_load;
+};
+
+/*
+ * Check if CPUs are balanced. The check is more involved than the O(1) original
+ * because that one is simply wrong in certain cases (integer arithmetic !!!) EF
+ */
+#define CPUS_BALANCED(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 + 3)/4))
+/*
+ * Check if nodes are imbalanced. "this" node is balanced (compared to the "comp"
+ * node) when it's load is not smaller than "comp"'s load - LOADSCALE/2.
+ */
+#define NODES_BALANCED(comp,this) (((comp)-(this)) < LOADSCALE/2)
+
+static inline runqueue_t *find_busiest_queue(int this_cpu, int idle, int *nr_running)
+{
+	runqueue_t *busiest_rq = NULL, *this_rq = cpu_rq(this_cpu), *src_rq;
+	int busiest_cpu, busiest_node=0, cpu, load, max_avg_load, avg_load;
+	int n, steal_delay, system_load = 0, this_node=cpu_to_node(this_cpu); 
+	struct node_queue_data nd[MAX_NUMNODES], *node;
 
-	/*
-	 * We search all runqueues to find the most busy one.
-	 * We do this lockless to reduce cache-bouncing overhead,
-	 * we re-check the 'best' source CPU later on again, with
-	 * the lock held.
-	 *
-	 * We fend off statistical fluctuations in runqueue lengths by
-	 * saving the runqueue length during the previous load-balancing
-	 * operation and using the smaller one the current and saved lengths.
-	 * If a runqueue is long enough for a longer amount of time then
-	 * we recognize it and pull tasks from it.
-	 *
-	 * The 'current runqueue length' is a statistical maximum variable,
-	 * for that one we take the longer one - to avoid fluctuations in
-	 * the other direction. So for a load-balance to happen it needs
-	 * stable long runqueue on the target CPU and stable short runqueue
-	 * on the local runqueue.
-	 *
-	 * We make an exception if this CPU is about to become idle - in
-	 * that case we are less picky about moving a task across CPUs and
-	 * take what can be taken.
-	 */
 	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
-		nr_running = this_rq->nr_running;
+		*nr_running = this_rq->nr_running;
 	else
-		nr_running = this_rq->prev_nr_running[this_cpu];
+		*nr_running = this_rq->prev_nr_running[this_cpu];
 
-	busiest = NULL;
-	max_load = 1;
-	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
-			continue;
+	for (n = 0; n < numnodes; n++) {
+		nd[n].busiest_cpu_load = -1;
+		nd[n].total_load = 0;
+	}
 
-		rq_src = cpu_rq(i);
-		if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
-			load = rq_src->nr_running;
+	/* compute all node loads and save their max cpu loads */
+	for (cpu = 0; cpu < NR_CPUS; cpu++) {
+		if (!cpu_online(cpu)) continue;
+		node = &nd[cpu_to_node(cpu)];
+		src_rq = cpu_rq(cpu);
+		if (idle || (src_rq->nr_running < this_rq->prev_nr_running[cpu]))
+			load = src_rq->nr_running;
 		else
-			load = this_rq->prev_nr_running[i];
-		this_rq->prev_nr_running[i] = rq_src->nr_running;
-
-		if ((load > max_load) && (rq_src != this_rq)) {
-			busiest = rq_src;
-			max_load = load;
+			load = this_rq->prev_nr_running[cpu];
+		this_rq->prev_nr_running[cpu] = src_rq->nr_running;
+		node->total_load += load;
+		if (load > node->busiest_cpu_load) {
+			node->busiest_cpu_load = load;
+			node->busiest_cpu = cpu;
 		}
 	}
 
-	if (likely(!busiest))
-		goto out;
+	busiest_cpu = nd[this_node].busiest_cpu;
+	if (busiest_cpu != this_cpu) {
+		if (!CPUS_BALANCED(nd[this_node].busiest_cpu_load,*nr_running)) {
+			busiest_rq = cpu_rq(busiest_cpu);
+			this_rq->wait_node = -1;
+			goto out;
+		}
+	}
 
-	*imbalance = (max_load - nr_running) / 2;
+#ifdef CONFIG_NUMA
+	max_avg_load = -1;
+	for (n = 0; n < numnodes; n++) {
+		node = &nd[n];
+		system_load += node->total_load;
+		node->average_load = node->total_load*LOADSCALE/node_ncpus(n);
+		if (node->average_load > max_avg_load) {
+			max_avg_load = node->average_load;
+			busiest_node = n;
+		}
+	}
 
-	/* It needs an at least ~25% imbalance to trigger balancing. */
-	if (!idle && (*imbalance < (max_load + 3)/4)) {
-		busiest = NULL;
+	/* Exit if not enough imbalance on any remote node. */
+	if ((busiest_node == this_node) || (max_avg_load <= LOADSCALE) ||
+	    NODES_BALANCED(max_avg_load,nd[this_node].average_load)) {
+		this_rq->wait_node = -1;
 		goto out;
 	}
 
-	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
-	/*
-	 * Make sure nothing changed since we checked the
-	 * runqueue length.
-	 */
-	if (busiest->nr_running <= nr_running + 1) {
-		spin_unlock(&busiest->lock);
-		busiest = NULL;
+	busiest_cpu = nd[busiest_node].busiest_cpu;
+	avg_load = system_load*LOADSCALE/num_online_cpus();
+	/* Wait longer before stealing if own node's load is average. */
+	if (NODES_BALANCED(avg_load,nd[this_node].average_load))
+		steal_delay = NODE_DELAY_BUSY;
+	else
+		steal_delay = NODE_DELAY_IDLE;
+	/* if we have a new most loaded node: just mark it */
+	if (this_rq->wait_node != busiest_node) {
+		this_rq->wait_node = busiest_node;
+		this_rq->wait_time = jiffies;
+		goto out;
+	} else
+	/* old most loaded node: check if waited enough */
+		if (jiffies - this_rq->wait_time < steal_delay)
+			goto out;
+
+	if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {
+		busiest_rq = cpu_rq(busiest_cpu);
+		this_rq->wait_node = -1;
 	}
-out:
-	return busiest;
+#endif
+ out:
+	return busiest_rq;
 }
 
 /*
@@ -758,16 +863,21 @@
  */
 static void load_balance(runqueue_t *this_rq, int idle)
 {
-	int imbalance, idx, this_cpu = smp_processor_id();
+	int imbalance, nr_running, idx, this_cpu = smp_processor_id();
 	runqueue_t *busiest;
 	prio_array_t *array;
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+	busiest = find_busiest_queue(this_cpu, idle, &nr_running);
 	if (!busiest)
 		goto out;
 
+	imbalance = (busiest->nr_running - nr_running)/2;
+
+	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+	if (busiest->nr_running <= nr_running + 1)
+		goto out_unlock;
 	/*
 	 * We first consider expired tasks. Those will likely not be
 	 * executed in the near future, and they are most likely to
@@ -839,10 +949,10 @@
  * frequency and balancing agressivity depends on whether the CPU is
  * idle or not.
  *
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
  * systems with HZ=100, every 10 msecs.)
  */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
 
 static inline void idle_tick(runqueue_t *rq)
@@ -2080,7 +2190,8 @@
 		spin_unlock_irqrestore(&rq->lock, flags);
 
 		p = req->task;
-		cpu_dest = __ffs(p->cpus_allowed);
+		cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+
 		rq_dest = cpu_rq(cpu_dest);
 repeat:
 		cpu_src = task_cpu(p);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2.5.48] NUMA scheduler (2/2)
  2002-11-19 16:26   ` [PATCH 2.5.48] NUMA scheduler (1/2) Erich Focht
@ 2002-11-19 16:27     ` Erich Focht
  2002-12-02 15:29     ` [PATCH 2.5.50] NUMA scheduler (1/2) Erich Focht
  1 sibling, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-11-19 16:27 UTC (permalink / raw
  To: Martin J. Bligh, Michael Hohnbaum
  Cc: Robert Love, Anton Blanchard, Ingo Molnar, Stephen Hemminger,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 850 bytes --]

And here is the the second patch...

Erich

On Tuesday 19 November 2002 17:26, Erich Focht wrote:
> As requested in another email, I attach the 2.5.48 patches for the
> NUMA scheduler which emerged from Michael's and my work:
>
> 01-numa-sched-core-2.5.48-22.patch: core NUMA scheduler infrastructure
>   providing a node aware load_balancer. Rediffed and fixed one
>   external declaration.
>
> 02-numa-sched-ilb-2.5.48-21.patch: initial load balancing, selects
>   least loaded node & CPU at exec(). Rediffed.
>
> We are curious about any benchmark results and, of course, about running
> on other platforms than NUMAQ & Azusa/TX7, too.
>
> Regards,
> Erich
>
> On Monday 18 November 2002 20:40, Martin J. Bligh wrote:
> > BTW, can you keep producing normal patches too, when you do an update?
> > I don't use bitkeeper ...

[-- Attachment #2: 02-numa-sched-ilb-2.5.48-21.patch --]
[-- Type: text/x-diff, Size: 4660 bytes --]

diff -Nru a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c	Tue Nov 19 17:09:02 2002
+++ b/fs/exec.c	Tue Nov 19 17:09:02 2002
@@ -1023,6 +1023,8 @@
 	int retval;
 	int i;
 
+	sched_balance_exec();
+
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
diff -Nru a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	Tue Nov 19 17:09:02 2002
+++ b/include/linux/sched.h	Tue Nov 19 17:09:02 2002
@@ -159,7 +159,19 @@
 			       unsigned long system, int cpu);
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
-
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+	rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+	rq->nr_running--
+#else
+#define sched_balance_exec() {}
+#define node_nr_running_init() {}
+#define nr_running_inc(rq) rq->nr_running++
+#define nr_running_dec(rq) rq->nr_running--
+#endif
 
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
diff -Nru a/init/main.c b/init/main.c
--- a/init/main.c	Tue Nov 19 17:09:02 2002
+++ b/init/main.c	Tue Nov 19 17:09:02 2002
@@ -500,6 +500,7 @@
 
 	migration_init();
 #endif
+	node_nr_running_init();
 	spawn_ksoftirqd();
 }
 
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	Tue Nov 19 17:09:02 2002
+++ b/kernel/sched.c	Tue Nov 19 17:09:02 2002
@@ -153,6 +153,7 @@
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
+	atomic_t * node_ptr;
 
 	task_t *migration_thread;
 	struct list_head migration_queue;
@@ -346,7 +347,7 @@
 		p->prio = effective_prio(p);
 	}
 	enqueue_task(p, array);
-	rq->nr_running++;
+	nr_running_inc(rq);
 }
 
 /*
@@ -354,7 +355,7 @@
  */
 static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
-	rq->nr_running--;
+	nr_running_dec(rq);
 	if (p->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
 	dequeue_task(p, p->array);
@@ -841,9 +842,9 @@
 static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
 {
 	dequeue_task(p, src_array);
-	src_rq->nr_running--;
+	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
+	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
@@ -2253,6 +2254,83 @@
 
 #endif
 
+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+__init void node_nr_running_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+	}
+	return;
+}
+
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	/* force the process onto the specified CPU */
+	set_cpus_allowed(p, 1UL << dest_cpu);
+
+	/* restore the cpus allowed mask */
+	set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU.  Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+	int i, minload, load, best_cpu, node = 0;
+
+	best_cpu = task_cpu(p);
+	if (cpu_rq(best_cpu)->nr_running <= 2)
+		return best_cpu;
+
+	minload = 10000000;
+	for (i = 0; i < numnodes; i++) {
+		load = atomic_read(&node_nr_running[i]);
+		if (load < minload) {
+			minload = load;
+			node = i;
+		}
+	}
+	minload = 10000000;
+	loop_over_node(i,node) {
+		if (!cpu_online(i))
+			continue;
+		if (cpu_rq(i)->nr_running < minload) {
+			best_cpu = i;
+			minload = cpu_rq(i)->nr_running;
+		}
+	}
+	return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu;
+
+	if (numnodes > 1) {
+		new_cpu = sched_best_cpu(current);
+		if (new_cpu != smp_processor_id())
+			sched_migrate_task(current, new_cpu);
+	}
+}
+#endif /* CONFIG_NUMA */
+
 #if CONFIG_SMP || CONFIG_PREEMPT
 /*
  * The 'big kernel lock'
@@ -2314,6 +2392,10 @@
 		spin_lock_init(&rq->lock);
 		INIT_LIST_HEAD(&rq->migration_queue);
 		atomic_set(&rq->nr_iowait, 0);
+#if CONFIG_NUMA
+		rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */
+
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2.5.50] NUMA scheduler (1/2)
  2002-11-19 16:26   ` [PATCH 2.5.48] NUMA scheduler (1/2) Erich Focht
  2002-11-19 16:27     ` [PATCH 2.5.48] NUMA scheduler (2/2) Erich Focht
@ 2002-12-02 15:29     ` Erich Focht
  2002-12-02 15:30       ` [PATCH 2.5.50] NUMA scheduler (2/2) Erich Focht
                         ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-02 15:29 UTC (permalink / raw
  To: Martin J. Bligh, Michael Hohnbaum
  Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 193 bytes --]

Here come the NUMA scheduler patches rediffed for 2.5.50. No
functional changes since last version (
http://marc.theaimsgroup.com/?l=linux-kernel&m=103772346430798&w=2 ).

Regards,
Erich

[-- Attachment #2: 01-numa-sched-core-2.5.50-22.patch --]
[-- Type: text/x-diff, Size: 12681 bytes --]

diff -urN a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c	2002-11-27 23:36:00.000000000 +0100
+++ b/arch/i386/kernel/smpboot.c	2002-11-29 12:51:12.000000000 +0100
@@ -1202,6 +1202,9 @@
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	zap_low_mappings();
+#ifdef CONFIG_NUMA
+	build_node_data();
+#endif
 }
 
 void __init smp_intr_init()
diff -urN a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
--- a/arch/ia64/kernel/smpboot.c	2002-11-27 23:35:53.000000000 +0100
+++ b/arch/ia64/kernel/smpboot.c	2002-11-29 12:51:12.000000000 +0100
@@ -397,7 +397,7 @@
 static void
 smp_tune_scheduling (void)
 {
-	cache_decay_ticks = 10;	/* XXX base this on PAL info and cache-bandwidth estimate */
+	cache_decay_ticks = 8;	/* XXX base this on PAL info and cache-bandwidth estimate */
 
 	printk("task migration cache decay timeout: %ld msecs.\n",
 	       (cache_decay_ticks + 1) * 1000 / HZ);
@@ -522,6 +522,9 @@
 
 	printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n",
 	       num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
+#ifdef CONFIG_NUMA
+	build_node_data();
+#endif
 }
 
 int __devinit
diff -urN a/arch/ppc64/kernel/smp.c b/arch/ppc64/kernel/smp.c
--- a/arch/ppc64/kernel/smp.c	2002-11-27 23:36:16.000000000 +0100
+++ b/arch/ppc64/kernel/smp.c	2002-11-29 12:51:12.000000000 +0100
@@ -679,4 +679,7 @@
 
 	/* XXX fix this, xics currently relies on it - Anton */
 	smp_threads_ready = 1;
+#ifdef CONFIG_NUMA
+	build_node_data();
+#endif
 }
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	2002-11-27 23:35:49.000000000 +0100
+++ b/include/linux/sched.h	2002-11-29 12:51:12.000000000 +0100
@@ -20,6 +20,7 @@
 #include <asm/mmu.h>
 
 #include <linux/smp.h>
+#include <asm/topology.h>
 #include <linux/sem.h>
 #include <linux/signal.h>
 #include <linux/securebits.h>
@@ -450,6 +451,9 @@
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void build_node_data(void);
+#endif
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	2002-11-27 23:36:17.000000000 +0100
+++ b/kernel/sched.c	2002-11-29 12:51:12.000000000 +0100
@@ -158,6 +158,10 @@
 	struct list_head migration_queue;
 
 	atomic_t nr_iowait;
+
+	unsigned long wait_time;
+	int wait_node;
+
 } ____cacheline_aligned;
 
 static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -177,6 +181,54 @@
 # define task_running(rq, p)		((rq)->curr == (p))
 #endif
 
+#define cpu_to_node(cpu) __cpu_to_node(cpu)
+#define LOADSCALE 128
+
+#ifdef CONFIG_NUMA
+/* Number of CPUs per node: sane values until all CPUs are up */
+int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
+int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are sorted!)*/
+#define node_ncpus(node)  _node_nr_cpus[node]
+
+#define NODE_DELAY_IDLE  (1*HZ/1000)
+#define NODE_DELAY_BUSY  (20*HZ/1000)
+
+/* the following macro implies that logical CPUs are sorted by node number */
+#define loop_over_node(cpu,node) \
+	for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)
+
+/*
+ * Build node_ptr and node_ncpus data after all CPUs have come up. This
+ * expects that the logical CPUs are sorted by their node numbers! Check
+ * out the NUMA API for details on why this should be that way.     [EF]
+ */
+void build_node_data(void)
+{
+	int n, cpu, ptr;
+	unsigned long mask;
+
+	ptr=0;
+	for (n=0; n<numnodes; n++) {
+		mask = __node_to_cpu_mask(n) & cpu_online_map;
+		node_ptr[n] = ptr;
+		for (cpu=0; cpu<NR_CPUS; cpu++)
+			if (mask  & (1UL << cpu))
+				ptr++;
+		node_ncpus(n) = ptr - node_ptr[n];;
+	}
+	printk("CPU nodes : %d\n",numnodes);
+	for (n=0; n<numnodes; n++)
+		printk("node %d : %d .. %d\n",n,node_ptr[n],node_ptr[n+1]-1);
+}
+
+#else
+#define node_ncpus(node)  num_online_cpus()
+#define NODE_DELAY_IDLE 0
+#define NODE_DELAY_BUSY 0
+#define loop_over_node(cpu,n) for(cpu=0; cpu<NR_CPUS; cpu++)
+#endif
+
+
 /*
  * task_rq_lock - lock the runqueue a given task resides on and disable
  * interrupts.  Note the ordering: we can safely lookup the task_rq without
@@ -652,81 +704,134 @@
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
+ * Find a runqueue from which to steal a task. We try to do this as locally as
+ * possible because we don't want to let tasks get far from their node.
+ * 
+ * 1. First try to find a runqueue within the own node with
+ * imbalance larger than 25% (relative to the current runqueue).
+ * 2. If the local node is well balanced, locate the most loaded node and its
+ * most loaded CPU.
+ *
+ * This routine implements node balancing by delaying steals from remote
+ * nodes more if the own node is (within margins) averagely loaded. The
+ * most loaded node is remembered as well as the time (jiffies). In the
+ * following calls to the load_balancer the time is compared with
+ * NODE_DELAY_BUSY (if load is around the average) or NODE_DELAY_IDLE (if own
+ * node is unloaded) if the most loaded node didn't change. This gives less 
+ * loaded nodes the chance to approach the average load but doesn't exclude
+ * busy nodes from stealing (just in case the cpus_allowed mask isn't good
+ * for the idle nodes).
+ * This concept can be extended easilly to more than two levels (multi-level
+ * scheduler), e.g.: CPU -> node -> supernode... by implementing node-distance
+ * dependent steal delays.
+ *
+ *                                                         <efocht@ess.nec.de>
+ */
+
+struct node_queue_data {
+	int total_load;
+	int average_load;
+	int busiest_cpu;
+	int busiest_cpu_load;
+};
+
+/*
+ * Check if CPUs are balanced. The check is more involved than the O(1) original
+ * because that one is simply wrong in certain cases (integer arithmetic !!!) EF
+ */
+#define CPUS_BALANCED(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 + 3)/4))
+/*
+ * Check if nodes are imbalanced. "this" node is balanced (compared to the "comp"
+ * node) when it's load is not smaller than "comp"'s load - LOADSCALE/2.
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+#define NODES_BALANCED(comp,this) (((comp)-(this)) < LOADSCALE/2)
+
+static inline runqueue_t *find_busiest_queue(int this_cpu, int idle, int *nr_running)
 {
-	int nr_running, load, max_load, i;
-	runqueue_t *busiest, *rq_src;
+	runqueue_t *busiest_rq = NULL, *this_rq = cpu_rq(this_cpu), *src_rq;
+	int busiest_cpu, busiest_node=0, cpu, load, max_avg_load, avg_load;
+	int n, steal_delay, system_load = 0, this_node=cpu_to_node(this_cpu); 
+	struct node_queue_data nd[MAX_NUMNODES], *node;
 
-	/*
-	 * We search all runqueues to find the most busy one.
-	 * We do this lockless to reduce cache-bouncing overhead,
-	 * we re-check the 'best' source CPU later on again, with
-	 * the lock held.
-	 *
-	 * We fend off statistical fluctuations in runqueue lengths by
-	 * saving the runqueue length during the previous load-balancing
-	 * operation and using the smaller one the current and saved lengths.
-	 * If a runqueue is long enough for a longer amount of time then
-	 * we recognize it and pull tasks from it.
-	 *
-	 * The 'current runqueue length' is a statistical maximum variable,
-	 * for that one we take the longer one - to avoid fluctuations in
-	 * the other direction. So for a load-balance to happen it needs
-	 * stable long runqueue on the target CPU and stable short runqueue
-	 * on the local runqueue.
-	 *
-	 * We make an exception if this CPU is about to become idle - in
-	 * that case we are less picky about moving a task across CPUs and
-	 * take what can be taken.
-	 */
 	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
-		nr_running = this_rq->nr_running;
+		*nr_running = this_rq->nr_running;
 	else
-		nr_running = this_rq->prev_nr_running[this_cpu];
+		*nr_running = this_rq->prev_nr_running[this_cpu];
 
-	busiest = NULL;
-	max_load = 1;
-	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
-			continue;
+	for (n = 0; n < numnodes; n++) {
+		nd[n].busiest_cpu_load = -1;
+		nd[n].total_load = 0;
+	}
 
-		rq_src = cpu_rq(i);
-		if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
-			load = rq_src->nr_running;
+	/* compute all node loads and save their max cpu loads */
+	for (cpu = 0; cpu < NR_CPUS; cpu++) {
+		if (!cpu_online(cpu)) continue;
+		node = &nd[cpu_to_node(cpu)];
+		src_rq = cpu_rq(cpu);
+		if (idle || (src_rq->nr_running < this_rq->prev_nr_running[cpu]))
+			load = src_rq->nr_running;
 		else
-			load = this_rq->prev_nr_running[i];
-		this_rq->prev_nr_running[i] = rq_src->nr_running;
-
-		if ((load > max_load) && (rq_src != this_rq)) {
-			busiest = rq_src;
-			max_load = load;
+			load = this_rq->prev_nr_running[cpu];
+		this_rq->prev_nr_running[cpu] = src_rq->nr_running;
+		node->total_load += load;
+		if (load > node->busiest_cpu_load) {
+			node->busiest_cpu_load = load;
+			node->busiest_cpu = cpu;
 		}
 	}
 
-	if (likely(!busiest))
-		goto out;
+	busiest_cpu = nd[this_node].busiest_cpu;
+	if (busiest_cpu != this_cpu) {
+		if (!CPUS_BALANCED(nd[this_node].busiest_cpu_load,*nr_running)) {
+			busiest_rq = cpu_rq(busiest_cpu);
+			this_rq->wait_node = -1;
+			goto out;
+		}
+	}
 
-	*imbalance = (max_load - nr_running) / 2;
+#ifdef CONFIG_NUMA
+	max_avg_load = -1;
+	for (n = 0; n < numnodes; n++) {
+		node = &nd[n];
+		system_load += node->total_load;
+		node->average_load = node->total_load*LOADSCALE/node_ncpus(n);
+		if (node->average_load > max_avg_load) {
+			max_avg_load = node->average_load;
+			busiest_node = n;
+		}
+	}
 
-	/* It needs an at least ~25% imbalance to trigger balancing. */
-	if (!idle && (*imbalance < (max_load + 3)/4)) {
-		busiest = NULL;
+	/* Exit if not enough imbalance on any remote node. */
+	if ((busiest_node == this_node) || (max_avg_load <= LOADSCALE) ||
+	    NODES_BALANCED(max_avg_load,nd[this_node].average_load)) {
+		this_rq->wait_node = -1;
 		goto out;
 	}
 
-	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
-	/*
-	 * Make sure nothing changed since we checked the
-	 * runqueue length.
-	 */
-	if (busiest->nr_running <= nr_running + 1) {
-		spin_unlock(&busiest->lock);
-		busiest = NULL;
+	busiest_cpu = nd[busiest_node].busiest_cpu;
+	avg_load = system_load*LOADSCALE/num_online_cpus();
+	/* Wait longer before stealing if own node's load is average. */
+	if (NODES_BALANCED(avg_load,nd[this_node].average_load))
+		steal_delay = NODE_DELAY_BUSY;
+	else
+		steal_delay = NODE_DELAY_IDLE;
+	/* if we have a new most loaded node: just mark it */
+	if (this_rq->wait_node != busiest_node) {
+		this_rq->wait_node = busiest_node;
+		this_rq->wait_time = jiffies;
+		goto out;
+	} else
+	/* old most loaded node: check if waited enough */
+		if (jiffies - this_rq->wait_time < steal_delay)
+			goto out;
+
+	if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {
+		busiest_rq = cpu_rq(busiest_cpu);
+		this_rq->wait_node = -1;
 	}
-out:
-	return busiest;
+#endif
+ out:
+	return busiest_rq;
 }
 
 /*
@@ -758,16 +863,21 @@
  */
 static void load_balance(runqueue_t *this_rq, int idle)
 {
-	int imbalance, idx, this_cpu = smp_processor_id();
+	int imbalance, nr_running, idx, this_cpu = smp_processor_id();
 	runqueue_t *busiest;
 	prio_array_t *array;
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+	busiest = find_busiest_queue(this_cpu, idle, &nr_running);
 	if (!busiest)
 		goto out;
 
+	imbalance = (busiest->nr_running - nr_running)/2;
+
+	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+	if (busiest->nr_running <= nr_running + 1)
+		goto out_unlock;
 	/*
 	 * We first consider expired tasks. Those will likely not be
 	 * executed in the near future, and they are most likely to
@@ -839,10 +949,10 @@
  * frequency and balancing agressivity depends on whether the CPU is
  * idle or not.
  *
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
  * systems with HZ=100, every 10 msecs.)
  */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
 
 static inline void idle_tick(runqueue_t *rq)
@@ -2095,7 +2205,8 @@
 		spin_unlock_irqrestore(&rq->lock, flags);
 
 		p = req->task;
-		cpu_dest = __ffs(p->cpus_allowed);
+		cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+
 		rq_dest = cpu_rq(cpu_dest);
 repeat:
 		cpu_src = task_cpu(p);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2.5.50] NUMA scheduler (2/2)
  2002-12-02 15:29     ` [PATCH 2.5.50] NUMA scheduler (1/2) Erich Focht
@ 2002-12-02 15:30       ` Erich Focht
  2002-12-06 17:39       ` [PATCH 2.5.50] NUMA scheduler (1/2) Michael Hohnbaum
  2002-12-18 16:21       ` [PATCH 2.5.52] " Erich Focht
  2 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-02 15:30 UTC (permalink / raw
  To: Martin J. Bligh, Michael Hohnbaum
  Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 289 bytes --]

... and the second part ...

On Monday 02 December 2002 16:29, Erich Focht wrote:
> Here come the NUMA scheduler patches rediffed for 2.5.50. No
> functional changes since last version (
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103772346430798&w=2 ).
>
> Regards,
> Erich

[-- Attachment #2: 02-numa-sched-ilb-2.5.50-21.patch --]
[-- Type: text/x-diff, Size: 4748 bytes --]

diff -urN a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c	2002-11-27 23:35:57.000000000 +0100
+++ b/fs/exec.c	2002-11-29 17:44:19.000000000 +0100
@@ -1022,6 +1022,8 @@
 	int retval;
 	int i;
 
+	sched_balance_exec();
+
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	2002-11-29 12:51:12.000000000 +0100
+++ b/include/linux/sched.h	2002-11-29 17:44:19.000000000 +0100
@@ -160,7 +160,19 @@
 			       unsigned long system, int cpu);
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
-
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+	rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+	rq->nr_running--
+#else
+#define sched_balance_exec() {}
+#define node_nr_running_init() {}
+#define nr_running_inc(rq) rq->nr_running++
+#define nr_running_dec(rq) rq->nr_running--
+#endif
 
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
diff -urN a/init/main.c b/init/main.c
--- a/init/main.c	2002-11-27 23:35:51.000000000 +0100
+++ b/init/main.c	2002-11-29 17:44:19.000000000 +0100
@@ -500,6 +500,7 @@
 
 	migration_init();
 #endif
+	node_nr_running_init();
 	spawn_ksoftirqd();
 }
 
diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	2002-11-29 12:51:12.000000000 +0100
+++ b/kernel/sched.c	2002-11-29 17:44:19.000000000 +0100
@@ -153,6 +153,7 @@
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
+	atomic_t * node_ptr;
 
 	task_t *migration_thread;
 	struct list_head migration_queue;
@@ -346,7 +347,7 @@
 		p->prio = effective_prio(p);
 	}
 	enqueue_task(p, array);
-	rq->nr_running++;
+	nr_running_inc(rq);
 }
 
 /*
@@ -354,7 +355,7 @@
  */
 static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
-	rq->nr_running--;
+	nr_running_dec(rq);
 	if (p->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
 	dequeue_task(p, p->array);
@@ -841,9 +842,9 @@
 static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
 {
 	dequeue_task(p, src_array);
-	src_rq->nr_running--;
+	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
+	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
@@ -2268,6 +2269,83 @@
 
 #endif
 
+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+__init void node_nr_running_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+	}
+	return;
+}
+
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	/* force the process onto the specified CPU */
+	set_cpus_allowed(p, 1UL << dest_cpu);
+
+	/* restore the cpus allowed mask */
+	set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU.  Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+	int i, minload, load, best_cpu, node = 0;
+
+	best_cpu = task_cpu(p);
+	if (cpu_rq(best_cpu)->nr_running <= 2)
+		return best_cpu;
+
+	minload = 10000000;
+	for (i = 0; i < numnodes; i++) {
+		load = atomic_read(&node_nr_running[i]);
+		if (load < minload) {
+			minload = load;
+			node = i;
+		}
+	}
+	minload = 10000000;
+	loop_over_node(i,node) {
+		if (!cpu_online(i))
+			continue;
+		if (cpu_rq(i)->nr_running < minload) {
+			best_cpu = i;
+			minload = cpu_rq(i)->nr_running;
+		}
+	}
+	return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu;
+
+	if (numnodes > 1) {
+		new_cpu = sched_best_cpu(current);
+		if (new_cpu != smp_processor_id())
+			sched_migrate_task(current, new_cpu);
+	}
+}
+#endif /* CONFIG_NUMA */
+
 #if CONFIG_SMP || CONFIG_PREEMPT
 /*
  * The 'big kernel lock'
@@ -2329,6 +2407,10 @@
 		spin_lock_init(&rq->lock);
 		INIT_LIST_HEAD(&rq->migration_queue);
 		atomic_set(&rq->nr_iowait, 0);
+#if CONFIG_NUMA
+		rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */
+
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2.5.50] NUMA scheduler (1/2)
  2002-12-02 15:29     ` [PATCH 2.5.50] NUMA scheduler (1/2) Erich Focht
  2002-12-02 15:30       ` [PATCH 2.5.50] NUMA scheduler (2/2) Erich Focht
@ 2002-12-06 17:39       ` Michael Hohnbaum
  2002-12-18 16:21       ` [PATCH 2.5.52] " Erich Focht
  2 siblings, 0 replies; 29+ messages in thread
From: Michael Hohnbaum @ 2002-12-06 17:39 UTC (permalink / raw
  To: Erich Focht
  Cc: Martin J. Bligh, Robert Love, Ingo Molnar, Stephen Hemminger,
	linux-kernel

On Mon, 2002-12-02 at 07:29, Erich Focht wrote:
> Here come the NUMA scheduler patches rediffed for 2.5.50. No
> functional changes since last version (
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103772346430798&w=2 ).

Tested on NUMAQ.  Ran into the problem of the missing CPU stats from
/proc/self/cpu so initial test results were worthless.  Applied Erich's
patch that restored this information and ran on NUMAQ.  As previous 
versions, result was a modest performance gain on kernbench and a more
significant performance gain on Erich's memory intensive test (fondly 
referred to here as schedbench).

Kernbench:
                             Elapsed        User      System         CPU
                stock50a      20.92s     194.12s      53.15s       1182%
                 sched50     20.002s    191.976s      51.17s       1215%

Schedbench 4:
                             AvgUser     Elapsed   TotalUser    TotalSys
                stock50a       29.50       43.11      118.02        0.83
                 sched50       33.93       47.17      135.76        0.74

Schedbench 8:
                             AvgUser     Elapsed   TotalUser    TotalSys
                stock50a       46.88       62.51      375.12        1.78
                 sched50       33.90       47.31      271.30        1.98

Schedbench 16:
                             AvgUser     Elapsed   TotalUser    TotalSys
                stock50a       56.26       71.17      900.32        6.23
                 sched50       57.22       73.34      915.76        4.53

Schedbench 32:
                             AvgUser     Elapsed   TotalUser    TotalSys
                stock50a       84.86      191.48     2715.76       10.93
                 sched50       57.36      130.88     1835.76       10.46

Schedbench 64:
                             AvgUser     Elapsed   TotalUser    TotalSys
                stock50a      114.91      474.55     7355.45       24.95
                 sched50       80.20      338.04     5133.56       22.70

In other testing we are seeing unexpectedly high idle times.  Andrea has
patches against 2.4 port of the O(1) scheduler that are suppose to help
with this.  I plan to try those out to see if they reduce our idle time.

                Michael

> 
> Regards,
> Erich
> ----

-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2.5.52] NUMA scheduler (1/2)
  2002-12-02 15:29     ` [PATCH 2.5.50] NUMA scheduler (1/2) Erich Focht
  2002-12-02 15:30       ` [PATCH 2.5.50] NUMA scheduler (2/2) Erich Focht
  2002-12-06 17:39       ` [PATCH 2.5.50] NUMA scheduler (1/2) Michael Hohnbaum
@ 2002-12-18 16:21       ` Erich Focht
  2002-12-18 16:23         ` [PATCH 2.5.52] NUMA scheduler (2/2) Erich Focht
                           ` (3 more replies)
  2 siblings, 4 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-18 16:21 UTC (permalink / raw
  To: Martin J. Bligh, Michael Hohnbaum
  Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 438 bytes --]

This is the first part of the minimal NUMA scheduler built on top of
the O(1) load balancer. It is rediffed for 2.5.52 and contains a small
bugfix in build_node_data().

The two patches are:

01-numa-sched-core-2.5.52-23.patch: core NUMA scheduler infrastructure
  providing a node aware load_balancer.

02-numa-sched-ilb-2.5.52-21.patch: initial load balancing, selects
  least loaded node & CPU at exec().

Regards,
Erich

[-- Attachment #2: 01-numa-sched-core-2.5.52-23.patch --]
[-- Type: text/x-diff, Size: 12708 bytes --]

diff -urN a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c	2002-12-16 03:07:56.000000000 +0100
+++ b/arch/i386/kernel/smpboot.c	2002-12-18 16:53:12.000000000 +0100
@@ -1202,6 +1202,9 @@
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	zap_low_mappings();
+#ifdef CONFIG_NUMA
+	build_node_data();
+#endif
 }
 
 void __init smp_intr_init()
diff -urN a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
--- a/arch/ia64/kernel/smpboot.c	2002-12-16 03:07:50.000000000 +0100
+++ b/arch/ia64/kernel/smpboot.c	2002-12-18 16:53:13.000000000 +0100
@@ -397,7 +397,7 @@
 static void
 smp_tune_scheduling (void)
 {
-	cache_decay_ticks = 10;	/* XXX base this on PAL info and cache-bandwidth estimate */
+	cache_decay_ticks = 8;	/* XXX base this on PAL info and cache-bandwidth estimate */
 
 	printk("task migration cache decay timeout: %ld msecs.\n",
 	       (cache_decay_ticks + 1) * 1000 / HZ);
@@ -522,6 +522,9 @@
 
 	printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n",
 	       num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
+#ifdef CONFIG_NUMA
+	build_node_data();
+#endif
 }
 
 int __devinit
diff -urN a/arch/ppc64/kernel/smp.c b/arch/ppc64/kernel/smp.c
--- a/arch/ppc64/kernel/smp.c	2002-12-16 03:08:13.000000000 +0100
+++ b/arch/ppc64/kernel/smp.c	2002-12-18 16:53:13.000000000 +0100
@@ -679,4 +679,7 @@
 
 	/* XXX fix this, xics currently relies on it - Anton */
 	smp_threads_ready = 1;
+#ifdef CONFIG_NUMA
+	build_node_data();
+#endif
 }
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	2002-12-16 03:07:44.000000000 +0100
+++ b/include/linux/sched.h	2002-12-18 16:53:13.000000000 +0100
@@ -20,6 +20,7 @@
 #include <asm/mmu.h>
 
 #include <linux/smp.h>
+#include <asm/topology.h>
 #include <linux/sem.h>
 #include <linux/signal.h>
 #include <linux/securebits.h>
@@ -446,6 +447,9 @@
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void build_node_data(void);
+#endif
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	2002-12-16 03:08:14.000000000 +0100
+++ b/kernel/sched.c	2002-12-18 16:53:13.000000000 +0100
@@ -158,6 +158,10 @@
 	struct list_head migration_queue;
 
 	atomic_t nr_iowait;
+
+	unsigned long wait_time;
+	int wait_node;
+
 } ____cacheline_aligned;
 
 static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -177,6 +181,55 @@
 # define task_running(rq, p)		((rq)->curr == (p))
 #endif
 
+#define cpu_to_node(cpu) __cpu_to_node(cpu)
+#define LOADSCALE 128
+
+#ifdef CONFIG_NUMA
+/* Number of CPUs per node: sane values until all CPUs are up */
+int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
+int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are sorted!)*/
+#define node_ncpus(node)  _node_nr_cpus[node]
+
+#define NODE_DELAY_IDLE  (1*HZ/1000)
+#define NODE_DELAY_BUSY  (20*HZ/1000)
+
+/* the following macro implies that logical CPUs are sorted by node number */
+#define loop_over_node(cpu,node) \
+	for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)
+
+/*
+ * Build node_ptr and node_ncpus data after all CPUs have come up. This
+ * expects that the logical CPUs are sorted by their node numbers! Check
+ * out the NUMA API for details on why this should be that way.     [EF]
+ */
+void build_node_data(void)
+{
+	int n, cpu, ptr;
+	unsigned long mask;
+
+	ptr=0;
+	for (n=0; n<numnodes; n++) {
+		mask = __node_to_cpu_mask(n) & cpu_online_map;
+		node_ptr[n] = ptr;
+		for (cpu=0; cpu<NR_CPUS; cpu++)
+			if (mask  & (1UL << cpu))
+				ptr++;
+		node_ncpus(n) = ptr - node_ptr[n];
+	}
+	node_ptr[numnodes] = ptr;
+	printk("CPU nodes : %d\n",numnodes);
+	for (n=0; n<numnodes; n++)
+		printk("node %d : %d .. %d\n",n,node_ptr[n],node_ptr[n+1]-1);
+}
+
+#else
+#define node_ncpus(node)  num_online_cpus()
+#define NODE_DELAY_IDLE 0
+#define NODE_DELAY_BUSY 0
+#define loop_over_node(cpu,n) for(cpu=0; cpu<NR_CPUS; cpu++)
+#endif
+
+
 /*
  * task_rq_lock - lock the runqueue a given task resides on and disable
  * interrupts.  Note the ordering: we can safely lookup the task_rq without
@@ -652,81 +705,134 @@
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
+ * Find a runqueue from which to steal a task. We try to do this as locally as
+ * possible because we don't want to let tasks get far from their node.
+ * 
+ * 1. First try to find a runqueue within the own node with
+ * imbalance larger than 25% (relative to the current runqueue).
+ * 2. If the local node is well balanced, locate the most loaded node and its
+ * most loaded CPU.
+ *
+ * This routine implements node balancing by delaying steals from remote
+ * nodes more if the own node is (within margins) averagely loaded. The
+ * most loaded node is remembered as well as the time (jiffies). In the
+ * following calls to the load_balancer the time is compared with
+ * NODE_DELAY_BUSY (if load is around the average) or NODE_DELAY_IDLE (if own
+ * node is unloaded) if the most loaded node didn't change. This gives less 
+ * loaded nodes the chance to approach the average load but doesn't exclude
+ * busy nodes from stealing (just in case the cpus_allowed mask isn't good
+ * for the idle nodes).
+ * This concept can be extended easilly to more than two levels (multi-level
+ * scheduler), e.g.: CPU -> node -> supernode... by implementing node-distance
+ * dependent steal delays.
+ *
+ *                                                         <efocht@ess.nec.de>
+ */
+
+struct node_queue_data {
+	int total_load;
+	int average_load;
+	int busiest_cpu;
+	int busiest_cpu_load;
+};
+
+/*
+ * Check if CPUs are balanced. The check is more involved than the O(1) original
+ * because that one is simply wrong in certain cases (integer arithmetic !!!) EF
+ */
+#define CPUS_BALANCED(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 + 3)/4))
+/*
+ * Check if nodes are imbalanced. "this" node is balanced (compared to the "comp"
+ * node) when it's load is not smaller than "comp"'s load - LOADSCALE/2.
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+#define NODES_BALANCED(comp,this) (((comp)-(this)) < LOADSCALE/2)
+
+static inline runqueue_t *find_busiest_queue(int this_cpu, int idle, int *nr_running)
 {
-	int nr_running, load, max_load, i;
-	runqueue_t *busiest, *rq_src;
+	runqueue_t *busiest_rq = NULL, *this_rq = cpu_rq(this_cpu), *src_rq;
+	int busiest_cpu, busiest_node=0, cpu, load, max_avg_load, avg_load;
+	int n, steal_delay, system_load = 0, this_node=cpu_to_node(this_cpu); 
+	struct node_queue_data nd[MAX_NUMNODES], *node;
 
-	/*
-	 * We search all runqueues to find the most busy one.
-	 * We do this lockless to reduce cache-bouncing overhead,
-	 * we re-check the 'best' source CPU later on again, with
-	 * the lock held.
-	 *
-	 * We fend off statistical fluctuations in runqueue lengths by
-	 * saving the runqueue length during the previous load-balancing
-	 * operation and using the smaller one the current and saved lengths.
-	 * If a runqueue is long enough for a longer amount of time then
-	 * we recognize it and pull tasks from it.
-	 *
-	 * The 'current runqueue length' is a statistical maximum variable,
-	 * for that one we take the longer one - to avoid fluctuations in
-	 * the other direction. So for a load-balance to happen it needs
-	 * stable long runqueue on the target CPU and stable short runqueue
-	 * on the local runqueue.
-	 *
-	 * We make an exception if this CPU is about to become idle - in
-	 * that case we are less picky about moving a task across CPUs and
-	 * take what can be taken.
-	 */
 	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
-		nr_running = this_rq->nr_running;
+		*nr_running = this_rq->nr_running;
 	else
-		nr_running = this_rq->prev_nr_running[this_cpu];
+		*nr_running = this_rq->prev_nr_running[this_cpu];
 
-	busiest = NULL;
-	max_load = 1;
-	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
-			continue;
+	for (n = 0; n < numnodes; n++) {
+		nd[n].busiest_cpu_load = -1;
+		nd[n].total_load = 0;
+	}
 
-		rq_src = cpu_rq(i);
-		if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
-			load = rq_src->nr_running;
+	/* compute all node loads and save their max cpu loads */
+	for (cpu = 0; cpu < NR_CPUS; cpu++) {
+		if (!cpu_online(cpu)) continue;
+		node = &nd[cpu_to_node(cpu)];
+		src_rq = cpu_rq(cpu);
+		if (idle || (src_rq->nr_running < this_rq->prev_nr_running[cpu]))
+			load = src_rq->nr_running;
 		else
-			load = this_rq->prev_nr_running[i];
-		this_rq->prev_nr_running[i] = rq_src->nr_running;
-
-		if ((load > max_load) && (rq_src != this_rq)) {
-			busiest = rq_src;
-			max_load = load;
+			load = this_rq->prev_nr_running[cpu];
+		this_rq->prev_nr_running[cpu] = src_rq->nr_running;
+		node->total_load += load;
+		if (load > node->busiest_cpu_load) {
+			node->busiest_cpu_load = load;
+			node->busiest_cpu = cpu;
 		}
 	}
 
-	if (likely(!busiest))
-		goto out;
+	busiest_cpu = nd[this_node].busiest_cpu;
+	if (busiest_cpu != this_cpu) {
+		if (!CPUS_BALANCED(nd[this_node].busiest_cpu_load,*nr_running)) {
+			busiest_rq = cpu_rq(busiest_cpu);
+			this_rq->wait_node = -1;
+			goto out;
+		}
+	}
 
-	*imbalance = (max_load - nr_running) / 2;
+#ifdef CONFIG_NUMA
+	max_avg_load = -1;
+	for (n = 0; n < numnodes; n++) {
+		node = &nd[n];
+		system_load += node->total_load;
+		node->average_load = node->total_load*LOADSCALE/node_ncpus(n);
+		if (node->average_load > max_avg_load) {
+			max_avg_load = node->average_load;
+			busiest_node = n;
+		}
+	}
 
-	/* It needs an at least ~25% imbalance to trigger balancing. */
-	if (!idle && (*imbalance < (max_load + 3)/4)) {
-		busiest = NULL;
+	/* Exit if not enough imbalance on any remote node. */
+	if ((busiest_node == this_node) || (max_avg_load <= LOADSCALE) ||
+	    NODES_BALANCED(max_avg_load,nd[this_node].average_load)) {
+		this_rq->wait_node = -1;
 		goto out;
 	}
 
-	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
-	/*
-	 * Make sure nothing changed since we checked the
-	 * runqueue length.
-	 */
-	if (busiest->nr_running <= nr_running + 1) {
-		spin_unlock(&busiest->lock);
-		busiest = NULL;
+	busiest_cpu = nd[busiest_node].busiest_cpu;
+	avg_load = system_load*LOADSCALE/num_online_cpus();
+	/* Wait longer before stealing if own node's load is average. */
+	if (NODES_BALANCED(avg_load,nd[this_node].average_load))
+		steal_delay = NODE_DELAY_BUSY;
+	else
+		steal_delay = NODE_DELAY_IDLE;
+	/* if we have a new most loaded node: just mark it */
+	if (this_rq->wait_node != busiest_node) {
+		this_rq->wait_node = busiest_node;
+		this_rq->wait_time = jiffies;
+		goto out;
+	} else
+	/* old most loaded node: check if waited enough */
+		if (jiffies - this_rq->wait_time < steal_delay)
+			goto out;
+
+	if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {
+		busiest_rq = cpu_rq(busiest_cpu);
+		this_rq->wait_node = -1;
 	}
-out:
-	return busiest;
+#endif
+ out:
+	return busiest_rq;
 }
 
 /*
@@ -758,16 +864,21 @@
  */
 static void load_balance(runqueue_t *this_rq, int idle)
 {
-	int imbalance, idx, this_cpu = smp_processor_id();
+	int imbalance, nr_running, idx, this_cpu = smp_processor_id();
 	runqueue_t *busiest;
 	prio_array_t *array;
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+	busiest = find_busiest_queue(this_cpu, idle, &nr_running);
 	if (!busiest)
 		goto out;
 
+	imbalance = (busiest->nr_running - nr_running)/2;
+
+	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+	if (busiest->nr_running <= nr_running + 1)
+		goto out_unlock;
 	/*
 	 * We first consider expired tasks. Those will likely not be
 	 * executed in the near future, and they are most likely to
@@ -839,10 +950,10 @@
  * frequency and balancing agressivity depends on whether the CPU is
  * idle or not.
  *
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
  * systems with HZ=100, every 10 msecs.)
  */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
 
 static inline void idle_tick(runqueue_t *rq)
@@ -2110,7 +2221,8 @@
 		spin_unlock_irqrestore(&rq->lock, flags);
 
 		p = req->task;
-		cpu_dest = __ffs(p->cpus_allowed);
+		cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+
 		rq_dest = cpu_rq(cpu_dest);
 repeat:
 		cpu_src = task_cpu(p);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2.5.52] NUMA scheduler (2/2)
  2002-12-18 16:21       ` [PATCH 2.5.52] " Erich Focht
@ 2002-12-18 16:23         ` Erich Focht
  2002-12-20 14:49         ` [PATCH 2.5.52] NUMA scheduler: cputimes stats Erich Focht
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-18 16:23 UTC (permalink / raw
  To: Martin J. Bligh, Michael Hohnbaum
  Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 578 bytes --]

And here comes patch #2, work done by Michael Hohnbaum.

On Wednesday 18 December 2002 17:21, Erich Focht wrote:
> This is the first part of the minimal NUMA scheduler built on top of
> the O(1) load balancer. It is rediffed for 2.5.52 and contains a small
> bugfix in build_node_data().
>
> The two patches are:
>
> 01-numa-sched-core-2.5.52-23.patch: core NUMA scheduler infrastructure
>   providing a node aware load_balancer.
>
> 02-numa-sched-ilb-2.5.52-21.patch: initial load balancing, selects
>   least loaded node & CPU at exec().
>
> Regards,
> Erich

[-- Attachment #2: 02-numa-sched-ilb-2.5.52-21.patch --]
[-- Type: text/x-diff, Size: 4748 bytes --]

diff -urN a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c	2002-12-16 03:07:53.000000000 +0100
+++ b/fs/exec.c	2002-12-18 16:54:40.000000000 +0100
@@ -1024,6 +1024,8 @@
 	int retval;
 	int i;
 
+	sched_balance_exec();
+
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	2002-12-18 16:53:13.000000000 +0100
+++ b/include/linux/sched.h	2002-12-18 16:54:40.000000000 +0100
@@ -160,7 +160,19 @@
 			       unsigned long system, int cpu);
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
-
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+	rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+	rq->nr_running--
+#else
+#define sched_balance_exec() {}
+#define node_nr_running_init() {}
+#define nr_running_inc(rq) rq->nr_running++
+#define nr_running_dec(rq) rq->nr_running--
+#endif
 
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
diff -urN a/init/main.c b/init/main.c
--- a/init/main.c	2002-12-16 03:07:48.000000000 +0100
+++ b/init/main.c	2002-12-18 16:54:40.000000000 +0100
@@ -473,6 +473,7 @@
 
 	migration_init();
 #endif
+	node_nr_running_init();
 	spawn_ksoftirqd();
 }
 
diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	2002-12-18 16:53:13.000000000 +0100
+++ b/kernel/sched.c	2002-12-18 16:54:40.000000000 +0100
@@ -153,6 +153,7 @@
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
+	atomic_t * node_ptr;
 
 	task_t *migration_thread;
 	struct list_head migration_queue;
@@ -347,7 +348,7 @@
 		p->prio = effective_prio(p);
 	}
 	enqueue_task(p, array);
-	rq->nr_running++;
+	nr_running_inc(rq);
 }
 
 /*
@@ -355,7 +356,7 @@
  */
 static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
-	rq->nr_running--;
+	nr_running_dec(rq);
 	if (p->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
 	dequeue_task(p, p->array);
@@ -842,9 +843,9 @@
 static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
 {
 	dequeue_task(p, src_array);
-	src_rq->nr_running--;
+	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
+	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
@@ -2284,6 +2285,83 @@
 
 #endif
 
+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+__init void node_nr_running_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+	}
+	return;
+}
+
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	/* force the process onto the specified CPU */
+	set_cpus_allowed(p, 1UL << dest_cpu);
+
+	/* restore the cpus allowed mask */
+	set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU.  Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+	int i, minload, load, best_cpu, node = 0;
+
+	best_cpu = task_cpu(p);
+	if (cpu_rq(best_cpu)->nr_running <= 2)
+		return best_cpu;
+
+	minload = 10000000;
+	for (i = 0; i < numnodes; i++) {
+		load = atomic_read(&node_nr_running[i]);
+		if (load < minload) {
+			minload = load;
+			node = i;
+		}
+	}
+	minload = 10000000;
+	loop_over_node(i,node) {
+		if (!cpu_online(i))
+			continue;
+		if (cpu_rq(i)->nr_running < minload) {
+			best_cpu = i;
+			minload = cpu_rq(i)->nr_running;
+		}
+	}
+	return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu;
+
+	if (numnodes > 1) {
+		new_cpu = sched_best_cpu(current);
+		if (new_cpu != smp_processor_id())
+			sched_migrate_task(current, new_cpu);
+	}
+}
+#endif /* CONFIG_NUMA */
+
 #if CONFIG_SMP || CONFIG_PREEMPT
 /*
  * The 'big kernel lock'
@@ -2345,6 +2423,10 @@
 		spin_lock_init(&rq->lock);
 		INIT_LIST_HEAD(&rq->migration_queue);
 		atomic_set(&rq->nr_iowait, 0);
+#if CONFIG_NUMA
+		rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */
+
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2.5.52] NUMA scheduler: cputimes stats
  2002-12-18 16:21       ` [PATCH 2.5.52] " Erich Focht
  2002-12-18 16:23         ` [PATCH 2.5.52] NUMA scheduler (2/2) Erich Focht
@ 2002-12-20 14:49         ` Erich Focht
  2002-12-20 15:17         ` [PATCH 2.5.52] NUMA scheduler (1/2) Christoph Hellwig
  2002-12-31 13:29         ` [PATCH 2.5.53] NUMA scheduler (1/3) Erich Focht
  3 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-20 14:49 UTC (permalink / raw
  To: Martin J. Bligh, Michael Hohnbaum
  Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 807 bytes --]

The attached patch is needed for evaluating and understanding
scheduler and load balancer activity. It adds back to the kernel per
CPU user and system time statistics for each process in /proc/PID/cpu,
a feature which was around for ages and went lost in 2.5.50.

Regards,
Erich

On Wednesday 18 December 2002 17:21, Erich Focht wrote:
> This is the first part of the minimal NUMA scheduler built on top of
> the O(1) load balancer. It is rediffed for 2.5.52 and contains a small
> bugfix in build_node_data().
>
> The two patches are:
>
> 01-numa-sched-core-2.5.52-23.patch: core NUMA scheduler infrastructure
>   providing a node aware load_balancer.
>
> 02-numa-sched-ilb-2.5.52-21.patch: initial load balancing, selects
>   least loaded node & CPU at exec().
>
> Regards,
> Erich

[-- Attachment #2: 03-cputimes_stat-2.5.52.patch --]
[-- Type: text/x-diff, Size: 3316 bytes --]

diff -urN a/fs/proc/array.c b/fs/proc/array.c
--- a/fs/proc/array.c	2002-12-16 03:08:11.000000000 +0100
+++ b/fs/proc/array.c	2002-12-20 15:40:41.000000000 +0100
@@ -597,3 +597,26 @@
 out:
 	return retval;
 }
+
+#ifdef CONFIG_SMP
+int proc_pid_cpu(struct task_struct *task, char * buffer)
+{
+	int i, len;
+
+	len = sprintf(buffer,
+		"cpu  %lu %lu\n",
+		jiffies_to_clock_t(task->utime),
+		jiffies_to_clock_t(task->stime));
+		
+	for (i = 0 ; i < NR_CPUS; i++) {
+		if (cpu_online(i))
+		len += sprintf(buffer + len, "cpu%d %lu %lu\n",
+			i,
+			jiffies_to_clock_t(task->per_cpu_utime[i]),
+			jiffies_to_clock_t(task->per_cpu_stime[i]));
+
+	}
+	len += sprintf(buffer + len, "current_cpu %d\n",task_cpu(task));
+	return len;
+}
+#endif
diff -urN a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c	2002-12-16 03:08:11.000000000 +0100
+++ b/fs/proc/base.c	2002-12-20 15:40:42.000000000 +0100
@@ -55,6 +55,7 @@
 	PROC_PID_STAT,
 	PROC_PID_STATM,
 	PROC_PID_MAPS,
+	PROC_PID_CPU,
 	PROC_PID_MOUNTS,
 	PROC_PID_WCHAN,
 	PROC_PID_FD_DIR = 0x8000,	/* 0x8000-0xffff */
@@ -75,6 +76,9 @@
   E(PROC_PID_CMDLINE,	"cmdline",	S_IFREG|S_IRUGO),
   E(PROC_PID_STAT,	"stat",		S_IFREG|S_IRUGO),
   E(PROC_PID_STATM,	"statm",	S_IFREG|S_IRUGO),
+#ifdef CONFIG_SMP
+  E(PROC_PID_CPU,	"cpu",		S_IFREG|S_IRUGO),
+#endif
   E(PROC_PID_MAPS,	"maps",		S_IFREG|S_IRUGO),
   E(PROC_PID_MEM,	"mem",		S_IFREG|S_IRUSR|S_IWUSR),
   E(PROC_PID_CWD,	"cwd",		S_IFLNK|S_IRWXUGO),
@@ -1026,7 +1030,12 @@
 		case PROC_PID_MAPS:
 			inode->i_fop = &proc_maps_operations;
 			break;
-
+#ifdef CONFIG_SMP
+		case PROC_PID_CPU:
+			inode->i_fop = &proc_info_file_operations;
+			ei->op.proc_read = proc_pid_cpu;
+			break;
+#endif
 		case PROC_PID_MEM:
 			inode->i_op = &proc_mem_inode_operations;
 			inode->i_fop = &proc_mem_operations;
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	2002-12-18 16:54:40.000000000 +0100
+++ b/include/linux/sched.h	2002-12-20 15:40:42.000000000 +0100
@@ -354,6 +354,9 @@
 	struct timer_list real_timer;
 	unsigned long utime, stime, cutime, cstime;
 	unsigned long start_time;
+#ifdef CONFIG_SMP
+	long per_cpu_utime[NR_CPUS], per_cpu_stime[NR_CPUS];
+#endif
 /* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
 	unsigned long min_flt, maj_flt, nswap, cmin_flt, cmaj_flt, cnswap;
 	int swappable:1;
diff -urN a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c	2002-12-16 03:07:44.000000000 +0100
+++ b/kernel/fork.c	2002-12-20 15:40:42.000000000 +0100
@@ -795,6 +795,14 @@
 	p->tty_old_pgrp = 0;
 	p->utime = p->stime = 0;
 	p->cutime = p->cstime = 0;
+#ifdef CONFIG_SMP
+	{
+		int i;
+
+		for(i = 0; i < NR_CPUS; i++)
+			p->per_cpu_utime[i] = p->per_cpu_stime[i] = 0;
+	}
+#endif
 	p->array = NULL;
 	p->lock_depth = -1;		/* -1 = no lock */
 	p->start_time = jiffies;
diff -urN a/kernel/timer.c b/kernel/timer.c
--- a/kernel/timer.c	2002-12-16 03:07:52.000000000 +0100
+++ b/kernel/timer.c	2002-12-20 15:40:42.000000000 +0100
@@ -695,6 +695,10 @@
 void update_one_process(struct task_struct *p, unsigned long user,
 			unsigned long system, int cpu)
 {
+#ifdef CONFIG_SMP
+	p->per_cpu_utime[cpu] += user;
+	p->per_cpu_stime[cpu] += system;
+#endif
 	do_process_times(p, user, system);
 	do_it_virt(p, user);
 	do_it_prof(p);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2.5.52] NUMA scheduler (1/2)
  2002-12-18 16:21       ` [PATCH 2.5.52] " Erich Focht
  2002-12-18 16:23         ` [PATCH 2.5.52] NUMA scheduler (2/2) Erich Focht
  2002-12-20 14:49         ` [PATCH 2.5.52] NUMA scheduler: cputimes stats Erich Focht
@ 2002-12-20 15:17         ` Christoph Hellwig
  2002-12-20 17:44           ` Erich Focht
  2002-12-31 13:29         ` [PATCH 2.5.53] NUMA scheduler (1/3) Erich Focht
  3 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2002-12-20 15:17 UTC (permalink / raw
  To: Erich Focht
  Cc: Martin J. Bligh, Michael Hohnbaum, Robert Love, Ingo Molnar,
	Stephen Hemminger, linux-kernel


diff -urN a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c	2002-12-16 03:07:56.000000000 +0100
+++ b/arch/i386/kernel/smpboot.c	2002-12-18 16:53:12.000000000 +0100
@@ -1202,6 +1202,9 @@
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	zap_low_mappings();
+#ifdef CONFIG_NUMA
+	build_node_data();
+#endif

I think it would be much nicer if you had a proper stub for !CONFIG_NUMA..

@@ -20,6 +20,7 @@
 #include <asm/mmu.h>
 
 #include <linux/smp.h>
+#include <asm/topology.h>

Another header in sched.h??  And it doesn't look like it's used at all.

 #include <linux/sem.h>
 #include <linux/signal.h>
 #include <linux/securebits.h>
@@ -446,6 +447,9 @@
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void build_node_data(void);
+#endif

The ifdef is superflous.

diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	2002-12-16 03:08:14.000000000 +0100
+++ b/kernel/sched.c	2002-12-18 16:53:13.000000000 +0100
@@ -158,6 +158,10 @@
 	struct list_head migration_queue;
 
 	atomic_t nr_iowait;
+
+	unsigned long wait_time;
+	int wait_node;
+

Here OTOH a #if CONFIG_MUA could help to avoid a little bit of bloat.

 } ____cacheline_aligned;
 
+#define cpu_to_node(cpu) __cpu_to_node(cpu)

I wonder why we don't have a proper cpu_to_node() yet, but as long
as it doesn't exist please use __cpu_to_node() directly.

+#define LOADSCALE 128

Any explanation?

+#ifdef CONFIG_NUMA

sched.c uses #if CONFIG_FOO, not #ifdef CONFIG_FOO, it would be cool
if you could follow the style of existing source files.

+/* Number of CPUs per node: sane values until all CPUs are up */
+int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
+int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are sorted!)*/
+#define node_ncpus(node)  _node_nr_cpus[node]

Parametrized macros and variables aren't in the ßame namespace, what about
just node_nr_cpus for the macro, too.  And should these be static?

+
+#define NODE_DELAY_IDLE  (1*HZ/1000)
+#define NODE_DELAY_BUSY  (20*HZ/1000)

Comments, please..

+/* the following macro implies that logical CPUs are sorted by node number */
+#define loop_over_node(cpu,node) \
+	for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)

Move to asm/topology.h?

+	ptr=0;
+	for (n=0; n<numnodes; n++) {

You need to add lots of space to match Documentation/odingStyle.. :)

+	for (cpu = 0; cpu < NR_CPUS; cpu++) {
+		if (!cpu_online(cpu)) continue;

And linebreaks..

Btw, what about a for_each_cpu that has the cpu_online check in topology.h?

+	/* Wait longer before stealing if own node's load is average. */
+	if (NODES_BALANCED(avg_load,nd[this_node].average_load))

Shouldn't NODES_BALANCED shout less and be an inline called nodes_balanced?

+		this_rq->wait_node = busiest_node;
+		this_rq->wait_time = jiffies;
+		goto out;
+	} else
+	/* old most loaded node: check if waited enough */
+		if (jiffies - this_rq->wait_time < steal_delay)
+			goto out;

That indentation looks really strange, why not just

	/* old most loaded node: check if waited enough */
	} else if (jiffies - this_rq->wait_time < steal_delay)
		goto out;

+
+	if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {

Dito, the name shouts a bit too much

+#endif

#endif /* CONFIG_NUMA */

-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)

And explanation why you changed that constant?

 		p = req->task;
-		cpu_dest = __ffs(p->cpus_allowed);
+		cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+

This looks like a bugfix valid without the rest of the patch.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2.5.52] NUMA scheduler (1/2)
  2002-12-20 15:17         ` [PATCH 2.5.52] NUMA scheduler (1/2) Christoph Hellwig
@ 2002-12-20 17:44           ` Erich Focht
  0 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-20 17:44 UTC (permalink / raw
  To: Christoph Hellwig
  Cc: Martin J. Bligh, Michael Hohnbaum, Robert Love, Ingo Molnar,
	Stephen Hemminger, linux-kernel

Hi Christoph,

thanks for the comments, I'll try to fix the things you mentioned when
preparing the next patch and add some more comments.

Regards,
Erich

On Friday 20 December 2002 16:17, Christoph Hellwig wrote:
> diff -urN a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
> --- a/arch/i386/kernel/smpboot.c	2002-12-16 03:07:56.000000000 +0100
> +++ b/arch/i386/kernel/smpboot.c	2002-12-18 16:53:12.000000000 +0100
> @@ -1202,6 +1202,9 @@
>  void __init smp_cpus_done(unsigned int max_cpus)
>  {
>  	zap_low_mappings();
> +#ifdef CONFIG_NUMA
> +	build_node_data();
> +#endif
>
> I think it would be much nicer if you had a proper stub for !CONFIG_NUMA..
>
> @@ -20,6 +20,7 @@
>  #include <asm/mmu.h>
>
>  #include <linux/smp.h>
> +#include <asm/topology.h>
>
> Another header in sched.h??  And it doesn't look like it's used at all.
>
>  #include <linux/sem.h>
>  #include <linux/signal.h>
>  #include <linux/securebits.h>
> @@ -446,6 +447,9 @@
>  # define set_cpus_allowed(p, new_mask) do { } while (0)
>  #endif
>
> +#ifdef CONFIG_NUMA
> +extern void build_node_data(void);
> +#endif
>
> The ifdef is superflous.
>
> diff -urN a/kernel/sched.c b/kernel/sched.c
> --- a/kernel/sched.c	2002-12-16 03:08:14.000000000 +0100
> +++ b/kernel/sched.c	2002-12-18 16:53:13.000000000 +0100
> @@ -158,6 +158,10 @@
>  	struct list_head migration_queue;
>
>  	atomic_t nr_iowait;
> +
> +	unsigned long wait_time;
> +	int wait_node;
> +
>
> Here OTOH a #if CONFIG_MUA could help to avoid a little bit of bloat.
>
>  } ____cacheline_aligned;
>
> +#define cpu_to_node(cpu) __cpu_to_node(cpu)
>
> I wonder why we don't have a proper cpu_to_node() yet, but as long
> as it doesn't exist please use __cpu_to_node() directly.
>
> +#define LOADSCALE 128
>
> Any explanation?
>
> +#ifdef CONFIG_NUMA
>
> sched.c uses #if CONFIG_FOO, not #ifdef CONFIG_FOO, it would be cool
> if you could follow the style of existing source files.
>
> +/* Number of CPUs per node: sane values until all CPUs are up */
> +int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
> +int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are
> sorted!)*/ +#define node_ncpus(node)  _node_nr_cpus[node]
>
> Parametrized macros and variables aren't in the ßame namespace, what about
> just node_nr_cpus for the macro, too.  And should these be static?
>
> +
> +#define NODE_DELAY_IDLE  (1*HZ/1000)
> +#define NODE_DELAY_BUSY  (20*HZ/1000)
>
> Comments, please..
>
> +/* the following macro implies that logical CPUs are sorted by node number
> */ +#define loop_over_node(cpu,node) \
> +	for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)
>
> Move to asm/topology.h?
>
> +	ptr=0;
> +	for (n=0; n<numnodes; n++) {
>
> You need to add lots of space to match Documentation/odingStyle.. :)
>
> +	for (cpu = 0; cpu < NR_CPUS; cpu++) {
> +		if (!cpu_online(cpu)) continue;
>
> And linebreaks..
>
> Btw, what about a for_each_cpu that has the cpu_online check in topology.h?
>
> +	/* Wait longer before stealing if own node's load is average. */
> +	if (NODES_BALANCED(avg_load,nd[this_node].average_load))
>
> Shouldn't NODES_BALANCED shout less and be an inline called nodes_balanced?
>
> +		this_rq->wait_node = busiest_node;
> +		this_rq->wait_time = jiffies;
> +		goto out;
> +	} else
> +	/* old most loaded node: check if waited enough */
> +		if (jiffies - this_rq->wait_time < steal_delay)
> +			goto out;
>
> That indentation looks really strange, why not just
>
> 	/* old most loaded node: check if waited enough */
> 	} else if (jiffies - this_rq->wait_time < steal_delay)
> 		goto out;
>
> +
> +	if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {
>
> Dito, the name shouts a bit too much
>
> +#endif
>
> #endif /* CONFIG_NUMA */
>
> -#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
> +#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
>
> And explanation why you changed that constant?
>
>  		p = req->task;
> -		cpu_dest = __ffs(p->cpus_allowed);
> +		cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
> +
>
> This looks like a bugfix valid without the rest of the patch.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2.5.53] NUMA scheduler (1/3)
  2002-12-18 16:21       ` [PATCH 2.5.52] " Erich Focht
                           ` (2 preceding siblings ...)
  2002-12-20 15:17         ` [PATCH 2.5.52] NUMA scheduler (1/2) Christoph Hellwig
@ 2002-12-31 13:29         ` Erich Focht
  2002-12-31 13:29           ` [PATCH 2.5.53] NUMA scheduler (2/3) Erich Focht
                             ` (2 more replies)
  3 siblings, 3 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-31 13:29 UTC (permalink / raw
  To: Martin J. Bligh, Michael Hohnbaum
  Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 801 bytes --]

Here comes the minimal NUMA scheduler built on top of the O(1) load
balancer rediffed for 2.5.53 with some changes in the core part. As
suggested by Michael, I added the cputimes_stat patch, as it is
absolutely needed for understanding the scheduler behavior.

The three patches:
01-numa-sched-core-2.5.53-24.patch: core NUMA scheduler infrastructure
  providing a node aware load_balancer. Cosmetic changes + more comments.

02-numa-sched-ilb-2.5.53-21.patch: initial load balancing, selects
  least loaded node & CPU at exec().

03-cputimes_stat-2.5.53.patch: adds back to the kernel per CPU user
  and system time statistics for each process in /proc/PID/cpu. Needed
  for evaluating scheduler behavior and performance of tasks running
  on SMP and NUMA systems.

Regards,
Erich

[-- Attachment #2: 01-numa-sched-core-2.5.53-24.patch --]
[-- Type: text/x-diff, Size: 12460 bytes --]

diff -urN a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c	2002-12-24 06:20:16.000000000 +0100
+++ b/arch/i386/kernel/smpboot.c	2002-12-31 13:02:47.000000000 +0100
@@ -1191,6 +1191,7 @@
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	zap_low_mappings();
+	build_node_data();
 }
 
 void __init smp_intr_init()
diff -urN a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
--- a/arch/ia64/kernel/smpboot.c	2002-12-24 06:19:49.000000000 +0100
+++ b/arch/ia64/kernel/smpboot.c	2002-12-31 13:03:02.000000000 +0100
@@ -397,7 +397,7 @@
 static void
 smp_tune_scheduling (void)
 {
-	cache_decay_ticks = 10;	/* XXX base this on PAL info and cache-bandwidth estimate */
+	cache_decay_ticks = 8;	/* XXX base this on PAL info and cache-bandwidth estimate */
 
 	printk("task migration cache decay timeout: %ld msecs.\n",
 	       (cache_decay_ticks + 1) * 1000 / HZ);
@@ -522,6 +522,7 @@
 
 	printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n",
 	       num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
+	build_node_data();
 }
 
 int __devinit
diff -urN a/arch/ppc64/kernel/smp.c b/arch/ppc64/kernel/smp.c
--- a/arch/ppc64/kernel/smp.c	2002-12-24 06:21:01.000000000 +0100
+++ b/arch/ppc64/kernel/smp.c	2002-12-31 13:03:21.000000000 +0100
@@ -679,4 +679,5 @@
 
 	/* XXX fix this, xics currently relies on it - Anton */
 	smp_threads_ready = 1;
+	build_node_data();
 }
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	2002-12-24 06:19:35.000000000 +0100
+++ b/include/linux/sched.h	2002-12-31 13:02:18.000000000 +0100
@@ -445,6 +445,12 @@
 # define set_cpus_allowed(p, new_mask) do { } while (0)
 #endif
 
+#ifdef CONFIG_NUMA
+extern void build_node_data(void);
+#else
+#define build_node_data() {}
+#endif
+
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	2002-12-24 06:21:05.000000000 +0100
+++ b/kernel/sched.c	2002-12-31 13:46:03.000000000 +0100
@@ -158,6 +158,10 @@
 	struct list_head migration_queue;
 
 	atomic_t nr_iowait;
+#ifdef CONFIG_NUMA
+	unsigned long wait_time;
+	int wait_node;
+#endif
 } ____cacheline_aligned;
 
 static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -178,6 +182,64 @@
 #endif
 
 /*
+ * Node loads are scaled with LOADSCALE. This way:
+ * - we avoid zeros in integer divisions (dividing by node CPU number),
+ * - loads of nodes with different numbers of CPUs are comparable.
+ */
+#define LOADSCALE 128
+
+#ifdef CONFIG_NUMA
+/* Number of CPUs per node: sane values until all CPUs are up */
+static int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
+static int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are sorted!)*/
+#define node_nr_cpus(node)  _node_nr_cpus[node]
+
+/*
+ * Delay stealing from remote node when own queue is idle/busy. These delays
+ * tend to equalize the node loads. NODE_DELAY_IDLE is nonzero because we
+ * want to give idle CPUs in the busiest node a chance to steal first.
+ */
+#define NODE_DELAY_IDLE  (1*HZ/1000)
+#define NODE_DELAY_BUSY  (20*HZ/1000)
+
+/* the following macro implies that logical CPUs are sorted by node number */
+#define loop_over_node(cpu,node) \
+	for(cpu = node_ptr[node]; cpu < node_ptr[node+1]; cpu++)
+
+/*
+ * Build node_ptr and node_nr_cpus data after all CPUs have come up. This
+ * expects that the logical CPUs are sorted by their node numbers! Check
+ * out the NUMA API for details on why this should be that way.     [EF]
+ */
+void build_node_data(void)
+{
+	int n, cpu, ptr;
+	unsigned long mask;
+
+	ptr=0;
+	for (n = 0; n < numnodes; n++) {
+		mask = __node_to_cpu_mask(n) & cpu_online_map;
+		node_ptr[n] = ptr;
+		for (cpu = 0; cpu < NR_CPUS; cpu++)
+			if (mask  & (1UL << cpu))
+				ptr++;
+		node_nr_cpus(n) = ptr - node_ptr[n];
+	}
+	node_ptr[numnodes] = ptr;
+	printk("CPU nodes : %d\n",numnodes);
+	for (n = 0; n < numnodes; n++)
+		printk("node %d : %d .. %d\n",n,node_ptr[n],node_ptr[n+1]-1);
+}
+
+#else
+#define node_nr_cpus(node)  num_online_cpus()
+#define NODE_DELAY_IDLE 0
+#define NODE_DELAY_BUSY 0
+#define loop_over_node(cpu,n) for(cpu = 0; cpu < NR_CPUS; cpu++)
+#endif
+
+
+/*
  * task_rq_lock - lock the runqueue a given task resides on and disable
  * interrupts.  Note the ordering: we can safely lookup the task_rq without
  * explicitly disabling preemption.
@@ -652,81 +714,134 @@
 }
 
 /*
- * find_busiest_queue - find the busiest runqueue.
+ * Find a runqueue from which to steal a task. We try to do this as locally as
+ * possible because we don't want to let tasks get far from their node.
+ * 
+ * 1. First try to find a runqueue within the own node with
+ * imbalance larger than 25% (relative to the current runqueue).
+ * 2. If the local node is well balanced, locate the most loaded node and its
+ * most loaded CPU.
+ *
+ * This routine implements node balancing by delaying steals from remote
+ * nodes more if the own node is (within margins) averagely loaded. The
+ * most loaded node is remembered as well as the time (jiffies). In the
+ * following calls to the load_balancer the time is compared with
+ * NODE_DELAY_BUSY (if load is around the average) or NODE_DELAY_IDLE (if own
+ * node is unloaded) if the most loaded node didn't change. This gives less 
+ * loaded nodes the chance to approach the average load but doesn't exclude
+ * busy nodes from stealing (just in case the cpus_allowed mask isn't good
+ * for the idle nodes).
+ * This concept can be extended easilly to more than two levels (multi-level
+ * scheduler), e.g.: CPU -> node -> supernode... by implementing node-distance
+ * dependent steal delays.
+ *
+ *                                                         <efocht@ess.nec.de>
+ */
+
+struct node_queue_data {
+	int total_load;
+	int average_load;
+	int busiest_cpu;
+	int busiest_cpu_load;
+};
+
+/*
+ * Check if CPUs are balanced. The check is more involved than the O(1) original
+ * because that one is simply wrong in certain cases (integer arithmetic !!!) EF
+ */
+#define cpus_balanced(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 + 3)/4))
+/*
+ * Check if nodes are imbalanced. "this" node is balanced (compared to the "comp"
+ * node) when it's load is not smaller than "comp"'s load - LOADSCALE/2.
  */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+#define nodes_balanced(comp,this) (((comp)-(this)) < LOADSCALE/2)
+
+static inline runqueue_t *find_busiest_queue(int this_cpu, int idle, int *nr_running)
 {
-	int nr_running, load, max_load, i;
-	runqueue_t *busiest, *rq_src;
+	runqueue_t *busiest_rq = NULL, *this_rq = cpu_rq(this_cpu), *src_rq;
+	int busiest_cpu, busiest_node=0, cpu, load, max_avg_load, avg_load;
+	int n, steal_delay, system_load = 0, this_node=__cpu_to_node(this_cpu); 
+	struct node_queue_data nd[MAX_NUMNODES], *node;
 
-	/*
-	 * We search all runqueues to find the most busy one.
-	 * We do this lockless to reduce cache-bouncing overhead,
-	 * we re-check the 'best' source CPU later on again, with
-	 * the lock held.
-	 *
-	 * We fend off statistical fluctuations in runqueue lengths by
-	 * saving the runqueue length during the previous load-balancing
-	 * operation and using the smaller one the current and saved lengths.
-	 * If a runqueue is long enough for a longer amount of time then
-	 * we recognize it and pull tasks from it.
-	 *
-	 * The 'current runqueue length' is a statistical maximum variable,
-	 * for that one we take the longer one - to avoid fluctuations in
-	 * the other direction. So for a load-balance to happen it needs
-	 * stable long runqueue on the target CPU and stable short runqueue
-	 * on the local runqueue.
-	 *
-	 * We make an exception if this CPU is about to become idle - in
-	 * that case we are less picky about moving a task across CPUs and
-	 * take what can be taken.
-	 */
 	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
-		nr_running = this_rq->nr_running;
+		*nr_running = this_rq->nr_running;
 	else
-		nr_running = this_rq->prev_nr_running[this_cpu];
+		*nr_running = this_rq->prev_nr_running[this_cpu];
 
-	busiest = NULL;
-	max_load = 1;
-	for (i = 0; i < NR_CPUS; i++) {
-		if (!cpu_online(i))
-			continue;
+	for (n = 0; n < numnodes; n++) {
+		nd[n].busiest_cpu_load = -1;
+		nd[n].total_load = 0;
+	}
 
-		rq_src = cpu_rq(i);
-		if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
-			load = rq_src->nr_running;
+	/* compute all node loads and save their max cpu loads */
+	for (cpu = 0; cpu < NR_CPUS; cpu++) {
+		if (!cpu_online(cpu))
+			continue;
+		node = &nd[__cpu_to_node(cpu)];
+		src_rq = cpu_rq(cpu);
+		if (idle || (src_rq->nr_running < this_rq->prev_nr_running[cpu]))
+			load = src_rq->nr_running;
 		else
-			load = this_rq->prev_nr_running[i];
-		this_rq->prev_nr_running[i] = rq_src->nr_running;
-
-		if ((load > max_load) && (rq_src != this_rq)) {
-			busiest = rq_src;
-			max_load = load;
+			load = this_rq->prev_nr_running[cpu];
+		this_rq->prev_nr_running[cpu] = src_rq->nr_running;
+		node->total_load += load;
+		if (load > node->busiest_cpu_load) {
+			node->busiest_cpu_load = load;
+			node->busiest_cpu = cpu;
 		}
 	}
 
-	if (likely(!busiest))
-		goto out;
+	busiest_cpu = nd[this_node].busiest_cpu;
+	if (busiest_cpu != this_cpu) {
+		if (!cpus_balanced(nd[this_node].busiest_cpu_load,*nr_running)) {
+			busiest_rq = cpu_rq(busiest_cpu);
+			this_rq->wait_node = -1;
+			goto out;
+		}
+	}
 
-	*imbalance = (max_load - nr_running) / 2;
+#ifdef CONFIG_NUMA
+	max_avg_load = -1;
+	for (n = 0; n < numnodes; n++) {
+		node = &nd[n];
+		system_load += node->total_load;
+		node->average_load = node->total_load*LOADSCALE/node_nr_cpus(n);
+		if (node->average_load > max_avg_load) {
+			max_avg_load = node->average_load;
+			busiest_node = n;
+		}
+	}
 
-	/* It needs an at least ~25% imbalance to trigger balancing. */
-	if (!idle && (*imbalance < (max_load + 3)/4)) {
-		busiest = NULL;
+	/* Exit if not enough imbalance on any remote node. */
+	if ((busiest_node == this_node) || (max_avg_load <= LOADSCALE) ||
+	    nodes_balanced(max_avg_load,nd[this_node].average_load)) {
+		this_rq->wait_node = -1;
 		goto out;
 	}
 
-	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
-	/*
-	 * Make sure nothing changed since we checked the
-	 * runqueue length.
-	 */
-	if (busiest->nr_running <= nr_running + 1) {
-		spin_unlock(&busiest->lock);
-		busiest = NULL;
+	busiest_cpu = nd[busiest_node].busiest_cpu;
+	avg_load = system_load*LOADSCALE/num_online_cpus();
+	/* Wait longer before stealing if own node's load is average. */
+	if (nodes_balanced(avg_load,nd[this_node].average_load))
+		steal_delay = NODE_DELAY_BUSY;
+	else
+		steal_delay = NODE_DELAY_IDLE;
+	/* if we have a new most loaded node: just mark it */
+	if (this_rq->wait_node != busiest_node) {
+		this_rq->wait_node = busiest_node;
+		this_rq->wait_time = jiffies;
+		goto out;
+	/* old most loaded node: check if waited enough */
+	} else if (jiffies - this_rq->wait_time < steal_delay)
+		goto out;
+
+	if ((!cpus_balanced(nd[busiest_node].busiest_cpu_load,*nr_running))) {
+		busiest_rq = cpu_rq(busiest_cpu);
+		this_rq->wait_node = -1;
 	}
-out:
-	return busiest;
+#endif
+ out:
+	return busiest_rq;
 }
 
 /*
@@ -758,16 +873,21 @@
  */
 static void load_balance(runqueue_t *this_rq, int idle)
 {
-	int imbalance, idx, this_cpu = smp_processor_id();
+	int imbalance, nr_running, idx, this_cpu = smp_processor_id();
 	runqueue_t *busiest;
 	prio_array_t *array;
 	struct list_head *head, *curr;
 	task_t *tmp;
 
-	busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+	busiest = find_busiest_queue(this_cpu, idle, &nr_running);
 	if (!busiest)
 		goto out;
 
+	imbalance = (busiest->nr_running - nr_running)/2;
+
+	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+	if (busiest->nr_running <= nr_running + 1)
+		goto out_unlock;
 	/*
 	 * We first consider expired tasks. Those will likely not be
 	 * executed in the near future, and they are most likely to
@@ -2110,7 +2230,8 @@
 		spin_unlock_irqrestore(&rq->lock, flags);
 
 		p = req->task;
-		cpu_dest = __ffs(p->cpus_allowed);
+		cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+
 		rq_dest = cpu_rq(cpu_dest);
 repeat:
 		cpu_src = task_cpu(p);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2.5.53] NUMA scheduler (2/3)
  2002-12-31 13:29         ` [PATCH 2.5.53] NUMA scheduler (1/3) Erich Focht
@ 2002-12-31 13:29           ` Erich Focht
  2002-12-31 13:30           ` [PATCH 2.5.53] NUMA scheduler (3/3) Erich Focht
  2003-01-04  1:58           ` [PATCH 2.5.53] NUMA scheduler (1/3) Michael Hohnbaum
  2 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-31 13:29 UTC (permalink / raw
  To: Martin J. Bligh, Michael Hohnbaum
  Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 918 bytes --]

... the second patch ...

On Tuesday 31 December 2002 14:29, Erich Focht wrote:
> Here comes the minimal NUMA scheduler built on top of the O(1) load
> balancer rediffed for 2.5.53 with some changes in the core part. As
> suggested by Michael, I added the cputimes_stat patch, as it is
> absolutely needed for understanding the scheduler behavior.
>
> The three patches:
> 01-numa-sched-core-2.5.53-24.patch: core NUMA scheduler infrastructure
>   providing a node aware load_balancer. Cosmetic changes + more comments.
>
> 02-numa-sched-ilb-2.5.53-21.patch: initial load balancing, selects
>   least loaded node & CPU at exec().
>
> 03-cputimes_stat-2.5.53.patch: adds back to the kernel per CPU user
>   and system time statistics for each process in /proc/PID/cpu. Needed
>   for evaluating scheduler behavior and performance of tasks running
>   on SMP and NUMA systems.
>
> Regards,
> Erich

[-- Attachment #2: 02-numa-sched-ilb-2.5.53-21.patch --]
[-- Type: text/x-diff, Size: 4878 bytes --]

diff -urN a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c	2002-12-24 06:20:03.000000000 +0100
+++ b/fs/exec.c	2002-12-31 14:15:45.000000000 +0100
@@ -1024,6 +1024,8 @@
 	int retval;
 	int i;
 
+	sched_balance_exec();
+
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	2002-12-31 13:02:18.000000000 +0100
+++ b/include/linux/sched.h	2002-12-31 14:17:13.000000000 +0100
@@ -160,7 +160,6 @@
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
 
-
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
 asmlinkage void schedule(void);
@@ -447,8 +446,18 @@
 
 #ifdef CONFIG_NUMA
 extern void build_node_data(void);
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+	rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+	rq->nr_running--
 #else
 #define build_node_data() {}
+#define sched_balance_exec() {}
+#define node_nr_running_init() {}
+#define nr_running_inc(rq) rq->nr_running++
+#define nr_running_dec(rq) rq->nr_running--
 #endif
 
 extern void set_user_nice(task_t *p, long nice);
diff -urN a/init/main.c b/init/main.c
--- a/init/main.c	2002-12-24 06:19:46.000000000 +0100
+++ b/init/main.c	2002-12-31 14:15:45.000000000 +0100
@@ -489,6 +489,7 @@
 
 	migration_init();
 #endif
+	node_nr_running_init();
 	spawn_ksoftirqd();
 }
 
diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c	2002-12-31 13:46:03.000000000 +0100
+++ b/kernel/sched.c	2002-12-31 14:15:45.000000000 +0100
@@ -153,6 +153,7 @@
 	task_t *curr, *idle;
 	prio_array_t *active, *expired, arrays[2];
 	int prev_nr_running[NR_CPUS];
+	atomic_t * node_ptr;
 
 	task_t *migration_thread;
 	struct list_head migration_queue;
@@ -356,7 +357,7 @@
 		p->prio = effective_prio(p);
 	}
 	enqueue_task(p, array);
-	rq->nr_running++;
+	nr_running_inc(rq);
 }
 
 /*
@@ -364,7 +365,7 @@
  */
 static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
 {
-	rq->nr_running--;
+	nr_running_dec(rq);
 	if (p->state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible++;
 	dequeue_task(p, p->array);
@@ -851,9 +852,9 @@
 static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
 {
 	dequeue_task(p, src_array);
-	src_rq->nr_running--;
+	nr_running_dec(src_rq);
 	set_task_cpu(p, this_cpu);
-	this_rq->nr_running++;
+	nr_running_inc(this_rq);
 	enqueue_task(p, this_rq->active);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
@@ -2293,6 +2294,83 @@
 
 #endif
 
+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+__init void node_nr_running_init(void)
+{
+	int i;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+	}
+	return;
+}
+
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	/* force the process onto the specified CPU */
+	set_cpus_allowed(p, 1UL << dest_cpu);
+
+	/* restore the cpus allowed mask */
+	set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU.  Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+	int i, minload, load, best_cpu, node = 0;
+
+	best_cpu = task_cpu(p);
+	if (cpu_rq(best_cpu)->nr_running <= 2)
+		return best_cpu;
+
+	minload = 10000000;
+	for (i = 0; i < numnodes; i++) {
+		load = atomic_read(&node_nr_running[i]);
+		if (load < minload) {
+			minload = load;
+			node = i;
+		}
+	}
+	minload = 10000000;
+	loop_over_node(i,node) {
+		if (!cpu_online(i))
+			continue;
+		if (cpu_rq(i)->nr_running < minload) {
+			best_cpu = i;
+			minload = cpu_rq(i)->nr_running;
+		}
+	}
+	return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu;
+
+	if (numnodes > 1) {
+		new_cpu = sched_best_cpu(current);
+		if (new_cpu != smp_processor_id())
+			sched_migrate_task(current, new_cpu);
+	}
+}
+#endif /* CONFIG_NUMA */
+
 #if CONFIG_SMP || CONFIG_PREEMPT
 /*
  * The 'big kernel lock'
@@ -2354,6 +2432,10 @@
 		spin_lock_init(&rq->lock);
 		INIT_LIST_HEAD(&rq->migration_queue);
 		atomic_set(&rq->nr_iowait, 0);
+#if CONFIG_NUMA
+		rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */
+
 
 		for (j = 0; j < 2; j++) {
 			array = rq->arrays + j;

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 2.5.53] NUMA scheduler (3/3)
  2002-12-31 13:29         ` [PATCH 2.5.53] NUMA scheduler (1/3) Erich Focht
  2002-12-31 13:29           ` [PATCH 2.5.53] NUMA scheduler (2/3) Erich Focht
@ 2002-12-31 13:30           ` Erich Focht
  2003-01-04  1:58           ` [PATCH 2.5.53] NUMA scheduler (1/3) Michael Hohnbaum
  2 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-31 13:30 UTC (permalink / raw
  To: Martin J. Bligh, Michael Hohnbaum
  Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 921 bytes --]

... and the third patch ...

On Tuesday 31 December 2002 14:29, Erich Focht wrote:
> Here comes the minimal NUMA scheduler built on top of the O(1) load
> balancer rediffed for 2.5.53 with some changes in the core part. As
> suggested by Michael, I added the cputimes_stat patch, as it is
> absolutely needed for understanding the scheduler behavior.
>
> The three patches:
> 01-numa-sched-core-2.5.53-24.patch: core NUMA scheduler infrastructure
>   providing a node aware load_balancer. Cosmetic changes + more comments.
>
> 02-numa-sched-ilb-2.5.53-21.patch: initial load balancing, selects
>   least loaded node & CPU at exec().
>
> 03-cputimes_stat-2.5.53.patch: adds back to the kernel per CPU user
>   and system time statistics for each process in /proc/PID/cpu. Needed
>   for evaluating scheduler behavior and performance of tasks running
>   on SMP and NUMA systems.
>
> Regards,
> Erich

[-- Attachment #2: 03-cputimes_stat-2.5.53.patch --]
[-- Type: text/x-diff, Size: 3316 bytes --]

diff -urN a/fs/proc/array.c b/fs/proc/array.c
--- a/fs/proc/array.c	2002-12-24 06:20:36.000000000 +0100
+++ b/fs/proc/array.c	2002-12-31 14:19:15.000000000 +0100
@@ -597,3 +597,26 @@
 out:
 	return retval;
 }
+
+#ifdef CONFIG_SMP
+int proc_pid_cpu(struct task_struct *task, char * buffer)
+{
+	int i, len;
+
+	len = sprintf(buffer,
+		"cpu  %lu %lu\n",
+		jiffies_to_clock_t(task->utime),
+		jiffies_to_clock_t(task->stime));
+		
+	for (i = 0 ; i < NR_CPUS; i++) {
+		if (cpu_online(i))
+		len += sprintf(buffer + len, "cpu%d %lu %lu\n",
+			i,
+			jiffies_to_clock_t(task->per_cpu_utime[i]),
+			jiffies_to_clock_t(task->per_cpu_stime[i]));
+
+	}
+	len += sprintf(buffer + len, "current_cpu %d\n",task_cpu(task));
+	return len;
+}
+#endif
diff -urN a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c	2002-12-24 06:20:39.000000000 +0100
+++ b/fs/proc/base.c	2002-12-31 14:19:15.000000000 +0100
@@ -55,6 +55,7 @@
 	PROC_PID_STAT,
 	PROC_PID_STATM,
 	PROC_PID_MAPS,
+	PROC_PID_CPU,
 	PROC_PID_MOUNTS,
 	PROC_PID_WCHAN,
 	PROC_PID_FD_DIR = 0x8000,	/* 0x8000-0xffff */
@@ -75,6 +76,9 @@
   E(PROC_PID_CMDLINE,	"cmdline",	S_IFREG|S_IRUGO),
   E(PROC_PID_STAT,	"stat",		S_IFREG|S_IRUGO),
   E(PROC_PID_STATM,	"statm",	S_IFREG|S_IRUGO),
+#ifdef CONFIG_SMP
+  E(PROC_PID_CPU,	"cpu",		S_IFREG|S_IRUGO),
+#endif
   E(PROC_PID_MAPS,	"maps",		S_IFREG|S_IRUGO),
   E(PROC_PID_MEM,	"mem",		S_IFREG|S_IRUSR|S_IWUSR),
   E(PROC_PID_CWD,	"cwd",		S_IFLNK|S_IRWXUGO),
@@ -1026,7 +1030,12 @@
 		case PROC_PID_MAPS:
 			inode->i_fop = &proc_maps_operations;
 			break;
-
+#ifdef CONFIG_SMP
+		case PROC_PID_CPU:
+			inode->i_fop = &proc_info_file_operations;
+			ei->op.proc_read = proc_pid_cpu;
+			break;
+#endif
 		case PROC_PID_MEM:
 			inode->i_op = &proc_mem_inode_operations;
 			inode->i_fop = &proc_mem_operations;
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h	2002-12-31 14:17:13.000000000 +0100
+++ b/include/linux/sched.h	2002-12-31 14:19:15.000000000 +0100
@@ -340,6 +340,9 @@
 	struct timer_list real_timer;
 	unsigned long utime, stime, cutime, cstime;
 	unsigned long start_time;
+#ifdef CONFIG_SMP
+	long per_cpu_utime[NR_CPUS], per_cpu_stime[NR_CPUS];
+#endif
 /* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
 	unsigned long min_flt, maj_flt, nswap, cmin_flt, cmaj_flt, cnswap;
 	int swappable:1;
diff -urN a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c	2002-12-24 06:19:35.000000000 +0100
+++ b/kernel/fork.c	2002-12-31 14:19:15.000000000 +0100
@@ -795,6 +795,14 @@
 	p->tty_old_pgrp = 0;
 	p->utime = p->stime = 0;
 	p->cutime = p->cstime = 0;
+#ifdef CONFIG_SMP
+	{
+		int i;
+
+		for(i = 0; i < NR_CPUS; i++)
+			p->per_cpu_utime[i] = p->per_cpu_stime[i] = 0;
+	}
+#endif
 	p->array = NULL;
 	p->lock_depth = -1;		/* -1 = no lock */
 	p->start_time = jiffies;
diff -urN a/kernel/timer.c b/kernel/timer.c
--- a/kernel/timer.c	2002-12-24 06:19:51.000000000 +0100
+++ b/kernel/timer.c	2002-12-31 14:19:15.000000000 +0100
@@ -695,6 +695,10 @@
 void update_one_process(struct task_struct *p, unsigned long user,
 			unsigned long system, int cpu)
 {
+#ifdef CONFIG_SMP
+	p->per_cpu_utime[cpu] += user;
+	p->per_cpu_stime[cpu] += system;
+#endif
 	do_process_times(p, user, system);
 	do_it_virt(p, user);
 	do_it_prof(p);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2.5.53] NUMA scheduler (1/3)
  2002-12-31 13:29         ` [PATCH 2.5.53] NUMA scheduler (1/3) Erich Focht
  2002-12-31 13:29           ` [PATCH 2.5.53] NUMA scheduler (2/3) Erich Focht
  2002-12-31 13:30           ` [PATCH 2.5.53] NUMA scheduler (3/3) Erich Focht
@ 2003-01-04  1:58           ` Michael Hohnbaum
  2003-01-05  5:35             ` Martin J. Bligh
  2 siblings, 1 reply; 29+ messages in thread
From: Michael Hohnbaum @ 2003-01-04  1:58 UTC (permalink / raw
  To: Erich Focht
  Cc: Martin J. Bligh, Robert Love, Ingo Molnar, Stephen Hemminger,
	linux-kernel

On Tue, 2002-12-31 at 05:29, Erich Focht wrote:
> Here comes the minimal NUMA scheduler built on top of the O(1) load
> balancer rediffed for 2.5.53 with some changes in the core part. As
> suggested by Michael, I added the cputimes_stat patch, as it is
> absolutely needed for understanding the scheduler behavior.

Thanks for this latest patch.  I've managed to cobble together
a 4 node NUMAQ system (16 700 MHZ PIII procs, 16GB memory) and
ran kernbench and schedbench on this, along with the 2.5.50 and
2.5.52 versions.  Results remain fairly consistent with what
we have been obtaining on earlier versions.

Kernbench:
                        Elapsed       User     System        CPU
             sched50     29.96s   288.308s    83.606s    1240.8%
             sched52    29.836s   285.832s    84.464s    1240.4%
             sched53    29.364s   284.808s    83.174s    1252.6%
             stock50    31.074s   303.664s    89.194s    1264.2%
             stock53    31.204s   306.224s    87.776s    1263.2%

Schedbench 4:
                        AvgUser    Elapsed  TotalUser   TotalSys
             sched50      22.00      35.50      88.04       0.86
             sched52      22.04      34.77      88.18       0.75
             sched53      22.16      34.39      88.66       0.74
             stock50      27.18      45.99     108.75       0.87
             stock53       0.00      41.96     133.85       1.07

Schedbench 8:
                        AvgUser    Elapsed  TotalUser   TotalSys
             sched50      32.05      46.81     256.45       1.81
             sched52      32.75      50.33     262.07       1.63
             sched53      30.56      46.58     244.59       1.67
             stock50      44.75      63.68     358.09       2.55
             stock53       0.00      59.64     318.63       2.40

Schedbench 16:
                        AvgUser    Elapsed  TotalUser   TotalSys
             sched50      55.34      70.74     885.61       4.09
             sched52      50.64      62.30     810.50       4.31
             sched53      54.04      70.01     864.84       3.68
             stock50      65.36      80.72    1045.94       5.92
             stock53       0.00      78.01     947.06       6.70

Schedbench 32:
                        AvgUser    Elapsed  TotalUser   TotalSys
             sched50      59.47     139.77    1903.37      10.24
             sched52      58.37     132.90    1868.30       7.45
             sched53      57.35     132.30    1835.49       9.29
             stock50      84.51     194.89    2704.83      12.44
             stock53       0.00     182.95    2612.83      12.46

Schedbench 64:
                        AvgUser    Elapsed  TotalUser   TotalSys
             sched50      63.55     276.81    4067.65      21.58
             sched52      66.75     293.31    4272.72      21.06
             sched53      62.38     276.99    3992.97      19.90
             stock50      99.68     422.06    6380.04      25.92
             stock53       0.00     441.42    6723.01      26.83

Note that the 0.00 in the AvgUser column for stock53 was due to
me not applying the cputime patch (03--cputimes_stat-2.5.53.patch).
Not sure how I managed to forget that one.  I'll rerun this on
a kernel with that patch on Monday.

-- 
Michael Hohnbaum            503-578-5486
hohnbaum@us.ibm.com         T/L 775-5486


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2.5.53] NUMA scheduler (1/3)
  2003-01-04  1:58           ` [PATCH 2.5.53] NUMA scheduler (1/3) Michael Hohnbaum
@ 2003-01-05  5:35             ` Martin J. Bligh
  2003-01-06  3:58               ` Michael Hohnbaum
  0 siblings, 1 reply; 29+ messages in thread
From: Martin J. Bligh @ 2003-01-05  5:35 UTC (permalink / raw
  To: Michael Hohnbaum, Erich Focht
  Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel

>> Here comes the minimal NUMA scheduler built on top of the O(1) load
>> balancer rediffed for 2.5.53 with some changes in the core part. As
>> suggested by Michael, I added the cputimes_stat patch, as it is
>> absolutely needed for understanding the scheduler behavior.
>
> Thanks for this latest patch.  I've managed to cobble together
> a 4 node NUMAQ system (16 700 MHZ PIII procs, 16GB memory) and
> ran kernbench and schedbench on this, along with the 2.5.50 and
> 2.5.52 versions.  Results remain fairly consistent with what
> we have been obtaining on earlier versions.
>
> Kernbench:
>                         Elapsed       User     System        CPU
>              sched50     29.96s   288.308s    83.606s    1240.8%
>              sched52    29.836s   285.832s    84.464s    1240.4%
>              sched53    29.364s   284.808s    83.174s    1252.6%
>              stock50    31.074s   303.664s    89.194s    1264.2%
>              stock53    31.204s   306.224s    87.776s    1263.2%

Not sure what you're correllating here because your rows are all named
the same thing. However, the new version seems to be much slower
on systime (about 7-8% for me), which roughly correllates with your
last two rows above. Me no like.

Erich, can you seperate out the cleanups from the functional changes?
The cleanups look cool. The only thing that I can see easily was
BUSY_REBALANCE_TICK - I tried changing that back to the old version
but it made no difference. Here's a diff going from your old scheduler
(that I've been keeping update in my tree) to your new one on 2.5.54.

M.

diff -urpN -X /home/fletch/.diff.exclude 
oldsched/arch/i386/kernel/smpboot.c newsched/arch/i386/kernel/smpboot.c
--- oldsched/arch/i386/kernel/smpboot.c	Sat Jan  4 14:57:43 2003
+++ newsched/arch/i386/kernel/smpboot.c	Sat Jan  4 14:58:25 2003
@@ -1198,9 +1198,7 @@ int __devinit __cpu_up(unsigned int cpu)
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	zap_low_mappings();
-#ifdef CONFIG_NUMA
 	build_node_data();
-#endif
 }

 void __init smp_intr_init()
diff -urpN -X /home/fletch/.diff.exclude 
oldsched/arch/ia64/kernel/smpboot.c newsched/arch/ia64/kernel/smpboot.c
--- oldsched/arch/ia64/kernel/smpboot.c	Sat Jan  4 14:57:43 2003
+++ newsched/arch/ia64/kernel/smpboot.c	Sat Jan  4 14:58:25 2003
@@ -528,9 +528,7 @@ smp_cpus_done (unsigned int dummy)

 	printk(KERN_INFO"Total of %d processors activated (%lu.%02lu 
BogoMIPS).\n",
 	       num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
-#ifdef CONFIG_NUMA
 	build_node_data();
-#endif
 }

 int __devinit
diff -urpN -X /home/fletch/.diff.exclude oldsched/arch/ppc64/kernel/smp.c 
newsched/arch/ppc64/kernel/smp.c
--- oldsched/arch/ppc64/kernel/smp.c	Sat Jan  4 14:57:43 2003
+++ newsched/arch/ppc64/kernel/smp.c	Sat Jan  4 14:58:25 2003
@@ -685,7 +685,5 @@ void __init smp_cpus_done(unsigned int m

 	/* XXX fix this, xics currently relies on it - Anton */
 	smp_threads_ready = 1;
-#ifdef CONFIG_NUMA
 	build_node_data();
-#endif
 }
diff -urpN -X /home/fletch/.diff.exclude oldsched/include/linux/sched.h 
newsched/include/linux/sched.h
--- oldsched/include/linux/sched.h	Sat Jan  4 14:57:50 2003
+++ newsched/include/linux/sched.h	Sat Jan  4 14:58:28 2003
@@ -20,7 +20,6 @@
 #include <asm/mmu.h>

 #include <linux/smp.h>
-#include <asm/topology.h>
 #include <linux/sem.h>
 #include <linux/signal.h>
 #include <linux/securebits.h>
@@ -160,19 +159,6 @@ extern void update_one_process(struct ta
 			       unsigned long system, int cpu);
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
-#ifdef CONFIG_NUMA
-extern void sched_balance_exec(void);
-extern void node_nr_running_init(void);
-#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
-	rq->nr_running++
-#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
-	rq->nr_running--
-#else
-#define sched_balance_exec() {}
-#define node_nr_running_init() {}
-#define nr_running_inc(rq) rq->nr_running++
-#define nr_running_dec(rq) rq->nr_running--
-#endif

 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX
 extern signed long FASTCALL(schedule_timeout(signed long timeout));
@@ -458,8 +444,21 @@ extern void set_cpus_allowed(task_t *p,
 #endif

 #ifdef CONFIG_NUMA
-extern void build_pools(void);
+extern void build_node_data(void);
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+	rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+	rq->nr_running--
+#else
+#define build_node_data() {}
+#define sched_balance_exec() {}
+#define node_nr_running_init() {}
+#define nr_running_inc(rq) rq->nr_running++
+#define nr_running_dec(rq) rq->nr_running--
 #endif
+
 extern void set_user_nice(task_t *p, long nice);
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
diff -urpN -X /home/fletch/.diff.exclude oldsched/kernel/sched.c 
newsched/kernel/sched.c
--- oldsched/kernel/sched.c	Sat Jan  4 14:57:50 2003
+++ newsched/kernel/sched.c	Sat Jan  4 14:58:28 2003
@@ -159,10 +159,10 @@ struct runqueue {
 	struct list_head migration_queue;

 	atomic_t nr_iowait;
-
+#ifdef CONFIG_NUMA
 	unsigned long wait_time;
 	int wait_node;
-
+#endif
 } ____cacheline_aligned;

 static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -182,24 +182,33 @@ static struct runqueue runqueues[NR_CPUS
 # define task_running(rq, p)		((rq)->curr == (p))
 #endif

-#define cpu_to_node(cpu) __cpu_to_node(cpu)
+/*
+ * Node loads are scaled with LOADSCALE. This way:
+ * - we avoid zeros in integer divisions (dividing by node CPU number),
+ * - loads of nodes with different numbers of CPUs are comparable.
+ */
 #define LOADSCALE 128

 #ifdef CONFIG_NUMA
 /* Number of CPUs per node: sane values until all CPUs are up */
-int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
-int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are 
sorted!)*/
-#define node_ncpus(node)  _node_nr_cpus[node]
+static int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = 
NR_CPUS };
+static int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus 
are sorted!)*/
+#define node_nr_cpus(node)  _node_nr_cpus[node]

+/*
+ * Delay stealing from remote node when own queue is idle/busy. These 
delays
+ * tend to equalize the node loads. NODE_DELAY_IDLE is nonzero because we
+ * want to give idle CPUs in the busiest node a chance to steal first.
+ */
 #define NODE_DELAY_IDLE  (1*HZ/1000)
 #define NODE_DELAY_BUSY  (20*HZ/1000)

 /* the following macro implies that logical CPUs are sorted by node number 
*/
 #define loop_over_node(cpu,node) \
-	for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)
+	for(cpu = node_ptr[node]; cpu < node_ptr[node+1]; cpu++)

 /*
- * Build node_ptr and node_ncpus data after all CPUs have come up. This
+ * Build node_ptr and node_nr_cpus data after all CPUs have come up. This
  * expects that the logical CPUs are sorted by their node numbers! Check
  * out the NUMA API for details on why this should be that way.     [EF]
  */
@@ -209,24 +218,25 @@ void build_node_data(void)
 	unsigned long mask;

 	ptr=0;
-	for (n=0; n<numnodes; n++) {
+	for (n = 0; n < numnodes; n++) {
 		mask = __node_to_cpu_mask(n) & cpu_online_map;
 		node_ptr[n] = ptr;
-		for (cpu=0; cpu<NR_CPUS; cpu++)
+		for (cpu = 0; cpu < NR_CPUS; cpu++)
 			if (mask  & (1UL << cpu))
 				ptr++;
-		node_ncpus(n) = ptr - node_ptr[n];;
+		node_nr_cpus(n) = ptr - node_ptr[n];
 	}
+	node_ptr[numnodes] = ptr;
 	printk("CPU nodes : %d\n",numnodes);
-	for (n=0; n<numnodes; n++)
+	for (n = 0; n < numnodes; n++)
 		printk("node %d : %d .. %d\n",n,node_ptr[n],node_ptr[n+1]-1);
 }

 #else
-#define node_ncpus(node)  num_online_cpus()
+#define node_nr_cpus(node)  num_online_cpus()
 #define NODE_DELAY_IDLE 0
 #define NODE_DELAY_BUSY 0
-#define loop_over_node(cpu,n) for(cpu=0; cpu<NR_CPUS; cpu++)
+#define loop_over_node(cpu,n) for(cpu = 0; cpu < NR_CPUS; cpu++)
 #endif


@@ -740,18 +750,18 @@ struct node_queue_data {
  * Check if CPUs are balanced. The check is more involved than the O(1) 
original
  * because that one is simply wrong in certain cases (integer arithmetic 
!!!) EF
  */
-#define CPUS_BALANCED(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 + 
3)/4))
+#define cpus_balanced(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 + 
3)/4))
 /*
  * Check if nodes are imbalanced. "this" node is balanced (compared to the 
"comp"
  * node) when it's load is not smaller than "comp"'s load - LOADSCALE/2.
  */
-#define NODES_BALANCED(comp,this) (((comp)-(this)) < LOADSCALE/2)
+#define nodes_balanced(comp,this) (((comp)-(this)) < LOADSCALE/2)

 static inline runqueue_t *find_busiest_queue(int this_cpu, int idle, int 
*nr_running)
 {
 	runqueue_t *busiest_rq = NULL, *this_rq = cpu_rq(this_cpu), *src_rq;
 	int busiest_cpu, busiest_node=0, cpu, load, max_avg_load, avg_load;
-	int n, steal_delay, system_load = 0, this_node=cpu_to_node(this_cpu);
+	int n, steal_delay, system_load = 0, this_node=__cpu_to_node(this_cpu);
 	struct node_queue_data nd[MAX_NUMNODES], *node;

 	if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
@@ -766,8 +776,9 @@ static inline runqueue_t *find_busiest_q

 	/* compute all node loads and save their max cpu loads */
 	for (cpu = 0; cpu < NR_CPUS; cpu++) {
-		if (!cpu_online(cpu)) continue;
-		node = &nd[cpu_to_node(cpu)];
+		if (!cpu_online(cpu))
+			continue;
+		node = &nd[__cpu_to_node(cpu)];
 		src_rq = cpu_rq(cpu);
 		if (idle || (src_rq->nr_running < this_rq->prev_nr_running[cpu]))
 			load = src_rq->nr_running;
@@ -783,7 +794,7 @@ static inline runqueue_t *find_busiest_q

 	busiest_cpu = nd[this_node].busiest_cpu;
 	if (busiest_cpu != this_cpu) {
-		if (!CPUS_BALANCED(nd[this_node].busiest_cpu_load,*nr_running)) {
+		if (!cpus_balanced(nd[this_node].busiest_cpu_load,*nr_running)) {
 			busiest_rq = cpu_rq(busiest_cpu);
 			this_rq->wait_node = -1;
 			goto out;
@@ -795,7 +806,7 @@ static inline runqueue_t *find_busiest_q
 	for (n = 0; n < numnodes; n++) {
 		node = &nd[n];
 		system_load += node->total_load;
-		node->average_load = node->total_load*LOADSCALE/node_ncpus(n);
+		node->average_load = node->total_load*LOADSCALE/node_nr_cpus(n);
 		if (node->average_load > max_avg_load) {
 			max_avg_load = node->average_load;
 			busiest_node = n;
@@ -804,7 +815,7 @@ static inline runqueue_t *find_busiest_q

 	/* Exit if not enough imbalance on any remote node. */
 	if ((busiest_node == this_node) || (max_avg_load <= LOADSCALE) ||
-	    NODES_BALANCED(max_avg_load,nd[this_node].average_load)) {
+	    nodes_balanced(max_avg_load,nd[this_node].average_load)) {
 		this_rq->wait_node = -1;
 		goto out;
 	}
@@ -812,7 +823,7 @@ static inline runqueue_t *find_busiest_q
 	busiest_cpu = nd[busiest_node].busiest_cpu;
 	avg_load = system_load*LOADSCALE/num_online_cpus();
 	/* Wait longer before stealing if own node's load is average. */
-	if (NODES_BALANCED(avg_load,nd[this_node].average_load))
+	if (nodes_balanced(avg_load,nd[this_node].average_load))
 		steal_delay = NODE_DELAY_BUSY;
 	else
 		steal_delay = NODE_DELAY_IDLE;
@@ -821,12 +832,11 @@ static inline runqueue_t *find_busiest_q
 		this_rq->wait_node = busiest_node;
 		this_rq->wait_time = jiffies;
 		goto out;
-	} else
 	/* old most loaded node: check if waited enough */
-		if (jiffies - this_rq->wait_time < steal_delay)
-			goto out;
+	} else if (jiffies - this_rq->wait_time < steal_delay)
+		goto out;

-	if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {
+	if ((!cpus_balanced(nd[busiest_node].busiest_cpu_load,*nr_running))) {
 		busiest_rq = cpu_rq(busiest_cpu);
 		this_rq->wait_node = -1;
 	}
@@ -950,10 +960,10 @@ out:
  * frequency and balancing agressivity depends on whether the CPU is
  * idle or not.
  *
- * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
  * systems with HZ=100, every 10 msecs.)
  */
-#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
 #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)

 static inline void idle_tick(runqueue_t *rq)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2.5.53] NUMA scheduler (1/3)
  2003-01-05  5:35             ` Martin J. Bligh
@ 2003-01-06  3:58               ` Michael Hohnbaum
  2003-01-06  6:07                 ` Martin J. Bligh
  0 siblings, 1 reply; 29+ messages in thread
From: Michael Hohnbaum @ 2003-01-06  3:58 UTC (permalink / raw
  To: Martin J. Bligh
  Cc: Erich Focht, Robert Love, Ingo Molnar, Stephen Hemminger,
	linux-kernel

On Sat, 2003-01-04 at 21:35, Martin J. Bligh wrote:
> >> Here comes the minimal NUMA scheduler built on top of the O(1) load
> >> balancer rediffed for 2.5.53 with some changes in the core part. As
> >> suggested by Michael, I added the cputimes_stat patch, as it is
> >> absolutely needed for understanding the scheduler behavior.
> >
> > Thanks for this latest patch.  I've managed to cobble together
> > a 4 node NUMAQ system (16 700 MHZ PIII procs, 16GB memory) and
> > ran kernbench and schedbench on this, along with the 2.5.50 and
> > 2.5.52 versions.  Results remain fairly consistent with what
> > we have been obtaining on earlier versions.
> >
> > Kernbench:
> >                         Elapsed       User     System        CPU
> >              sched50     29.96s   288.308s    83.606s    1240.8%
> >              sched52    29.836s   285.832s    84.464s    1240.4%
> >              sched53    29.364s   284.808s    83.174s    1252.6%
> >              stock50    31.074s   303.664s    89.194s    1264.2%
> >              stock53    31.204s   306.224s    87.776s    1263.2%
> 
> Not sure what you're correllating here because your rows are all named
> the same thing. However, the new version seems to be much slower
> on systime (about 7-8% for me), which roughly correllates with your
> last two rows above. Me no like.

Sorry, I forgot to include a bit better description of what the
row labels mean.  

sched50 = linux 2.5.50 with the NUMA scheduler
sched52 = linux 2.5.52 with the NUMA scheduler
sched53 = linux 2.5.53 with the NUMA scheduler
stock50 = linux 2.5.50 without the NUMA scheduler
stock53 = linux 2.5.53 without the NUMA scheduler

Thus, this shows that the NUMA scheduler drops systime by ~5.5 secs,
or roughly 8%.  So, my testing is not showing an increase in systime
like you apparently are seeing.  

               Michael
-- 
Michael Hohnbaum            503-578-5486
hohnbaum@us.ibm.com         T/L 775-5486


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2.5.53] NUMA scheduler (1/3)
  2003-01-06  3:58               ` Michael Hohnbaum
@ 2003-01-06  6:07                 ` Martin J. Bligh
  2003-01-07  2:23                   ` Michael Hohnbaum
  0 siblings, 1 reply; 29+ messages in thread
From: Martin J. Bligh @ 2003-01-06  6:07 UTC (permalink / raw
  To: Michael Hohnbaum
  Cc: Erich Focht, Robert Love, Ingo Molnar, Stephen Hemminger,
	linux-kernel

>> > Kernbench:
>> >                         Elapsed       User     System        CPU
>> >              sched50     29.96s   288.308s    83.606s    1240.8%
>> >              sched52    29.836s   285.832s    84.464s    1240.4%
>> >              sched53    29.364s   284.808s    83.174s    1252.6%
>> >              stock50    31.074s   303.664s    89.194s    1264.2%
>> >              stock53    31.204s   306.224s    87.776s    1263.2%
>>
>> Not sure what you're correllating here because your rows are all named
>> the same thing. However, the new version seems to be much slower
>> on systime (about 7-8% for me), which roughly correllates with your
>> last two rows above. Me no like.
>
> Sorry, I forgot to include a bit better description of what the
> row labels mean.
>
> sched50 = linux 2.5.50 with the NUMA scheduler
> sched52 = linux 2.5.52 with the NUMA scheduler
> sched53 = linux 2.5.53 with the NUMA scheduler
> stock50 = linux 2.5.50 without the NUMA scheduler
> stock53 = linux 2.5.53 without the NUMA scheduler
>
> Thus, this shows that the NUMA scheduler drops systime by ~5.5 secs,
> or roughly 8%.  So, my testing is not showing an increase in systime
> like you apparently are seeing.

Sorry, the row names weren't that bad if I actually read them carefully ;-)

I was doing a slightly different test - Erich's old sched code vs the new
both on 2.5.54, and seem to have a degredation.

M.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2.5.53] NUMA scheduler (1/3)
  2003-01-06  6:07                 ` Martin J. Bligh
@ 2003-01-07  2:23                   ` Michael Hohnbaum
  2003-01-07 11:27                     ` Erich Focht
  0 siblings, 1 reply; 29+ messages in thread
From: Michael Hohnbaum @ 2003-01-07  2:23 UTC (permalink / raw
  To: Martin J. Bligh
  Cc: Erich Focht, Robert Love, Ingo Molnar, Stephen Hemminger,
	linux-kernel

On Sun, 2003-01-05 at 22:07, Martin J. Bligh wrote:
> >> > Kernbench:
> >> >                         Elapsed       User     System        CPU
> >> >              sched50     29.96s   288.308s    83.606s    1240.8%
> >> >              sched52    29.836s   285.832s    84.464s    1240.4%
> >> >              sched53    29.364s   284.808s    83.174s    1252.6%
> >> >              stock50    31.074s   303.664s    89.194s    1264.2%
> >> >              stock53    31.204s   306.224s    87.776s    1263.2%
> >
> > sched50 = linux 2.5.50 with the NUMA scheduler
> > sched52 = linux 2.5.52 with the NUMA scheduler
> > sched53 = linux 2.5.53 with the NUMA scheduler
> > stock50 = linux 2.5.50 without the NUMA scheduler
> > stock53 = linux 2.5.53 without the NUMA scheduler
> 
> I was doing a slightly different test - Erich's old sched code vs the new
> both on 2.5.54, and seem to have a degredation.
> 
> M.

Martin,

I ran 2.5.54 with an older version of Erich's NUMA scheduler and
with the version sent out for 2.5.53.  Results were similar:

Kernbench:
                        Elapsed       User     System        CPU
             sched54    29.112s   283.888s     82.84s    1259.4%
          oldsched54    29.436s   286.942s    82.722s    1256.2%

sched54 = linux 2.5.54 with the 2.5.53 version of the NUMA scheduler
oldsched54 = linux 2.5.54 with an earlier version of the NUMA scheduler

The numbers for the new version are actually a touch better, but
close enough to be within a reasonable margin of error. 

I'll post numbers against stock 2.5.54 and include schedbench, tomorrow.

               Michael

-- 
Michael Hohnbaum            503-578-5486
hohnbaum@us.ibm.com         T/L 775-5486


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2.5.53] NUMA scheduler (1/3)
  2003-01-07  2:23                   ` Michael Hohnbaum
@ 2003-01-07 11:27                     ` Erich Focht
  2003-01-07 23:35                       ` Michael Hohnbaum
  0 siblings, 1 reply; 29+ messages in thread
From: Erich Focht @ 2003-01-07 11:27 UTC (permalink / raw
  To: Michael Hohnbaum, Martin J. Bligh
  Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel

Hi Michael and Martin,

thanks a lot for the testing!

I rechecked the changes and really don't see any reason for a
slowdown. Michael's measurements seem to confirm that this is just a
statistical effect. I suggest: when in doubt, do 10 kernel compiles
instead of 5. A simple statistical error estimation as I did for
schedbench might help, too. Guess I've sent you the script a while
ago.

I understand from your emails that the 2.5.53 patches apply and work
for 2.5.54, therefore I'll wait for 2.5.55 with a rediff.

Regards,
Erich


On Tuesday 07 January 2003 03:23, Michael Hohnbaum wrote:
> On Sun, 2003-01-05 at 22:07, Martin J. Bligh wrote:
> > >> > Kernbench:
> > >> >                         Elapsed       User     System        CPU
> > >> >              sched50     29.96s   288.308s    83.606s    1240.8%
> > >> >              sched52    29.836s   285.832s    84.464s    1240.4%
> > >> >              sched53    29.364s   284.808s    83.174s    1252.6%
> > >> >              stock50    31.074s   303.664s    89.194s    1264.2%
> > >> >              stock53    31.204s   306.224s    87.776s    1263.2%
> > >
> > > sched50 = linux 2.5.50 with the NUMA scheduler
> > > sched52 = linux 2.5.52 with the NUMA scheduler
> > > sched53 = linux 2.5.53 with the NUMA scheduler
> > > stock50 = linux 2.5.50 without the NUMA scheduler
> > > stock53 = linux 2.5.53 without the NUMA scheduler
> >
> > I was doing a slightly different test - Erich's old sched code vs the new
> > both on 2.5.54, and seem to have a degredation.
> >
> > M.
>
> Martin,
>
> I ran 2.5.54 with an older version of Erich's NUMA scheduler and
> with the version sent out for 2.5.53.  Results were similar:
>
> Kernbench:
>                         Elapsed       User     System        CPU
>              sched54    29.112s   283.888s     82.84s    1259.4%
>           oldsched54    29.436s   286.942s    82.722s    1256.2%
>
> sched54 = linux 2.5.54 with the 2.5.53 version of the NUMA scheduler
> oldsched54 = linux 2.5.54 with an earlier version of the NUMA scheduler
>
> The numbers for the new version are actually a touch better, but
> close enough to be within a reasonable margin of error.
>
> I'll post numbers against stock 2.5.54 and include schedbench, tomorrow.
>
>                Michael


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2.5.53] NUMA scheduler (1/3)
  2003-01-07 11:27                     ` Erich Focht
@ 2003-01-07 23:35                       ` Michael Hohnbaum
  0 siblings, 0 replies; 29+ messages in thread
From: Michael Hohnbaum @ 2003-01-07 23:35 UTC (permalink / raw
  To: Erich Focht
  Cc: Martin J. Bligh, Robert Love, Ingo Molnar, Stephen Hemminger,
	linux-kernel

On Tue, 2003-01-07 at 03:27, Erich Focht wrote:
> Hi Michael and Martin,
> 
> thanks a lot for the testing!
> 
> I rechecked the changes and really don't see any reason for a
> slowdown. Michael's measurements seem to confirm that this is just a
> statistical effect. I suggest: when in doubt, do 10 kernel compiles
> instead of 5. A simple statistical error estimation as I did for
> schedbench might help, too. Guess I've sent you the script a while
> ago.
> 
One more set of test results, this time including schedbench.  Previous
runs did not have the cputimes_stat patch, so the schedbench numbers
were not particularly useful.

Kernbench:
                        Elapsed       User     System        CPU
          oldsched54     29.75s    287.02s    82.876s    1242.8%
             sched54    29.112s   283.888s     82.84s    1259.4%
             stock54    31.348s   303.134s    87.824s    1247.2%
             sched50     29.96s   288.308s    83.606s    1240.8%
             stock50    31.074s   303.664s    89.194s    1264.2%

Schedbench 4:
                        AvgUser    Elapsed  TotalUser   TotalSys
          oldsched54      20.44      37.41      81.81       0.93
             sched54      22.03      34.90      88.15       0.75
             stock54      49.35      57.53     197.45       0.86
             sched50      22.00      35.50      88.04       0.86
             stock50      27.18      45.99     108.75       0.87

Schedbench 8:
                        AvgUser    Elapsed  TotalUser   TotalSys
          oldsched54      31.98      51.59     255.88       1.93
             sched54      27.95      37.12     223.66       1.50
             stock54      43.14      62.97     345.18       2.12
             sched50      32.05      46.81     256.45       1.81
             stock50      44.75      63.68     358.09       2.55

Schedbench 16:
                        AvgUser    Elapsed  TotalUser   TotalSys
          oldsched54      58.39      80.29     934.38       4.30
             sched54      55.37      69.58     886.10       3.79
             stock54      66.00      81.25    1056.25       7.12
             sched50      55.34      70.74     885.61       4.09
             stock50      65.36      80.72    1045.94       5.92

Schedbench 32:
                        AvgUser    Elapsed  TotalUser   TotalSys
          oldsched54      54.53     119.71    1745.37      11.70
             sched54      57.93     132.11    1854.01      10.74
             stock54      77.81     173.26    2490.31      12.37
             sched50      59.47     139.77    1903.37      10.24
             stock50      84.51     194.89    2704.83      12.44

Schedbench 64:
                        AvgUser    Elapsed  TotalUser   TotalSys
          oldsched54      79.57     339.88    5093.31      20.74
             sched54      72.91     308.87    4667.03      21.06
             stock54      86.68     368.55    5548.57      25.73
             sched50      63.55     276.81    4067.65      21.58
             stock50      99.68     422.06    6380.04      25.92

Row names are the same as before (see below).  Numbers seem to be
fairly consistent.

> I understand from your emails that the 2.5.53 patches apply and work
> for 2.5.54, therefore I'll wait for 2.5.55 with a rediff.

It is fine with me to wait for 2.5.55 to rediff.  I'm getting patch
errors now with the cputime_stat patch, so a rediff of that with
2.5.55 would be handy (although I suppose I should quit being lazy
and just do that myself).
> 
> Regards,
> Erich

             Michael
> 
> 
> On Tuesday 07 January 2003 03:23, Michael Hohnbaum wrote:
> > On Sun, 2003-01-05 at 22:07, Martin J. Bligh wrote:
> > > >> > Kernbench:
> > > >> >                         Elapsed       User     System        CPU
> > > >> >              sched50     29.96s   288.308s    83.606s    1240.8%
> > > >> >              sched52    29.836s   285.832s    84.464s    1240.4%
> > > >> >              sched53    29.364s   284.808s    83.174s    1252.6%
> > > >> >              stock50    31.074s   303.664s    89.194s    1264.2%
> > > >> >              stock53    31.204s   306.224s    87.776s    1263.2%
> > > >
> > > > sched50 = linux 2.5.50 with the NUMA scheduler
> > > > sched52 = linux 2.5.52 with the NUMA scheduler
> > > > sched53 = linux 2.5.53 with the NUMA scheduler
> > > > stock50 = linux 2.5.50 without the NUMA scheduler
> > > > stock53 = linux 2.5.53 without the NUMA scheduler
> > >
> > > I was doing a slightly different test - Erich's old sched code vs the new
> > > both on 2.5.54, and seem to have a degredation.
> > >
> > > M.
> >
> > Martin,
> >
> > I ran 2.5.54 with an older version of Erich's NUMA scheduler and
> > with the version sent out for 2.5.53.  Results were similar:
> >
> > Kernbench:
> >                         Elapsed       User     System        CPU
> >              sched54    29.112s   283.888s     82.84s    1259.4%
> >           oldsched54    29.436s   286.942s    82.722s    1256.2%
> >
> > sched54 = linux 2.5.54 with the 2.5.53 version of the NUMA scheduler
> > oldsched54 = linux 2.5.54 with an earlier version of the NUMA scheduler
> >
> > The numbers for the new version are actually a touch better, but
> > close enough to be within a reasonable margin of error.
> >
> > I'll post numbers against stock 2.5.54 and include schedbench, tomorrow.
> >
> >                Michael
> 
> 
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2003-01-07 23:25 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-11-06 16:34 NUMA scheduler BK tree Erich Focht
2002-11-06 18:10 ` Michael Hohnbaum
2002-11-07 23:05   ` Erich Focht
2002-11-07 23:46 ` Michael Hohnbaum
2002-11-08 16:57   ` Erich Focht
2002-11-11 15:13 ` [PATCH 2.5.47] NUMA scheduler (1/2) Erich Focht
2002-11-11 15:14   ` [PATCH 2.5.47] NUMA scheduler (2/2) Erich Focht
2002-11-12  0:24   ` [PATCH 2.5.47] NUMA scheduler (1/2) Michael Hohnbaum
2002-11-18 19:40 ` NUMA scheduler BK tree Martin J. Bligh
2002-11-19 16:26   ` [PATCH 2.5.48] NUMA scheduler (1/2) Erich Focht
2002-11-19 16:27     ` [PATCH 2.5.48] NUMA scheduler (2/2) Erich Focht
2002-12-02 15:29     ` [PATCH 2.5.50] NUMA scheduler (1/2) Erich Focht
2002-12-02 15:30       ` [PATCH 2.5.50] NUMA scheduler (2/2) Erich Focht
2002-12-06 17:39       ` [PATCH 2.5.50] NUMA scheduler (1/2) Michael Hohnbaum
2002-12-18 16:21       ` [PATCH 2.5.52] " Erich Focht
2002-12-18 16:23         ` [PATCH 2.5.52] NUMA scheduler (2/2) Erich Focht
2002-12-20 14:49         ` [PATCH 2.5.52] NUMA scheduler: cputimes stats Erich Focht
2002-12-20 15:17         ` [PATCH 2.5.52] NUMA scheduler (1/2) Christoph Hellwig
2002-12-20 17:44           ` Erich Focht
2002-12-31 13:29         ` [PATCH 2.5.53] NUMA scheduler (1/3) Erich Focht
2002-12-31 13:29           ` [PATCH 2.5.53] NUMA scheduler (2/3) Erich Focht
2002-12-31 13:30           ` [PATCH 2.5.53] NUMA scheduler (3/3) Erich Focht
2003-01-04  1:58           ` [PATCH 2.5.53] NUMA scheduler (1/3) Michael Hohnbaum
2003-01-05  5:35             ` Martin J. Bligh
2003-01-06  3:58               ` Michael Hohnbaum
2003-01-06  6:07                 ` Martin J. Bligh
2003-01-07  2:23                   ` Michael Hohnbaum
2003-01-07 11:27                     ` Erich Focht
2003-01-07 23:35                       ` Michael Hohnbaum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.