* NUMA scheduler BK tree
@ 2002-11-06 16:34 Erich Focht
2002-11-06 18:10 ` Michael Hohnbaum
` (3 more replies)
0 siblings, 4 replies; 29+ messages in thread
From: Erich Focht @ 2002-11-06 16:34 UTC (permalink / raw
To: Michael Hohnbaum, Martin J. Bligh
Cc: Robert Love, Anton Blanchard, Ingo Molnar, Stephen Hemminger,
linux-kernel
Michael, Martin,
in order to make it easier to keep up with the main Linux tree I've
set up a bitkeeper repository with our NUMA scheduler at
bk://numa-ef.bkbits.net/numa-sched
(Web view: http://numa-ef.bkbits.net/)
This used to contain my node affine NUMA scheduler, I'll add extra
trees when the additional patches for that are tested on top of our
NUMA scheduler.
Is it ok for you to have it this way or would you prefer having the
core and the initial load balancer separate?
The tree is currently in sync with bk://linux.bkbits.net/linux-2.5 and
I'll try to keep so.
Regards,
Erich
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NUMA scheduler BK tree
2002-11-06 16:34 NUMA scheduler BK tree Erich Focht
@ 2002-11-06 18:10 ` Michael Hohnbaum
2002-11-07 23:05 ` Erich Focht
2002-11-07 23:46 ` Michael Hohnbaum
` (2 subsequent siblings)
3 siblings, 1 reply; 29+ messages in thread
From: Michael Hohnbaum @ 2002-11-06 18:10 UTC (permalink / raw
To: Erich Focht
Cc: Martin J. Bligh, Robert Love, Anton Blanchard, Ingo Molnar,
Stephen Hemminger, linux-kernel
On Wed, 2002-11-06 at 08:34, Erich Focht wrote:
> Michael, Martin,
>
> in order to make it easier to keep up with the main Linux tree I've
> set up a bitkeeper repository with our NUMA scheduler at
> bk://numa-ef.bkbits.net/numa-sched
> (Web view: http://numa-ef.bkbits.net/)
> This used to contain my node affine NUMA scheduler, I'll add extra
> trees when the additional patches for that are tested on top of our
> NUMA scheduler.
>
> Is it ok for you to have it this way or would you prefer having the
> core and the initial load balancer separate?
>
> The tree is currently in sync with bk://linux.bkbits.net/linux-2.5 and
> I'll try to keep so.
Erich,
This is fine with me. Can't the core changes and and load
balancer be maintained as separate changesets within the bk
tree?
Michael
> Regards,
> Erich
--
Michael Hohnbaum 503-578-5486
hohnbaum@us.ibm.com T/L 775-5486
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NUMA scheduler BK tree
2002-11-06 18:10 ` Michael Hohnbaum
@ 2002-11-07 23:05 ` Erich Focht
0 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-11-07 23:05 UTC (permalink / raw
To: Michael Hohnbaum
Cc: Martin J. Bligh, Robert Love, Anton Blanchard, Ingo Molnar,
Stephen Hemminger, linux-kernel
On Wednesday 06 November 2002 19:10, Michael Hohnbaum wrote:
> > Is it ok for you to have it this way or would you prefer having the
> > core and the initial load balancer separate?
>
> This is fine with me. Can't the core changes and and load
> balancer be maintained as separate changesets within the bk
> tree?
OK, I'll do that. Any idea how I can apply a changeset which has
another name attached to it than mine?
Regards,
Erich
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NUMA scheduler BK tree
2002-11-06 16:34 NUMA scheduler BK tree Erich Focht
2002-11-06 18:10 ` Michael Hohnbaum
@ 2002-11-07 23:46 ` Michael Hohnbaum
2002-11-08 16:57 ` Erich Focht
2002-11-11 15:13 ` [PATCH 2.5.47] NUMA scheduler (1/2) Erich Focht
2002-11-18 19:40 ` NUMA scheduler BK tree Martin J. Bligh
3 siblings, 1 reply; 29+ messages in thread
From: Michael Hohnbaum @ 2002-11-07 23:46 UTC (permalink / raw
To: Erich Focht
Cc: Martin J. Bligh, Robert Love, Anton Blanchard, Ingo Molnar,
Stephen Hemminger, linux-kernel
On Wed, 2002-11-06 at 08:34, Erich Focht wrote:
> Michael, Martin,
>
> in order to make it easier to keep up with the main Linux tree I've
> set up a bitkeeper repository with our NUMA scheduler at
> bk://numa-ef.bkbits.net/numa-sched
> (Web view: http://numa-ef.bkbits.net/)
>
> The tree is currently in sync with bk://linux.bkbits.net/linux-2.5 and
> I'll try to keep so.
>
Tested this on a 4 node NUMAQ. Worked fine. Results:
$reportbench stock46 sched46
Kernbench:
Elapsed User System CPU
stock46 20.66s 194.062s 53.39s 1197.4%
sched46 19.988s 191.302s 50.692s 1210.4%
Schedbench 4:
Elapsed TotalUser TotalSys AvgUser
stock46 27.27 40.64 109.13 0.85
sched46 23.10 41.32 92.42 0.76
Schedbench 8:
Elapsed TotalUser TotalSys AvgUser
stock46 39.18 55.12 313.56 1.68
sched46 34.45 51.24 275.63 2.28
Schedbench 16:
Elapsed TotalUser TotalSys AvgUser
stock46 56.39 72.44 902.45 5.12
sched46 56.73 71.31 907.88 4.19
Schedbench 32:
Elapsed TotalUser TotalSys AvgUser
stock46 90.47 203.28 2895.41 10.39
sched46 60.95 143.21 1950.72 10.31
Schedbench 64:
Elapsed TotalUser TotalSys AvgUser
stock46 105.00 439.04 6720.90 25.02
sched46 59.07 262.98 3781.06 19.59
The schedbench runs were ran once each. Kernbench is the average of
5 runs.
Michael
> Regards,
> Erich
--
Michael Hohnbaum 503-578-5486
hohnbaum@us.ibm.com T/L 775-5486
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NUMA scheduler BK tree
2002-11-07 23:46 ` Michael Hohnbaum
@ 2002-11-08 16:57 ` Erich Focht
0 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-11-08 16:57 UTC (permalink / raw
To: Michael Hohnbaum
Cc: Martin J. Bligh, Robert Love, Anton Blanchard, Ingo Molnar,
Stephen Hemminger, linux-kernel
Michael,
thanks for the results, they look really good!
There's a problem with your script for Schedbench: the column labels
are permuted! Instead of
> Elapsed TotalUser TotalSys AvgUser
you should have
> AvgUser Elapsed TotalUser TotalSys
The numbers make more sense like that ;-)
Could you please send me your script for schedbench?
BTW: the bk repository has now two distinct changesets, one for the
"core" the other one for the initial balancing. Larry McVoy showed me
how to do things right with the names assigned to the changesets.
Thanks!
Regards,
Erich
On Friday 08 November 2002 00:46, Michael Hohnbaum wrote:
> On Wed, 2002-11-06 at 08:34, Erich Focht wrote:
> > in order to make it easier to keep up with the main Linux tree I've
> > set up a bitkeeper repository with our NUMA scheduler at
> > bk://numa-ef.bkbits.net/numa-sched
> > (Web view: http://numa-ef.bkbits.net/)
...
> Tested this on a 4 node NUMAQ. Worked fine. Results:
>
> $reportbench stock46 sched46
> Kernbench:
> Elapsed User System CPU
> stock46 20.66s 194.062s 53.39s 1197.4%
> sched46 19.988s 191.302s 50.692s 1210.4%
>
> Schedbench 4:
> Elapsed TotalUser TotalSys AvgUser
> stock46 27.27 40.64 109.13 0.85
> sched46 23.10 41.32 92.42 0.76
>
> Schedbench 8:
> Elapsed TotalUser TotalSys AvgUser
> stock46 39.18 55.12 313.56 1.68
> sched46 34.45 51.24 275.63 2.28
>
> Schedbench 16:
> Elapsed TotalUser TotalSys AvgUser
> stock46 56.39 72.44 902.45 5.12
> sched46 56.73 71.31 907.88 4.19
>
> Schedbench 32:
> Elapsed TotalUser TotalSys AvgUser
> stock46 90.47 203.28 2895.41 10.39
> sched46 60.95 143.21 1950.72 10.31
>
> Schedbench 64:
> Elapsed TotalUser TotalSys AvgUser
> stock46 105.00 439.04 6720.90 25.02
> sched46 59.07 262.98 3781.06 19.59
>
> The schedbench runs were ran once each. Kernbench is the average of
> 5 runs.
>
> Michael
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2.5.47] NUMA scheduler (1/2)
2002-11-06 16:34 NUMA scheduler BK tree Erich Focht
2002-11-06 18:10 ` Michael Hohnbaum
2002-11-07 23:46 ` Michael Hohnbaum
@ 2002-11-11 15:13 ` Erich Focht
2002-11-11 15:14 ` [PATCH 2.5.47] NUMA scheduler (2/2) Erich Focht
2002-11-12 0:24 ` [PATCH 2.5.47] NUMA scheduler (1/2) Michael Hohnbaum
2002-11-18 19:40 ` NUMA scheduler BK tree Martin J. Bligh
3 siblings, 2 replies; 29+ messages in thread
From: Erich Focht @ 2002-11-11 15:13 UTC (permalink / raw
To: Michael Hohnbaum, Martin J. Bligh
Cc: Robert Love, Anton Blanchard, Ingo Molnar, Stephen Hemminger,
linux-kernel
[-- Attachment #1: Type: text/plain, Size: 496 bytes --]
Hi,
here come the NUMA scheduler patches for 2.5.47 which emerged from
Michael's and my work on NUMA schedulers. As usual in two parts:
01-numa-sched-core-2.5.47-21.patch: core NUMA scheduler infrastructure
providing a node aware load_balancer
02-numa-sched-ilb-2.5.47-21.patch: initial load balancing, selects
least loaded node & CPU at exec()
An up-to-date BK tree containing the whole NUMA scheduler can be found
at bk://numa-ef.bkbits.net/numa-sched .
Regards,
Erich
[-- Attachment #2: 01-numa-sched-core-2.5.47-21.patch --]
[-- Type: text/x-diff, Size: 12567 bytes --]
diff -Nru a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c Mon Nov 11 13:33:22 2002
+++ b/arch/i386/kernel/smpboot.c Mon Nov 11 13:33:22 2002
@@ -1196,6 +1196,9 @@
void __init smp_cpus_done(unsigned int max_cpus)
{
zap_low_mappings();
+#ifdef CONFIG_NUMA
+ build_node_data();
+#endif
}
void __init smp_intr_init()
diff -Nru a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
--- a/arch/ia64/kernel/smpboot.c Mon Nov 11 13:33:22 2002
+++ b/arch/ia64/kernel/smpboot.c Mon Nov 11 13:33:22 2002
@@ -397,7 +397,7 @@
static void
smp_tune_scheduling (void)
{
- cache_decay_ticks = 10; /* XXX base this on PAL info and cache-bandwidth estimate */
+ cache_decay_ticks = 8; /* XXX base this on PAL info and cache-bandwidth estimate */
printk("task migration cache decay timeout: %ld msecs.\n",
(cache_decay_ticks + 1) * 1000 / HZ);
@@ -522,6 +522,9 @@
printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n",
num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
+#ifdef CONFIG_NUMA
+ build_node_data();
+#endif
}
int __devinit
diff -Nru a/arch/ppc64/kernel/smp.c b/arch/ppc64/kernel/smp.c
--- a/arch/ppc64/kernel/smp.c Mon Nov 11 13:33:22 2002
+++ b/arch/ppc64/kernel/smp.c Mon Nov 11 13:33:22 2002
@@ -679,4 +679,7 @@
/* XXX fix this, xics currently relies on it - Anton */
smp_threads_ready = 1;
+#ifdef CONFIG_NUMA
+ build_node_data();
+#endif
}
diff -Nru a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h Mon Nov 11 13:33:22 2002
+++ b/include/linux/sched.h Mon Nov 11 13:33:22 2002
@@ -20,6 +20,7 @@
#include <asm/mmu.h>
#include <linux/smp.h>
+#include <asm/topology.h>
#include <linux/sem.h>
#include <linux/signal.h>
#include <linux/securebits.h>
@@ -449,6 +462,9 @@
# define set_cpus_allowed(p, new_mask) do { } while (0)
#endif
+#ifdef CONFIG_NUMA
+extern void build_pools(void);
+#endif
extern void set_user_nice(task_t *p, long nice);
extern int task_prio(task_t *p);
extern int task_nice(task_t *p);
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c Mon Nov 11 13:33:22 2002
+++ b/kernel/sched.c Mon Nov 11 13:33:22 2002
@@ -153,6 +153,10 @@
struct list_head migration_queue;
atomic_t nr_iowait;
+
+ unsigned long wait_time;
+ int wait_node;
+
} ____cacheline_aligned;
static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -177,6 +182,54 @@
# define task_running(rq, p) ((rq)->curr == (p))
#endif
+#define cpu_to_node(cpu) __cpu_to_node(cpu)
+#define LOADSCALE 128
+
+#ifdef CONFIG_NUMA
+/* Number of CPUs per node: sane values until all CPUs are up */
+int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
+int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are sorted!)*/
+#define node_ncpus(node) _node_nr_cpus[node]
+
+#define NODE_DELAY_IDLE (1*HZ/1000)
+#define NODE_DELAY_BUSY (20*HZ/1000)
+
+/* the following macro implies that logical CPUs are sorted by node number */
+#define loop_over_node(cpu,node) \
+ for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)
+
+/*
+ * Build node_ptr and node_ncpus data after all CPUs have come up. This
+ * expects that the logical CPUs are sorted by their node numbers! Check
+ * out the NUMA API for details on why this should be that way. [EF]
+ */
+void build_node_data(void)
+{
+ int n, cpu, ptr;
+ unsigned long mask;
+
+ ptr=0;
+ for (n=0; n<numnodes; n++) {
+ mask = __node_to_cpu_mask(n) & cpu_online_map;
+ node_ptr[n] = ptr;
+ for (cpu=0; cpu<NR_CPUS; cpu++)
+ if (mask & (1UL << cpu))
+ ptr++;
+ node_ncpus(n) = ptr - node_ptr[n];;
+ }
+ printk("CPU nodes : %d\n",numnodes);
+ for (n=0; n<numnodes; n++)
+ printk("node %d : %d .. %d\n",n,node_ptr[n],node_ptr[n+1]-1);
+}
+
+#else
+#define node_ncpus(node) num_online_cpus()
+#define NODE_DELAY_IDLE 0
+#define NODE_DELAY_BUSY 0
+#define loop_over_node(cpu,n) for(cpu=0; cpu<NR_CPUS; cpu++)
+#endif
+
+
/*
* task_rq_lock - lock the runqueue a given task resides on and disable
* interrupts. Note the ordering: we can safely lookup the task_rq without
@@ -652,81 +705,134 @@
}
/*
- * find_busiest_queue - find the busiest runqueue.
+ * Find a runqueue from which to steal a task. We try to do this as locally as
+ * possible because we don't want to let tasks get far from their node.
+ *
+ * 1. First try to find a runqueue within the own node with
+ * imbalance larger than 25% (relative to the current runqueue).
+ * 2. If the local node is well balanced, locate the most loaded node and its
+ * most loaded CPU.
+ *
+ * This routine implements node balancing by delaying steals from remote
+ * nodes more if the own node is (within margins) averagely loaded. The
+ * most loaded node is remembered as well as the time (jiffies). In the
+ * following calls to the load_balancer the time is compared with
+ * NODE_DELAY_BUSY (if load is around the average) or NODE_DELAY_IDLE (if own
+ * node is unloaded) if the most loaded node didn't change. This gives less
+ * loaded nodes the chance to approach the average load but doesn't exclude
+ * busy nodes from stealing (just in case the cpus_allowed mask isn't good
+ * for the idle nodes).
+ * This concept can be extended easilly to more than two levels (multi-level
+ * scheduler), e.g.: CPU -> node -> supernode... by implementing node-distance
+ * dependent steal delays.
+ *
+ * <efocht@ess.nec.de>
+ */
+
+struct node_queue_data {
+ int total_load;
+ int average_load;
+ int busiest_cpu;
+ int busiest_cpu_load;
+};
+
+/*
+ * Check if CPUs are balanced. The check is more involved than the O(1) original
+ * because that one is simply wrong in certain cases (integer arithmetic !!!) EF
+ */
+#define CPUS_BALANCED(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 + 3)/4))
+/*
+ * Check if nodes are imbalanced. "this" node is balanced (compared to the "comp"
+ * node) when it's load is not smaller than "comp"'s load - LOADSCALE/2.
*/
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+#define NODES_BALANCED(comp,this) (((comp)-(this)) < LOADSCALE/2)
+
+static inline runqueue_t *find_busiest_queue(int this_cpu, int idle, int *nr_running)
{
- int nr_running, load, max_load, i;
- runqueue_t *busiest, *rq_src;
+ runqueue_t *busiest_rq = NULL, *this_rq = cpu_rq(this_cpu), *src_rq;
+ int busiest_cpu, busiest_node=0, cpu, load, max_avg_load, avg_load;
+ int n, steal_delay, system_load = 0, this_node=cpu_to_node(this_cpu);
+ struct node_queue_data nd[MAX_NUMNODES], *node;
- /*
- * We search all runqueues to find the most busy one.
- * We do this lockless to reduce cache-bouncing overhead,
- * we re-check the 'best' source CPU later on again, with
- * the lock held.
- *
- * We fend off statistical fluctuations in runqueue lengths by
- * saving the runqueue length during the previous load-balancing
- * operation and using the smaller one the current and saved lengths.
- * If a runqueue is long enough for a longer amount of time then
- * we recognize it and pull tasks from it.
- *
- * The 'current runqueue length' is a statistical maximum variable,
- * for that one we take the longer one - to avoid fluctuations in
- * the other direction. So for a load-balance to happen it needs
- * stable long runqueue on the target CPU and stable short runqueue
- * on the local runqueue.
- *
- * We make an exception if this CPU is about to become idle - in
- * that case we are less picky about moving a task across CPUs and
- * take what can be taken.
- */
if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
- nr_running = this_rq->nr_running;
+ *nr_running = this_rq->nr_running;
else
- nr_running = this_rq->prev_nr_running[this_cpu];
+ *nr_running = this_rq->prev_nr_running[this_cpu];
- busiest = NULL;
- max_load = 1;
- for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_online(i))
- continue;
+ for (n = 0; n < numnodes; n++) {
+ nd[n].busiest_cpu_load = -1;
+ nd[n].total_load = 0;
+ }
- rq_src = cpu_rq(i);
- if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
- load = rq_src->nr_running;
+ /* compute all node loads and save their max cpu loads */
+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ if (!cpu_online(cpu)) continue;
+ node = &nd[cpu_to_node(cpu)];
+ src_rq = cpu_rq(cpu);
+ if (idle || (src_rq->nr_running < this_rq->prev_nr_running[cpu]))
+ load = src_rq->nr_running;
else
- load = this_rq->prev_nr_running[i];
- this_rq->prev_nr_running[i] = rq_src->nr_running;
-
- if ((load > max_load) && (rq_src != this_rq)) {
- busiest = rq_src;
- max_load = load;
+ load = this_rq->prev_nr_running[cpu];
+ this_rq->prev_nr_running[cpu] = src_rq->nr_running;
+ node->total_load += load;
+ if (load > node->busiest_cpu_load) {
+ node->busiest_cpu_load = load;
+ node->busiest_cpu = cpu;
}
}
- if (likely(!busiest))
- goto out;
+ busiest_cpu = nd[this_node].busiest_cpu;
+ if (busiest_cpu != this_cpu) {
+ if (!CPUS_BALANCED(nd[this_node].busiest_cpu_load,*nr_running)) {
+ busiest_rq = cpu_rq(busiest_cpu);
+ this_rq->wait_node = -1;
+ goto out;
+ }
+ }
- *imbalance = (max_load - nr_running) / 2;
+#ifdef CONFIG_NUMA
+ max_avg_load = -1;
+ for (n = 0; n < numnodes; n++) {
+ node = &nd[n];
+ system_load += node->total_load;
+ node->average_load = node->total_load*LOADSCALE/node_ncpus(n);
+ if (node->average_load > max_avg_load) {
+ max_avg_load = node->average_load;
+ busiest_node = n;
+ }
+ }
- /* It needs an at least ~25% imbalance to trigger balancing. */
- if (!idle && (*imbalance < (max_load + 3)/4)) {
- busiest = NULL;
+ /* Exit if not enough imbalance on any remote node. */
+ if ((busiest_node == this_node) || (max_avg_load <= LOADSCALE) ||
+ NODES_BALANCED(max_avg_load,nd[this_node].average_load)) {
+ this_rq->wait_node = -1;
goto out;
}
- nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
- /*
- * Make sure nothing changed since we checked the
- * runqueue length.
- */
- if (busiest->nr_running <= nr_running + 1) {
- spin_unlock(&busiest->lock);
- busiest = NULL;
+ busiest_cpu = nd[busiest_node].busiest_cpu;
+ avg_load = system_load*LOADSCALE/num_online_cpus();
+ /* Wait longer before stealing if own node's load is average. */
+ if (NODES_BALANCED(avg_load,nd[this_node].average_load))
+ steal_delay = NODE_DELAY_BUSY;
+ else
+ steal_delay = NODE_DELAY_IDLE;
+ /* if we have a new most loaded node: just mark it */
+ if (this_rq->wait_node != busiest_node) {
+ this_rq->wait_node = busiest_node;
+ this_rq->wait_time = jiffies;
+ goto out;
+ } else
+ /* old most loaded node: check if waited enough */
+ if (jiffies - this_rq->wait_time < steal_delay)
+ goto out;
+
+ if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {
+ busiest_rq = cpu_rq(busiest_cpu);
+ this_rq->wait_node = -1;
}
-out:
- return busiest;
+#endif
+ out:
+ return busiest_rq;
}
/*
@@ -758,16 +864,21 @@
*/
static void load_balance(runqueue_t *this_rq, int idle)
{
- int imbalance, idx, this_cpu = smp_processor_id();
+ int imbalance, nr_running, idx, this_cpu = smp_processor_id();
runqueue_t *busiest;
prio_array_t *array;
struct list_head *head, *curr;
task_t *tmp;
- busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+ busiest = find_busiest_queue(this_cpu, idle, &nr_running);
if (!busiest)
goto out;
+ imbalance = (busiest->nr_running - nr_running)/2;
+
+ nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+ if (busiest->nr_running <= nr_running + 1)
+ goto out_unlock;
/*
* We first consider expired tasks. Those will likely not be
* executed in the near future, and they are most likely to
@@ -839,10 +950,10 @@
* frequency and balancing agressivity depends on whether the CPU is
* idle or not.
*
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
* systems with HZ=100, every 10 msecs.)
*/
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
static inline void idle_tick(runqueue_t *rq)
@@ -2080,7 +2191,8 @@
spin_unlock_irqrestore(&rq->lock, flags);
p = req->task;
- cpu_dest = __ffs(p->cpus_allowed);
+ cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+
rq_dest = cpu_rq(cpu_dest);
repeat:
cpu_src = task_cpu(p);
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2.5.47] NUMA scheduler (2/2)
2002-11-11 15:13 ` [PATCH 2.5.47] NUMA scheduler (1/2) Erich Focht
@ 2002-11-11 15:14 ` Erich Focht
2002-11-12 0:24 ` [PATCH 2.5.47] NUMA scheduler (1/2) Michael Hohnbaum
1 sibling, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-11-11 15:14 UTC (permalink / raw
To: Michael Hohnbaum, Martin J. Bligh
Cc: Robert Love, Anton Blanchard, Ingo Molnar, Stephen Hemminger,
linux-kernel
[-- Attachment #1: Type: text/plain, Size: 463 bytes --]
And here's the second part.
Erich
On Monday 11 November 2002 16:13, Erich Focht wrote:
> here come the NUMA scheduler patches for 2.5.47 which emerged from
> Michael's and my work on NUMA schedulers. As usual in two parts:
>
> 01-numa-sched-core-2.5.47-21.patch: core NUMA scheduler infrastructure
> providing a node aware load_balancer
>
> 02-numa-sched-ilb-2.5.47-21.patch: initial load balancing, selects
> least loaded node & CPU at exec()
[-- Attachment #2: 02-numa-sched-ilb-2.5.47-21.patch --]
[-- Type: text/x-diff, Size: 4660 bytes --]
diff -urN a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c Mon Nov 11 04:28:10 2002
+++ b/fs/exec.c Mon Nov 11 14:49:13 2002
@@ -1022,6 +1022,8 @@
int retval;
int i;
+ sched_balance_exec();
+
file = open_exec(filename);
retval = PTR_ERR(file);
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h Mon Nov 11 14:44:04 2002
+++ b/include/linux/sched.h Mon Nov 11 14:49:13 2002
@@ -159,7 +159,19 @@
unsigned long system, int cpu);
extern void scheduler_tick(int user_tick, int system);
extern unsigned long cache_decay_ticks;
-
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+ rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+ rq->nr_running--
+#else
+#define sched_balance_exec() {}
+#define node_nr_running_init() {}
+#define nr_running_inc(rq) rq->nr_running++
+#define nr_running_dec(rq) rq->nr_running--
+#endif
#define MAX_SCHEDULE_TIMEOUT LONG_MAX
extern signed long FASTCALL(schedule_timeout(signed long timeout));
diff -urN a/init/main.c b/init/main.c
--- a/init/main.c Mon Nov 11 04:28:05 2002
+++ b/init/main.c Mon Nov 11 14:49:13 2002
@@ -499,6 +499,7 @@
migration_init();
#endif
+ node_nr_running_init();
spawn_ksoftirqd();
}
diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c Mon Nov 11 14:44:04 2002
+++ b/kernel/sched.c Mon Nov 11 14:49:13 2002
@@ -153,6 +153,7 @@
task_t *curr, *idle;
prio_array_t *active, *expired, arrays[2];
int prev_nr_running[NR_CPUS];
+ atomic_t * node_ptr;
task_t *migration_thread;
struct list_head migration_queue;
@@ -346,7 +347,7 @@
p->prio = effective_prio(p);
}
enqueue_task(p, array);
- rq->nr_running++;
+ nr_running_inc(rq);
}
/*
@@ -354,7 +355,7 @@
*/
static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
- rq->nr_running--;
+ nr_running_dec(rq);
if (p->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
dequeue_task(p, p->array);
@@ -841,9 +842,9 @@
static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
{
dequeue_task(p, src_array);
- src_rq->nr_running--;
+ nr_running_dec(src_rq);
set_task_cpu(p, this_cpu);
- this_rq->nr_running++;
+ nr_running_inc(this_rq);
enqueue_task(p, this_rq->active);
/*
* Note that idle threads have a prio of MAX_PRIO, for this test
@@ -2253,6 +2254,83 @@
#endif
+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+__init void node_nr_running_init(void)
+{
+ int i;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+ }
+ return;
+}
+
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu. Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+ unsigned long old_mask;
+
+ old_mask = p->cpus_allowed;
+ if (!(old_mask & (1UL << dest_cpu)))
+ return;
+ /* force the process onto the specified CPU */
+ set_cpus_allowed(p, 1UL << dest_cpu);
+
+ /* restore the cpus allowed mask */
+ set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU. Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+ int i, minload, load, best_cpu, node = 0;
+
+ best_cpu = task_cpu(p);
+ if (cpu_rq(best_cpu)->nr_running <= 2)
+ return best_cpu;
+
+ minload = 10000000;
+ for (i = 0; i < numnodes; i++) {
+ load = atomic_read(&node_nr_running[i]);
+ if (load < minload) {
+ minload = load;
+ node = i;
+ }
+ }
+ minload = 10000000;
+ loop_over_node(i,node) {
+ if (!cpu_online(i))
+ continue;
+ if (cpu_rq(i)->nr_running < minload) {
+ best_cpu = i;
+ minload = cpu_rq(i)->nr_running;
+ }
+ }
+ return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+ int new_cpu;
+
+ if (numnodes > 1) {
+ new_cpu = sched_best_cpu(current);
+ if (new_cpu != smp_processor_id())
+ sched_migrate_task(current, new_cpu);
+ }
+}
+#endif /* CONFIG_NUMA */
+
#if CONFIG_SMP || CONFIG_PREEMPT
/*
* The 'big kernel lock'
@@ -2314,6 +2392,10 @@
spin_lock_init(&rq->lock);
INIT_LIST_HEAD(&rq->migration_queue);
atomic_set(&rq->nr_iowait, 0);
+#if CONFIG_NUMA
+ rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */
+
for (j = 0; j < 2; j++) {
array = rq->arrays + j;
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2.5.47] NUMA scheduler (1/2)
2002-11-11 15:13 ` [PATCH 2.5.47] NUMA scheduler (1/2) Erich Focht
2002-11-11 15:14 ` [PATCH 2.5.47] NUMA scheduler (2/2) Erich Focht
@ 2002-11-12 0:24 ` Michael Hohnbaum
1 sibling, 0 replies; 29+ messages in thread
From: Michael Hohnbaum @ 2002-11-12 0:24 UTC (permalink / raw
To: Erich Focht
Cc: Martin J. Bligh, Robert Love, Anton Blanchard, Ingo Molnar,
Stephen Hemminger, linux-kernel
On Mon, 2002-11-11 at 07:13, Erich Focht wrote:
> Hi,
>
> here come the NUMA scheduler patches for 2.5.47 which emerged from
> Michael's and my work on NUMA schedulers. As usual in two parts:
>
> 01-numa-sched-core-2.5.47-21.patch: core NUMA scheduler infrastructure
> providing a node aware load_balancer
>
> 02-numa-sched-ilb-2.5.47-21.patch: initial load balancing, selects
> least loaded node & CPU at exec()
Built, booted, tested on NUMAQ. Results:
$ reportbench sched47 stock47
Kernbench:
Elapsed User System CPU
sched47 20.39s 191.824s 50.358s 1187.6%
stock47 20.608s 194.536s 53s 1201.4%
Schedbench 4:
Elapsed TotalUser TotalSys AvgUser
sched47 22.45 34.79 89.84 0.69
stock47 29.05 42.58 116.25 0.94
Schedbench 8:
Elapsed TotalUser TotalSys AvgUser
sched47 39.58 60.98 316.72 1.78
stock47 43.30 56.84 346.49 2.44
Schedbench 16:
Elapsed TotalUser TotalSys AvgUser
sched47 58.83 72.58 941.56 5.30
stock47 60.76 82.05 972.31 4.82
Schedbench 32:
Elapsed TotalUser TotalSys AvgUser
sched47 55.87 122.75 1788.32 11.02
stock47 77.93 179.68 2494.06 11.35
Schedbench 64:
Elapsed TotalUser TotalSys AvgUser
sched47 85.77 360.49 5490.28 21.13
stock47 94.31 398.85 6036.95 23.47
Michael
> An up-to-date BK tree containing the whole NUMA scheduler can be found
> at bk://numa-ef.bkbits.net/numa-sched .
>
> Regards,
> Erich
--
Michael Hohnbaum 503-578-5486
hohnbaum@us.ibm.com T/L 775-5486
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: NUMA scheduler BK tree
2002-11-06 16:34 NUMA scheduler BK tree Erich Focht
` (2 preceding siblings ...)
2002-11-11 15:13 ` [PATCH 2.5.47] NUMA scheduler (1/2) Erich Focht
@ 2002-11-18 19:40 ` Martin J. Bligh
2002-11-19 16:26 ` [PATCH 2.5.48] NUMA scheduler (1/2) Erich Focht
3 siblings, 1 reply; 29+ messages in thread
From: Martin J. Bligh @ 2002-11-18 19:40 UTC (permalink / raw
To: Erich Focht, Michael Hohnbaum
Cc: Robert Love, Anton Blanchard, Ingo Molnar, Stephen Hemminger,
linux-kernel
> in order to make it easier to keep up with the main Linux tree I've
> set up a bitkeeper repository with our NUMA scheduler at
> bk://numa-ef.bkbits.net/numa-sched
> (Web view: http://numa-ef.bkbits.net/)
> This used to contain my node affine NUMA scheduler, I'll add extra
> trees when the additional patches for that are tested on top of our
> NUMA scheduler.
>
> Is it ok for you to have it this way or would you prefer having the
> core and the initial load balancer separate?
>
> The tree is currently in sync with bk://linux.bkbits.net/linux-2.5 and
> I'll try to keep so.
BTW, can you keep producing normal patches too, when you do an update?
I don't use bitkeeper ...
Thanks,
Martin.
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2.5.48] NUMA scheduler (1/2)
2002-11-18 19:40 ` NUMA scheduler BK tree Martin J. Bligh
@ 2002-11-19 16:26 ` Erich Focht
2002-11-19 16:27 ` [PATCH 2.5.48] NUMA scheduler (2/2) Erich Focht
2002-12-02 15:29 ` [PATCH 2.5.50] NUMA scheduler (1/2) Erich Focht
0 siblings, 2 replies; 29+ messages in thread
From: Erich Focht @ 2002-11-19 16:26 UTC (permalink / raw
To: Martin J. Bligh, Michael Hohnbaum
Cc: Robert Love, Anton Blanchard, Ingo Molnar, Stephen Hemminger,
linux-kernel
[-- Attachment #1: Type: text/plain, Size: 719 bytes --]
As requested in another email, I attach the 2.5.48 patches for the
NUMA scheduler which emerged from Michael's and my work:
01-numa-sched-core-2.5.48-22.patch: core NUMA scheduler infrastructure
providing a node aware load_balancer. Rediffed and fixed one
external declaration.
02-numa-sched-ilb-2.5.48-21.patch: initial load balancing, selects
least loaded node & CPU at exec(). Rediffed.
We are curious about any benchmark results and, of course, about running
on other platforms than NUMAQ & Azusa/TX7, too.
Regards,
Erich
On Monday 18 November 2002 20:40, Martin J. Bligh wrote:
>
> BTW, can you keep producing normal patches too, when you do an update?
> I don't use bitkeeper ...
[-- Attachment #2: 01-numa-sched-core-2.5.48-22.patch --]
[-- Type: text/x-diff, Size: 12579 bytes --]
diff -Nru a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c Tue Nov 19 17:08:40 2002
+++ b/arch/i386/kernel/smpboot.c Tue Nov 19 17:08:40 2002
@@ -1196,6 +1196,9 @@
void __init smp_cpus_done(unsigned int max_cpus)
{
zap_low_mappings();
+#ifdef CONFIG_NUMA
+ build_node_data();
+#endif
}
void __init smp_intr_init()
diff -Nru a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
--- a/arch/ia64/kernel/smpboot.c Tue Nov 19 17:08:40 2002
+++ b/arch/ia64/kernel/smpboot.c Tue Nov 19 17:08:40 2002
@@ -397,7 +397,7 @@
static void
smp_tune_scheduling (void)
{
- cache_decay_ticks = 10; /* XXX base this on PAL info and cache-bandwidth estimate */
+ cache_decay_ticks = 8; /* XXX base this on PAL info and cache-bandwidth estimate */
printk("task migration cache decay timeout: %ld msecs.\n",
(cache_decay_ticks + 1) * 1000 / HZ);
@@ -522,6 +522,9 @@
printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n",
num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
+#ifdef CONFIG_NUMA
+ build_node_data();
+#endif
}
int __devinit
diff -Nru a/arch/ppc64/kernel/smp.c b/arch/ppc64/kernel/smp.c
--- a/arch/ppc64/kernel/smp.c Tue Nov 19 17:08:40 2002
+++ b/arch/ppc64/kernel/smp.c Tue Nov 19 17:08:40 2002
@@ -679,4 +679,7 @@
/* XXX fix this, xics currently relies on it - Anton */
smp_threads_ready = 1;
+#ifdef CONFIG_NUMA
+ build_node_data();
+#endif
}
diff -Nru a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h Tue Nov 19 17:08:40 2002
+++ b/include/linux/sched.h Tue Nov 19 17:08:40 2002
@@ -20,6 +20,7 @@
#include <asm/mmu.h>
#include <linux/smp.h>
+#include <asm/topology.h>
#include <linux/sem.h>
#include <linux/signal.h>
#include <linux/securebits.h>
@@ -449,6 +450,9 @@
# define set_cpus_allowed(p, new_mask) do { } while (0)
#endif
+#ifdef CONFIG_NUMA
+extern void build_node_data(void);
+#endif
extern void set_user_nice(task_t *p, long nice);
extern int task_prio(task_t *p);
extern int task_nice(task_t *p);
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c Tue Nov 19 17:08:40 2002
+++ b/kernel/sched.c Tue Nov 19 17:08:40 2002
@@ -158,6 +158,10 @@
struct list_head migration_queue;
atomic_t nr_iowait;
+
+ unsigned long wait_time;
+ int wait_node;
+
} ____cacheline_aligned;
static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -177,6 +181,54 @@
# define task_running(rq, p) ((rq)->curr == (p))
#endif
+#define cpu_to_node(cpu) __cpu_to_node(cpu)
+#define LOADSCALE 128
+
+#ifdef CONFIG_NUMA
+/* Number of CPUs per node: sane values until all CPUs are up */
+int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
+int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are sorted!)*/
+#define node_ncpus(node) _node_nr_cpus[node]
+
+#define NODE_DELAY_IDLE (1*HZ/1000)
+#define NODE_DELAY_BUSY (20*HZ/1000)
+
+/* the following macro implies that logical CPUs are sorted by node number */
+#define loop_over_node(cpu,node) \
+ for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)
+
+/*
+ * Build node_ptr and node_ncpus data after all CPUs have come up. This
+ * expects that the logical CPUs are sorted by their node numbers! Check
+ * out the NUMA API for details on why this should be that way. [EF]
+ */
+void build_node_data(void)
+{
+ int n, cpu, ptr;
+ unsigned long mask;
+
+ ptr=0;
+ for (n=0; n<numnodes; n++) {
+ mask = __node_to_cpu_mask(n) & cpu_online_map;
+ node_ptr[n] = ptr;
+ for (cpu=0; cpu<NR_CPUS; cpu++)
+ if (mask & (1UL << cpu))
+ ptr++;
+ node_ncpus(n) = ptr - node_ptr[n];;
+ }
+ printk("CPU nodes : %d\n",numnodes);
+ for (n=0; n<numnodes; n++)
+ printk("node %d : %d .. %d\n",n,node_ptr[n],node_ptr[n+1]-1);
+}
+
+#else
+#define node_ncpus(node) num_online_cpus()
+#define NODE_DELAY_IDLE 0
+#define NODE_DELAY_BUSY 0
+#define loop_over_node(cpu,n) for(cpu=0; cpu<NR_CPUS; cpu++)
+#endif
+
+
/*
* task_rq_lock - lock the runqueue a given task resides on and disable
* interrupts. Note the ordering: we can safely lookup the task_rq without
@@ -652,81 +704,134 @@
}
/*
- * find_busiest_queue - find the busiest runqueue.
- */
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
-{
- int nr_running, load, max_load, i;
- runqueue_t *busiest, *rq_src;
+ * Find a runqueue from which to steal a task. We try to do this as locally as
+ * possible because we don't want to let tasks get far from their node.
+ *
+ * 1. First try to find a runqueue within the own node with
+ * imbalance larger than 25% (relative to the current runqueue).
+ * 2. If the local node is well balanced, locate the most loaded node and its
+ * most loaded CPU.
+ *
+ * This routine implements node balancing by delaying steals from remote
+ * nodes more if the own node is (within margins) averagely loaded. The
+ * most loaded node is remembered as well as the time (jiffies). In the
+ * following calls to the load_balancer the time is compared with
+ * NODE_DELAY_BUSY (if load is around the average) or NODE_DELAY_IDLE (if own
+ * node is unloaded) if the most loaded node didn't change. This gives less
+ * loaded nodes the chance to approach the average load but doesn't exclude
+ * busy nodes from stealing (just in case the cpus_allowed mask isn't good
+ * for the idle nodes).
+ * This concept can be extended easilly to more than two levels (multi-level
+ * scheduler), e.g.: CPU -> node -> supernode... by implementing node-distance
+ * dependent steal delays.
+ *
+ * <efocht@ess.nec.de>
+ */
+
+struct node_queue_data {
+ int total_load;
+ int average_load;
+ int busiest_cpu;
+ int busiest_cpu_load;
+};
+
+/*
+ * Check if CPUs are balanced. The check is more involved than the O(1) original
+ * because that one is simply wrong in certain cases (integer arithmetic !!!) EF
+ */
+#define CPUS_BALANCED(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 + 3)/4))
+/*
+ * Check if nodes are imbalanced. "this" node is balanced (compared to the "comp"
+ * node) when it's load is not smaller than "comp"'s load - LOADSCALE/2.
+ */
+#define NODES_BALANCED(comp,this) (((comp)-(this)) < LOADSCALE/2)
+
+static inline runqueue_t *find_busiest_queue(int this_cpu, int idle, int *nr_running)
+{
+ runqueue_t *busiest_rq = NULL, *this_rq = cpu_rq(this_cpu), *src_rq;
+ int busiest_cpu, busiest_node=0, cpu, load, max_avg_load, avg_load;
+ int n, steal_delay, system_load = 0, this_node=cpu_to_node(this_cpu);
+ struct node_queue_data nd[MAX_NUMNODES], *node;
- /*
- * We search all runqueues to find the most busy one.
- * We do this lockless to reduce cache-bouncing overhead,
- * we re-check the 'best' source CPU later on again, with
- * the lock held.
- *
- * We fend off statistical fluctuations in runqueue lengths by
- * saving the runqueue length during the previous load-balancing
- * operation and using the smaller one the current and saved lengths.
- * If a runqueue is long enough for a longer amount of time then
- * we recognize it and pull tasks from it.
- *
- * The 'current runqueue length' is a statistical maximum variable,
- * for that one we take the longer one - to avoid fluctuations in
- * the other direction. So for a load-balance to happen it needs
- * stable long runqueue on the target CPU and stable short runqueue
- * on the local runqueue.
- *
- * We make an exception if this CPU is about to become idle - in
- * that case we are less picky about moving a task across CPUs and
- * take what can be taken.
- */
if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
- nr_running = this_rq->nr_running;
+ *nr_running = this_rq->nr_running;
else
- nr_running = this_rq->prev_nr_running[this_cpu];
+ *nr_running = this_rq->prev_nr_running[this_cpu];
- busiest = NULL;
- max_load = 1;
- for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_online(i))
- continue;
+ for (n = 0; n < numnodes; n++) {
+ nd[n].busiest_cpu_load = -1;
+ nd[n].total_load = 0;
+ }
- rq_src = cpu_rq(i);
- if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
- load = rq_src->nr_running;
+ /* compute all node loads and save their max cpu loads */
+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ if (!cpu_online(cpu)) continue;
+ node = &nd[cpu_to_node(cpu)];
+ src_rq = cpu_rq(cpu);
+ if (idle || (src_rq->nr_running < this_rq->prev_nr_running[cpu]))
+ load = src_rq->nr_running;
else
- load = this_rq->prev_nr_running[i];
- this_rq->prev_nr_running[i] = rq_src->nr_running;
-
- if ((load > max_load) && (rq_src != this_rq)) {
- busiest = rq_src;
- max_load = load;
+ load = this_rq->prev_nr_running[cpu];
+ this_rq->prev_nr_running[cpu] = src_rq->nr_running;
+ node->total_load += load;
+ if (load > node->busiest_cpu_load) {
+ node->busiest_cpu_load = load;
+ node->busiest_cpu = cpu;
}
}
- if (likely(!busiest))
- goto out;
+ busiest_cpu = nd[this_node].busiest_cpu;
+ if (busiest_cpu != this_cpu) {
+ if (!CPUS_BALANCED(nd[this_node].busiest_cpu_load,*nr_running)) {
+ busiest_rq = cpu_rq(busiest_cpu);
+ this_rq->wait_node = -1;
+ goto out;
+ }
+ }
- *imbalance = (max_load - nr_running) / 2;
+#ifdef CONFIG_NUMA
+ max_avg_load = -1;
+ for (n = 0; n < numnodes; n++) {
+ node = &nd[n];
+ system_load += node->total_load;
+ node->average_load = node->total_load*LOADSCALE/node_ncpus(n);
+ if (node->average_load > max_avg_load) {
+ max_avg_load = node->average_load;
+ busiest_node = n;
+ }
+ }
- /* It needs an at least ~25% imbalance to trigger balancing. */
- if (!idle && (*imbalance < (max_load + 3)/4)) {
- busiest = NULL;
+ /* Exit if not enough imbalance on any remote node. */
+ if ((busiest_node == this_node) || (max_avg_load <= LOADSCALE) ||
+ NODES_BALANCED(max_avg_load,nd[this_node].average_load)) {
+ this_rq->wait_node = -1;
goto out;
}
- nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
- /*
- * Make sure nothing changed since we checked the
- * runqueue length.
- */
- if (busiest->nr_running <= nr_running + 1) {
- spin_unlock(&busiest->lock);
- busiest = NULL;
+ busiest_cpu = nd[busiest_node].busiest_cpu;
+ avg_load = system_load*LOADSCALE/num_online_cpus();
+ /* Wait longer before stealing if own node's load is average. */
+ if (NODES_BALANCED(avg_load,nd[this_node].average_load))
+ steal_delay = NODE_DELAY_BUSY;
+ else
+ steal_delay = NODE_DELAY_IDLE;
+ /* if we have a new most loaded node: just mark it */
+ if (this_rq->wait_node != busiest_node) {
+ this_rq->wait_node = busiest_node;
+ this_rq->wait_time = jiffies;
+ goto out;
+ } else
+ /* old most loaded node: check if waited enough */
+ if (jiffies - this_rq->wait_time < steal_delay)
+ goto out;
+
+ if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {
+ busiest_rq = cpu_rq(busiest_cpu);
+ this_rq->wait_node = -1;
}
-out:
- return busiest;
+#endif
+ out:
+ return busiest_rq;
}
/*
@@ -758,16 +863,21 @@
*/
static void load_balance(runqueue_t *this_rq, int idle)
{
- int imbalance, idx, this_cpu = smp_processor_id();
+ int imbalance, nr_running, idx, this_cpu = smp_processor_id();
runqueue_t *busiest;
prio_array_t *array;
struct list_head *head, *curr;
task_t *tmp;
- busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+ busiest = find_busiest_queue(this_cpu, idle, &nr_running);
if (!busiest)
goto out;
+ imbalance = (busiest->nr_running - nr_running)/2;
+
+ nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+ if (busiest->nr_running <= nr_running + 1)
+ goto out_unlock;
/*
* We first consider expired tasks. Those will likely not be
* executed in the near future, and they are most likely to
@@ -839,10 +949,10 @@
* frequency and balancing agressivity depends on whether the CPU is
* idle or not.
*
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
* systems with HZ=100, every 10 msecs.)
*/
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
static inline void idle_tick(runqueue_t *rq)
@@ -2080,7 +2190,8 @@
spin_unlock_irqrestore(&rq->lock, flags);
p = req->task;
- cpu_dest = __ffs(p->cpus_allowed);
+ cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+
rq_dest = cpu_rq(cpu_dest);
repeat:
cpu_src = task_cpu(p);
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2.5.48] NUMA scheduler (2/2)
2002-11-19 16:26 ` [PATCH 2.5.48] NUMA scheduler (1/2) Erich Focht
@ 2002-11-19 16:27 ` Erich Focht
2002-12-02 15:29 ` [PATCH 2.5.50] NUMA scheduler (1/2) Erich Focht
1 sibling, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-11-19 16:27 UTC (permalink / raw
To: Martin J. Bligh, Michael Hohnbaum
Cc: Robert Love, Anton Blanchard, Ingo Molnar, Stephen Hemminger,
linux-kernel
[-- Attachment #1: Type: text/plain, Size: 850 bytes --]
And here is the the second patch...
Erich
On Tuesday 19 November 2002 17:26, Erich Focht wrote:
> As requested in another email, I attach the 2.5.48 patches for the
> NUMA scheduler which emerged from Michael's and my work:
>
> 01-numa-sched-core-2.5.48-22.patch: core NUMA scheduler infrastructure
> providing a node aware load_balancer. Rediffed and fixed one
> external declaration.
>
> 02-numa-sched-ilb-2.5.48-21.patch: initial load balancing, selects
> least loaded node & CPU at exec(). Rediffed.
>
> We are curious about any benchmark results and, of course, about running
> on other platforms than NUMAQ & Azusa/TX7, too.
>
> Regards,
> Erich
>
> On Monday 18 November 2002 20:40, Martin J. Bligh wrote:
> > BTW, can you keep producing normal patches too, when you do an update?
> > I don't use bitkeeper ...
[-- Attachment #2: 02-numa-sched-ilb-2.5.48-21.patch --]
[-- Type: text/x-diff, Size: 4660 bytes --]
diff -Nru a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c Tue Nov 19 17:09:02 2002
+++ b/fs/exec.c Tue Nov 19 17:09:02 2002
@@ -1023,6 +1023,8 @@
int retval;
int i;
+ sched_balance_exec();
+
file = open_exec(filename);
retval = PTR_ERR(file);
diff -Nru a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h Tue Nov 19 17:09:02 2002
+++ b/include/linux/sched.h Tue Nov 19 17:09:02 2002
@@ -159,7 +159,19 @@
unsigned long system, int cpu);
extern void scheduler_tick(int user_tick, int system);
extern unsigned long cache_decay_ticks;
-
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+ rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+ rq->nr_running--
+#else
+#define sched_balance_exec() {}
+#define node_nr_running_init() {}
+#define nr_running_inc(rq) rq->nr_running++
+#define nr_running_dec(rq) rq->nr_running--
+#endif
#define MAX_SCHEDULE_TIMEOUT LONG_MAX
extern signed long FASTCALL(schedule_timeout(signed long timeout));
diff -Nru a/init/main.c b/init/main.c
--- a/init/main.c Tue Nov 19 17:09:02 2002
+++ b/init/main.c Tue Nov 19 17:09:02 2002
@@ -500,6 +500,7 @@
migration_init();
#endif
+ node_nr_running_init();
spawn_ksoftirqd();
}
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c Tue Nov 19 17:09:02 2002
+++ b/kernel/sched.c Tue Nov 19 17:09:02 2002
@@ -153,6 +153,7 @@
task_t *curr, *idle;
prio_array_t *active, *expired, arrays[2];
int prev_nr_running[NR_CPUS];
+ atomic_t * node_ptr;
task_t *migration_thread;
struct list_head migration_queue;
@@ -346,7 +347,7 @@
p->prio = effective_prio(p);
}
enqueue_task(p, array);
- rq->nr_running++;
+ nr_running_inc(rq);
}
/*
@@ -354,7 +355,7 @@
*/
static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
- rq->nr_running--;
+ nr_running_dec(rq);
if (p->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
dequeue_task(p, p->array);
@@ -841,9 +842,9 @@
static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
{
dequeue_task(p, src_array);
- src_rq->nr_running--;
+ nr_running_dec(src_rq);
set_task_cpu(p, this_cpu);
- this_rq->nr_running++;
+ nr_running_inc(this_rq);
enqueue_task(p, this_rq->active);
/*
* Note that idle threads have a prio of MAX_PRIO, for this test
@@ -2253,6 +2254,83 @@
#endif
+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+__init void node_nr_running_init(void)
+{
+ int i;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+ }
+ return;
+}
+
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu. Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+ unsigned long old_mask;
+
+ old_mask = p->cpus_allowed;
+ if (!(old_mask & (1UL << dest_cpu)))
+ return;
+ /* force the process onto the specified CPU */
+ set_cpus_allowed(p, 1UL << dest_cpu);
+
+ /* restore the cpus allowed mask */
+ set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU. Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+ int i, minload, load, best_cpu, node = 0;
+
+ best_cpu = task_cpu(p);
+ if (cpu_rq(best_cpu)->nr_running <= 2)
+ return best_cpu;
+
+ minload = 10000000;
+ for (i = 0; i < numnodes; i++) {
+ load = atomic_read(&node_nr_running[i]);
+ if (load < minload) {
+ minload = load;
+ node = i;
+ }
+ }
+ minload = 10000000;
+ loop_over_node(i,node) {
+ if (!cpu_online(i))
+ continue;
+ if (cpu_rq(i)->nr_running < minload) {
+ best_cpu = i;
+ minload = cpu_rq(i)->nr_running;
+ }
+ }
+ return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+ int new_cpu;
+
+ if (numnodes > 1) {
+ new_cpu = sched_best_cpu(current);
+ if (new_cpu != smp_processor_id())
+ sched_migrate_task(current, new_cpu);
+ }
+}
+#endif /* CONFIG_NUMA */
+
#if CONFIG_SMP || CONFIG_PREEMPT
/*
* The 'big kernel lock'
@@ -2314,6 +2392,10 @@
spin_lock_init(&rq->lock);
INIT_LIST_HEAD(&rq->migration_queue);
atomic_set(&rq->nr_iowait, 0);
+#if CONFIG_NUMA
+ rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */
+
for (j = 0; j < 2; j++) {
array = rq->arrays + j;
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2.5.50] NUMA scheduler (1/2)
2002-11-19 16:26 ` [PATCH 2.5.48] NUMA scheduler (1/2) Erich Focht
2002-11-19 16:27 ` [PATCH 2.5.48] NUMA scheduler (2/2) Erich Focht
@ 2002-12-02 15:29 ` Erich Focht
2002-12-02 15:30 ` [PATCH 2.5.50] NUMA scheduler (2/2) Erich Focht
` (2 more replies)
1 sibling, 3 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-02 15:29 UTC (permalink / raw
To: Martin J. Bligh, Michael Hohnbaum
Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 193 bytes --]
Here come the NUMA scheduler patches rediffed for 2.5.50. No
functional changes since last version (
http://marc.theaimsgroup.com/?l=linux-kernel&m=103772346430798&w=2 ).
Regards,
Erich
[-- Attachment #2: 01-numa-sched-core-2.5.50-22.patch --]
[-- Type: text/x-diff, Size: 12681 bytes --]
diff -urN a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c 2002-11-27 23:36:00.000000000 +0100
+++ b/arch/i386/kernel/smpboot.c 2002-11-29 12:51:12.000000000 +0100
@@ -1202,6 +1202,9 @@
void __init smp_cpus_done(unsigned int max_cpus)
{
zap_low_mappings();
+#ifdef CONFIG_NUMA
+ build_node_data();
+#endif
}
void __init smp_intr_init()
diff -urN a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
--- a/arch/ia64/kernel/smpboot.c 2002-11-27 23:35:53.000000000 +0100
+++ b/arch/ia64/kernel/smpboot.c 2002-11-29 12:51:12.000000000 +0100
@@ -397,7 +397,7 @@
static void
smp_tune_scheduling (void)
{
- cache_decay_ticks = 10; /* XXX base this on PAL info and cache-bandwidth estimate */
+ cache_decay_ticks = 8; /* XXX base this on PAL info and cache-bandwidth estimate */
printk("task migration cache decay timeout: %ld msecs.\n",
(cache_decay_ticks + 1) * 1000 / HZ);
@@ -522,6 +522,9 @@
printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n",
num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
+#ifdef CONFIG_NUMA
+ build_node_data();
+#endif
}
int __devinit
diff -urN a/arch/ppc64/kernel/smp.c b/arch/ppc64/kernel/smp.c
--- a/arch/ppc64/kernel/smp.c 2002-11-27 23:36:16.000000000 +0100
+++ b/arch/ppc64/kernel/smp.c 2002-11-29 12:51:12.000000000 +0100
@@ -679,4 +679,7 @@
/* XXX fix this, xics currently relies on it - Anton */
smp_threads_ready = 1;
+#ifdef CONFIG_NUMA
+ build_node_data();
+#endif
}
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h 2002-11-27 23:35:49.000000000 +0100
+++ b/include/linux/sched.h 2002-11-29 12:51:12.000000000 +0100
@@ -20,6 +20,7 @@
#include <asm/mmu.h>
#include <linux/smp.h>
+#include <asm/topology.h>
#include <linux/sem.h>
#include <linux/signal.h>
#include <linux/securebits.h>
@@ -450,6 +451,9 @@
# define set_cpus_allowed(p, new_mask) do { } while (0)
#endif
+#ifdef CONFIG_NUMA
+extern void build_node_data(void);
+#endif
extern void set_user_nice(task_t *p, long nice);
extern int task_prio(task_t *p);
extern int task_nice(task_t *p);
diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c 2002-11-27 23:36:17.000000000 +0100
+++ b/kernel/sched.c 2002-11-29 12:51:12.000000000 +0100
@@ -158,6 +158,10 @@
struct list_head migration_queue;
atomic_t nr_iowait;
+
+ unsigned long wait_time;
+ int wait_node;
+
} ____cacheline_aligned;
static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -177,6 +181,54 @@
# define task_running(rq, p) ((rq)->curr == (p))
#endif
+#define cpu_to_node(cpu) __cpu_to_node(cpu)
+#define LOADSCALE 128
+
+#ifdef CONFIG_NUMA
+/* Number of CPUs per node: sane values until all CPUs are up */
+int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
+int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are sorted!)*/
+#define node_ncpus(node) _node_nr_cpus[node]
+
+#define NODE_DELAY_IDLE (1*HZ/1000)
+#define NODE_DELAY_BUSY (20*HZ/1000)
+
+/* the following macro implies that logical CPUs are sorted by node number */
+#define loop_over_node(cpu,node) \
+ for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)
+
+/*
+ * Build node_ptr and node_ncpus data after all CPUs have come up. This
+ * expects that the logical CPUs are sorted by their node numbers! Check
+ * out the NUMA API for details on why this should be that way. [EF]
+ */
+void build_node_data(void)
+{
+ int n, cpu, ptr;
+ unsigned long mask;
+
+ ptr=0;
+ for (n=0; n<numnodes; n++) {
+ mask = __node_to_cpu_mask(n) & cpu_online_map;
+ node_ptr[n] = ptr;
+ for (cpu=0; cpu<NR_CPUS; cpu++)
+ if (mask & (1UL << cpu))
+ ptr++;
+ node_ncpus(n) = ptr - node_ptr[n];;
+ }
+ printk("CPU nodes : %d\n",numnodes);
+ for (n=0; n<numnodes; n++)
+ printk("node %d : %d .. %d\n",n,node_ptr[n],node_ptr[n+1]-1);
+}
+
+#else
+#define node_ncpus(node) num_online_cpus()
+#define NODE_DELAY_IDLE 0
+#define NODE_DELAY_BUSY 0
+#define loop_over_node(cpu,n) for(cpu=0; cpu<NR_CPUS; cpu++)
+#endif
+
+
/*
* task_rq_lock - lock the runqueue a given task resides on and disable
* interrupts. Note the ordering: we can safely lookup the task_rq without
@@ -652,81 +704,134 @@
}
/*
- * find_busiest_queue - find the busiest runqueue.
+ * Find a runqueue from which to steal a task. We try to do this as locally as
+ * possible because we don't want to let tasks get far from their node.
+ *
+ * 1. First try to find a runqueue within the own node with
+ * imbalance larger than 25% (relative to the current runqueue).
+ * 2. If the local node is well balanced, locate the most loaded node and its
+ * most loaded CPU.
+ *
+ * This routine implements node balancing by delaying steals from remote
+ * nodes more if the own node is (within margins) averagely loaded. The
+ * most loaded node is remembered as well as the time (jiffies). In the
+ * following calls to the load_balancer the time is compared with
+ * NODE_DELAY_BUSY (if load is around the average) or NODE_DELAY_IDLE (if own
+ * node is unloaded) if the most loaded node didn't change. This gives less
+ * loaded nodes the chance to approach the average load but doesn't exclude
+ * busy nodes from stealing (just in case the cpus_allowed mask isn't good
+ * for the idle nodes).
+ * This concept can be extended easilly to more than two levels (multi-level
+ * scheduler), e.g.: CPU -> node -> supernode... by implementing node-distance
+ * dependent steal delays.
+ *
+ * <efocht@ess.nec.de>
+ */
+
+struct node_queue_data {
+ int total_load;
+ int average_load;
+ int busiest_cpu;
+ int busiest_cpu_load;
+};
+
+/*
+ * Check if CPUs are balanced. The check is more involved than the O(1) original
+ * because that one is simply wrong in certain cases (integer arithmetic !!!) EF
+ */
+#define CPUS_BALANCED(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 + 3)/4))
+/*
+ * Check if nodes are imbalanced. "this" node is balanced (compared to the "comp"
+ * node) when it's load is not smaller than "comp"'s load - LOADSCALE/2.
*/
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+#define NODES_BALANCED(comp,this) (((comp)-(this)) < LOADSCALE/2)
+
+static inline runqueue_t *find_busiest_queue(int this_cpu, int idle, int *nr_running)
{
- int nr_running, load, max_load, i;
- runqueue_t *busiest, *rq_src;
+ runqueue_t *busiest_rq = NULL, *this_rq = cpu_rq(this_cpu), *src_rq;
+ int busiest_cpu, busiest_node=0, cpu, load, max_avg_load, avg_load;
+ int n, steal_delay, system_load = 0, this_node=cpu_to_node(this_cpu);
+ struct node_queue_data nd[MAX_NUMNODES], *node;
- /*
- * We search all runqueues to find the most busy one.
- * We do this lockless to reduce cache-bouncing overhead,
- * we re-check the 'best' source CPU later on again, with
- * the lock held.
- *
- * We fend off statistical fluctuations in runqueue lengths by
- * saving the runqueue length during the previous load-balancing
- * operation and using the smaller one the current and saved lengths.
- * If a runqueue is long enough for a longer amount of time then
- * we recognize it and pull tasks from it.
- *
- * The 'current runqueue length' is a statistical maximum variable,
- * for that one we take the longer one - to avoid fluctuations in
- * the other direction. So for a load-balance to happen it needs
- * stable long runqueue on the target CPU and stable short runqueue
- * on the local runqueue.
- *
- * We make an exception if this CPU is about to become idle - in
- * that case we are less picky about moving a task across CPUs and
- * take what can be taken.
- */
if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
- nr_running = this_rq->nr_running;
+ *nr_running = this_rq->nr_running;
else
- nr_running = this_rq->prev_nr_running[this_cpu];
+ *nr_running = this_rq->prev_nr_running[this_cpu];
- busiest = NULL;
- max_load = 1;
- for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_online(i))
- continue;
+ for (n = 0; n < numnodes; n++) {
+ nd[n].busiest_cpu_load = -1;
+ nd[n].total_load = 0;
+ }
- rq_src = cpu_rq(i);
- if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
- load = rq_src->nr_running;
+ /* compute all node loads and save their max cpu loads */
+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ if (!cpu_online(cpu)) continue;
+ node = &nd[cpu_to_node(cpu)];
+ src_rq = cpu_rq(cpu);
+ if (idle || (src_rq->nr_running < this_rq->prev_nr_running[cpu]))
+ load = src_rq->nr_running;
else
- load = this_rq->prev_nr_running[i];
- this_rq->prev_nr_running[i] = rq_src->nr_running;
-
- if ((load > max_load) && (rq_src != this_rq)) {
- busiest = rq_src;
- max_load = load;
+ load = this_rq->prev_nr_running[cpu];
+ this_rq->prev_nr_running[cpu] = src_rq->nr_running;
+ node->total_load += load;
+ if (load > node->busiest_cpu_load) {
+ node->busiest_cpu_load = load;
+ node->busiest_cpu = cpu;
}
}
- if (likely(!busiest))
- goto out;
+ busiest_cpu = nd[this_node].busiest_cpu;
+ if (busiest_cpu != this_cpu) {
+ if (!CPUS_BALANCED(nd[this_node].busiest_cpu_load,*nr_running)) {
+ busiest_rq = cpu_rq(busiest_cpu);
+ this_rq->wait_node = -1;
+ goto out;
+ }
+ }
- *imbalance = (max_load - nr_running) / 2;
+#ifdef CONFIG_NUMA
+ max_avg_load = -1;
+ for (n = 0; n < numnodes; n++) {
+ node = &nd[n];
+ system_load += node->total_load;
+ node->average_load = node->total_load*LOADSCALE/node_ncpus(n);
+ if (node->average_load > max_avg_load) {
+ max_avg_load = node->average_load;
+ busiest_node = n;
+ }
+ }
- /* It needs an at least ~25% imbalance to trigger balancing. */
- if (!idle && (*imbalance < (max_load + 3)/4)) {
- busiest = NULL;
+ /* Exit if not enough imbalance on any remote node. */
+ if ((busiest_node == this_node) || (max_avg_load <= LOADSCALE) ||
+ NODES_BALANCED(max_avg_load,nd[this_node].average_load)) {
+ this_rq->wait_node = -1;
goto out;
}
- nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
- /*
- * Make sure nothing changed since we checked the
- * runqueue length.
- */
- if (busiest->nr_running <= nr_running + 1) {
- spin_unlock(&busiest->lock);
- busiest = NULL;
+ busiest_cpu = nd[busiest_node].busiest_cpu;
+ avg_load = system_load*LOADSCALE/num_online_cpus();
+ /* Wait longer before stealing if own node's load is average. */
+ if (NODES_BALANCED(avg_load,nd[this_node].average_load))
+ steal_delay = NODE_DELAY_BUSY;
+ else
+ steal_delay = NODE_DELAY_IDLE;
+ /* if we have a new most loaded node: just mark it */
+ if (this_rq->wait_node != busiest_node) {
+ this_rq->wait_node = busiest_node;
+ this_rq->wait_time = jiffies;
+ goto out;
+ } else
+ /* old most loaded node: check if waited enough */
+ if (jiffies - this_rq->wait_time < steal_delay)
+ goto out;
+
+ if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {
+ busiest_rq = cpu_rq(busiest_cpu);
+ this_rq->wait_node = -1;
}
-out:
- return busiest;
+#endif
+ out:
+ return busiest_rq;
}
/*
@@ -758,16 +863,21 @@
*/
static void load_balance(runqueue_t *this_rq, int idle)
{
- int imbalance, idx, this_cpu = smp_processor_id();
+ int imbalance, nr_running, idx, this_cpu = smp_processor_id();
runqueue_t *busiest;
prio_array_t *array;
struct list_head *head, *curr;
task_t *tmp;
- busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+ busiest = find_busiest_queue(this_cpu, idle, &nr_running);
if (!busiest)
goto out;
+ imbalance = (busiest->nr_running - nr_running)/2;
+
+ nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+ if (busiest->nr_running <= nr_running + 1)
+ goto out_unlock;
/*
* We first consider expired tasks. Those will likely not be
* executed in the near future, and they are most likely to
@@ -839,10 +949,10 @@
* frequency and balancing agressivity depends on whether the CPU is
* idle or not.
*
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
* systems with HZ=100, every 10 msecs.)
*/
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
static inline void idle_tick(runqueue_t *rq)
@@ -2095,7 +2205,8 @@
spin_unlock_irqrestore(&rq->lock, flags);
p = req->task;
- cpu_dest = __ffs(p->cpus_allowed);
+ cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+
rq_dest = cpu_rq(cpu_dest);
repeat:
cpu_src = task_cpu(p);
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2.5.50] NUMA scheduler (2/2)
2002-12-02 15:29 ` [PATCH 2.5.50] NUMA scheduler (1/2) Erich Focht
@ 2002-12-02 15:30 ` Erich Focht
2002-12-06 17:39 ` [PATCH 2.5.50] NUMA scheduler (1/2) Michael Hohnbaum
2002-12-18 16:21 ` [PATCH 2.5.52] " Erich Focht
2 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-02 15:30 UTC (permalink / raw
To: Martin J. Bligh, Michael Hohnbaum
Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 289 bytes --]
... and the second part ...
On Monday 02 December 2002 16:29, Erich Focht wrote:
> Here come the NUMA scheduler patches rediffed for 2.5.50. No
> functional changes since last version (
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103772346430798&w=2 ).
>
> Regards,
> Erich
[-- Attachment #2: 02-numa-sched-ilb-2.5.50-21.patch --]
[-- Type: text/x-diff, Size: 4748 bytes --]
diff -urN a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c 2002-11-27 23:35:57.000000000 +0100
+++ b/fs/exec.c 2002-11-29 17:44:19.000000000 +0100
@@ -1022,6 +1022,8 @@
int retval;
int i;
+ sched_balance_exec();
+
file = open_exec(filename);
retval = PTR_ERR(file);
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h 2002-11-29 12:51:12.000000000 +0100
+++ b/include/linux/sched.h 2002-11-29 17:44:19.000000000 +0100
@@ -160,7 +160,19 @@
unsigned long system, int cpu);
extern void scheduler_tick(int user_tick, int system);
extern unsigned long cache_decay_ticks;
-
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+ rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+ rq->nr_running--
+#else
+#define sched_balance_exec() {}
+#define node_nr_running_init() {}
+#define nr_running_inc(rq) rq->nr_running++
+#define nr_running_dec(rq) rq->nr_running--
+#endif
#define MAX_SCHEDULE_TIMEOUT LONG_MAX
extern signed long FASTCALL(schedule_timeout(signed long timeout));
diff -urN a/init/main.c b/init/main.c
--- a/init/main.c 2002-11-27 23:35:51.000000000 +0100
+++ b/init/main.c 2002-11-29 17:44:19.000000000 +0100
@@ -500,6 +500,7 @@
migration_init();
#endif
+ node_nr_running_init();
spawn_ksoftirqd();
}
diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c 2002-11-29 12:51:12.000000000 +0100
+++ b/kernel/sched.c 2002-11-29 17:44:19.000000000 +0100
@@ -153,6 +153,7 @@
task_t *curr, *idle;
prio_array_t *active, *expired, arrays[2];
int prev_nr_running[NR_CPUS];
+ atomic_t * node_ptr;
task_t *migration_thread;
struct list_head migration_queue;
@@ -346,7 +347,7 @@
p->prio = effective_prio(p);
}
enqueue_task(p, array);
- rq->nr_running++;
+ nr_running_inc(rq);
}
/*
@@ -354,7 +355,7 @@
*/
static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
- rq->nr_running--;
+ nr_running_dec(rq);
if (p->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
dequeue_task(p, p->array);
@@ -841,9 +842,9 @@
static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
{
dequeue_task(p, src_array);
- src_rq->nr_running--;
+ nr_running_dec(src_rq);
set_task_cpu(p, this_cpu);
- this_rq->nr_running++;
+ nr_running_inc(this_rq);
enqueue_task(p, this_rq->active);
/*
* Note that idle threads have a prio of MAX_PRIO, for this test
@@ -2268,6 +2269,83 @@
#endif
+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+__init void node_nr_running_init(void)
+{
+ int i;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+ }
+ return;
+}
+
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu. Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+ unsigned long old_mask;
+
+ old_mask = p->cpus_allowed;
+ if (!(old_mask & (1UL << dest_cpu)))
+ return;
+ /* force the process onto the specified CPU */
+ set_cpus_allowed(p, 1UL << dest_cpu);
+
+ /* restore the cpus allowed mask */
+ set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU. Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+ int i, minload, load, best_cpu, node = 0;
+
+ best_cpu = task_cpu(p);
+ if (cpu_rq(best_cpu)->nr_running <= 2)
+ return best_cpu;
+
+ minload = 10000000;
+ for (i = 0; i < numnodes; i++) {
+ load = atomic_read(&node_nr_running[i]);
+ if (load < minload) {
+ minload = load;
+ node = i;
+ }
+ }
+ minload = 10000000;
+ loop_over_node(i,node) {
+ if (!cpu_online(i))
+ continue;
+ if (cpu_rq(i)->nr_running < minload) {
+ best_cpu = i;
+ minload = cpu_rq(i)->nr_running;
+ }
+ }
+ return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+ int new_cpu;
+
+ if (numnodes > 1) {
+ new_cpu = sched_best_cpu(current);
+ if (new_cpu != smp_processor_id())
+ sched_migrate_task(current, new_cpu);
+ }
+}
+#endif /* CONFIG_NUMA */
+
#if CONFIG_SMP || CONFIG_PREEMPT
/*
* The 'big kernel lock'
@@ -2329,6 +2407,10 @@
spin_lock_init(&rq->lock);
INIT_LIST_HEAD(&rq->migration_queue);
atomic_set(&rq->nr_iowait, 0);
+#if CONFIG_NUMA
+ rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */
+
for (j = 0; j < 2; j++) {
array = rq->arrays + j;
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2.5.50] NUMA scheduler (1/2)
2002-12-02 15:29 ` [PATCH 2.5.50] NUMA scheduler (1/2) Erich Focht
2002-12-02 15:30 ` [PATCH 2.5.50] NUMA scheduler (2/2) Erich Focht
@ 2002-12-06 17:39 ` Michael Hohnbaum
2002-12-18 16:21 ` [PATCH 2.5.52] " Erich Focht
2 siblings, 0 replies; 29+ messages in thread
From: Michael Hohnbaum @ 2002-12-06 17:39 UTC (permalink / raw
To: Erich Focht
Cc: Martin J. Bligh, Robert Love, Ingo Molnar, Stephen Hemminger,
linux-kernel
On Mon, 2002-12-02 at 07:29, Erich Focht wrote:
> Here come the NUMA scheduler patches rediffed for 2.5.50. No
> functional changes since last version (
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103772346430798&w=2 ).
Tested on NUMAQ. Ran into the problem of the missing CPU stats from
/proc/self/cpu so initial test results were worthless. Applied Erich's
patch that restored this information and ran on NUMAQ. As previous
versions, result was a modest performance gain on kernbench and a more
significant performance gain on Erich's memory intensive test (fondly
referred to here as schedbench).
Kernbench:
Elapsed User System CPU
stock50a 20.92s 194.12s 53.15s 1182%
sched50 20.002s 191.976s 51.17s 1215%
Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
stock50a 29.50 43.11 118.02 0.83
sched50 33.93 47.17 135.76 0.74
Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
stock50a 46.88 62.51 375.12 1.78
sched50 33.90 47.31 271.30 1.98
Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
stock50a 56.26 71.17 900.32 6.23
sched50 57.22 73.34 915.76 4.53
Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
stock50a 84.86 191.48 2715.76 10.93
sched50 57.36 130.88 1835.76 10.46
Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
stock50a 114.91 474.55 7355.45 24.95
sched50 80.20 338.04 5133.56 22.70
In other testing we are seeing unexpectedly high idle times. Andrea has
patches against 2.4 port of the O(1) scheduler that are suppose to help
with this. I plan to try those out to see if they reduce our idle time.
Michael
>
> Regards,
> Erich
> ----
--
Michael Hohnbaum 503-578-5486
hohnbaum@us.ibm.com T/L 775-5486
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2.5.52] NUMA scheduler (1/2)
2002-12-02 15:29 ` [PATCH 2.5.50] NUMA scheduler (1/2) Erich Focht
2002-12-02 15:30 ` [PATCH 2.5.50] NUMA scheduler (2/2) Erich Focht
2002-12-06 17:39 ` [PATCH 2.5.50] NUMA scheduler (1/2) Michael Hohnbaum
@ 2002-12-18 16:21 ` Erich Focht
2002-12-18 16:23 ` [PATCH 2.5.52] NUMA scheduler (2/2) Erich Focht
` (3 more replies)
2 siblings, 4 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-18 16:21 UTC (permalink / raw
To: Martin J. Bligh, Michael Hohnbaum
Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 438 bytes --]
This is the first part of the minimal NUMA scheduler built on top of
the O(1) load balancer. It is rediffed for 2.5.52 and contains a small
bugfix in build_node_data().
The two patches are:
01-numa-sched-core-2.5.52-23.patch: core NUMA scheduler infrastructure
providing a node aware load_balancer.
02-numa-sched-ilb-2.5.52-21.patch: initial load balancing, selects
least loaded node & CPU at exec().
Regards,
Erich
[-- Attachment #2: 01-numa-sched-core-2.5.52-23.patch --]
[-- Type: text/x-diff, Size: 12708 bytes --]
diff -urN a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c 2002-12-16 03:07:56.000000000 +0100
+++ b/arch/i386/kernel/smpboot.c 2002-12-18 16:53:12.000000000 +0100
@@ -1202,6 +1202,9 @@
void __init smp_cpus_done(unsigned int max_cpus)
{
zap_low_mappings();
+#ifdef CONFIG_NUMA
+ build_node_data();
+#endif
}
void __init smp_intr_init()
diff -urN a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
--- a/arch/ia64/kernel/smpboot.c 2002-12-16 03:07:50.000000000 +0100
+++ b/arch/ia64/kernel/smpboot.c 2002-12-18 16:53:13.000000000 +0100
@@ -397,7 +397,7 @@
static void
smp_tune_scheduling (void)
{
- cache_decay_ticks = 10; /* XXX base this on PAL info and cache-bandwidth estimate */
+ cache_decay_ticks = 8; /* XXX base this on PAL info and cache-bandwidth estimate */
printk("task migration cache decay timeout: %ld msecs.\n",
(cache_decay_ticks + 1) * 1000 / HZ);
@@ -522,6 +522,9 @@
printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n",
num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
+#ifdef CONFIG_NUMA
+ build_node_data();
+#endif
}
int __devinit
diff -urN a/arch/ppc64/kernel/smp.c b/arch/ppc64/kernel/smp.c
--- a/arch/ppc64/kernel/smp.c 2002-12-16 03:08:13.000000000 +0100
+++ b/arch/ppc64/kernel/smp.c 2002-12-18 16:53:13.000000000 +0100
@@ -679,4 +679,7 @@
/* XXX fix this, xics currently relies on it - Anton */
smp_threads_ready = 1;
+#ifdef CONFIG_NUMA
+ build_node_data();
+#endif
}
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h 2002-12-16 03:07:44.000000000 +0100
+++ b/include/linux/sched.h 2002-12-18 16:53:13.000000000 +0100
@@ -20,6 +20,7 @@
#include <asm/mmu.h>
#include <linux/smp.h>
+#include <asm/topology.h>
#include <linux/sem.h>
#include <linux/signal.h>
#include <linux/securebits.h>
@@ -446,6 +447,9 @@
# define set_cpus_allowed(p, new_mask) do { } while (0)
#endif
+#ifdef CONFIG_NUMA
+extern void build_node_data(void);
+#endif
extern void set_user_nice(task_t *p, long nice);
extern int task_prio(task_t *p);
extern int task_nice(task_t *p);
diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c 2002-12-16 03:08:14.000000000 +0100
+++ b/kernel/sched.c 2002-12-18 16:53:13.000000000 +0100
@@ -158,6 +158,10 @@
struct list_head migration_queue;
atomic_t nr_iowait;
+
+ unsigned long wait_time;
+ int wait_node;
+
} ____cacheline_aligned;
static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -177,6 +181,55 @@
# define task_running(rq, p) ((rq)->curr == (p))
#endif
+#define cpu_to_node(cpu) __cpu_to_node(cpu)
+#define LOADSCALE 128
+
+#ifdef CONFIG_NUMA
+/* Number of CPUs per node: sane values until all CPUs are up */
+int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
+int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are sorted!)*/
+#define node_ncpus(node) _node_nr_cpus[node]
+
+#define NODE_DELAY_IDLE (1*HZ/1000)
+#define NODE_DELAY_BUSY (20*HZ/1000)
+
+/* the following macro implies that logical CPUs are sorted by node number */
+#define loop_over_node(cpu,node) \
+ for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)
+
+/*
+ * Build node_ptr and node_ncpus data after all CPUs have come up. This
+ * expects that the logical CPUs are sorted by their node numbers! Check
+ * out the NUMA API for details on why this should be that way. [EF]
+ */
+void build_node_data(void)
+{
+ int n, cpu, ptr;
+ unsigned long mask;
+
+ ptr=0;
+ for (n=0; n<numnodes; n++) {
+ mask = __node_to_cpu_mask(n) & cpu_online_map;
+ node_ptr[n] = ptr;
+ for (cpu=0; cpu<NR_CPUS; cpu++)
+ if (mask & (1UL << cpu))
+ ptr++;
+ node_ncpus(n) = ptr - node_ptr[n];
+ }
+ node_ptr[numnodes] = ptr;
+ printk("CPU nodes : %d\n",numnodes);
+ for (n=0; n<numnodes; n++)
+ printk("node %d : %d .. %d\n",n,node_ptr[n],node_ptr[n+1]-1);
+}
+
+#else
+#define node_ncpus(node) num_online_cpus()
+#define NODE_DELAY_IDLE 0
+#define NODE_DELAY_BUSY 0
+#define loop_over_node(cpu,n) for(cpu=0; cpu<NR_CPUS; cpu++)
+#endif
+
+
/*
* task_rq_lock - lock the runqueue a given task resides on and disable
* interrupts. Note the ordering: we can safely lookup the task_rq without
@@ -652,81 +705,134 @@
}
/*
- * find_busiest_queue - find the busiest runqueue.
+ * Find a runqueue from which to steal a task. We try to do this as locally as
+ * possible because we don't want to let tasks get far from their node.
+ *
+ * 1. First try to find a runqueue within the own node with
+ * imbalance larger than 25% (relative to the current runqueue).
+ * 2. If the local node is well balanced, locate the most loaded node and its
+ * most loaded CPU.
+ *
+ * This routine implements node balancing by delaying steals from remote
+ * nodes more if the own node is (within margins) averagely loaded. The
+ * most loaded node is remembered as well as the time (jiffies). In the
+ * following calls to the load_balancer the time is compared with
+ * NODE_DELAY_BUSY (if load is around the average) or NODE_DELAY_IDLE (if own
+ * node is unloaded) if the most loaded node didn't change. This gives less
+ * loaded nodes the chance to approach the average load but doesn't exclude
+ * busy nodes from stealing (just in case the cpus_allowed mask isn't good
+ * for the idle nodes).
+ * This concept can be extended easilly to more than two levels (multi-level
+ * scheduler), e.g.: CPU -> node -> supernode... by implementing node-distance
+ * dependent steal delays.
+ *
+ * <efocht@ess.nec.de>
+ */
+
+struct node_queue_data {
+ int total_load;
+ int average_load;
+ int busiest_cpu;
+ int busiest_cpu_load;
+};
+
+/*
+ * Check if CPUs are balanced. The check is more involved than the O(1) original
+ * because that one is simply wrong in certain cases (integer arithmetic !!!) EF
+ */
+#define CPUS_BALANCED(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 + 3)/4))
+/*
+ * Check if nodes are imbalanced. "this" node is balanced (compared to the "comp"
+ * node) when it's load is not smaller than "comp"'s load - LOADSCALE/2.
*/
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+#define NODES_BALANCED(comp,this) (((comp)-(this)) < LOADSCALE/2)
+
+static inline runqueue_t *find_busiest_queue(int this_cpu, int idle, int *nr_running)
{
- int nr_running, load, max_load, i;
- runqueue_t *busiest, *rq_src;
+ runqueue_t *busiest_rq = NULL, *this_rq = cpu_rq(this_cpu), *src_rq;
+ int busiest_cpu, busiest_node=0, cpu, load, max_avg_load, avg_load;
+ int n, steal_delay, system_load = 0, this_node=cpu_to_node(this_cpu);
+ struct node_queue_data nd[MAX_NUMNODES], *node;
- /*
- * We search all runqueues to find the most busy one.
- * We do this lockless to reduce cache-bouncing overhead,
- * we re-check the 'best' source CPU later on again, with
- * the lock held.
- *
- * We fend off statistical fluctuations in runqueue lengths by
- * saving the runqueue length during the previous load-balancing
- * operation and using the smaller one the current and saved lengths.
- * If a runqueue is long enough for a longer amount of time then
- * we recognize it and pull tasks from it.
- *
- * The 'current runqueue length' is a statistical maximum variable,
- * for that one we take the longer one - to avoid fluctuations in
- * the other direction. So for a load-balance to happen it needs
- * stable long runqueue on the target CPU and stable short runqueue
- * on the local runqueue.
- *
- * We make an exception if this CPU is about to become idle - in
- * that case we are less picky about moving a task across CPUs and
- * take what can be taken.
- */
if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
- nr_running = this_rq->nr_running;
+ *nr_running = this_rq->nr_running;
else
- nr_running = this_rq->prev_nr_running[this_cpu];
+ *nr_running = this_rq->prev_nr_running[this_cpu];
- busiest = NULL;
- max_load = 1;
- for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_online(i))
- continue;
+ for (n = 0; n < numnodes; n++) {
+ nd[n].busiest_cpu_load = -1;
+ nd[n].total_load = 0;
+ }
- rq_src = cpu_rq(i);
- if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
- load = rq_src->nr_running;
+ /* compute all node loads and save their max cpu loads */
+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ if (!cpu_online(cpu)) continue;
+ node = &nd[cpu_to_node(cpu)];
+ src_rq = cpu_rq(cpu);
+ if (idle || (src_rq->nr_running < this_rq->prev_nr_running[cpu]))
+ load = src_rq->nr_running;
else
- load = this_rq->prev_nr_running[i];
- this_rq->prev_nr_running[i] = rq_src->nr_running;
-
- if ((load > max_load) && (rq_src != this_rq)) {
- busiest = rq_src;
- max_load = load;
+ load = this_rq->prev_nr_running[cpu];
+ this_rq->prev_nr_running[cpu] = src_rq->nr_running;
+ node->total_load += load;
+ if (load > node->busiest_cpu_load) {
+ node->busiest_cpu_load = load;
+ node->busiest_cpu = cpu;
}
}
- if (likely(!busiest))
- goto out;
+ busiest_cpu = nd[this_node].busiest_cpu;
+ if (busiest_cpu != this_cpu) {
+ if (!CPUS_BALANCED(nd[this_node].busiest_cpu_load,*nr_running)) {
+ busiest_rq = cpu_rq(busiest_cpu);
+ this_rq->wait_node = -1;
+ goto out;
+ }
+ }
- *imbalance = (max_load - nr_running) / 2;
+#ifdef CONFIG_NUMA
+ max_avg_load = -1;
+ for (n = 0; n < numnodes; n++) {
+ node = &nd[n];
+ system_load += node->total_load;
+ node->average_load = node->total_load*LOADSCALE/node_ncpus(n);
+ if (node->average_load > max_avg_load) {
+ max_avg_load = node->average_load;
+ busiest_node = n;
+ }
+ }
- /* It needs an at least ~25% imbalance to trigger balancing. */
- if (!idle && (*imbalance < (max_load + 3)/4)) {
- busiest = NULL;
+ /* Exit if not enough imbalance on any remote node. */
+ if ((busiest_node == this_node) || (max_avg_load <= LOADSCALE) ||
+ NODES_BALANCED(max_avg_load,nd[this_node].average_load)) {
+ this_rq->wait_node = -1;
goto out;
}
- nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
- /*
- * Make sure nothing changed since we checked the
- * runqueue length.
- */
- if (busiest->nr_running <= nr_running + 1) {
- spin_unlock(&busiest->lock);
- busiest = NULL;
+ busiest_cpu = nd[busiest_node].busiest_cpu;
+ avg_load = system_load*LOADSCALE/num_online_cpus();
+ /* Wait longer before stealing if own node's load is average. */
+ if (NODES_BALANCED(avg_load,nd[this_node].average_load))
+ steal_delay = NODE_DELAY_BUSY;
+ else
+ steal_delay = NODE_DELAY_IDLE;
+ /* if we have a new most loaded node: just mark it */
+ if (this_rq->wait_node != busiest_node) {
+ this_rq->wait_node = busiest_node;
+ this_rq->wait_time = jiffies;
+ goto out;
+ } else
+ /* old most loaded node: check if waited enough */
+ if (jiffies - this_rq->wait_time < steal_delay)
+ goto out;
+
+ if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {
+ busiest_rq = cpu_rq(busiest_cpu);
+ this_rq->wait_node = -1;
}
-out:
- return busiest;
+#endif
+ out:
+ return busiest_rq;
}
/*
@@ -758,16 +864,21 @@
*/
static void load_balance(runqueue_t *this_rq, int idle)
{
- int imbalance, idx, this_cpu = smp_processor_id();
+ int imbalance, nr_running, idx, this_cpu = smp_processor_id();
runqueue_t *busiest;
prio_array_t *array;
struct list_head *head, *curr;
task_t *tmp;
- busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+ busiest = find_busiest_queue(this_cpu, idle, &nr_running);
if (!busiest)
goto out;
+ imbalance = (busiest->nr_running - nr_running)/2;
+
+ nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+ if (busiest->nr_running <= nr_running + 1)
+ goto out_unlock;
/*
* We first consider expired tasks. Those will likely not be
* executed in the near future, and they are most likely to
@@ -839,10 +950,10 @@
* frequency and balancing agressivity depends on whether the CPU is
* idle or not.
*
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
* systems with HZ=100, every 10 msecs.)
*/
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
static inline void idle_tick(runqueue_t *rq)
@@ -2110,7 +2221,8 @@
spin_unlock_irqrestore(&rq->lock, flags);
p = req->task;
- cpu_dest = __ffs(p->cpus_allowed);
+ cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+
rq_dest = cpu_rq(cpu_dest);
repeat:
cpu_src = task_cpu(p);
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2.5.52] NUMA scheduler (2/2)
2002-12-18 16:21 ` [PATCH 2.5.52] " Erich Focht
@ 2002-12-18 16:23 ` Erich Focht
2002-12-20 14:49 ` [PATCH 2.5.52] NUMA scheduler: cputimes stats Erich Focht
` (2 subsequent siblings)
3 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-18 16:23 UTC (permalink / raw
To: Martin J. Bligh, Michael Hohnbaum
Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 578 bytes --]
And here comes patch #2, work done by Michael Hohnbaum.
On Wednesday 18 December 2002 17:21, Erich Focht wrote:
> This is the first part of the minimal NUMA scheduler built on top of
> the O(1) load balancer. It is rediffed for 2.5.52 and contains a small
> bugfix in build_node_data().
>
> The two patches are:
>
> 01-numa-sched-core-2.5.52-23.patch: core NUMA scheduler infrastructure
> providing a node aware load_balancer.
>
> 02-numa-sched-ilb-2.5.52-21.patch: initial load balancing, selects
> least loaded node & CPU at exec().
>
> Regards,
> Erich
[-- Attachment #2: 02-numa-sched-ilb-2.5.52-21.patch --]
[-- Type: text/x-diff, Size: 4748 bytes --]
diff -urN a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c 2002-12-16 03:07:53.000000000 +0100
+++ b/fs/exec.c 2002-12-18 16:54:40.000000000 +0100
@@ -1024,6 +1024,8 @@
int retval;
int i;
+ sched_balance_exec();
+
file = open_exec(filename);
retval = PTR_ERR(file);
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h 2002-12-18 16:53:13.000000000 +0100
+++ b/include/linux/sched.h 2002-12-18 16:54:40.000000000 +0100
@@ -160,7 +160,19 @@
unsigned long system, int cpu);
extern void scheduler_tick(int user_tick, int system);
extern unsigned long cache_decay_ticks;
-
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+ rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+ rq->nr_running--
+#else
+#define sched_balance_exec() {}
+#define node_nr_running_init() {}
+#define nr_running_inc(rq) rq->nr_running++
+#define nr_running_dec(rq) rq->nr_running--
+#endif
#define MAX_SCHEDULE_TIMEOUT LONG_MAX
extern signed long FASTCALL(schedule_timeout(signed long timeout));
diff -urN a/init/main.c b/init/main.c
--- a/init/main.c 2002-12-16 03:07:48.000000000 +0100
+++ b/init/main.c 2002-12-18 16:54:40.000000000 +0100
@@ -473,6 +473,7 @@
migration_init();
#endif
+ node_nr_running_init();
spawn_ksoftirqd();
}
diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c 2002-12-18 16:53:13.000000000 +0100
+++ b/kernel/sched.c 2002-12-18 16:54:40.000000000 +0100
@@ -153,6 +153,7 @@
task_t *curr, *idle;
prio_array_t *active, *expired, arrays[2];
int prev_nr_running[NR_CPUS];
+ atomic_t * node_ptr;
task_t *migration_thread;
struct list_head migration_queue;
@@ -347,7 +348,7 @@
p->prio = effective_prio(p);
}
enqueue_task(p, array);
- rq->nr_running++;
+ nr_running_inc(rq);
}
/*
@@ -355,7 +356,7 @@
*/
static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
- rq->nr_running--;
+ nr_running_dec(rq);
if (p->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
dequeue_task(p, p->array);
@@ -842,9 +843,9 @@
static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
{
dequeue_task(p, src_array);
- src_rq->nr_running--;
+ nr_running_dec(src_rq);
set_task_cpu(p, this_cpu);
- this_rq->nr_running++;
+ nr_running_inc(this_rq);
enqueue_task(p, this_rq->active);
/*
* Note that idle threads have a prio of MAX_PRIO, for this test
@@ -2284,6 +2285,83 @@
#endif
+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+__init void node_nr_running_init(void)
+{
+ int i;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+ }
+ return;
+}
+
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu. Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+ unsigned long old_mask;
+
+ old_mask = p->cpus_allowed;
+ if (!(old_mask & (1UL << dest_cpu)))
+ return;
+ /* force the process onto the specified CPU */
+ set_cpus_allowed(p, 1UL << dest_cpu);
+
+ /* restore the cpus allowed mask */
+ set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU. Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+ int i, minload, load, best_cpu, node = 0;
+
+ best_cpu = task_cpu(p);
+ if (cpu_rq(best_cpu)->nr_running <= 2)
+ return best_cpu;
+
+ minload = 10000000;
+ for (i = 0; i < numnodes; i++) {
+ load = atomic_read(&node_nr_running[i]);
+ if (load < minload) {
+ minload = load;
+ node = i;
+ }
+ }
+ minload = 10000000;
+ loop_over_node(i,node) {
+ if (!cpu_online(i))
+ continue;
+ if (cpu_rq(i)->nr_running < minload) {
+ best_cpu = i;
+ minload = cpu_rq(i)->nr_running;
+ }
+ }
+ return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+ int new_cpu;
+
+ if (numnodes > 1) {
+ new_cpu = sched_best_cpu(current);
+ if (new_cpu != smp_processor_id())
+ sched_migrate_task(current, new_cpu);
+ }
+}
+#endif /* CONFIG_NUMA */
+
#if CONFIG_SMP || CONFIG_PREEMPT
/*
* The 'big kernel lock'
@@ -2345,6 +2423,10 @@
spin_lock_init(&rq->lock);
INIT_LIST_HEAD(&rq->migration_queue);
atomic_set(&rq->nr_iowait, 0);
+#if CONFIG_NUMA
+ rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */
+
for (j = 0; j < 2; j++) {
array = rq->arrays + j;
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2.5.52] NUMA scheduler: cputimes stats
2002-12-18 16:21 ` [PATCH 2.5.52] " Erich Focht
2002-12-18 16:23 ` [PATCH 2.5.52] NUMA scheduler (2/2) Erich Focht
@ 2002-12-20 14:49 ` Erich Focht
2002-12-20 15:17 ` [PATCH 2.5.52] NUMA scheduler (1/2) Christoph Hellwig
2002-12-31 13:29 ` [PATCH 2.5.53] NUMA scheduler (1/3) Erich Focht
3 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-20 14:49 UTC (permalink / raw
To: Martin J. Bligh, Michael Hohnbaum
Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 807 bytes --]
The attached patch is needed for evaluating and understanding
scheduler and load balancer activity. It adds back to the kernel per
CPU user and system time statistics for each process in /proc/PID/cpu,
a feature which was around for ages and went lost in 2.5.50.
Regards,
Erich
On Wednesday 18 December 2002 17:21, Erich Focht wrote:
> This is the first part of the minimal NUMA scheduler built on top of
> the O(1) load balancer. It is rediffed for 2.5.52 and contains a small
> bugfix in build_node_data().
>
> The two patches are:
>
> 01-numa-sched-core-2.5.52-23.patch: core NUMA scheduler infrastructure
> providing a node aware load_balancer.
>
> 02-numa-sched-ilb-2.5.52-21.patch: initial load balancing, selects
> least loaded node & CPU at exec().
>
> Regards,
> Erich
[-- Attachment #2: 03-cputimes_stat-2.5.52.patch --]
[-- Type: text/x-diff, Size: 3316 bytes --]
diff -urN a/fs/proc/array.c b/fs/proc/array.c
--- a/fs/proc/array.c 2002-12-16 03:08:11.000000000 +0100
+++ b/fs/proc/array.c 2002-12-20 15:40:41.000000000 +0100
@@ -597,3 +597,26 @@
out:
return retval;
}
+
+#ifdef CONFIG_SMP
+int proc_pid_cpu(struct task_struct *task, char * buffer)
+{
+ int i, len;
+
+ len = sprintf(buffer,
+ "cpu %lu %lu\n",
+ jiffies_to_clock_t(task->utime),
+ jiffies_to_clock_t(task->stime));
+
+ for (i = 0 ; i < NR_CPUS; i++) {
+ if (cpu_online(i))
+ len += sprintf(buffer + len, "cpu%d %lu %lu\n",
+ i,
+ jiffies_to_clock_t(task->per_cpu_utime[i]),
+ jiffies_to_clock_t(task->per_cpu_stime[i]));
+
+ }
+ len += sprintf(buffer + len, "current_cpu %d\n",task_cpu(task));
+ return len;
+}
+#endif
diff -urN a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c 2002-12-16 03:08:11.000000000 +0100
+++ b/fs/proc/base.c 2002-12-20 15:40:42.000000000 +0100
@@ -55,6 +55,7 @@
PROC_PID_STAT,
PROC_PID_STATM,
PROC_PID_MAPS,
+ PROC_PID_CPU,
PROC_PID_MOUNTS,
PROC_PID_WCHAN,
PROC_PID_FD_DIR = 0x8000, /* 0x8000-0xffff */
@@ -75,6 +76,9 @@
E(PROC_PID_CMDLINE, "cmdline", S_IFREG|S_IRUGO),
E(PROC_PID_STAT, "stat", S_IFREG|S_IRUGO),
E(PROC_PID_STATM, "statm", S_IFREG|S_IRUGO),
+#ifdef CONFIG_SMP
+ E(PROC_PID_CPU, "cpu", S_IFREG|S_IRUGO),
+#endif
E(PROC_PID_MAPS, "maps", S_IFREG|S_IRUGO),
E(PROC_PID_MEM, "mem", S_IFREG|S_IRUSR|S_IWUSR),
E(PROC_PID_CWD, "cwd", S_IFLNK|S_IRWXUGO),
@@ -1026,7 +1030,12 @@
case PROC_PID_MAPS:
inode->i_fop = &proc_maps_operations;
break;
-
+#ifdef CONFIG_SMP
+ case PROC_PID_CPU:
+ inode->i_fop = &proc_info_file_operations;
+ ei->op.proc_read = proc_pid_cpu;
+ break;
+#endif
case PROC_PID_MEM:
inode->i_op = &proc_mem_inode_operations;
inode->i_fop = &proc_mem_operations;
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h 2002-12-18 16:54:40.000000000 +0100
+++ b/include/linux/sched.h 2002-12-20 15:40:42.000000000 +0100
@@ -354,6 +354,9 @@
struct timer_list real_timer;
unsigned long utime, stime, cutime, cstime;
unsigned long start_time;
+#ifdef CONFIG_SMP
+ long per_cpu_utime[NR_CPUS], per_cpu_stime[NR_CPUS];
+#endif
/* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
unsigned long min_flt, maj_flt, nswap, cmin_flt, cmaj_flt, cnswap;
int swappable:1;
diff -urN a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c 2002-12-16 03:07:44.000000000 +0100
+++ b/kernel/fork.c 2002-12-20 15:40:42.000000000 +0100
@@ -795,6 +795,14 @@
p->tty_old_pgrp = 0;
p->utime = p->stime = 0;
p->cutime = p->cstime = 0;
+#ifdef CONFIG_SMP
+ {
+ int i;
+
+ for(i = 0; i < NR_CPUS; i++)
+ p->per_cpu_utime[i] = p->per_cpu_stime[i] = 0;
+ }
+#endif
p->array = NULL;
p->lock_depth = -1; /* -1 = no lock */
p->start_time = jiffies;
diff -urN a/kernel/timer.c b/kernel/timer.c
--- a/kernel/timer.c 2002-12-16 03:07:52.000000000 +0100
+++ b/kernel/timer.c 2002-12-20 15:40:42.000000000 +0100
@@ -695,6 +695,10 @@
void update_one_process(struct task_struct *p, unsigned long user,
unsigned long system, int cpu)
{
+#ifdef CONFIG_SMP
+ p->per_cpu_utime[cpu] += user;
+ p->per_cpu_stime[cpu] += system;
+#endif
do_process_times(p, user, system);
do_it_virt(p, user);
do_it_prof(p);
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2.5.52] NUMA scheduler (1/2)
2002-12-18 16:21 ` [PATCH 2.5.52] " Erich Focht
2002-12-18 16:23 ` [PATCH 2.5.52] NUMA scheduler (2/2) Erich Focht
2002-12-20 14:49 ` [PATCH 2.5.52] NUMA scheduler: cputimes stats Erich Focht
@ 2002-12-20 15:17 ` Christoph Hellwig
2002-12-20 17:44 ` Erich Focht
2002-12-31 13:29 ` [PATCH 2.5.53] NUMA scheduler (1/3) Erich Focht
3 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2002-12-20 15:17 UTC (permalink / raw
To: Erich Focht
Cc: Martin J. Bligh, Michael Hohnbaum, Robert Love, Ingo Molnar,
Stephen Hemminger, linux-kernel
diff -urN a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c 2002-12-16 03:07:56.000000000 +0100
+++ b/arch/i386/kernel/smpboot.c 2002-12-18 16:53:12.000000000 +0100
@@ -1202,6 +1202,9 @@
void __init smp_cpus_done(unsigned int max_cpus)
{
zap_low_mappings();
+#ifdef CONFIG_NUMA
+ build_node_data();
+#endif
I think it would be much nicer if you had a proper stub for !CONFIG_NUMA..
@@ -20,6 +20,7 @@
#include <asm/mmu.h>
#include <linux/smp.h>
+#include <asm/topology.h>
Another header in sched.h?? And it doesn't look like it's used at all.
#include <linux/sem.h>
#include <linux/signal.h>
#include <linux/securebits.h>
@@ -446,6 +447,9 @@
# define set_cpus_allowed(p, new_mask) do { } while (0)
#endif
+#ifdef CONFIG_NUMA
+extern void build_node_data(void);
+#endif
The ifdef is superflous.
diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c 2002-12-16 03:08:14.000000000 +0100
+++ b/kernel/sched.c 2002-12-18 16:53:13.000000000 +0100
@@ -158,6 +158,10 @@
struct list_head migration_queue;
atomic_t nr_iowait;
+
+ unsigned long wait_time;
+ int wait_node;
+
Here OTOH a #if CONFIG_MUA could help to avoid a little bit of bloat.
} ____cacheline_aligned;
+#define cpu_to_node(cpu) __cpu_to_node(cpu)
I wonder why we don't have a proper cpu_to_node() yet, but as long
as it doesn't exist please use __cpu_to_node() directly.
+#define LOADSCALE 128
Any explanation?
+#ifdef CONFIG_NUMA
sched.c uses #if CONFIG_FOO, not #ifdef CONFIG_FOO, it would be cool
if you could follow the style of existing source files.
+/* Number of CPUs per node: sane values until all CPUs are up */
+int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
+int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are sorted!)*/
+#define node_ncpus(node) _node_nr_cpus[node]
Parametrized macros and variables aren't in the ßame namespace, what about
just node_nr_cpus for the macro, too. And should these be static?
+
+#define NODE_DELAY_IDLE (1*HZ/1000)
+#define NODE_DELAY_BUSY (20*HZ/1000)
Comments, please..
+/* the following macro implies that logical CPUs are sorted by node number */
+#define loop_over_node(cpu,node) \
+ for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)
Move to asm/topology.h?
+ ptr=0;
+ for (n=0; n<numnodes; n++) {
You need to add lots of space to match Documentation/odingStyle.. :)
+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ if (!cpu_online(cpu)) continue;
And linebreaks..
Btw, what about a for_each_cpu that has the cpu_online check in topology.h?
+ /* Wait longer before stealing if own node's load is average. */
+ if (NODES_BALANCED(avg_load,nd[this_node].average_load))
Shouldn't NODES_BALANCED shout less and be an inline called nodes_balanced?
+ this_rq->wait_node = busiest_node;
+ this_rq->wait_time = jiffies;
+ goto out;
+ } else
+ /* old most loaded node: check if waited enough */
+ if (jiffies - this_rq->wait_time < steal_delay)
+ goto out;
That indentation looks really strange, why not just
/* old most loaded node: check if waited enough */
} else if (jiffies - this_rq->wait_time < steal_delay)
goto out;
+
+ if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {
Dito, the name shouts a bit too much
+#endif
#endif /* CONFIG_NUMA */
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
And explanation why you changed that constant?
p = req->task;
- cpu_dest = __ffs(p->cpus_allowed);
+ cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+
This looks like a bugfix valid without the rest of the patch.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2.5.52] NUMA scheduler (1/2)
2002-12-20 15:17 ` [PATCH 2.5.52] NUMA scheduler (1/2) Christoph Hellwig
@ 2002-12-20 17:44 ` Erich Focht
0 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-20 17:44 UTC (permalink / raw
To: Christoph Hellwig
Cc: Martin J. Bligh, Michael Hohnbaum, Robert Love, Ingo Molnar,
Stephen Hemminger, linux-kernel
Hi Christoph,
thanks for the comments, I'll try to fix the things you mentioned when
preparing the next patch and add some more comments.
Regards,
Erich
On Friday 20 December 2002 16:17, Christoph Hellwig wrote:
> diff -urN a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
> --- a/arch/i386/kernel/smpboot.c 2002-12-16 03:07:56.000000000 +0100
> +++ b/arch/i386/kernel/smpboot.c 2002-12-18 16:53:12.000000000 +0100
> @@ -1202,6 +1202,9 @@
> void __init smp_cpus_done(unsigned int max_cpus)
> {
> zap_low_mappings();
> +#ifdef CONFIG_NUMA
> + build_node_data();
> +#endif
>
> I think it would be much nicer if you had a proper stub for !CONFIG_NUMA..
>
> @@ -20,6 +20,7 @@
> #include <asm/mmu.h>
>
> #include <linux/smp.h>
> +#include <asm/topology.h>
>
> Another header in sched.h?? And it doesn't look like it's used at all.
>
> #include <linux/sem.h>
> #include <linux/signal.h>
> #include <linux/securebits.h>
> @@ -446,6 +447,9 @@
> # define set_cpus_allowed(p, new_mask) do { } while (0)
> #endif
>
> +#ifdef CONFIG_NUMA
> +extern void build_node_data(void);
> +#endif
>
> The ifdef is superflous.
>
> diff -urN a/kernel/sched.c b/kernel/sched.c
> --- a/kernel/sched.c 2002-12-16 03:08:14.000000000 +0100
> +++ b/kernel/sched.c 2002-12-18 16:53:13.000000000 +0100
> @@ -158,6 +158,10 @@
> struct list_head migration_queue;
>
> atomic_t nr_iowait;
> +
> + unsigned long wait_time;
> + int wait_node;
> +
>
> Here OTOH a #if CONFIG_MUA could help to avoid a little bit of bloat.
>
> } ____cacheline_aligned;
>
> +#define cpu_to_node(cpu) __cpu_to_node(cpu)
>
> I wonder why we don't have a proper cpu_to_node() yet, but as long
> as it doesn't exist please use __cpu_to_node() directly.
>
> +#define LOADSCALE 128
>
> Any explanation?
>
> +#ifdef CONFIG_NUMA
>
> sched.c uses #if CONFIG_FOO, not #ifdef CONFIG_FOO, it would be cool
> if you could follow the style of existing source files.
>
> +/* Number of CPUs per node: sane values until all CPUs are up */
> +int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
> +int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are
> sorted!)*/ +#define node_ncpus(node) _node_nr_cpus[node]
>
> Parametrized macros and variables aren't in the ßame namespace, what about
> just node_nr_cpus for the macro, too. And should these be static?
>
> +
> +#define NODE_DELAY_IDLE (1*HZ/1000)
> +#define NODE_DELAY_BUSY (20*HZ/1000)
>
> Comments, please..
>
> +/* the following macro implies that logical CPUs are sorted by node number
> */ +#define loop_over_node(cpu,node) \
> + for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)
>
> Move to asm/topology.h?
>
> + ptr=0;
> + for (n=0; n<numnodes; n++) {
>
> You need to add lots of space to match Documentation/odingStyle.. :)
>
> + for (cpu = 0; cpu < NR_CPUS; cpu++) {
> + if (!cpu_online(cpu)) continue;
>
> And linebreaks..
>
> Btw, what about a for_each_cpu that has the cpu_online check in topology.h?
>
> + /* Wait longer before stealing if own node's load is average. */
> + if (NODES_BALANCED(avg_load,nd[this_node].average_load))
>
> Shouldn't NODES_BALANCED shout less and be an inline called nodes_balanced?
>
> + this_rq->wait_node = busiest_node;
> + this_rq->wait_time = jiffies;
> + goto out;
> + } else
> + /* old most loaded node: check if waited enough */
> + if (jiffies - this_rq->wait_time < steal_delay)
> + goto out;
>
> That indentation looks really strange, why not just
>
> /* old most loaded node: check if waited enough */
> } else if (jiffies - this_rq->wait_time < steal_delay)
> goto out;
>
> +
> + if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {
>
> Dito, the name shouts a bit too much
>
> +#endif
>
> #endif /* CONFIG_NUMA */
>
> -#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
> +#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
>
> And explanation why you changed that constant?
>
> p = req->task;
> - cpu_dest = __ffs(p->cpus_allowed);
> + cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
> +
>
> This looks like a bugfix valid without the rest of the patch.
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2.5.53] NUMA scheduler (1/3)
2002-12-18 16:21 ` [PATCH 2.5.52] " Erich Focht
` (2 preceding siblings ...)
2002-12-20 15:17 ` [PATCH 2.5.52] NUMA scheduler (1/2) Christoph Hellwig
@ 2002-12-31 13:29 ` Erich Focht
2002-12-31 13:29 ` [PATCH 2.5.53] NUMA scheduler (2/3) Erich Focht
` (2 more replies)
3 siblings, 3 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-31 13:29 UTC (permalink / raw
To: Martin J. Bligh, Michael Hohnbaum
Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 801 bytes --]
Here comes the minimal NUMA scheduler built on top of the O(1) load
balancer rediffed for 2.5.53 with some changes in the core part. As
suggested by Michael, I added the cputimes_stat patch, as it is
absolutely needed for understanding the scheduler behavior.
The three patches:
01-numa-sched-core-2.5.53-24.patch: core NUMA scheduler infrastructure
providing a node aware load_balancer. Cosmetic changes + more comments.
02-numa-sched-ilb-2.5.53-21.patch: initial load balancing, selects
least loaded node & CPU at exec().
03-cputimes_stat-2.5.53.patch: adds back to the kernel per CPU user
and system time statistics for each process in /proc/PID/cpu. Needed
for evaluating scheduler behavior and performance of tasks running
on SMP and NUMA systems.
Regards,
Erich
[-- Attachment #2: 01-numa-sched-core-2.5.53-24.patch --]
[-- Type: text/x-diff, Size: 12460 bytes --]
diff -urN a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c 2002-12-24 06:20:16.000000000 +0100
+++ b/arch/i386/kernel/smpboot.c 2002-12-31 13:02:47.000000000 +0100
@@ -1191,6 +1191,7 @@
void __init smp_cpus_done(unsigned int max_cpus)
{
zap_low_mappings();
+ build_node_data();
}
void __init smp_intr_init()
diff -urN a/arch/ia64/kernel/smpboot.c b/arch/ia64/kernel/smpboot.c
--- a/arch/ia64/kernel/smpboot.c 2002-12-24 06:19:49.000000000 +0100
+++ b/arch/ia64/kernel/smpboot.c 2002-12-31 13:03:02.000000000 +0100
@@ -397,7 +397,7 @@
static void
smp_tune_scheduling (void)
{
- cache_decay_ticks = 10; /* XXX base this on PAL info and cache-bandwidth estimate */
+ cache_decay_ticks = 8; /* XXX base this on PAL info and cache-bandwidth estimate */
printk("task migration cache decay timeout: %ld msecs.\n",
(cache_decay_ticks + 1) * 1000 / HZ);
@@ -522,6 +522,7 @@
printk(KERN_INFO"Total of %d processors activated (%lu.%02lu BogoMIPS).\n",
num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
+ build_node_data();
}
int __devinit
diff -urN a/arch/ppc64/kernel/smp.c b/arch/ppc64/kernel/smp.c
--- a/arch/ppc64/kernel/smp.c 2002-12-24 06:21:01.000000000 +0100
+++ b/arch/ppc64/kernel/smp.c 2002-12-31 13:03:21.000000000 +0100
@@ -679,4 +679,5 @@
/* XXX fix this, xics currently relies on it - Anton */
smp_threads_ready = 1;
+ build_node_data();
}
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h 2002-12-24 06:19:35.000000000 +0100
+++ b/include/linux/sched.h 2002-12-31 13:02:18.000000000 +0100
@@ -445,6 +445,12 @@
# define set_cpus_allowed(p, new_mask) do { } while (0)
#endif
+#ifdef CONFIG_NUMA
+extern void build_node_data(void);
+#else
+#define build_node_data() {}
+#endif
+
extern void set_user_nice(task_t *p, long nice);
extern int task_prio(task_t *p);
extern int task_nice(task_t *p);
diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c 2002-12-24 06:21:05.000000000 +0100
+++ b/kernel/sched.c 2002-12-31 13:46:03.000000000 +0100
@@ -158,6 +158,10 @@
struct list_head migration_queue;
atomic_t nr_iowait;
+#ifdef CONFIG_NUMA
+ unsigned long wait_time;
+ int wait_node;
+#endif
} ____cacheline_aligned;
static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -178,6 +182,64 @@
#endif
/*
+ * Node loads are scaled with LOADSCALE. This way:
+ * - we avoid zeros in integer divisions (dividing by node CPU number),
+ * - loads of nodes with different numbers of CPUs are comparable.
+ */
+#define LOADSCALE 128
+
+#ifdef CONFIG_NUMA
+/* Number of CPUs per node: sane values until all CPUs are up */
+static int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
+static int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are sorted!)*/
+#define node_nr_cpus(node) _node_nr_cpus[node]
+
+/*
+ * Delay stealing from remote node when own queue is idle/busy. These delays
+ * tend to equalize the node loads. NODE_DELAY_IDLE is nonzero because we
+ * want to give idle CPUs in the busiest node a chance to steal first.
+ */
+#define NODE_DELAY_IDLE (1*HZ/1000)
+#define NODE_DELAY_BUSY (20*HZ/1000)
+
+/* the following macro implies that logical CPUs are sorted by node number */
+#define loop_over_node(cpu,node) \
+ for(cpu = node_ptr[node]; cpu < node_ptr[node+1]; cpu++)
+
+/*
+ * Build node_ptr and node_nr_cpus data after all CPUs have come up. This
+ * expects that the logical CPUs are sorted by their node numbers! Check
+ * out the NUMA API for details on why this should be that way. [EF]
+ */
+void build_node_data(void)
+{
+ int n, cpu, ptr;
+ unsigned long mask;
+
+ ptr=0;
+ for (n = 0; n < numnodes; n++) {
+ mask = __node_to_cpu_mask(n) & cpu_online_map;
+ node_ptr[n] = ptr;
+ for (cpu = 0; cpu < NR_CPUS; cpu++)
+ if (mask & (1UL << cpu))
+ ptr++;
+ node_nr_cpus(n) = ptr - node_ptr[n];
+ }
+ node_ptr[numnodes] = ptr;
+ printk("CPU nodes : %d\n",numnodes);
+ for (n = 0; n < numnodes; n++)
+ printk("node %d : %d .. %d\n",n,node_ptr[n],node_ptr[n+1]-1);
+}
+
+#else
+#define node_nr_cpus(node) num_online_cpus()
+#define NODE_DELAY_IDLE 0
+#define NODE_DELAY_BUSY 0
+#define loop_over_node(cpu,n) for(cpu = 0; cpu < NR_CPUS; cpu++)
+#endif
+
+
+/*
* task_rq_lock - lock the runqueue a given task resides on and disable
* interrupts. Note the ordering: we can safely lookup the task_rq without
* explicitly disabling preemption.
@@ -652,81 +714,134 @@
}
/*
- * find_busiest_queue - find the busiest runqueue.
+ * Find a runqueue from which to steal a task. We try to do this as locally as
+ * possible because we don't want to let tasks get far from their node.
+ *
+ * 1. First try to find a runqueue within the own node with
+ * imbalance larger than 25% (relative to the current runqueue).
+ * 2. If the local node is well balanced, locate the most loaded node and its
+ * most loaded CPU.
+ *
+ * This routine implements node balancing by delaying steals from remote
+ * nodes more if the own node is (within margins) averagely loaded. The
+ * most loaded node is remembered as well as the time (jiffies). In the
+ * following calls to the load_balancer the time is compared with
+ * NODE_DELAY_BUSY (if load is around the average) or NODE_DELAY_IDLE (if own
+ * node is unloaded) if the most loaded node didn't change. This gives less
+ * loaded nodes the chance to approach the average load but doesn't exclude
+ * busy nodes from stealing (just in case the cpus_allowed mask isn't good
+ * for the idle nodes).
+ * This concept can be extended easilly to more than two levels (multi-level
+ * scheduler), e.g.: CPU -> node -> supernode... by implementing node-distance
+ * dependent steal delays.
+ *
+ * <efocht@ess.nec.de>
+ */
+
+struct node_queue_data {
+ int total_load;
+ int average_load;
+ int busiest_cpu;
+ int busiest_cpu_load;
+};
+
+/*
+ * Check if CPUs are balanced. The check is more involved than the O(1) original
+ * because that one is simply wrong in certain cases (integer arithmetic !!!) EF
+ */
+#define cpus_balanced(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 + 3)/4))
+/*
+ * Check if nodes are imbalanced. "this" node is balanced (compared to the "comp"
+ * node) when it's load is not smaller than "comp"'s load - LOADSCALE/2.
*/
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+#define nodes_balanced(comp,this) (((comp)-(this)) < LOADSCALE/2)
+
+static inline runqueue_t *find_busiest_queue(int this_cpu, int idle, int *nr_running)
{
- int nr_running, load, max_load, i;
- runqueue_t *busiest, *rq_src;
+ runqueue_t *busiest_rq = NULL, *this_rq = cpu_rq(this_cpu), *src_rq;
+ int busiest_cpu, busiest_node=0, cpu, load, max_avg_load, avg_load;
+ int n, steal_delay, system_load = 0, this_node=__cpu_to_node(this_cpu);
+ struct node_queue_data nd[MAX_NUMNODES], *node;
- /*
- * We search all runqueues to find the most busy one.
- * We do this lockless to reduce cache-bouncing overhead,
- * we re-check the 'best' source CPU later on again, with
- * the lock held.
- *
- * We fend off statistical fluctuations in runqueue lengths by
- * saving the runqueue length during the previous load-balancing
- * operation and using the smaller one the current and saved lengths.
- * If a runqueue is long enough for a longer amount of time then
- * we recognize it and pull tasks from it.
- *
- * The 'current runqueue length' is a statistical maximum variable,
- * for that one we take the longer one - to avoid fluctuations in
- * the other direction. So for a load-balance to happen it needs
- * stable long runqueue on the target CPU and stable short runqueue
- * on the local runqueue.
- *
- * We make an exception if this CPU is about to become idle - in
- * that case we are less picky about moving a task across CPUs and
- * take what can be taken.
- */
if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
- nr_running = this_rq->nr_running;
+ *nr_running = this_rq->nr_running;
else
- nr_running = this_rq->prev_nr_running[this_cpu];
+ *nr_running = this_rq->prev_nr_running[this_cpu];
- busiest = NULL;
- max_load = 1;
- for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_online(i))
- continue;
+ for (n = 0; n < numnodes; n++) {
+ nd[n].busiest_cpu_load = -1;
+ nd[n].total_load = 0;
+ }
- rq_src = cpu_rq(i);
- if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
- load = rq_src->nr_running;
+ /* compute all node loads and save their max cpu loads */
+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ if (!cpu_online(cpu))
+ continue;
+ node = &nd[__cpu_to_node(cpu)];
+ src_rq = cpu_rq(cpu);
+ if (idle || (src_rq->nr_running < this_rq->prev_nr_running[cpu]))
+ load = src_rq->nr_running;
else
- load = this_rq->prev_nr_running[i];
- this_rq->prev_nr_running[i] = rq_src->nr_running;
-
- if ((load > max_load) && (rq_src != this_rq)) {
- busiest = rq_src;
- max_load = load;
+ load = this_rq->prev_nr_running[cpu];
+ this_rq->prev_nr_running[cpu] = src_rq->nr_running;
+ node->total_load += load;
+ if (load > node->busiest_cpu_load) {
+ node->busiest_cpu_load = load;
+ node->busiest_cpu = cpu;
}
}
- if (likely(!busiest))
- goto out;
+ busiest_cpu = nd[this_node].busiest_cpu;
+ if (busiest_cpu != this_cpu) {
+ if (!cpus_balanced(nd[this_node].busiest_cpu_load,*nr_running)) {
+ busiest_rq = cpu_rq(busiest_cpu);
+ this_rq->wait_node = -1;
+ goto out;
+ }
+ }
- *imbalance = (max_load - nr_running) / 2;
+#ifdef CONFIG_NUMA
+ max_avg_load = -1;
+ for (n = 0; n < numnodes; n++) {
+ node = &nd[n];
+ system_load += node->total_load;
+ node->average_load = node->total_load*LOADSCALE/node_nr_cpus(n);
+ if (node->average_load > max_avg_load) {
+ max_avg_load = node->average_load;
+ busiest_node = n;
+ }
+ }
- /* It needs an at least ~25% imbalance to trigger balancing. */
- if (!idle && (*imbalance < (max_load + 3)/4)) {
- busiest = NULL;
+ /* Exit if not enough imbalance on any remote node. */
+ if ((busiest_node == this_node) || (max_avg_load <= LOADSCALE) ||
+ nodes_balanced(max_avg_load,nd[this_node].average_load)) {
+ this_rq->wait_node = -1;
goto out;
}
- nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
- /*
- * Make sure nothing changed since we checked the
- * runqueue length.
- */
- if (busiest->nr_running <= nr_running + 1) {
- spin_unlock(&busiest->lock);
- busiest = NULL;
+ busiest_cpu = nd[busiest_node].busiest_cpu;
+ avg_load = system_load*LOADSCALE/num_online_cpus();
+ /* Wait longer before stealing if own node's load is average. */
+ if (nodes_balanced(avg_load,nd[this_node].average_load))
+ steal_delay = NODE_DELAY_BUSY;
+ else
+ steal_delay = NODE_DELAY_IDLE;
+ /* if we have a new most loaded node: just mark it */
+ if (this_rq->wait_node != busiest_node) {
+ this_rq->wait_node = busiest_node;
+ this_rq->wait_time = jiffies;
+ goto out;
+ /* old most loaded node: check if waited enough */
+ } else if (jiffies - this_rq->wait_time < steal_delay)
+ goto out;
+
+ if ((!cpus_balanced(nd[busiest_node].busiest_cpu_load,*nr_running))) {
+ busiest_rq = cpu_rq(busiest_cpu);
+ this_rq->wait_node = -1;
}
-out:
- return busiest;
+#endif
+ out:
+ return busiest_rq;
}
/*
@@ -758,16 +873,21 @@
*/
static void load_balance(runqueue_t *this_rq, int idle)
{
- int imbalance, idx, this_cpu = smp_processor_id();
+ int imbalance, nr_running, idx, this_cpu = smp_processor_id();
runqueue_t *busiest;
prio_array_t *array;
struct list_head *head, *curr;
task_t *tmp;
- busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+ busiest = find_busiest_queue(this_cpu, idle, &nr_running);
if (!busiest)
goto out;
+ imbalance = (busiest->nr_running - nr_running)/2;
+
+ nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+ if (busiest->nr_running <= nr_running + 1)
+ goto out_unlock;
/*
* We first consider expired tasks. Those will likely not be
* executed in the near future, and they are most likely to
@@ -2110,7 +2230,8 @@
spin_unlock_irqrestore(&rq->lock, flags);
p = req->task;
- cpu_dest = __ffs(p->cpus_allowed);
+ cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+
rq_dest = cpu_rq(cpu_dest);
repeat:
cpu_src = task_cpu(p);
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2.5.53] NUMA scheduler (2/3)
2002-12-31 13:29 ` [PATCH 2.5.53] NUMA scheduler (1/3) Erich Focht
@ 2002-12-31 13:29 ` Erich Focht
2002-12-31 13:30 ` [PATCH 2.5.53] NUMA scheduler (3/3) Erich Focht
2003-01-04 1:58 ` [PATCH 2.5.53] NUMA scheduler (1/3) Michael Hohnbaum
2 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-31 13:29 UTC (permalink / raw
To: Martin J. Bligh, Michael Hohnbaum
Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 918 bytes --]
... the second patch ...
On Tuesday 31 December 2002 14:29, Erich Focht wrote:
> Here comes the minimal NUMA scheduler built on top of the O(1) load
> balancer rediffed for 2.5.53 with some changes in the core part. As
> suggested by Michael, I added the cputimes_stat patch, as it is
> absolutely needed for understanding the scheduler behavior.
>
> The three patches:
> 01-numa-sched-core-2.5.53-24.patch: core NUMA scheduler infrastructure
> providing a node aware load_balancer. Cosmetic changes + more comments.
>
> 02-numa-sched-ilb-2.5.53-21.patch: initial load balancing, selects
> least loaded node & CPU at exec().
>
> 03-cputimes_stat-2.5.53.patch: adds back to the kernel per CPU user
> and system time statistics for each process in /proc/PID/cpu. Needed
> for evaluating scheduler behavior and performance of tasks running
> on SMP and NUMA systems.
>
> Regards,
> Erich
[-- Attachment #2: 02-numa-sched-ilb-2.5.53-21.patch --]
[-- Type: text/x-diff, Size: 4878 bytes --]
diff -urN a/fs/exec.c b/fs/exec.c
--- a/fs/exec.c 2002-12-24 06:20:03.000000000 +0100
+++ b/fs/exec.c 2002-12-31 14:15:45.000000000 +0100
@@ -1024,6 +1024,8 @@
int retval;
int i;
+ sched_balance_exec();
+
file = open_exec(filename);
retval = PTR_ERR(file);
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h 2002-12-31 13:02:18.000000000 +0100
+++ b/include/linux/sched.h 2002-12-31 14:17:13.000000000 +0100
@@ -160,7 +160,6 @@
extern void scheduler_tick(int user_tick, int system);
extern unsigned long cache_decay_ticks;
-
#define MAX_SCHEDULE_TIMEOUT LONG_MAX
extern signed long FASTCALL(schedule_timeout(signed long timeout));
asmlinkage void schedule(void);
@@ -447,8 +446,18 @@
#ifdef CONFIG_NUMA
extern void build_node_data(void);
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+ rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+ rq->nr_running--
#else
#define build_node_data() {}
+#define sched_balance_exec() {}
+#define node_nr_running_init() {}
+#define nr_running_inc(rq) rq->nr_running++
+#define nr_running_dec(rq) rq->nr_running--
#endif
extern void set_user_nice(task_t *p, long nice);
diff -urN a/init/main.c b/init/main.c
--- a/init/main.c 2002-12-24 06:19:46.000000000 +0100
+++ b/init/main.c 2002-12-31 14:15:45.000000000 +0100
@@ -489,6 +489,7 @@
migration_init();
#endif
+ node_nr_running_init();
spawn_ksoftirqd();
}
diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c 2002-12-31 13:46:03.000000000 +0100
+++ b/kernel/sched.c 2002-12-31 14:15:45.000000000 +0100
@@ -153,6 +153,7 @@
task_t *curr, *idle;
prio_array_t *active, *expired, arrays[2];
int prev_nr_running[NR_CPUS];
+ atomic_t * node_ptr;
task_t *migration_thread;
struct list_head migration_queue;
@@ -356,7 +357,7 @@
p->prio = effective_prio(p);
}
enqueue_task(p, array);
- rq->nr_running++;
+ nr_running_inc(rq);
}
/*
@@ -364,7 +365,7 @@
*/
static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
- rq->nr_running--;
+ nr_running_dec(rq);
if (p->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
dequeue_task(p, p->array);
@@ -851,9 +852,9 @@
static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
{
dequeue_task(p, src_array);
- src_rq->nr_running--;
+ nr_running_dec(src_rq);
set_task_cpu(p, this_cpu);
- this_rq->nr_running++;
+ nr_running_inc(this_rq);
enqueue_task(p, this_rq->active);
/*
* Note that idle threads have a prio of MAX_PRIO, for this test
@@ -2293,6 +2294,83 @@
#endif
+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+__init void node_nr_running_init(void)
+{
+ int i;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+ }
+ return;
+}
+
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu. Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+ unsigned long old_mask;
+
+ old_mask = p->cpus_allowed;
+ if (!(old_mask & (1UL << dest_cpu)))
+ return;
+ /* force the process onto the specified CPU */
+ set_cpus_allowed(p, 1UL << dest_cpu);
+
+ /* restore the cpus allowed mask */
+ set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * Find the least loaded CPU. Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+ int i, minload, load, best_cpu, node = 0;
+
+ best_cpu = task_cpu(p);
+ if (cpu_rq(best_cpu)->nr_running <= 2)
+ return best_cpu;
+
+ minload = 10000000;
+ for (i = 0; i < numnodes; i++) {
+ load = atomic_read(&node_nr_running[i]);
+ if (load < minload) {
+ minload = load;
+ node = i;
+ }
+ }
+ minload = 10000000;
+ loop_over_node(i,node) {
+ if (!cpu_online(i))
+ continue;
+ if (cpu_rq(i)->nr_running < minload) {
+ best_cpu = i;
+ minload = cpu_rq(i)->nr_running;
+ }
+ }
+ return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+ int new_cpu;
+
+ if (numnodes > 1) {
+ new_cpu = sched_best_cpu(current);
+ if (new_cpu != smp_processor_id())
+ sched_migrate_task(current, new_cpu);
+ }
+}
+#endif /* CONFIG_NUMA */
+
#if CONFIG_SMP || CONFIG_PREEMPT
/*
* The 'big kernel lock'
@@ -2354,6 +2432,10 @@
spin_lock_init(&rq->lock);
INIT_LIST_HEAD(&rq->migration_queue);
atomic_set(&rq->nr_iowait, 0);
+#if CONFIG_NUMA
+ rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */
+
for (j = 0; j < 2; j++) {
array = rq->arrays + j;
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 2.5.53] NUMA scheduler (3/3)
2002-12-31 13:29 ` [PATCH 2.5.53] NUMA scheduler (1/3) Erich Focht
2002-12-31 13:29 ` [PATCH 2.5.53] NUMA scheduler (2/3) Erich Focht
@ 2002-12-31 13:30 ` Erich Focht
2003-01-04 1:58 ` [PATCH 2.5.53] NUMA scheduler (1/3) Michael Hohnbaum
2 siblings, 0 replies; 29+ messages in thread
From: Erich Focht @ 2002-12-31 13:30 UTC (permalink / raw
To: Martin J. Bligh, Michael Hohnbaum
Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 921 bytes --]
... and the third patch ...
On Tuesday 31 December 2002 14:29, Erich Focht wrote:
> Here comes the minimal NUMA scheduler built on top of the O(1) load
> balancer rediffed for 2.5.53 with some changes in the core part. As
> suggested by Michael, I added the cputimes_stat patch, as it is
> absolutely needed for understanding the scheduler behavior.
>
> The three patches:
> 01-numa-sched-core-2.5.53-24.patch: core NUMA scheduler infrastructure
> providing a node aware load_balancer. Cosmetic changes + more comments.
>
> 02-numa-sched-ilb-2.5.53-21.patch: initial load balancing, selects
> least loaded node & CPU at exec().
>
> 03-cputimes_stat-2.5.53.patch: adds back to the kernel per CPU user
> and system time statistics for each process in /proc/PID/cpu. Needed
> for evaluating scheduler behavior and performance of tasks running
> on SMP and NUMA systems.
>
> Regards,
> Erich
[-- Attachment #2: 03-cputimes_stat-2.5.53.patch --]
[-- Type: text/x-diff, Size: 3316 bytes --]
diff -urN a/fs/proc/array.c b/fs/proc/array.c
--- a/fs/proc/array.c 2002-12-24 06:20:36.000000000 +0100
+++ b/fs/proc/array.c 2002-12-31 14:19:15.000000000 +0100
@@ -597,3 +597,26 @@
out:
return retval;
}
+
+#ifdef CONFIG_SMP
+int proc_pid_cpu(struct task_struct *task, char * buffer)
+{
+ int i, len;
+
+ len = sprintf(buffer,
+ "cpu %lu %lu\n",
+ jiffies_to_clock_t(task->utime),
+ jiffies_to_clock_t(task->stime));
+
+ for (i = 0 ; i < NR_CPUS; i++) {
+ if (cpu_online(i))
+ len += sprintf(buffer + len, "cpu%d %lu %lu\n",
+ i,
+ jiffies_to_clock_t(task->per_cpu_utime[i]),
+ jiffies_to_clock_t(task->per_cpu_stime[i]));
+
+ }
+ len += sprintf(buffer + len, "current_cpu %d\n",task_cpu(task));
+ return len;
+}
+#endif
diff -urN a/fs/proc/base.c b/fs/proc/base.c
--- a/fs/proc/base.c 2002-12-24 06:20:39.000000000 +0100
+++ b/fs/proc/base.c 2002-12-31 14:19:15.000000000 +0100
@@ -55,6 +55,7 @@
PROC_PID_STAT,
PROC_PID_STATM,
PROC_PID_MAPS,
+ PROC_PID_CPU,
PROC_PID_MOUNTS,
PROC_PID_WCHAN,
PROC_PID_FD_DIR = 0x8000, /* 0x8000-0xffff */
@@ -75,6 +76,9 @@
E(PROC_PID_CMDLINE, "cmdline", S_IFREG|S_IRUGO),
E(PROC_PID_STAT, "stat", S_IFREG|S_IRUGO),
E(PROC_PID_STATM, "statm", S_IFREG|S_IRUGO),
+#ifdef CONFIG_SMP
+ E(PROC_PID_CPU, "cpu", S_IFREG|S_IRUGO),
+#endif
E(PROC_PID_MAPS, "maps", S_IFREG|S_IRUGO),
E(PROC_PID_MEM, "mem", S_IFREG|S_IRUSR|S_IWUSR),
E(PROC_PID_CWD, "cwd", S_IFLNK|S_IRWXUGO),
@@ -1026,7 +1030,12 @@
case PROC_PID_MAPS:
inode->i_fop = &proc_maps_operations;
break;
-
+#ifdef CONFIG_SMP
+ case PROC_PID_CPU:
+ inode->i_fop = &proc_info_file_operations;
+ ei->op.proc_read = proc_pid_cpu;
+ break;
+#endif
case PROC_PID_MEM:
inode->i_op = &proc_mem_inode_operations;
inode->i_fop = &proc_mem_operations;
diff -urN a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h 2002-12-31 14:17:13.000000000 +0100
+++ b/include/linux/sched.h 2002-12-31 14:19:15.000000000 +0100
@@ -340,6 +340,9 @@
struct timer_list real_timer;
unsigned long utime, stime, cutime, cstime;
unsigned long start_time;
+#ifdef CONFIG_SMP
+ long per_cpu_utime[NR_CPUS], per_cpu_stime[NR_CPUS];
+#endif
/* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
unsigned long min_flt, maj_flt, nswap, cmin_flt, cmaj_flt, cnswap;
int swappable:1;
diff -urN a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c 2002-12-24 06:19:35.000000000 +0100
+++ b/kernel/fork.c 2002-12-31 14:19:15.000000000 +0100
@@ -795,6 +795,14 @@
p->tty_old_pgrp = 0;
p->utime = p->stime = 0;
p->cutime = p->cstime = 0;
+#ifdef CONFIG_SMP
+ {
+ int i;
+
+ for(i = 0; i < NR_CPUS; i++)
+ p->per_cpu_utime[i] = p->per_cpu_stime[i] = 0;
+ }
+#endif
p->array = NULL;
p->lock_depth = -1; /* -1 = no lock */
p->start_time = jiffies;
diff -urN a/kernel/timer.c b/kernel/timer.c
--- a/kernel/timer.c 2002-12-24 06:19:51.000000000 +0100
+++ b/kernel/timer.c 2002-12-31 14:19:15.000000000 +0100
@@ -695,6 +695,10 @@
void update_one_process(struct task_struct *p, unsigned long user,
unsigned long system, int cpu)
{
+#ifdef CONFIG_SMP
+ p->per_cpu_utime[cpu] += user;
+ p->per_cpu_stime[cpu] += system;
+#endif
do_process_times(p, user, system);
do_it_virt(p, user);
do_it_prof(p);
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2.5.53] NUMA scheduler (1/3)
2002-12-31 13:29 ` [PATCH 2.5.53] NUMA scheduler (1/3) Erich Focht
2002-12-31 13:29 ` [PATCH 2.5.53] NUMA scheduler (2/3) Erich Focht
2002-12-31 13:30 ` [PATCH 2.5.53] NUMA scheduler (3/3) Erich Focht
@ 2003-01-04 1:58 ` Michael Hohnbaum
2003-01-05 5:35 ` Martin J. Bligh
2 siblings, 1 reply; 29+ messages in thread
From: Michael Hohnbaum @ 2003-01-04 1:58 UTC (permalink / raw
To: Erich Focht
Cc: Martin J. Bligh, Robert Love, Ingo Molnar, Stephen Hemminger,
linux-kernel
On Tue, 2002-12-31 at 05:29, Erich Focht wrote:
> Here comes the minimal NUMA scheduler built on top of the O(1) load
> balancer rediffed for 2.5.53 with some changes in the core part. As
> suggested by Michael, I added the cputimes_stat patch, as it is
> absolutely needed for understanding the scheduler behavior.
Thanks for this latest patch. I've managed to cobble together
a 4 node NUMAQ system (16 700 MHZ PIII procs, 16GB memory) and
ran kernbench and schedbench on this, along with the 2.5.50 and
2.5.52 versions. Results remain fairly consistent with what
we have been obtaining on earlier versions.
Kernbench:
Elapsed User System CPU
sched50 29.96s 288.308s 83.606s 1240.8%
sched52 29.836s 285.832s 84.464s 1240.4%
sched53 29.364s 284.808s 83.174s 1252.6%
stock50 31.074s 303.664s 89.194s 1264.2%
stock53 31.204s 306.224s 87.776s 1263.2%
Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
sched50 22.00 35.50 88.04 0.86
sched52 22.04 34.77 88.18 0.75
sched53 22.16 34.39 88.66 0.74
stock50 27.18 45.99 108.75 0.87
stock53 0.00 41.96 133.85 1.07
Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
sched50 32.05 46.81 256.45 1.81
sched52 32.75 50.33 262.07 1.63
sched53 30.56 46.58 244.59 1.67
stock50 44.75 63.68 358.09 2.55
stock53 0.00 59.64 318.63 2.40
Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
sched50 55.34 70.74 885.61 4.09
sched52 50.64 62.30 810.50 4.31
sched53 54.04 70.01 864.84 3.68
stock50 65.36 80.72 1045.94 5.92
stock53 0.00 78.01 947.06 6.70
Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
sched50 59.47 139.77 1903.37 10.24
sched52 58.37 132.90 1868.30 7.45
sched53 57.35 132.30 1835.49 9.29
stock50 84.51 194.89 2704.83 12.44
stock53 0.00 182.95 2612.83 12.46
Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
sched50 63.55 276.81 4067.65 21.58
sched52 66.75 293.31 4272.72 21.06
sched53 62.38 276.99 3992.97 19.90
stock50 99.68 422.06 6380.04 25.92
stock53 0.00 441.42 6723.01 26.83
Note that the 0.00 in the AvgUser column for stock53 was due to
me not applying the cputime patch (03--cputimes_stat-2.5.53.patch).
Not sure how I managed to forget that one. I'll rerun this on
a kernel with that patch on Monday.
--
Michael Hohnbaum 503-578-5486
hohnbaum@us.ibm.com T/L 775-5486
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2.5.53] NUMA scheduler (1/3)
2003-01-04 1:58 ` [PATCH 2.5.53] NUMA scheduler (1/3) Michael Hohnbaum
@ 2003-01-05 5:35 ` Martin J. Bligh
2003-01-06 3:58 ` Michael Hohnbaum
0 siblings, 1 reply; 29+ messages in thread
From: Martin J. Bligh @ 2003-01-05 5:35 UTC (permalink / raw
To: Michael Hohnbaum, Erich Focht
Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel
>> Here comes the minimal NUMA scheduler built on top of the O(1) load
>> balancer rediffed for 2.5.53 with some changes in the core part. As
>> suggested by Michael, I added the cputimes_stat patch, as it is
>> absolutely needed for understanding the scheduler behavior.
>
> Thanks for this latest patch. I've managed to cobble together
> a 4 node NUMAQ system (16 700 MHZ PIII procs, 16GB memory) and
> ran kernbench and schedbench on this, along with the 2.5.50 and
> 2.5.52 versions. Results remain fairly consistent with what
> we have been obtaining on earlier versions.
>
> Kernbench:
> Elapsed User System CPU
> sched50 29.96s 288.308s 83.606s 1240.8%
> sched52 29.836s 285.832s 84.464s 1240.4%
> sched53 29.364s 284.808s 83.174s 1252.6%
> stock50 31.074s 303.664s 89.194s 1264.2%
> stock53 31.204s 306.224s 87.776s 1263.2%
Not sure what you're correllating here because your rows are all named
the same thing. However, the new version seems to be much slower
on systime (about 7-8% for me), which roughly correllates with your
last two rows above. Me no like.
Erich, can you seperate out the cleanups from the functional changes?
The cleanups look cool. The only thing that I can see easily was
BUSY_REBALANCE_TICK - I tried changing that back to the old version
but it made no difference. Here's a diff going from your old scheduler
(that I've been keeping update in my tree) to your new one on 2.5.54.
M.
diff -urpN -X /home/fletch/.diff.exclude
oldsched/arch/i386/kernel/smpboot.c newsched/arch/i386/kernel/smpboot.c
--- oldsched/arch/i386/kernel/smpboot.c Sat Jan 4 14:57:43 2003
+++ newsched/arch/i386/kernel/smpboot.c Sat Jan 4 14:58:25 2003
@@ -1198,9 +1198,7 @@ int __devinit __cpu_up(unsigned int cpu)
void __init smp_cpus_done(unsigned int max_cpus)
{
zap_low_mappings();
-#ifdef CONFIG_NUMA
build_node_data();
-#endif
}
void __init smp_intr_init()
diff -urpN -X /home/fletch/.diff.exclude
oldsched/arch/ia64/kernel/smpboot.c newsched/arch/ia64/kernel/smpboot.c
--- oldsched/arch/ia64/kernel/smpboot.c Sat Jan 4 14:57:43 2003
+++ newsched/arch/ia64/kernel/smpboot.c Sat Jan 4 14:58:25 2003
@@ -528,9 +528,7 @@ smp_cpus_done (unsigned int dummy)
printk(KERN_INFO"Total of %d processors activated (%lu.%02lu
BogoMIPS).\n",
num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
-#ifdef CONFIG_NUMA
build_node_data();
-#endif
}
int __devinit
diff -urpN -X /home/fletch/.diff.exclude oldsched/arch/ppc64/kernel/smp.c
newsched/arch/ppc64/kernel/smp.c
--- oldsched/arch/ppc64/kernel/smp.c Sat Jan 4 14:57:43 2003
+++ newsched/arch/ppc64/kernel/smp.c Sat Jan 4 14:58:25 2003
@@ -685,7 +685,5 @@ void __init smp_cpus_done(unsigned int m
/* XXX fix this, xics currently relies on it - Anton */
smp_threads_ready = 1;
-#ifdef CONFIG_NUMA
build_node_data();
-#endif
}
diff -urpN -X /home/fletch/.diff.exclude oldsched/include/linux/sched.h
newsched/include/linux/sched.h
--- oldsched/include/linux/sched.h Sat Jan 4 14:57:50 2003
+++ newsched/include/linux/sched.h Sat Jan 4 14:58:28 2003
@@ -20,7 +20,6 @@
#include <asm/mmu.h>
#include <linux/smp.h>
-#include <asm/topology.h>
#include <linux/sem.h>
#include <linux/signal.h>
#include <linux/securebits.h>
@@ -160,19 +159,6 @@ extern void update_one_process(struct ta
unsigned long system, int cpu);
extern void scheduler_tick(int user_tick, int system);
extern unsigned long cache_decay_ticks;
-#ifdef CONFIG_NUMA
-extern void sched_balance_exec(void);
-extern void node_nr_running_init(void);
-#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
- rq->nr_running++
-#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
- rq->nr_running--
-#else
-#define sched_balance_exec() {}
-#define node_nr_running_init() {}
-#define nr_running_inc(rq) rq->nr_running++
-#define nr_running_dec(rq) rq->nr_running--
-#endif
#define MAX_SCHEDULE_TIMEOUT LONG_MAX
extern signed long FASTCALL(schedule_timeout(signed long timeout));
@@ -458,8 +444,21 @@ extern void set_cpus_allowed(task_t *p,
#endif
#ifdef CONFIG_NUMA
-extern void build_pools(void);
+extern void build_node_data(void);
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+ rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+ rq->nr_running--
+#else
+#define build_node_data() {}
+#define sched_balance_exec() {}
+#define node_nr_running_init() {}
+#define nr_running_inc(rq) rq->nr_running++
+#define nr_running_dec(rq) rq->nr_running--
#endif
+
extern void set_user_nice(task_t *p, long nice);
extern int task_prio(task_t *p);
extern int task_nice(task_t *p);
diff -urpN -X /home/fletch/.diff.exclude oldsched/kernel/sched.c
newsched/kernel/sched.c
--- oldsched/kernel/sched.c Sat Jan 4 14:57:50 2003
+++ newsched/kernel/sched.c Sat Jan 4 14:58:28 2003
@@ -159,10 +159,10 @@ struct runqueue {
struct list_head migration_queue;
atomic_t nr_iowait;
-
+#ifdef CONFIG_NUMA
unsigned long wait_time;
int wait_node;
-
+#endif
} ____cacheline_aligned;
static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -182,24 +182,33 @@ static struct runqueue runqueues[NR_CPUS
# define task_running(rq, p) ((rq)->curr == (p))
#endif
-#define cpu_to_node(cpu) __cpu_to_node(cpu)
+/*
+ * Node loads are scaled with LOADSCALE. This way:
+ * - we avoid zeros in integer divisions (dividing by node CPU number),
+ * - loads of nodes with different numbers of CPUs are comparable.
+ */
#define LOADSCALE 128
#ifdef CONFIG_NUMA
/* Number of CPUs per node: sane values until all CPUs are up */
-int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
-int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are
sorted!)*/
-#define node_ncpus(node) _node_nr_cpus[node]
+static int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] =
NR_CPUS };
+static int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus
are sorted!)*/
+#define node_nr_cpus(node) _node_nr_cpus[node]
+/*
+ * Delay stealing from remote node when own queue is idle/busy. These
delays
+ * tend to equalize the node loads. NODE_DELAY_IDLE is nonzero because we
+ * want to give idle CPUs in the busiest node a chance to steal first.
+ */
#define NODE_DELAY_IDLE (1*HZ/1000)
#define NODE_DELAY_BUSY (20*HZ/1000)
/* the following macro implies that logical CPUs are sorted by node number
*/
#define loop_over_node(cpu,node) \
- for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)
+ for(cpu = node_ptr[node]; cpu < node_ptr[node+1]; cpu++)
/*
- * Build node_ptr and node_ncpus data after all CPUs have come up. This
+ * Build node_ptr and node_nr_cpus data after all CPUs have come up. This
* expects that the logical CPUs are sorted by their node numbers! Check
* out the NUMA API for details on why this should be that way. [EF]
*/
@@ -209,24 +218,25 @@ void build_node_data(void)
unsigned long mask;
ptr=0;
- for (n=0; n<numnodes; n++) {
+ for (n = 0; n < numnodes; n++) {
mask = __node_to_cpu_mask(n) & cpu_online_map;
node_ptr[n] = ptr;
- for (cpu=0; cpu<NR_CPUS; cpu++)
+ for (cpu = 0; cpu < NR_CPUS; cpu++)
if (mask & (1UL << cpu))
ptr++;
- node_ncpus(n) = ptr - node_ptr[n];;
+ node_nr_cpus(n) = ptr - node_ptr[n];
}
+ node_ptr[numnodes] = ptr;
printk("CPU nodes : %d\n",numnodes);
- for (n=0; n<numnodes; n++)
+ for (n = 0; n < numnodes; n++)
printk("node %d : %d .. %d\n",n,node_ptr[n],node_ptr[n+1]-1);
}
#else
-#define node_ncpus(node) num_online_cpus()
+#define node_nr_cpus(node) num_online_cpus()
#define NODE_DELAY_IDLE 0
#define NODE_DELAY_BUSY 0
-#define loop_over_node(cpu,n) for(cpu=0; cpu<NR_CPUS; cpu++)
+#define loop_over_node(cpu,n) for(cpu = 0; cpu < NR_CPUS; cpu++)
#endif
@@ -740,18 +750,18 @@ struct node_queue_data {
* Check if CPUs are balanced. The check is more involved than the O(1)
original
* because that one is simply wrong in certain cases (integer arithmetic
!!!) EF
*/
-#define CPUS_BALANCED(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 +
3)/4))
+#define cpus_balanced(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 +
3)/4))
/*
* Check if nodes are imbalanced. "this" node is balanced (compared to the
"comp"
* node) when it's load is not smaller than "comp"'s load - LOADSCALE/2.
*/
-#define NODES_BALANCED(comp,this) (((comp)-(this)) < LOADSCALE/2)
+#define nodes_balanced(comp,this) (((comp)-(this)) < LOADSCALE/2)
static inline runqueue_t *find_busiest_queue(int this_cpu, int idle, int
*nr_running)
{
runqueue_t *busiest_rq = NULL, *this_rq = cpu_rq(this_cpu), *src_rq;
int busiest_cpu, busiest_node=0, cpu, load, max_avg_load, avg_load;
- int n, steal_delay, system_load = 0, this_node=cpu_to_node(this_cpu);
+ int n, steal_delay, system_load = 0, this_node=__cpu_to_node(this_cpu);
struct node_queue_data nd[MAX_NUMNODES], *node;
if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
@@ -766,8 +776,9 @@ static inline runqueue_t *find_busiest_q
/* compute all node loads and save their max cpu loads */
for (cpu = 0; cpu < NR_CPUS; cpu++) {
- if (!cpu_online(cpu)) continue;
- node = &nd[cpu_to_node(cpu)];
+ if (!cpu_online(cpu))
+ continue;
+ node = &nd[__cpu_to_node(cpu)];
src_rq = cpu_rq(cpu);
if (idle || (src_rq->nr_running < this_rq->prev_nr_running[cpu]))
load = src_rq->nr_running;
@@ -783,7 +794,7 @@ static inline runqueue_t *find_busiest_q
busiest_cpu = nd[this_node].busiest_cpu;
if (busiest_cpu != this_cpu) {
- if (!CPUS_BALANCED(nd[this_node].busiest_cpu_load,*nr_running)) {
+ if (!cpus_balanced(nd[this_node].busiest_cpu_load,*nr_running)) {
busiest_rq = cpu_rq(busiest_cpu);
this_rq->wait_node = -1;
goto out;
@@ -795,7 +806,7 @@ static inline runqueue_t *find_busiest_q
for (n = 0; n < numnodes; n++) {
node = &nd[n];
system_load += node->total_load;
- node->average_load = node->total_load*LOADSCALE/node_ncpus(n);
+ node->average_load = node->total_load*LOADSCALE/node_nr_cpus(n);
if (node->average_load > max_avg_load) {
max_avg_load = node->average_load;
busiest_node = n;
@@ -804,7 +815,7 @@ static inline runqueue_t *find_busiest_q
/* Exit if not enough imbalance on any remote node. */
if ((busiest_node == this_node) || (max_avg_load <= LOADSCALE) ||
- NODES_BALANCED(max_avg_load,nd[this_node].average_load)) {
+ nodes_balanced(max_avg_load,nd[this_node].average_load)) {
this_rq->wait_node = -1;
goto out;
}
@@ -812,7 +823,7 @@ static inline runqueue_t *find_busiest_q
busiest_cpu = nd[busiest_node].busiest_cpu;
avg_load = system_load*LOADSCALE/num_online_cpus();
/* Wait longer before stealing if own node's load is average. */
- if (NODES_BALANCED(avg_load,nd[this_node].average_load))
+ if (nodes_balanced(avg_load,nd[this_node].average_load))
steal_delay = NODE_DELAY_BUSY;
else
steal_delay = NODE_DELAY_IDLE;
@@ -821,12 +832,11 @@ static inline runqueue_t *find_busiest_q
this_rq->wait_node = busiest_node;
this_rq->wait_time = jiffies;
goto out;
- } else
/* old most loaded node: check if waited enough */
- if (jiffies - this_rq->wait_time < steal_delay)
- goto out;
+ } else if (jiffies - this_rq->wait_time < steal_delay)
+ goto out;
- if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {
+ if ((!cpus_balanced(nd[busiest_node].busiest_cpu_load,*nr_running))) {
busiest_rq = cpu_rq(busiest_cpu);
this_rq->wait_node = -1;
}
@@ -950,10 +960,10 @@ out:
* frequency and balancing agressivity depends on whether the CPU is
* idle or not.
*
- * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
* systems with HZ=100, every 10 msecs.)
*/
-#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
static inline void idle_tick(runqueue_t *rq)
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2.5.53] NUMA scheduler (1/3)
2003-01-05 5:35 ` Martin J. Bligh
@ 2003-01-06 3:58 ` Michael Hohnbaum
2003-01-06 6:07 ` Martin J. Bligh
0 siblings, 1 reply; 29+ messages in thread
From: Michael Hohnbaum @ 2003-01-06 3:58 UTC (permalink / raw
To: Martin J. Bligh
Cc: Erich Focht, Robert Love, Ingo Molnar, Stephen Hemminger,
linux-kernel
On Sat, 2003-01-04 at 21:35, Martin J. Bligh wrote:
> >> Here comes the minimal NUMA scheduler built on top of the O(1) load
> >> balancer rediffed for 2.5.53 with some changes in the core part. As
> >> suggested by Michael, I added the cputimes_stat patch, as it is
> >> absolutely needed for understanding the scheduler behavior.
> >
> > Thanks for this latest patch. I've managed to cobble together
> > a 4 node NUMAQ system (16 700 MHZ PIII procs, 16GB memory) and
> > ran kernbench and schedbench on this, along with the 2.5.50 and
> > 2.5.52 versions. Results remain fairly consistent with what
> > we have been obtaining on earlier versions.
> >
> > Kernbench:
> > Elapsed User System CPU
> > sched50 29.96s 288.308s 83.606s 1240.8%
> > sched52 29.836s 285.832s 84.464s 1240.4%
> > sched53 29.364s 284.808s 83.174s 1252.6%
> > stock50 31.074s 303.664s 89.194s 1264.2%
> > stock53 31.204s 306.224s 87.776s 1263.2%
>
> Not sure what you're correllating here because your rows are all named
> the same thing. However, the new version seems to be much slower
> on systime (about 7-8% for me), which roughly correllates with your
> last two rows above. Me no like.
Sorry, I forgot to include a bit better description of what the
row labels mean.
sched50 = linux 2.5.50 with the NUMA scheduler
sched52 = linux 2.5.52 with the NUMA scheduler
sched53 = linux 2.5.53 with the NUMA scheduler
stock50 = linux 2.5.50 without the NUMA scheduler
stock53 = linux 2.5.53 without the NUMA scheduler
Thus, this shows that the NUMA scheduler drops systime by ~5.5 secs,
or roughly 8%. So, my testing is not showing an increase in systime
like you apparently are seeing.
Michael
--
Michael Hohnbaum 503-578-5486
hohnbaum@us.ibm.com T/L 775-5486
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2.5.53] NUMA scheduler (1/3)
2003-01-06 3:58 ` Michael Hohnbaum
@ 2003-01-06 6:07 ` Martin J. Bligh
2003-01-07 2:23 ` Michael Hohnbaum
0 siblings, 1 reply; 29+ messages in thread
From: Martin J. Bligh @ 2003-01-06 6:07 UTC (permalink / raw
To: Michael Hohnbaum
Cc: Erich Focht, Robert Love, Ingo Molnar, Stephen Hemminger,
linux-kernel
>> > Kernbench:
>> > Elapsed User System CPU
>> > sched50 29.96s 288.308s 83.606s 1240.8%
>> > sched52 29.836s 285.832s 84.464s 1240.4%
>> > sched53 29.364s 284.808s 83.174s 1252.6%
>> > stock50 31.074s 303.664s 89.194s 1264.2%
>> > stock53 31.204s 306.224s 87.776s 1263.2%
>>
>> Not sure what you're correllating here because your rows are all named
>> the same thing. However, the new version seems to be much slower
>> on systime (about 7-8% for me), which roughly correllates with your
>> last two rows above. Me no like.
>
> Sorry, I forgot to include a bit better description of what the
> row labels mean.
>
> sched50 = linux 2.5.50 with the NUMA scheduler
> sched52 = linux 2.5.52 with the NUMA scheduler
> sched53 = linux 2.5.53 with the NUMA scheduler
> stock50 = linux 2.5.50 without the NUMA scheduler
> stock53 = linux 2.5.53 without the NUMA scheduler
>
> Thus, this shows that the NUMA scheduler drops systime by ~5.5 secs,
> or roughly 8%. So, my testing is not showing an increase in systime
> like you apparently are seeing.
Sorry, the row names weren't that bad if I actually read them carefully ;-)
I was doing a slightly different test - Erich's old sched code vs the new
both on 2.5.54, and seem to have a degredation.
M.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2.5.53] NUMA scheduler (1/3)
2003-01-06 6:07 ` Martin J. Bligh
@ 2003-01-07 2:23 ` Michael Hohnbaum
2003-01-07 11:27 ` Erich Focht
0 siblings, 1 reply; 29+ messages in thread
From: Michael Hohnbaum @ 2003-01-07 2:23 UTC (permalink / raw
To: Martin J. Bligh
Cc: Erich Focht, Robert Love, Ingo Molnar, Stephen Hemminger,
linux-kernel
On Sun, 2003-01-05 at 22:07, Martin J. Bligh wrote:
> >> > Kernbench:
> >> > Elapsed User System CPU
> >> > sched50 29.96s 288.308s 83.606s 1240.8%
> >> > sched52 29.836s 285.832s 84.464s 1240.4%
> >> > sched53 29.364s 284.808s 83.174s 1252.6%
> >> > stock50 31.074s 303.664s 89.194s 1264.2%
> >> > stock53 31.204s 306.224s 87.776s 1263.2%
> >
> > sched50 = linux 2.5.50 with the NUMA scheduler
> > sched52 = linux 2.5.52 with the NUMA scheduler
> > sched53 = linux 2.5.53 with the NUMA scheduler
> > stock50 = linux 2.5.50 without the NUMA scheduler
> > stock53 = linux 2.5.53 without the NUMA scheduler
>
> I was doing a slightly different test - Erich's old sched code vs the new
> both on 2.5.54, and seem to have a degredation.
>
> M.
Martin,
I ran 2.5.54 with an older version of Erich's NUMA scheduler and
with the version sent out for 2.5.53. Results were similar:
Kernbench:
Elapsed User System CPU
sched54 29.112s 283.888s 82.84s 1259.4%
oldsched54 29.436s 286.942s 82.722s 1256.2%
sched54 = linux 2.5.54 with the 2.5.53 version of the NUMA scheduler
oldsched54 = linux 2.5.54 with an earlier version of the NUMA scheduler
The numbers for the new version are actually a touch better, but
close enough to be within a reasonable margin of error.
I'll post numbers against stock 2.5.54 and include schedbench, tomorrow.
Michael
--
Michael Hohnbaum 503-578-5486
hohnbaum@us.ibm.com T/L 775-5486
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2.5.53] NUMA scheduler (1/3)
2003-01-07 2:23 ` Michael Hohnbaum
@ 2003-01-07 11:27 ` Erich Focht
2003-01-07 23:35 ` Michael Hohnbaum
0 siblings, 1 reply; 29+ messages in thread
From: Erich Focht @ 2003-01-07 11:27 UTC (permalink / raw
To: Michael Hohnbaum, Martin J. Bligh
Cc: Robert Love, Ingo Molnar, Stephen Hemminger, linux-kernel
Hi Michael and Martin,
thanks a lot for the testing!
I rechecked the changes and really don't see any reason for a
slowdown. Michael's measurements seem to confirm that this is just a
statistical effect. I suggest: when in doubt, do 10 kernel compiles
instead of 5. A simple statistical error estimation as I did for
schedbench might help, too. Guess I've sent you the script a while
ago.
I understand from your emails that the 2.5.53 patches apply and work
for 2.5.54, therefore I'll wait for 2.5.55 with a rediff.
Regards,
Erich
On Tuesday 07 January 2003 03:23, Michael Hohnbaum wrote:
> On Sun, 2003-01-05 at 22:07, Martin J. Bligh wrote:
> > >> > Kernbench:
> > >> > Elapsed User System CPU
> > >> > sched50 29.96s 288.308s 83.606s 1240.8%
> > >> > sched52 29.836s 285.832s 84.464s 1240.4%
> > >> > sched53 29.364s 284.808s 83.174s 1252.6%
> > >> > stock50 31.074s 303.664s 89.194s 1264.2%
> > >> > stock53 31.204s 306.224s 87.776s 1263.2%
> > >
> > > sched50 = linux 2.5.50 with the NUMA scheduler
> > > sched52 = linux 2.5.52 with the NUMA scheduler
> > > sched53 = linux 2.5.53 with the NUMA scheduler
> > > stock50 = linux 2.5.50 without the NUMA scheduler
> > > stock53 = linux 2.5.53 without the NUMA scheduler
> >
> > I was doing a slightly different test - Erich's old sched code vs the new
> > both on 2.5.54, and seem to have a degredation.
> >
> > M.
>
> Martin,
>
> I ran 2.5.54 with an older version of Erich's NUMA scheduler and
> with the version sent out for 2.5.53. Results were similar:
>
> Kernbench:
> Elapsed User System CPU
> sched54 29.112s 283.888s 82.84s 1259.4%
> oldsched54 29.436s 286.942s 82.722s 1256.2%
>
> sched54 = linux 2.5.54 with the 2.5.53 version of the NUMA scheduler
> oldsched54 = linux 2.5.54 with an earlier version of the NUMA scheduler
>
> The numbers for the new version are actually a touch better, but
> close enough to be within a reasonable margin of error.
>
> I'll post numbers against stock 2.5.54 and include schedbench, tomorrow.
>
> Michael
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2.5.53] NUMA scheduler (1/3)
2003-01-07 11:27 ` Erich Focht
@ 2003-01-07 23:35 ` Michael Hohnbaum
0 siblings, 0 replies; 29+ messages in thread
From: Michael Hohnbaum @ 2003-01-07 23:35 UTC (permalink / raw
To: Erich Focht
Cc: Martin J. Bligh, Robert Love, Ingo Molnar, Stephen Hemminger,
linux-kernel
On Tue, 2003-01-07 at 03:27, Erich Focht wrote:
> Hi Michael and Martin,
>
> thanks a lot for the testing!
>
> I rechecked the changes and really don't see any reason for a
> slowdown. Michael's measurements seem to confirm that this is just a
> statistical effect. I suggest: when in doubt, do 10 kernel compiles
> instead of 5. A simple statistical error estimation as I did for
> schedbench might help, too. Guess I've sent you the script a while
> ago.
>
One more set of test results, this time including schedbench. Previous
runs did not have the cputimes_stat patch, so the schedbench numbers
were not particularly useful.
Kernbench:
Elapsed User System CPU
oldsched54 29.75s 287.02s 82.876s 1242.8%
sched54 29.112s 283.888s 82.84s 1259.4%
stock54 31.348s 303.134s 87.824s 1247.2%
sched50 29.96s 288.308s 83.606s 1240.8%
stock50 31.074s 303.664s 89.194s 1264.2%
Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
oldsched54 20.44 37.41 81.81 0.93
sched54 22.03 34.90 88.15 0.75
stock54 49.35 57.53 197.45 0.86
sched50 22.00 35.50 88.04 0.86
stock50 27.18 45.99 108.75 0.87
Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
oldsched54 31.98 51.59 255.88 1.93
sched54 27.95 37.12 223.66 1.50
stock54 43.14 62.97 345.18 2.12
sched50 32.05 46.81 256.45 1.81
stock50 44.75 63.68 358.09 2.55
Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
oldsched54 58.39 80.29 934.38 4.30
sched54 55.37 69.58 886.10 3.79
stock54 66.00 81.25 1056.25 7.12
sched50 55.34 70.74 885.61 4.09
stock50 65.36 80.72 1045.94 5.92
Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
oldsched54 54.53 119.71 1745.37 11.70
sched54 57.93 132.11 1854.01 10.74
stock54 77.81 173.26 2490.31 12.37
sched50 59.47 139.77 1903.37 10.24
stock50 84.51 194.89 2704.83 12.44
Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
oldsched54 79.57 339.88 5093.31 20.74
sched54 72.91 308.87 4667.03 21.06
stock54 86.68 368.55 5548.57 25.73
sched50 63.55 276.81 4067.65 21.58
stock50 99.68 422.06 6380.04 25.92
Row names are the same as before (see below). Numbers seem to be
fairly consistent.
> I understand from your emails that the 2.5.53 patches apply and work
> for 2.5.54, therefore I'll wait for 2.5.55 with a rediff.
It is fine with me to wait for 2.5.55 to rediff. I'm getting patch
errors now with the cputime_stat patch, so a rediff of that with
2.5.55 would be handy (although I suppose I should quit being lazy
and just do that myself).
>
> Regards,
> Erich
Michael
>
>
> On Tuesday 07 January 2003 03:23, Michael Hohnbaum wrote:
> > On Sun, 2003-01-05 at 22:07, Martin J. Bligh wrote:
> > > >> > Kernbench:
> > > >> > Elapsed User System CPU
> > > >> > sched50 29.96s 288.308s 83.606s 1240.8%
> > > >> > sched52 29.836s 285.832s 84.464s 1240.4%
> > > >> > sched53 29.364s 284.808s 83.174s 1252.6%
> > > >> > stock50 31.074s 303.664s 89.194s 1264.2%
> > > >> > stock53 31.204s 306.224s 87.776s 1263.2%
> > > >
> > > > sched50 = linux 2.5.50 with the NUMA scheduler
> > > > sched52 = linux 2.5.52 with the NUMA scheduler
> > > > sched53 = linux 2.5.53 with the NUMA scheduler
> > > > stock50 = linux 2.5.50 without the NUMA scheduler
> > > > stock53 = linux 2.5.53 without the NUMA scheduler
> > >
> > > I was doing a slightly different test - Erich's old sched code vs the new
> > > both on 2.5.54, and seem to have a degredation.
> > >
> > > M.
> >
> > Martin,
> >
> > I ran 2.5.54 with an older version of Erich's NUMA scheduler and
> > with the version sent out for 2.5.53. Results were similar:
> >
> > Kernbench:
> > Elapsed User System CPU
> > sched54 29.112s 283.888s 82.84s 1259.4%
> > oldsched54 29.436s 286.942s 82.722s 1256.2%
> >
> > sched54 = linux 2.5.54 with the 2.5.53 version of the NUMA scheduler
> > oldsched54 = linux 2.5.54 with an earlier version of the NUMA scheduler
> >
> > The numbers for the new version are actually a touch better, but
> > close enough to be within a reasonable margin of error.
> >
> > I'll post numbers against stock 2.5.54 and include schedbench, tomorrow.
> >
> > Michael
>
>
--
Michael Hohnbaum 503-578-5486
hohnbaum@us.ibm.com T/L 775-5486
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2003-01-07 23:25 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-11-06 16:34 NUMA scheduler BK tree Erich Focht
2002-11-06 18:10 ` Michael Hohnbaum
2002-11-07 23:05 ` Erich Focht
2002-11-07 23:46 ` Michael Hohnbaum
2002-11-08 16:57 ` Erich Focht
2002-11-11 15:13 ` [PATCH 2.5.47] NUMA scheduler (1/2) Erich Focht
2002-11-11 15:14 ` [PATCH 2.5.47] NUMA scheduler (2/2) Erich Focht
2002-11-12 0:24 ` [PATCH 2.5.47] NUMA scheduler (1/2) Michael Hohnbaum
2002-11-18 19:40 ` NUMA scheduler BK tree Martin J. Bligh
2002-11-19 16:26 ` [PATCH 2.5.48] NUMA scheduler (1/2) Erich Focht
2002-11-19 16:27 ` [PATCH 2.5.48] NUMA scheduler (2/2) Erich Focht
2002-12-02 15:29 ` [PATCH 2.5.50] NUMA scheduler (1/2) Erich Focht
2002-12-02 15:30 ` [PATCH 2.5.50] NUMA scheduler (2/2) Erich Focht
2002-12-06 17:39 ` [PATCH 2.5.50] NUMA scheduler (1/2) Michael Hohnbaum
2002-12-18 16:21 ` [PATCH 2.5.52] " Erich Focht
2002-12-18 16:23 ` [PATCH 2.5.52] NUMA scheduler (2/2) Erich Focht
2002-12-20 14:49 ` [PATCH 2.5.52] NUMA scheduler: cputimes stats Erich Focht
2002-12-20 15:17 ` [PATCH 2.5.52] NUMA scheduler (1/2) Christoph Hellwig
2002-12-20 17:44 ` Erich Focht
2002-12-31 13:29 ` [PATCH 2.5.53] NUMA scheduler (1/3) Erich Focht
2002-12-31 13:29 ` [PATCH 2.5.53] NUMA scheduler (2/3) Erich Focht
2002-12-31 13:30 ` [PATCH 2.5.53] NUMA scheduler (3/3) Erich Focht
2003-01-04 1:58 ` [PATCH 2.5.53] NUMA scheduler (1/3) Michael Hohnbaum
2003-01-05 5:35 ` Martin J. Bligh
2003-01-06 3:58 ` Michael Hohnbaum
2003-01-06 6:07 ` Martin J. Bligh
2003-01-07 2:23 ` Michael Hohnbaum
2003-01-07 11:27 ` Erich Focht
2003-01-07 23:35 ` Michael Hohnbaum
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.