[PATCH 1/2] Customize sched domain via cpuset

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] Customize sched domain via cpuset
@ 2008-04-01 11:26 Hidetoshi Seto
  2008-04-01 11:40 ` Andi Kleen
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Hidetoshi Seto @ 2008-04-01 11:26 UTC (permalink / raw
  To: linux-kernel

Hi all,

Using cpuset, now we can partition the system into multiple sched domains.
Then, how about providing different characteristics for each domains?

This patch introduces new feature of cpuset - sched domain customization.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 Documentation/cpusets.txt |   89 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 87 insertions(+), 2 deletions(-)

Index: GIT-torvalds/Documentation/cpusets.txt
===================================================================
--- GIT-torvalds.orig/Documentation/cpusets.txt
+++ GIT-torvalds/Documentation/cpusets.txt
@@ -8,6 +8,7 @@ Portions Copyright (c) 2004-2006 Silicon
 Modified by Paul Jackson <pj@sgi.com>
 Modified by Christoph Lameter <clameter@sgi.com>
 Modified by Paul Menage <menage@google.com>
+Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

 CONTENTS:
 =========
@@ -20,7 +21,8 @@ CONTENTS:
   1.5 What is memory_pressure ?
   1.6 What is memory spread ?
   1.7 What is sched_load_balance ?
-  1.8 How do I use cpusets ?
+  1.8 What are other sched_* files ?
+  1.9 How do I use cpusets ?
 2. Usage Examples and Syntax
   2.1 Basic Usage
   2.2 Adding/removing cpus
@@ -497,7 +499,90 @@ the cpuset code to update these sched do
 partition requested with the current, and updates its sched domains,
 removing the old and adding the new, for each change.

-1.8 How do I use cpusets ?
+1.8 What are other sched_* files ?
+----------------------------------
+
+As described in 1.7, cpuset allows you to partition the systems CPUs
+into a number of sched domains.  Each sched domain is load balanced
+independently, in a traditional way that designed to be good for
+usual systems.
+
+But you may want to customize the behavior of load balancing for your
+special system.  For this requirement, cpuset provides some files named
+sched_* to customize the sched domain of the cpuset for some special
+situation, i.e. some specific application on some special system.
+
+These files are per-cpuset and affect the sched domain where the
+cpuset belongs to.  If multiple cpusets are overlapping and hence they
+form a single sched domain, changes in one of them affect others.
+If flag "sched_load_balance" of a cpuset is disabled, sched_* files
+have no effect since there is no sched domain belonging the cpuset.
+
+Note that modifying sched_* files will have both good and bad effects,
+and whether it is acceptable or not will be depend on your situation.
+Don't modify these files if you are not sure the effect.
+
+1.8.1 What is sched_wake_idle_far ?
+-----------------------------------
+
+When a task is woken up, scheduler try to wake up the task on idle CPU.
+
+For example, if a task A running on CPU X activates another task B
+on the same CPU X, and if CPU Y is X's sibling and performing idle,
+then scheduler migrate task B to CPU Y so that task B can start
+on CPU Y without waiting task A on CPU X.
+
+However scheduler doesn't search whole system, just searches nearby
+siblings at default.  Assume CPU Z is relatively far from CPU X.
+Even if CPU Z is idle while CPU X and the siblings are busy, scheduler
+can't migrate woken task B from X to Z.  As the result, task B on CPU X
+need to wait task A or wait load balance on the next tick.  For some
+special applications, waiting 1 tick is too long.
+
+The main reason why scheduler limits the range of searching idle CPU
+so small such as "siblings in the socket" is because it saves
+searching cost and migration cost.  Nowadays there are shared
+resources between siblings - CPU caches and so on, so this limit can
+save some migration cost assuming that the resources contain enough
+not-expired stuff for migrating task.  Usually this assumption will
+work, but not guaranteed.
+
+When the flag 'sched_wake_idle_far' is enabled, this searching range
+is expanded to all CPUs in the sched domain of the cpuset.
+
+If this flag was enabled on the example of CPU Z given above,
+scheduler can find CPU Z by taking some extra searching cost, and
+migrate task B to CPU Z by taking some extra migration cost.
+In exchange of these costs, you can start task B relatively fast.
+
+If your situation is:
+ - The migration costs between each cpu can be assumed considerably
+   small(for you) due to your special application's behavior or
+   special hardware support for CPU cache etc.
+ - The searching cost doesn't have impact(for you) or you can make
+   the searching cost enough small by managing cpuset to compact etc.
+ - The latency is required even it sacrifices cache hit rate etc.
+then turning on 'sched_wake_idle_far' would benefit you.
+
+1.8.2 What is sched_balance_newidle_far ?
+-----------------------------------------
+
+If a CPU run out of tasks in its runqueue, the CPU try to pull extra
+tasks from other busy CPUs to help them before it is going to be idle.
+
+Of course it takes some searching cost to find movable tasks,
+scheduler might not search all CPUs in the system.  For example,
+the range is limited in the same socket or node where the CPU locates.
+
+When the flag 'sched_balance_newidle_far' is enabled, this range
+is expanded to all CPUs in the sched domain of the cpuset.
+
+The assumed situation where this flag is considerable is almost same
+as that of 'sched_wake_idle_far'.  If you would like to trade better
+latency and high operating ratio in return of some other benefits,
+then enable this flag.
+
+1.9 How do I use cpusets ?
 --------------------------

 In order to minimize the impact of cpusets on critical kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] Customize sched domain via cpuset
  2008-04-01 11:26 [PATCH 1/2] Customize sched domain via cpuset Hidetoshi Seto
@ 2008-04-01 11:40 ` Andi Kleen
  2008-04-01 11:56   ` Peter Zijlstra
  2008-04-01 11:48 ` Peter Zijlstra
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Andi Kleen @ 2008-04-01 11:40 UTC (permalink / raw
  To: Hidetoshi Seto; +Cc: linux-kernel

Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> writes:

> Using cpuset, now we can partition the system into multiple sched domains.
> Then, how about providing different characteristics for each domains?

Did you actually see much improvement in any relevant workload
from tweaking these parameters?  If yes what did you change?
And how much did it gain?

Ideally the kernel should perform well without much tweaking
out of the box, simply because most users won't tweak. Adding a 
lot of such parameters would imply giving up on good defaults which 
is not a good thing.

-Andi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] Customize sched domain via cpuset
  2008-04-01 11:26 [PATCH 1/2] Customize sched domain via cpuset Hidetoshi Seto
  2008-04-01 11:40 ` Andi Kleen
@ 2008-04-01 11:48 ` Peter Zijlstra
  2008-04-01 11:55 ` Paul Jackson
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2008-04-01 11:48 UTC (permalink / raw
  To: Hidetoshi Seto; +Cc: linux-kernel, Ingo Molnar, Paul Jackson

Adding CCs (highly recommended to CC at least the subsystem maintainers
of the stuff you touch :-)

On Tue, 2008-04-01 at 20:26 +0900, Hidetoshi Seto wrote:
> Hi all,
> 
> Using cpuset, now we can partition the system into multiple sched domains.
> Then, how about providing different characteristics for each domains?
> 
> This patch introduces new feature of cpuset - sched domain customization.
> 
> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> 
> ---
>  Documentation/cpusets.txt |   89 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 87 insertions(+), 2 deletions(-)
> 
> Index: GIT-torvalds/Documentation/cpusets.txt
> ===================================================================
> --- GIT-torvalds.orig/Documentation/cpusets.txt
> +++ GIT-torvalds/Documentation/cpusets.txt
> @@ -8,6 +8,7 @@ Portions Copyright (c) 2004-2006 Silicon
>  Modified by Paul Jackson <pj@sgi.com>
>  Modified by Christoph Lameter <clameter@sgi.com>
>  Modified by Paul Menage <menage@google.com>
> +Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
> 
>  CONTENTS:
>  =========
> @@ -20,7 +21,8 @@ CONTENTS:
>    1.5 What is memory_pressure ?
>    1.6 What is memory spread ?
>    1.7 What is sched_load_balance ?
> -  1.8 How do I use cpusets ?
> +  1.8 What are other sched_* files ?
> +  1.9 How do I use cpusets ?
>  2. Usage Examples and Syntax
>    2.1 Basic Usage
>    2.2 Adding/removing cpus
> @@ -497,7 +499,90 @@ the cpuset code to update these sched do
>  partition requested with the current, and updates its sched domains,
>  removing the old and adding the new, for each change.
> 
> -1.8 How do I use cpusets ?
> +1.8 What are other sched_* files ?
> +----------------------------------
> +
> +As described in 1.7, cpuset allows you to partition the systems CPUs
> +into a number of sched domains.  Each sched domain is load balanced
> +independently, in a traditional way that designed to be good for
> +usual systems.
> +
> +But you may want to customize the behavior of load balancing for your
> +special system.  For this requirement, cpuset provides some files named
> +sched_* to customize the sched domain of the cpuset for some special
> +situation, i.e. some specific application on some special system.
> +
> +These files are per-cpuset and affect the sched domain where the
> +cpuset belongs to.  If multiple cpusets are overlapping and hence they
> +form a single sched domain, changes in one of them affect others.
> +If flag "sched_load_balance" of a cpuset is disabled, sched_* files
> +have no effect since there is no sched domain belonging the cpuset.
> +
> +Note that modifying sched_* files will have both good and bad effects,
> +and whether it is acceptable or not will be depend on your situation.
> +Don't modify these files if you are not sure the effect.
> +
> +1.8.1 What is sched_wake_idle_far ?
> +-----------------------------------
> +
> +When a task is woken up, scheduler try to wake up the task on idle CPU.
> +
> +For example, if a task A running on CPU X activates another task B
> +on the same CPU X, and if CPU Y is X's sibling and performing idle,
> +then scheduler migrate task B to CPU Y so that task B can start
> +on CPU Y without waiting task A on CPU X.
> +
> +However scheduler doesn't search whole system, just searches nearby
> +siblings at default.  Assume CPU Z is relatively far from CPU X.
> +Even if CPU Z is idle while CPU X and the siblings are busy, scheduler
> +can't migrate woken task B from X to Z.  As the result, task B on CPU X
> +need to wait task A or wait load balance on the next tick.  For some
> +special applications, waiting 1 tick is too long.
> +
> +The main reason why scheduler limits the range of searching idle CPU
> +so small such as "siblings in the socket" is because it saves
> +searching cost and migration cost.  Nowadays there are shared
> +resources between siblings - CPU caches and so on, so this limit can
> +save some migration cost assuming that the resources contain enough
> +not-expired stuff for migrating task.  Usually this assumption will
> +work, but not guaranteed.
> +
> +When the flag 'sched_wake_idle_far' is enabled, this searching range
> +is expanded to all CPUs in the sched domain of the cpuset.
> +
> +If this flag was enabled on the example of CPU Z given above,
> +scheduler can find CPU Z by taking some extra searching cost, and
> +migrate task B to CPU Z by taking some extra migration cost.
> +In exchange of these costs, you can start task B relatively fast.
> +
> +If your situation is:
> + - The migration costs between each cpu can be assumed considerably
> +   small(for you) due to your special application's behavior or
> +   special hardware support for CPU cache etc.
> + - The searching cost doesn't have impact(for you) or you can make
> +   the searching cost enough small by managing cpuset to compact etc.
> + - The latency is required even it sacrifices cache hit rate etc.
> +then turning on 'sched_wake_idle_far' would benefit you.
> +
> +1.8.2 What is sched_balance_newidle_far ?
> +-----------------------------------------
> +
> +If a CPU run out of tasks in its runqueue, the CPU try to pull extra
> +tasks from other busy CPUs to help them before it is going to be idle.
> +
> +Of course it takes some searching cost to find movable tasks,
> +scheduler might not search all CPUs in the system.  For example,
> +the range is limited in the same socket or node where the CPU locates.
> +
> +When the flag 'sched_balance_newidle_far' is enabled, this range
> +is expanded to all CPUs in the sched domain of the cpuset.
> +
> +The assumed situation where this flag is considerable is almost same
> +as that of 'sched_wake_idle_far'.  If you would like to trade better
> +latency and high operating ratio in return of some other benefits,
> +then enable this flag.
> +
> +1.9 How do I use cpusets ?
>  --------------------------
> 
>  In order to minimize the impact of cpusets on critical kernel
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] Customize sched domain via cpuset
  2008-04-01 11:26 [PATCH 1/2] Customize sched domain via cpuset Hidetoshi Seto
  2008-04-01 11:40 ` Andi Kleen
  2008-04-01 11:48 ` Peter Zijlstra
@ 2008-04-01 11:55 ` Paul Jackson
  2008-04-01 11:59   ` Peter Zijlstra
  2008-04-02  8:39   ` Hidetoshi Seto
  2008-04-04  9:10 ` [PATCH 1/2] Customize sched domain via cpuset (v2) Hidetoshi Seto
  2008-04-04  9:11 ` [PATCH 2/2] " Hidetoshi Seto
  4 siblings, 2 replies; 19+ messages in thread
From: Paul Jackson @ 2008-04-01 11:55 UTC (permalink / raw
  To: Hidetoshi Seto; +Cc: linux-kernel

Interesting ...

So, we have two flags here.  One flag "sched_wake_idle_far" that will
cause the current task to search farther for an idle CPU when it wakes
up another task that needs a CPU on which to run, and the other flag
"sched_balance_newidle_far" that will cause a soon-to-idle CPU to search
farther for a task it might pull over and run, instead of going idle.

I am tempted to ask if we should not elaborate this in one dimension,
and simplify it in another dimension.

First the simplification side: do we need both flags?  Yes, they are
two distinct cases in the code, but perhaps practical uses will always
end up setting both flags the same way.  If that's the case, then we
are just burdening the user of these flags with understanding a detail
that didn't matter to them: did a waking task or an idle CPU provoke
the search?  Do you have or know of a situation where you actually
desire to enable one flag while disabling the other?

For the elaboration side: your proposal has just two-level's of
distance, near and far.  Perhaps, as architectures become more
elaborate and hierarchies deeper, we would want N-level's of distance,
and the ability to request such load balancing for all levels "n"
for our choice of "n" <= N.

If we did both the above, then we might have a single per-cpuset file
that took an integer value ... this "n".  If (n == 0), that might mean
no such balancing at all.  If (n == 1), that might mean just the
nearest balancing, for example, to the hyperthread within the same core,
on some current Intel architectures.  If (n == 2), then that might mean,
on the same architectures, that balancing could occur across cores
within the same package.  If (n == 3) then that might mean, again on
that architecture, that balancing could occur across packages on the
same node board.  As architectures evolve over time, the exact details
of what each value of "n" mean would evolve, but always higher "n"
would enable balancing across a wider portion of the system.

Please understand I am just brain storming here.  I don't know that
the alternatives I considered above are preferrable or not to what
your patch presents.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] Customize sched domain via cpuset
  2008-04-01 11:40 ` Andi Kleen
@ 2008-04-01 11:56   ` Peter Zijlstra
  2008-04-01 13:29     ` Andi Kleen
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2008-04-01 11:56 UTC (permalink / raw
  To: Andi Kleen; +Cc: Hidetoshi Seto, linux-kernel, Ingo Molnar, Paul Jackson

On Tue, 2008-04-01 at 13:40 +0200, Andi Kleen wrote:
> Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> writes:
> 
> > Using cpuset, now we can partition the system into multiple sched domains.
> > Then, how about providing different characteristics for each domains?
> 
> Did you actually see much improvement in any relevant workload
> from tweaking these parameters?  If yes what did you change?
> And how much did it gain?
> 
> Ideally the kernel should perform well without much tweaking
> out of the box, simply because most users won't tweak. Adding a 
> lot of such parameters would imply giving up on good defaults which 
> is not a good thing.

>From what I understand they need very aggressive idle balancing; much
more so than what is normally healty.

I can see how something like that can be useful when you have a lot of
very short running tasks. These could pile up on a few cpus and leave
others idle.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] Customize sched domain via cpuset
  2008-04-01 11:55 ` Paul Jackson
@ 2008-04-01 11:59   ` Peter Zijlstra
  2008-04-02  8:39   ` Hidetoshi Seto
  1 sibling, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2008-04-01 11:59 UTC (permalink / raw
  To: Paul Jackson; +Cc: Hidetoshi Seto, linux-kernel, Ingo Molnar

On Tue, 2008-04-01 at 06:55 -0500, Paul Jackson wrote:
> Interesting ...
> 
> So, we have two flags here.  One flag "sched_wake_idle_far" that will
> cause the current task to search farther for an idle CPU when it wakes
> up another task that needs a CPU on which to run, and the other flag
> "sched_balance_newidle_far" that will cause a soon-to-idle CPU to search
> farther for a task it might pull over and run, instead of going idle.
> 
> I am tempted to ask if we should not elaborate this in one dimension,
> and simplify it in another dimension.
> 
> First the simplification side: do we need both flags?  Yes, they are
> two distinct cases in the code, but perhaps practical uses will always
> end up setting both flags the same way.  If that's the case, then we
> are just burdening the user of these flags with understanding a detail
> that didn't matter to them: did a waking task or an idle CPU provoke
> the search?  Do you have or know of a situation where you actually
> desire to enable one flag while disabling the other?
> 
> For the elaboration side: your proposal has just two-level's of
> distance, near and far.  Perhaps, as architectures become more
> elaborate and hierarchies deeper, we would want N-level's of distance,
> and the ability to request such load balancing for all levels "n"
> for our choice of "n" <= N.
> 
> If we did both the above, then we might have a single per-cpuset file
> that took an integer value ... this "n".  If (n == 0), that might mean
> no such balancing at all.  If (n == 1), that might mean just the
> nearest balancing, for example, to the hyperthread within the same core,
> on some current Intel architectures.  If (n == 2), then that might mean,
> on the same architectures, that balancing could occur across cores
> within the same package.  If (n == 3) then that might mean, again on
> that architecture, that balancing could occur across packages on the
> same node board.  As architectures evolve over time, the exact details
> of what each value of "n" mean would evolve, but always higher "n"
> would enable balancing across a wider portion of the system.
> 
> Please understand I am just brain storming here.  I don't know that
> the alternatives I considered above are preferrable or not to what
> your patch presents.

FWIW I like your suggestions.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] Customize sched domain via cpuset
  2008-04-01 11:56   ` Peter Zijlstra
@ 2008-04-01 13:29     ` Andi Kleen
  2008-04-01 13:38       ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Andi Kleen @ 2008-04-01 13:29 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: Andi Kleen, Hidetoshi Seto, linux-kernel, Ingo Molnar,
	Paul Jackson

On Tue, Apr 01, 2008 at 01:56:08PM +0200, Peter Zijlstra wrote:
> On Tue, 2008-04-01 at 13:40 +0200, Andi Kleen wrote:
> > Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> writes:
> > 
> > > Using cpuset, now we can partition the system into multiple sched domains.
> > > Then, how about providing different characteristics for each domains?
> > 
> > Did you actually see much improvement in any relevant workload
> > from tweaking these parameters?  If yes what did you change?
> > And how much did it gain?
> > 
> > Ideally the kernel should perform well without much tweaking
> > out of the box, simply because most users won't tweak. Adding a 
> > lot of such parameters would imply giving up on good defaults which 
> > is not a good thing.
> 
> >From what I understand they need very aggressive idle balancing; much
> more so than what is normally healty.
> 
> I can see how something like that can be useful when you have a lot of
> very short running tasks. These could pile up on a few cpus and leave
> others idle.

Could the scheduler auto tune itself to this situation?

e.g. when it sees a row of very high run queue inbalances increase the
frequency of the idle balancer?

-Andi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] Customize sched domain via cpuset
  2008-04-01 13:29     ` Andi Kleen
@ 2008-04-01 13:38       ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2008-04-01 13:38 UTC (permalink / raw
  To: Andi Kleen; +Cc: Hidetoshi Seto, linux-kernel, Ingo Molnar, Paul Jackson

On Tue, 2008-04-01 at 15:29 +0200, Andi Kleen wrote:
> On Tue, Apr 01, 2008 at 01:56:08PM +0200, Peter Zijlstra wrote:
> > On Tue, 2008-04-01 at 13:40 +0200, Andi Kleen wrote:
> > > Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> writes:
> > > 
> > > > Using cpuset, now we can partition the system into multiple sched domains.
> > > > Then, how about providing different characteristics for each domains?
> > > 
> > > Did you actually see much improvement in any relevant workload
> > > from tweaking these parameters?  If yes what did you change?
> > > And how much did it gain?
> > > 
> > > Ideally the kernel should perform well without much tweaking
> > > out of the box, simply because most users won't tweak. Adding a 
> > > lot of such parameters would imply giving up on good defaults which 
> > > is not a good thing.
> > 
> > >From what I understand they need very aggressive idle balancing; much
> > more so than what is normally healty.
> > 
> > I can see how something like that can be useful when you have a lot of
> > very short running tasks. These could pile up on a few cpus and leave
> > others idle.
> 
> Could the scheduler auto tune itself to this situation?
> 
> e.g. when it sees a row of very high run queue inbalances increase the
> frequency of the idle balancer?

Its not actually the idle balancer that's addressed here, but that runs
at 1/HZ, so no we can't do that faster unless you tie it to a hrtimer.

What it does do is more aggresively look for idle cpus on newidle and
fork. Normally we only consider the socket for these lookups, they want
a wider view.

Auto-tune, perhaps although I'm a bit skeptical of heuristics. We'd need
data on the avg 'atom' length of the tasks and idle-ness of remote cpus
and so on.

The thing is, even then it depends on the data footprint of these tasks
and the cost/benefit for your application.

By more aggresively migrating tasks you penalize through-put but get a
better worst case response time.

I'm just not sure we can make that decision for the user.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] Customize sched domain via cpuset
  2008-04-01 11:55 ` Paul Jackson
  2008-04-01 11:59   ` Peter Zijlstra
@ 2008-04-02  8:39   ` Hidetoshi Seto
  2008-04-02 11:14     ` Paul Jackson
  1 sibling, 1 reply; 19+ messages in thread
From: Hidetoshi Seto @ 2008-04-02  8:39 UTC (permalink / raw
  To: Paul Jackson; +Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Andi Kleen

Paul Jackson wrote:
> Interesting ...

Thank you for saying that ;-)

> So, we have two flags here.  One flag "sched_wake_idle_far" that will
> cause the current task to search farther for an idle CPU when it wakes
> up another task that needs a CPU on which to run, and the other flag
> "sched_balance_newidle_far" that will cause a soon-to-idle CPU to search
> farther for a task it might pull over and run, instead of going idle.
> 
> I am tempted to ask if we should not elaborate this in one dimension,
> and simplify it in another dimension.
> 
> First the simplification side: do we need both flags?  Yes, they are
> two distinct cases in the code, but perhaps practical uses will always
> end up setting both flags the same way.  If that's the case, then we
> are just burdening the user of these flags with understanding a detail
> that didn't matter to them: did a waking task or an idle CPU provoke
> the search?  Do you have or know of a situation where you actually
> desire to enable one flag while disabling the other?

Yes, we need both flags.

At least in case of hackbench (results are attached bottom),
I couldn't find any positive effect with enabling "sched_wake_idle_far",
but "sched_balance_newidle_far" shows significant gains.

It doesn't mean "sched_wake_idle_far" is useless everywhere.
As Peter pointed, when we have a lot of very short running tasks,
"sched_wake_idle_far" accelerates task propagation and its throughput.
There are definitely such situations (and in fact it's where I'm now).

Put simply, if the system tend to be idle, then "push to idle" strategy
works well.  OTOH if the system tend to be busy, then "pull by idle"
strategy works well.  Else, both strategy will work but besides of all
there is a question: how much searching cost can you pay?

So, it is case by case, depend on the situation.

> For the elaboration side: your proposal has just two-level's of
> distance, near and far.  Perhaps, as architectures become more
> elaborate and hierarchies deeper, we would want N-level's of distance,
> and the ability to request such load balancing for all levels "n"
> for our choice of "n" <= N.
> 
> If we did both the above, then we might have a single per-cpuset file
> that took an integer value ... this "n".  If (n == 0), that might mean
> no such balancing at all.  If (n == 1), that might mean just the
> nearest balancing, for example, to the hyperthread within the same core,
> on some current Intel architectures.  If (n == 2), then that might mean,
> on the same architectures, that balancing could occur across cores
> within the same package.  If (n == 3) then that might mean, again on
> that architecture, that balancing could occur across packages on the
> same node board.  As architectures evolve over time, the exact details
> of what each value of "n" mean would evolve, but always higher "n"
> would enable balancing across a wider portion of the system.
> 
> Please understand I am just brain storming here.  I don't know that
> the alternatives I considered above are preferrable or not to what
> your patch presents.

Now we already have such levels in sched domain, so if "n" is given,
I can choice:
   0: (none)
   1: cpu_domain - balance to hyperthreads in a core
   2: core_domain - balance to cores in a package
   3: phys_domain - balance to packages in a node
( 4: node_domain - balance to nodes in a chunk of nodes )
( 5: allnodes_domain - global balance )

It looks easy... but how do you handle if cpusets are overlapping?

Thanks,
H.Seto

-----
(@ CPUx8 ((Dual-Core Itanium2 x 2 sockets) x 2 nodes), 8GB mem)

[root@HACKBENCH]# echo 0 > /dev/cpuset/sched_balance_newidle_far
[root@HACKBENCH]# echo 0 > /dev/cpuset/sched_wake_idle_far
[root@HACKBENCH]# for i in `seq 50 50 250` ; do ./hackbench $i ; done
Time: 1.956
Time: 4.008
Time: 5.918
Time: 8.269
Time: 10.216
[root@HACKBENCH]# for i in `seq 50 50 250` ; do ./hackbench $i ; done
Time: 1.918
Time: 3.964
Time: 5.732
Time: 8.013
Time: 10.028
[root@HACKBENCH]# for i in `seq 50 50 250` ; do ./hackbench $i ; done
Time: 1.925
Time: 3.824
Time: 5.893
Time: 7.975
Time: 10.373
[root@HACKBENCH]# echo 0 > /dev/cpuset/sched_balance_newidle_far
[root@HACKBENCH]# echo 1 > /dev/cpuset/sched_wake_idle_far
[root@HACKBENCH]# for i in `seq 50 50 250` ; do ./hackbench $i ; done
Time: 2.153
Time: 3.749
Time: 5.846
Time: 8.088
Time: 9.996
[root@HACKBENCH]# for i in `seq 50 50 250` ; do ./hackbench $i ; done
Time: 1.845
Time: 3.932
Time: 6.137
Time: 8.062
Time: 10.282
[root@HACKBENCH]# for i in `seq 50 50 250` ; do ./hackbench $i ; done
Time: 1.963
Time: 4.040
Time: 5.837
Time: 8.017
Time: 9.718
[root@HACKBENCH]# echo 1 > /dev/cpuset/sched_balance_newidle_far
[root@HACKBENCH]# echo 0 > /dev/cpuset/sched_wake_idle_far
[root@HACKBENCH]# for i in `seq 50 50 250` ; do ./hackbench $i ; done
Time: 1.725
Time: 3.412
Time: 5.275
Time: 7.441
Time: 8.974
[root@HACKBENCH]# for i in `seq 50 50 250` ; do ./hackbench $i ; done
Time: 1.674
Time: 3.334
Time: 5.374
Time: 7.204
Time: 8.903
[root@HACKBENCH]# for i in `seq 50 50 250` ; do ./hackbench $i ; done
Time: 1.689
Time: 3.281
Time: 5.002
Time: 7.245
Time: 9.039
[root@HACKBENCH]# echo 1 > /dev/cpuset/sched_balance_newidle_far
[root@HACKBENCH]# echo 1 > /dev/cpuset/sched_wake_idle_far
[root@HACKBENCH]# for i in `seq 50 50 250` ; do ./hackbench $i ; done
Time: 1.923
Time: 3.697
Time: 5.632
Time: 7.379
Time: 9.223
[root@HACKBENCH]# for i in `seq 50 50 250` ; do ./hackbench $i ; done
Time: 1.809
Time: 3.656
Time: 5.746
Time: 7.386
Time: 9.399
[root@HACKBENCH]# for i in `seq 50 50 250` ; do ./hackbench $i ; done
Time: 1.832
Time: 3.743
Time: 5.580
Time: 7.477
Time: 9.163


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] Customize sched domain via cpuset
  2008-04-02  8:39   ` Hidetoshi Seto
@ 2008-04-02 11:14     ` Paul Jackson
  2008-04-03  3:21       ` Hidetoshi Seto
  0 siblings, 1 reply; 19+ messages in thread
From: Paul Jackson @ 2008-04-02 11:14 UTC (permalink / raw
  To: Hidetoshi Seto; +Cc: linux-kernel, mingo, peterz, andi

Hidetoshi wrote:
> Put simply, if the system tend to be idle, then "push to idle" strategy
> works well.  OTOH if the system tend to be busy, then "pull by idle"
> strategy works well.  Else, both strategy will work but besides of all
> there is a question: how much searching cost can you pay?

So each flag has value in some cases ... that much seems reasonable to me.

But you're saying that you'd like to avoid having to turn on both, just to
get the benefit of one of them, in order to avoid the searching costs of
the other flag that was not valuable on that load, right?

But is this necessarily so?  If "pull by idle" is attempted on a system
which tends to be idle, then while it is true that the search for something
to pull will usually find nothing, what does it matter that we wasted some
otherwise idle cycles, looking for pullable, runnable tasks that cannot be
found, on a system that is mostly idle?

If "push to idle" is attempted on a system that is quite busy, then
couldn't that be coded to notice rather quickly if any nearby CPUs are
idle, and not search if there are no idle neighbors.  One could imagine
a word of memory for each smaller domain ("neighborhood") of CPUs (say
all the logical CPUs in a package), with one bit per logical CPU, that
was set if-and-only-if that CPU was in idle.  Then it would be very
quick for all the CPUs in that domain to see if there are (or just
were ... close enough) any idle CPUs, and skip trying to "push to idle"
if that word was all zero bits.  That is, there would be no sense
trying to push to idle if there were no idle CPUs to push to.  The only
writing and the only locking of that word would be from idle loop code,
and only from nearby CPUs in the same small domain, so it would not be
an impediment to large system scaling or a waste of many CPU cycles on
busy systems.

With a little work such as this, we could make it so that anytime you
needed either flag, you could turn on both, and the other one would be
harmless enough ... just a minor consumer of otherwise idle cycles.

Then with that, we could have one flag, that did both.

> It looks easy... but how do you handle if cpusets are overlapping?

Yeah - that part might be challenging.  Would it work to always take
the largest domain balancing requested?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] Customize sched domain via cpuset
  2008-04-02 11:14     ` Paul Jackson
@ 2008-04-03  3:21       ` Hidetoshi Seto
  2008-04-03 10:46         ` Peter Zijlstra
                           ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Hidetoshi Seto @ 2008-04-03  3:21 UTC (permalink / raw
  To: Paul Jackson; +Cc: linux-kernel, mingo, peterz, andi

Paul Jackson wrote:
> Hidetoshi wrote:
>> Put simply, if the system tend to be idle, then "push to idle" strategy
>> works well.  OTOH if the system tend to be busy, then "pull by idle"
>> strategy works well.  Else, both strategy will work but besides of all
>> there is a question: how much searching cost can you pay?
> 
> So each flag has value in some cases ... that much seems reasonable to me.
> 
> But you're saying that you'd like to avoid having to turn on both, just to
> get the benefit of one of them, in order to avoid the searching costs of
> the other flag that was not valuable on that load, right?
> 
> But is this necessarily so?

I'd like to turn on both(since I know it is best for my application/system),
but it can't be denied that there are other situations loving only one of
them...  At least there is a small possible conflict:
   "Are you idle?" - "No, I'm busy to search a busy CPU!"

To be honest, I don't have strong reason to have them to be divided.
Just I thought that they could work independently and it might be usable
interface for other people.
(... well, I would be a little happy if I don't need to rewrite almost all
  of the additional piece of Documentation/cpuset.txt, but don't care :-D)

So, if there is no one can find use of two flags, I'll change it to one.
Comments from any others?

> If "pull by idle" is attempted on a system
> which tends to be idle, then while it is true that the search for something
> to pull will usually find nothing, what does it matter that we wasted some
> otherwise idle cycles, looking for pullable, runnable tasks that cannot be
> found, on a system that is mostly idle?
> 
> If "push to idle" is attempted on a system that is quite busy, then
> couldn't that be coded to notice rather quickly if any nearby CPUs are
> idle, and not search if there are no idle neighbors.  One could imagine
> a word of memory for each smaller domain ("neighborhood") of CPUs (say
> all the logical CPUs in a package), with one bit per logical CPU, that
> was set if-and-only-if that CPU was in idle.  Then it would be very
> quick for all the CPUs in that domain to see if there are (or just
> were ... close enough) any idle CPUs, and skip trying to "push to idle"
> if that word was all zero bits.  That is, there would be no sense
> trying to push to idle if there were no idle CPUs to push to.  The only
> writing and the only locking of that word would be from idle loop code,
> and only from nearby CPUs in the same small domain, so it would not be
> an impediment to large system scaling or a waste of many CPU cycles on
> busy systems.
> 
> With a little work such as this, we could make it so that anytime you
> needed either flag, you could turn on both, and the other one would be
> harmless enough ... just a minor consumer of otherwise idle cycles.
> 
> Then with that, we could have one flag, that did both.

I believe there are quite technical reasons why we have no "idle_map."
Excellent answers would be brought by scheduler folks...

>> It looks easy... but how do you handle if cpusets are overlapping?
> 
> Yeah - that part might be challenging.  Would it work to always take
> the largest domain balancing requested?

Hum... if one requests "smaller" and another is "don't care = default",
we always take "default" range.

Anyway, I'd like to give a lot of care to well-defined cpusets, and
I know that balancing on overlapping cpusets are easy to be confused,
so I'll update my patch to take levels, getting in your suggestion.

Thanks,
H.Seto

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] Customize sched domain via cpuset
  2008-04-03  3:21       ` Hidetoshi Seto
@ 2008-04-03 10:46         ` Peter Zijlstra
  2008-04-03 12:56         ` Paul Jackson
  2008-04-03 13:14         ` Paul Jackson
  2 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2008-04-03 10:46 UTC (permalink / raw
  To: Hidetoshi Seto; +Cc: Paul Jackson, linux-kernel, mingo, andi

On Thu, 2008-04-03 at 12:21 +0900, Hidetoshi Seto wrote:

> I believe there are quite technical reasons why we have no "idle_map."
> Excellent answers would be brought by scheduler folks...

Can you say: 'cacheline bouncing fest'? :-)




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] Customize sched domain via cpuset
  2008-04-03  3:21       ` Hidetoshi Seto
  2008-04-03 10:46         ` Peter Zijlstra
@ 2008-04-03 12:56         ` Paul Jackson
  2008-04-03 13:14         ` Paul Jackson
  2 siblings, 0 replies; 19+ messages in thread
From: Paul Jackson @ 2008-04-03 12:56 UTC (permalink / raw
  To: Hidetoshi Seto; +Cc: linux-kernel, mingo, peterz, andi

H.Seto wrote:
> So, if there is no one can find use of two flags, I'll change it to one.
> Comments from any others?

I too don't have a strong preference either way ... just a bias toward
keeping the exposed per-cpuset flags as simple and generic as practical.

Exposing internal details that don't need to be exposed has two downsides:
 1) it makes using the flags slightly more difficult, as the user has to
    figure out more details, and
 2) it exposes some internal details that we didn't need to expose, thereby
    possibly making future changes more difficult.  We can change internal
    hidden detail anytime we want, but exposed interfaces are locked in
    with a high barrier to incompatible change.

I too would welcome comments from others.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] Customize sched domain via cpuset
  2008-04-03  3:21       ` Hidetoshi Seto
  2008-04-03 10:46         ` Peter Zijlstra
  2008-04-03 12:56         ` Paul Jackson
@ 2008-04-03 13:14         ` Paul Jackson
  2 siblings, 0 replies; 19+ messages in thread
From: Paul Jackson @ 2008-04-03 13:14 UTC (permalink / raw
  To: Hidetoshi Seto; +Cc: linux-kernel, mingo, peterz, andi

H.Seto wrote:
> I believe there are quite technical reasons why we have no "idle_map."

I would not advocate a single system wide cpumask of idle CPUs.  As
Peter Zijlstra notes in a follow up post, that's too hot a cache line
and clearly doesn't scale.

But I would think it would be ok to have a separate cpumask per node,
that marked just the node-local CPUs.  We have other per-node data
already.  If we only support this optional load balancing level across
the other CPUs on the same node (or smaller domains, such as the cores
in a package), that should work, shouldn't it?

> so I'll update my patch to take levels, getting in your suggestion.

If you see a good solution here that you can provide, good.  But if my
brain storming ideas have problems, don't hesitate to object to them.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/2] Customize sched domain via cpuset (v2)
  2008-04-01 11:26 [PATCH 1/2] Customize sched domain via cpuset Hidetoshi Seto
                   ` (2 preceding siblings ...)
  2008-04-01 11:55 ` Paul Jackson
@ 2008-04-04  9:10 ` Hidetoshi Seto
  2008-04-04  9:11 ` [PATCH 2/2] " Hidetoshi Seto
  4 siblings, 0 replies; 19+ messages in thread
From: Hidetoshi Seto @ 2008-04-04  9:10 UTC (permalink / raw
  To: linux-kernel; +Cc: Paul Jackson, Peter Zijlstra, Ingo Molnar, Andi Kleen

Here comes v2!

This patch introduces new feature of cpuset - sched domain customization.

This version provides a per-cpuset file 'sched_relax_domain_level' that
enable us to change the searching range of scheduler, which used to limit
how many cpus the scheduler searches at some schedule events, such as
wakening task and running out of runqueue.

Thanks,
H.Seto

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 Documentation/cpusets.txt |   72 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 70 insertions(+), 2 deletions(-)

Index: GIT-torvalds/Documentation/cpusets.txt
===================================================================
--- GIT-torvalds.orig/Documentation/cpusets.txt
+++ GIT-torvalds/Documentation/cpusets.txt
@@ -8,6 +8,7 @@ Portions Copyright (c) 2004-2006 Silicon
 Modified by Paul Jackson <pj@sgi.com>
 Modified by Christoph Lameter <clameter@sgi.com>
 Modified by Paul Menage <menage@google.com>
+Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

 CONTENTS:
 =========
@@ -20,7 +21,8 @@ CONTENTS:
   1.5 What is memory_pressure ?
   1.6 What is memory spread ?
   1.7 What is sched_load_balance ?
-  1.8 How do I use cpusets ?
+  1.8 What is sched_relax_domain_level ?
+  1.9 How do I use cpusets ?
 2. Usage Examples and Syntax
   2.1 Basic Usage
   2.2 Adding/removing cpus
@@ -497,7 +499,73 @@ the cpuset code to update these sched do
 partition requested with the current, and updates its sched domains,
 removing the old and adding the new, for each change.

-1.8 How do I use cpusets ?
+
+1.8 What is sched_relax_domain_level ?
+--------------------------------------
+
+In sched domain, the scheduler migrates tasks in 2 ways; periodic load
+balance on tick, and at time of some schedule events.
+
+When a task is woken up, scheduler try to move the task on idle CPU.
+For example, if a task A running on CPU X activates another task B
+on the same CPU X, and if CPU Y is X's sibling and performing idle,
+then scheduler migrate task B to CPU Y so that task B can start on
+CPU Y without waiting task A on CPU X.
+
+And if a CPU run out of tasks in its runqueue, the CPU try to pull
+extra tasks from other busy CPUs to help them before it is going to
+be idle.
+
+Of course it takes some searching cost to find movable tasks and/or
+idle CPUs, the scheduler might not search all CPUs in the domain
+everytime.  In fact, in some architectures, the searching ranges on
+events are limited in the same socket or node where the CPU locates,
+while the load balance on tick searches all.
+
+For example, assume CPU Z is relatively far from CPU X.  Even if CPU Z
+is idle while CPU X and the siblings are busy, scheduler can't migrate
+woken task B from X to Z since it is out of its searching range.
+As the result, task B on CPU X need to wait task A or wait load balance
+on the next tick.  For some applications in special situation, waiting
+1 tick may be too long.
+
+The 'sched_relax_domain_level' file allows you to request changing
+this searching range as you like.  This file takes int value which
+indicates size of searching range in levels ideally as follows,
+otherwise initial value -1 that indicates the cpuset has no request.
+
+  -1  : no request. use system default or follow request of others.
+   0  : no search.
+   1  : search siblings (hyperthreads in a core).
+   2  : search cores in a package.
+   3  : search cpus in a node [= system wide on non-NUMA system]
+ ( 4  : search nodes in a chunk of node [on NUMA system] )
+ ( 5~ : search system wide [on NUMA system])
+
+This file is per-cpuset and affect the sched domain where the cpuset
+belongs to.  Therefore if the flag 'sched_load_balance' of a cpuset
+is disabled, then 'sched_relax_domain_level' have no effect since
+there is no sched domain belonging the cpuset.
+
+If multiple cpusets are overlapping and hence they form a single sched
+domain, the largest value among those is used.  Be careful, if one
+requests 0 and others are -1 then 0 is used.
+
+Note that modifying this file will have both good and bad effects,
+and whether it is acceptable or not will be depend on your situation.
+Don't modify this file if you are not sure.
+
+If your situation is:
+ - The migration costs between each cpu can be assumed considerably
+   small(for you) due to your special application's behavior or
+   special hardware support for CPU cache etc.
+ - The searching cost doesn't have impact(for you) or you can make
+   the searching cost enough small by managing cpuset to compact etc.
+ - The latency is required even it sacrifices cache hit rate etc.
+then increasing 'sched_relax_domain_level' would benefit you.
+
+
+1.9 How do I use cpusets ?
 --------------------------

 In order to minimize the impact of cpusets on critical kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/2] Customize sched domain via cpuset (v2)
  2008-04-01 11:26 [PATCH 1/2] Customize sched domain via cpuset Hidetoshi Seto
                   ` (3 preceding siblings ...)
  2008-04-04  9:10 ` [PATCH 1/2] Customize sched domain via cpuset (v2) Hidetoshi Seto
@ 2008-04-04  9:11 ` Hidetoshi Seto
  2008-04-10 14:53   ` Peter Zijlstra
  4 siblings, 1 reply; 19+ messages in thread
From: Hidetoshi Seto @ 2008-04-04  9:11 UTC (permalink / raw
  To: linux-kernel; +Cc: Paul Jackson, Peter Zijlstra, Ingo Molnar, Andi Kleen

The implementation has some updates...

>  - Add 2 new cpuset files:
>      sched_wake_idle_far
>      sched_balance_newidle_far
    -> Merged into 1 file, having levels:
         sched_relax_domain_level

>  - Modify partition_sched_domains() and build_sched_domains()
>    to take flags parameter passed from cpuset.
    -> Changed to "attributes" rather than "flags."

>  - Fill newidle_idx for node domains which currently unused but
>    might be required for sched_balance_newidle_far.

   + We can change the "default" level by boot option 'relax_domain_level='.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

---
 include/asm-ia64/topology.h |    2 -
 include/asm-sh/topology.h   |    2 -
 include/asm-x86/topology.h  |    2 -
 include/linux/sched.h       |   23 +++++++++++-
 kernel/cpuset.c             |   61 ++++++++++++++++++++++++++++++++
 kernel/sched.c              |   82 +++++++++++++++++++++++++++++++++++++++++---
 kernel/sched_fair.c         |    4 +-
 7 files changed, 165 insertions(+), 11 deletions(-)

Index: GIT-torvalds/include/linux/sched.h
===================================================================
--- GIT-torvalds.orig/include/linux/sched.h
+++ GIT-torvalds/include/linux/sched.h
@@ -704,6 +704,7 @@ enum cpu_idle_type {
 #define SD_POWERSAVINGS_BALANCE	256	/* Balance for power savings */
 #define SD_SHARE_PKG_RESOURCES	512	/* Domain members share cpu pkg resources */
 #define SD_SERIALIZE		1024	/* Only a single load balancing instance */
+#define SD_WAKE_IDLE_FAR	2048	/* Gain latency sacrificing cache hit */

 #define BALANCE_FOR_MC_POWER	\
 	(sched_smt_power_savings ? SD_POWERSAVINGS_BALANCE : 0)
@@ -733,6 +734,24 @@ struct sched_group {
 	u32 reciprocal_cpu_power;
 };

+enum sched_domain_level {
+	SD_LV_NONE = 0,
+	SD_LV_SIBLING,
+	SD_LV_MC,
+	SD_LV_CPU,
+	SD_LV_NODE,
+	SD_LV_ALLNODES,
+	SD_LV_MAX
+};
+
+struct sched_domain_attr {
+	int relax_domain_level;
+};
+
+#define SD_ATTR_INIT	(struct sched_domain_attr) {	\
+	.relax_domain_level = -1,			\
+}
+
 struct sched_domain {
 	/* These fields must be setup */
 	struct sched_domain *parent;	/* top domain must be null terminated */
@@ -750,6 +769,7 @@ struct sched_domain {
 	unsigned int wake_idx;
 	unsigned int forkexec_idx;
 	int flags;			/* See SD_* */
+	enum sched_domain_level level;

 	/* Runtime fields. */
 	unsigned long last_balance;	/* init to jiffies. units in jiffies */
@@ -789,7 +809,8 @@ struct sched_domain {
 #endif
 };

-extern void partition_sched_domains(int ndoms_new, cpumask_t *doms_new);
+extern void partition_sched_domains(int ndoms_new, cpumask_t *doms_new,
+				    struct sched_domain_attr *dattr_new);
 extern int arch_reinit_sched_domains(void);

 #endif	/* CONFIG_SMP */
Index: GIT-torvalds/kernel/sched.c
===================================================================
--- GIT-torvalds.orig/kernel/sched.c
+++ GIT-torvalds/kernel/sched.c
@@ -6582,11 +6582,42 @@ static void init_sched_groups_power(int
 	} while (group != child->groups);
 }

+static int default_relax_domain_level = -1;
+
+static int __init setup_relax_domain_level(char *str)
+{
+	default_relax_domain_level = simple_strtoul(str, NULL, 0);
+	return 1;
+}
+__setup("relax_domain_level=", setup_relax_domain_level);
+
+static void set_domain_attribute(struct sched_domain *sd,
+				 struct sched_domain_attr *attr)
+{
+	int request;
+
+	if (!attr || attr->relax_domain_level < 0) {
+		if (default_relax_domain_level < 0)
+			return;
+		else
+			request = default_relax_domain_level;
+	} else
+		request = attr->relax_domain_level;
+	if (request < sd->level) {
+		/* turn off idle balance on this domain */
+		sd->flags &= ~(SD_WAKE_IDLE|SD_BALANCE_NEWIDLE);
+	} else {
+		/* turn on idle balance on this domain */
+		sd->flags |= (SD_WAKE_IDLE_FAR|SD_BALANCE_NEWIDLE);
+	}
+}
+
 /*
  * Build sched domains for a given set of cpus and attach the sched domains
  * to the individual cpus
  */
-static int build_sched_domains(const cpumask_t *cpu_map)
+static int __build_sched_domains(const cpumask_t *cpu_map,
+				 struct sched_domain_attr *attr)
 {
 	int i;
 	struct root_domain *rd;
@@ -6626,7 +6657,9 @@ static int build_sched_domains(const cpu
 				SD_NODES_PER_DOMAIN*cpus_weight(nodemask)) {
 			sd = &per_cpu(allnodes_domains, i);
 			*sd = SD_ALLNODES_INIT;
+			sd->level = SD_LV_ALLNODES;
 			sd->span = *cpu_map;
+			set_domain_attribute(sd, attr);
 			cpu_to_allnodes_group(i, cpu_map, &sd->groups);
 			p = sd;
 			sd_allnodes = 1;
@@ -6635,7 +6668,9 @@ static int build_sched_domains(const cpu

 		sd = &per_cpu(node_domains, i);
 		*sd = SD_NODE_INIT;
+		sd->level = SD_LV_NODE;
 		sd->span = sched_domain_node_span(cpu_to_node(i));
+		set_domain_attribute(sd, attr);
 		sd->parent = p;
 		if (p)
 			p->child = sd;
@@ -6645,7 +6680,9 @@ static int build_sched_domains(const cpu
 		p = sd;
 		sd = &per_cpu(phys_domains, i);
 		*sd = SD_CPU_INIT;
+		sd->level = SD_LV_CPU;
 		sd->span = nodemask;
+		set_domain_attribute(sd, attr);
 		sd->parent = p;
 		if (p)
 			p->child = sd;
@@ -6655,8 +6692,10 @@ static int build_sched_domains(const cpu
 		p = sd;
 		sd = &per_cpu(core_domains, i);
 		*sd = SD_MC_INIT;
+		sd->level = SD_LV_MC;
 		sd->span = cpu_coregroup_map(i);
 		cpus_and(sd->span, sd->span, *cpu_map);
+		set_domain_attribute(sd, attr);
 		sd->parent = p;
 		p->child = sd;
 		cpu_to_core_group(i, cpu_map, &sd->groups);
@@ -6666,8 +6705,10 @@ static int build_sched_domains(const cpu
 		p = sd;
 		sd = &per_cpu(cpu_domains, i);
 		*sd = SD_SIBLING_INIT;
+		sd->level = SD_LV_SIBLING;
 		sd->span = per_cpu(cpu_sibling_map, i);
 		cpus_and(sd->span, sd->span, *cpu_map);
+		set_domain_attribute(sd, attr);
 		sd->parent = p;
 		p->child = sd;
 		cpu_to_cpu_group(i, cpu_map, &sd->groups);
@@ -6840,8 +6881,15 @@ error:
 #endif
 }

+static int build_sched_domains(const cpumask_t *cpu_map)
+{
+	return __build_sched_domains(cpu_map, NULL);
+}
+
 static cpumask_t *doms_cur;	/* current sched domains */
 static int ndoms_cur;		/* number of sched domains in 'doms_cur' */
+static struct sched_domain_attr *dattr_cur;	/* attribues of custom domains
+						   in 'doms_cur' */

 /*
  * Special case: If a kmalloc of a doms_cur partition (array of
@@ -6868,6 +6916,7 @@ static int arch_init_sched_domains(const
 	doms_cur = kmalloc(sizeof(cpumask_t), GFP_KERNEL);
 	if (!doms_cur)
 		doms_cur = &fallback_doms;
+	dattr_cur = NULL;
 	cpus_andnot(*doms_cur, *cpu_map, cpu_isolated_map);
 	err = build_sched_domains(doms_cur);
 	register_sched_domain_sysctl();
@@ -6896,6 +6945,22 @@ static void detach_destroy_domains(const
 	arch_destroy_sched_domains(cpu_map);
 }

+/* handle null as "default" */
+static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
+			struct sched_domain_attr *new, int idx_new)
+{
+	struct sched_domain_attr tmp;
+
+	/* fast path */
+	if (!new && !cur)
+		return 1;
+
+	tmp = SD_ATTR_INIT;
+	return !memcmp(cur ? (cur + idx_cur) : &tmp,
+			new ? (new + idx_new) : &tmp,
+			sizeof(struct sched_domain_attr));
+}
+
 /*
  * Partition sched domains as specified by the 'ndoms_new'
  * cpumasks in the array doms_new[] of cpumasks. This compares
@@ -6917,7 +6982,8 @@ static void detach_destroy_domains(const
  *
  * Call with hotplug lock held
  */
-void partition_sched_domains(int ndoms_new, cpumask_t *doms_new)
+void partition_sched_domains(int ndoms_new, cpumask_t *doms_new,
+			     struct sched_domain_attr *dattr_new)
 {
 	int i, j;

@@ -6929,13 +6995,15 @@ void partition_sched_domains(int ndoms_n
 	if (doms_new == NULL) {
 		ndoms_new = 1;
 		doms_new = &fallback_doms;
+		dattr_new = NULL;
 		cpus_andnot(doms_new[0], cpu_online_map, cpu_isolated_map);
 	}

 	/* Destroy deleted domains */
 	for (i = 0; i < ndoms_cur; i++) {
 		for (j = 0; j < ndoms_new; j++) {
-			if (cpus_equal(doms_cur[i], doms_new[j]))
+			if (cpus_equal(doms_cur[i], doms_new[j])
+			    && dattrs_equal(dattr_cur, i, dattr_new, j))
 				goto match1;
 		}
 		/* no match - a current sched domain not in new doms_new[] */
@@ -6947,11 +7015,13 @@ match1:
 	/* Build new domains */
 	for (i = 0; i < ndoms_new; i++) {
 		for (j = 0; j < ndoms_cur; j++) {
-			if (cpus_equal(doms_new[i], doms_cur[j]))
+			if (cpus_equal(doms_new[i], doms_cur[j])
+			    && dattrs_equal(dattr_new, i, dattr_cur, j))
 				goto match2;
 		}
 		/* no match - add a new doms_new */
-		build_sched_domains(doms_new + i);
+		__build_sched_domains(doms_new + i,
+					dattr_new ? dattr_new + i : NULL);
 match2:
 		;
 	}
@@ -6959,7 +7029,9 @@ match2:
 	/* Remember the new sched domains */
 	if (doms_cur != &fallback_doms)
 		kfree(doms_cur);
+	kfree(dattr_cur);	/* kfree(NULL) is safe */
 	doms_cur = doms_new;
+	dattr_cur = dattr_new;
 	ndoms_cur = ndoms_new;

 	register_sched_domain_sysctl();
Index: GIT-torvalds/kernel/sched_fair.c
===================================================================
--- GIT-torvalds.orig/kernel/sched_fair.c
+++ GIT-torvalds/kernel/sched_fair.c
@@ -957,7 +957,9 @@ static int wake_idle(int cpu, struct tas
 		return cpu;

 	for_each_domain(cpu, sd) {
-		if (sd->flags & SD_WAKE_IDLE) {
+		if ((sd->flags & SD_WAKE_IDLE)
+		    || ((sd->flags & SD_WAKE_IDLE_FAR)
+			&& !task_hot(p, task_rq(p)->clock, sd))) {
 			cpus_and(tmp, sd->span, p->cpus_allowed);
 			for_each_cpu_mask(i, tmp) {
 				if (idle_cpu(i)) {
Index: GIT-torvalds/kernel/cpuset.c
===================================================================
--- GIT-torvalds.orig/kernel/cpuset.c
+++ GIT-torvalds/kernel/cpuset.c
@@ -98,6 +98,9 @@ struct cpuset {
 	/* partition number for rebuild_sched_domains() */
 	int pn;

+	/* for custom sched domain */
+	int relax_domain_level;
+
 	/* used for walking a cpuset heirarchy */
 	struct list_head stack_list;
 };
@@ -478,6 +481,16 @@ static int cpusets_overlap(struct cpuset
 	return cpus_intersects(a->cpus_allowed, b->cpus_allowed);
 }

+static void
+update_domain_attr(struct sched_domain_attr *dattr, struct cpuset *c)
+{
+	if (!dattr)
+		return;
+	if (dattr->relax_domain_level < c->relax_domain_level)
+		dattr->relax_domain_level = c->relax_domain_level;
+	return;
+}
+
 /*
  * rebuild_sched_domains()
  *
@@ -553,12 +566,14 @@ static void rebuild_sched_domains(void)
 	int csn;		/* how many cpuset ptrs in csa so far */
 	int i, j, k;		/* indices for partition finding loops */
 	cpumask_t *doms;	/* resulting partition; i.e. sched domains */
+	struct sched_domain_attr *dattr;  /* attributes for custom domains */
 	int ndoms;		/* number of sched domains in result */
 	int nslot;		/* next empty doms[] cpumask_t slot */

 	q = NULL;
 	csa = NULL;
 	doms = NULL;
+	dattr = NULL;

 	/* Special case for the 99% of systems with one, full, sched domain */
 	if (is_sched_load_balance(&top_cpuset)) {
@@ -566,6 +581,11 @@ static void rebuild_sched_domains(void)
 		doms = kmalloc(sizeof(cpumask_t), GFP_KERNEL);
 		if (!doms)
 			goto rebuild;
+		dattr = kmalloc(sizeof(struct sched_domain_attr), GFP_KERNEL);
+		if (dattr) {
+			*dattr = SD_ATTR_INIT;
+			update_domain_attr(dattr, &top_cpuset);
+		}
 		*doms = top_cpuset.cpus_allowed;
 		goto rebuild;
 	}
@@ -622,6 +642,7 @@ restart:
 	doms = kmalloc(ndoms * sizeof(cpumask_t), GFP_KERNEL);
 	if (!doms)
 		goto rebuild;
+	dattr = kmalloc(ndoms * sizeof(struct sched_domain_attr), GFP_KERNEL);

 	for (nslot = 0, i = 0; i < csn; i++) {
 		struct cpuset *a = csa[i];
@@ -644,12 +665,15 @@ restart:
 			}

 			cpus_clear(*dp);
+			if (dattr)
+				*(dattr + nslot) = SD_ATTR_INIT;
 			for (j = i; j < csn; j++) {
 				struct cpuset *b = csa[j];

 				if (apn == b->pn) {
 					cpus_or(*dp, *dp, b->cpus_allowed);
 					b->pn = -1;
+					update_domain_attr(dattr, b);
 				}
 			}
 			nslot++;
@@ -660,7 +684,7 @@ restart:
 rebuild:
 	/* Have scheduler rebuild sched domains */
 	get_online_cpus();
-	partition_sched_domains(ndoms, doms);
+	partition_sched_domains(ndoms, doms, dattr);
 	put_online_cpus();

 done:
@@ -668,6 +692,7 @@ done:
 		kfifo_free(q);
 	kfree(csa);
 	/* Don't kfree(doms) -- partition_sched_domains() does that. */
+	/* Don't kfree(dattr) -- partition_sched_domains() does that. */
 }

 static inline int started_after_time(struct task_struct *t1,
@@ -1011,6 +1036,21 @@ static int update_memory_pressure_enable
 	return 0;
 }

+static int update_relax_domain_level(struct cpuset *cs, char *buf)
+{
+	int val = simple_strtol(buf, NULL, 10);
+
+	if (val < 0)
+		val = -1;
+
+	if (val != cs->relax_domain_level) {
+		cs->relax_domain_level = val;
+		rebuild_sched_domains();
+	}
+
+	return 0;
+}
+
 /*
  * update_flag - read a 0 or a 1 in a file and update associated flag
  * bit:	the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE,
@@ -1202,6 +1242,7 @@ typedef enum {
 	FILE_CPU_EXCLUSIVE,
 	FILE_MEM_EXCLUSIVE,
 	FILE_SCHED_LOAD_BALANCE,
+	FILE_SCHED_RELAX_DOMAIN_LEVEL,
 	FILE_MEMORY_PRESSURE_ENABLED,
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
@@ -1256,6 +1297,9 @@ static ssize_t cpuset_common_file_write(
 	case FILE_SCHED_LOAD_BALANCE:
 		retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, buffer);
 		break;
+	case FILE_SCHED_RELAX_DOMAIN_LEVEL:
+		retval = update_relax_domain_level(cs, buffer);
+		break;
 	case FILE_MEMORY_MIGRATE:
 		retval = update_flag(CS_MEMORY_MIGRATE, cs, buffer);
 		break;
@@ -1354,6 +1398,9 @@ static ssize_t cpuset_common_file_read(s
 	case FILE_SCHED_LOAD_BALANCE:
 		*s++ = is_sched_load_balance(cs) ? '1' : '0';
 		break;
+	case FILE_SCHED_RELAX_DOMAIN_LEVEL:
+		s += sprintf(s, "%d", cs->relax_domain_level);
+		break;
 	case FILE_MEMORY_MIGRATE:
 		*s++ = is_memory_migrate(cs) ? '1' : '0';
 		break;
@@ -1424,6 +1471,13 @@ static struct cftype cft_sched_load_bala
 	.private = FILE_SCHED_LOAD_BALANCE,
 };

+static struct cftype cft_sched_relax_domain_level = {
+	.name = "sched_relax_domain_level",
+	.read = cpuset_common_file_read,
+	.write = cpuset_common_file_write,
+	.private = FILE_SCHED_RELAX_DOMAIN_LEVEL,
+};
+
 static struct cftype cft_memory_migrate = {
 	.name = "memory_migrate",
 	.read = cpuset_common_file_read,
@@ -1475,6 +1529,9 @@ static int cpuset_populate(struct cgroup
 		return err;
 	if ((err = cgroup_add_file(cont, ss, &cft_sched_load_balance)) < 0)
 		return err;
+	if ((err = cgroup_add_file(cont, ss,
+					&cft_sched_relax_domain_level)) < 0)
+		return err;
 	if ((err = cgroup_add_file(cont, ss, &cft_memory_pressure)) < 0)
 		return err;
 	if ((err = cgroup_add_file(cont, ss, &cft_spread_page)) < 0)
@@ -1559,6 +1616,7 @@ static struct cgroup_subsys_state *cpuse
 	cs->mems_allowed = NODE_MASK_NONE;
 	cs->mems_generation = cpuset_mems_generation++;
 	fmeter_init(&cs->fmeter);
+	cs->relax_domain_level = -1;

 	cs->parent = parent;
 	number_of_cpusets++;
@@ -1631,6 +1689,7 @@ int __init cpuset_init(void)
 	fmeter_init(&top_cpuset.fmeter);
 	top_cpuset.mems_generation = cpuset_mems_generation++;
 	set_bit(CS_SCHED_LOAD_BALANCE, &top_cpuset.flags);
+	top_cpuset.relax_domain_level = -1;

 	err = register_filesystem(&cpuset_fs_type);
 	if (err < 0)
Index: GIT-torvalds/include/asm-ia64/topology.h
===================================================================
--- GIT-torvalds.orig/include/asm-ia64/topology.h
+++ GIT-torvalds/include/asm-ia64/topology.h
@@ -93,7 +93,7 @@ void build_cpu_to_node_map(void);
 	.cache_nice_tries	= 2,			\
 	.busy_idx		= 3,			\
 	.idle_idx		= 2,			\
-	.newidle_idx		= 0, /* unused */	\
+	.newidle_idx		= 2,			\
 	.wake_idx		= 1,			\
 	.forkexec_idx		= 1,			\
 	.flags			= SD_LOAD_BALANCE	\
Index: GIT-torvalds/include/asm-sh/topology.h
===================================================================
--- GIT-torvalds.orig/include/asm-sh/topology.h
+++ GIT-torvalds/include/asm-sh/topology.h
@@ -16,7 +16,7 @@
 	.cache_nice_tries	= 2,			\
 	.busy_idx		= 3,			\
 	.idle_idx		= 2,			\
-	.newidle_idx		= 0,			\
+	.newidle_idx		= 2,			\
 	.wake_idx		= 1,			\
 	.forkexec_idx		= 1,			\
 	.flags			= SD_LOAD_BALANCE	\
Index: GIT-torvalds/include/asm-x86/topology.h
===================================================================
--- GIT-torvalds.orig/include/asm-x86/topology.h
+++ GIT-torvalds/include/asm-x86/topology.h
@@ -129,7 +129,7 @@ extern unsigned long node_remap_size[];

 # define SD_CACHE_NICE_TRIES	2
 # define SD_IDLE_IDX		2
-# define SD_NEWIDLE_IDX		0
+# define SD_NEWIDLE_IDX		2
 # define SD_FORKEXEC_IDX	1

 #endif


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] Customize sched domain via cpuset (v2)
  2008-04-04  9:11 ` [PATCH 2/2] " Hidetoshi Seto
@ 2008-04-10 14:53   ` Peter Zijlstra
  2008-04-14  1:45     ` Hidetoshi Seto
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2008-04-10 14:53 UTC (permalink / raw
  To: Hidetoshi Seto; +Cc: linux-kernel, Paul Jackson, Ingo Molnar, Andi Kleen

On Fri, 2008-04-04 at 18:11 +0900, Hidetoshi Seto wrote:
> The implementation has some updates...
> 
> >  - Add 2 new cpuset files:
> >      sched_wake_idle_far
> >      sched_balance_newidle_far
>     -> Merged into 1 file, having levels:
>          sched_relax_domain_level
> 
> >  - Modify partition_sched_domains() and build_sched_domains()
> >    to take flags parameter passed from cpuset.
>     -> Changed to "attributes" rather than "flags."
> 
> >  - Fill newidle_idx for node domains which currently unused but
> >    might be required for sched_balance_newidle_far.
> 
>    + We can change the "default" level by boot option 'relax_domain_level='.
> 
> Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>

This seems like a sufficiently flexible interface. Paul, have you got
any outstanding objections?

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

> ---
>  include/asm-ia64/topology.h |    2 -
>  include/asm-sh/topology.h   |    2 -
>  include/asm-x86/topology.h  |    2 -
>  include/linux/sched.h       |   23 +++++++++++-
>  kernel/cpuset.c             |   61 ++++++++++++++++++++++++++++++++
>  kernel/sched.c              |   82 +++++++++++++++++++++++++++++++++++++++++---
>  kernel/sched_fair.c         |    4 +-
>  7 files changed, 165 insertions(+), 11 deletions(-)
> 
> Index: GIT-torvalds/include/linux/sched.h
> ===================================================================
> --- GIT-torvalds.orig/include/linux/sched.h
> +++ GIT-torvalds/include/linux/sched.h
> @@ -704,6 +704,7 @@ enum cpu_idle_type {
>  #define SD_POWERSAVINGS_BALANCE	256	/* Balance for power savings */
>  #define SD_SHARE_PKG_RESOURCES	512	/* Domain members share cpu pkg resources */
>  #define SD_SERIALIZE		1024	/* Only a single load balancing instance */
> +#define SD_WAKE_IDLE_FAR	2048	/* Gain latency sacrificing cache hit */
> 
>  #define BALANCE_FOR_MC_POWER	\
>  	(sched_smt_power_savings ? SD_POWERSAVINGS_BALANCE : 0)
> @@ -733,6 +734,24 @@ struct sched_group {
>  	u32 reciprocal_cpu_power;
>  };
> 
> +enum sched_domain_level {
> +	SD_LV_NONE = 0,
> +	SD_LV_SIBLING,
> +	SD_LV_MC,
> +	SD_LV_CPU,
> +	SD_LV_NODE,
> +	SD_LV_ALLNODES,
> +	SD_LV_MAX
> +};
> +
> +struct sched_domain_attr {
> +	int relax_domain_level;
> +};
> +
> +#define SD_ATTR_INIT	(struct sched_domain_attr) {	\
> +	.relax_domain_level = -1,			\
> +}
> +
>  struct sched_domain {
>  	/* These fields must be setup */
>  	struct sched_domain *parent;	/* top domain must be null terminated */
> @@ -750,6 +769,7 @@ struct sched_domain {
>  	unsigned int wake_idx;
>  	unsigned int forkexec_idx;
>  	int flags;			/* See SD_* */
> +	enum sched_domain_level level;
> 
>  	/* Runtime fields. */
>  	unsigned long last_balance;	/* init to jiffies. units in jiffies */
> @@ -789,7 +809,8 @@ struct sched_domain {
>  #endif
>  };
> 
> -extern void partition_sched_domains(int ndoms_new, cpumask_t *doms_new);
> +extern void partition_sched_domains(int ndoms_new, cpumask_t *doms_new,
> +				    struct sched_domain_attr *dattr_new);
>  extern int arch_reinit_sched_domains(void);
> 
>  #endif	/* CONFIG_SMP */
> Index: GIT-torvalds/kernel/sched.c
> ===================================================================
> --- GIT-torvalds.orig/kernel/sched.c
> +++ GIT-torvalds/kernel/sched.c
> @@ -6582,11 +6582,42 @@ static void init_sched_groups_power(int
>  	} while (group != child->groups);
>  }
> 
> +static int default_relax_domain_level = -1;
> +
> +static int __init setup_relax_domain_level(char *str)
> +{
> +	default_relax_domain_level = simple_strtoul(str, NULL, 0);
> +	return 1;
> +}
> +__setup("relax_domain_level=", setup_relax_domain_level);
> +
> +static void set_domain_attribute(struct sched_domain *sd,
> +				 struct sched_domain_attr *attr)
> +{
> +	int request;
> +
> +	if (!attr || attr->relax_domain_level < 0) {
> +		if (default_relax_domain_level < 0)
> +			return;
> +		else
> +			request = default_relax_domain_level;
> +	} else
> +		request = attr->relax_domain_level;
> +	if (request < sd->level) {
> +		/* turn off idle balance on this domain */
> +		sd->flags &= ~(SD_WAKE_IDLE|SD_BALANCE_NEWIDLE);
> +	} else {
> +		/* turn on idle balance on this domain */
> +		sd->flags |= (SD_WAKE_IDLE_FAR|SD_BALANCE_NEWIDLE);
> +	}
> +}
> +
>  /*
>   * Build sched domains for a given set of cpus and attach the sched domains
>   * to the individual cpus
>   */
> -static int build_sched_domains(const cpumask_t *cpu_map)
> +static int __build_sched_domains(const cpumask_t *cpu_map,
> +				 struct sched_domain_attr *attr)
>  {
>  	int i;
>  	struct root_domain *rd;
> @@ -6626,7 +6657,9 @@ static int build_sched_domains(const cpu
>  				SD_NODES_PER_DOMAIN*cpus_weight(nodemask)) {
>  			sd = &per_cpu(allnodes_domains, i);
>  			*sd = SD_ALLNODES_INIT;
> +			sd->level = SD_LV_ALLNODES;
>  			sd->span = *cpu_map;
> +			set_domain_attribute(sd, attr);
>  			cpu_to_allnodes_group(i, cpu_map, &sd->groups);
>  			p = sd;
>  			sd_allnodes = 1;
> @@ -6635,7 +6668,9 @@ static int build_sched_domains(const cpu
> 
>  		sd = &per_cpu(node_domains, i);
>  		*sd = SD_NODE_INIT;
> +		sd->level = SD_LV_NODE;
>  		sd->span = sched_domain_node_span(cpu_to_node(i));
> +		set_domain_attribute(sd, attr);
>  		sd->parent = p;
>  		if (p)
>  			p->child = sd;
> @@ -6645,7 +6680,9 @@ static int build_sched_domains(const cpu
>  		p = sd;
>  		sd = &per_cpu(phys_domains, i);
>  		*sd = SD_CPU_INIT;
> +		sd->level = SD_LV_CPU;
>  		sd->span = nodemask;
> +		set_domain_attribute(sd, attr);
>  		sd->parent = p;
>  		if (p)
>  			p->child = sd;
> @@ -6655,8 +6692,10 @@ static int build_sched_domains(const cpu
>  		p = sd;
>  		sd = &per_cpu(core_domains, i);
>  		*sd = SD_MC_INIT;
> +		sd->level = SD_LV_MC;
>  		sd->span = cpu_coregroup_map(i);
>  		cpus_and(sd->span, sd->span, *cpu_map);
> +		set_domain_attribute(sd, attr);
>  		sd->parent = p;
>  		p->child = sd;
>  		cpu_to_core_group(i, cpu_map, &sd->groups);
> @@ -6666,8 +6705,10 @@ static int build_sched_domains(const cpu
>  		p = sd;
>  		sd = &per_cpu(cpu_domains, i);
>  		*sd = SD_SIBLING_INIT;
> +		sd->level = SD_LV_SIBLING;
>  		sd->span = per_cpu(cpu_sibling_map, i);
>  		cpus_and(sd->span, sd->span, *cpu_map);
> +		set_domain_attribute(sd, attr);
>  		sd->parent = p;
>  		p->child = sd;
>  		cpu_to_cpu_group(i, cpu_map, &sd->groups);
> @@ -6840,8 +6881,15 @@ error:
>  #endif
>  }
> 
> +static int build_sched_domains(const cpumask_t *cpu_map)
> +{
> +	return __build_sched_domains(cpu_map, NULL);
> +}
> +
>  static cpumask_t *doms_cur;	/* current sched domains */
>  static int ndoms_cur;		/* number of sched domains in 'doms_cur' */
> +static struct sched_domain_attr *dattr_cur;	/* attribues of custom domains
> +						   in 'doms_cur' */
> 
>  /*
>   * Special case: If a kmalloc of a doms_cur partition (array of
> @@ -6868,6 +6916,7 @@ static int arch_init_sched_domains(const
>  	doms_cur = kmalloc(sizeof(cpumask_t), GFP_KERNEL);
>  	if (!doms_cur)
>  		doms_cur = &fallback_doms;
> +	dattr_cur = NULL;
>  	cpus_andnot(*doms_cur, *cpu_map, cpu_isolated_map);
>  	err = build_sched_domains(doms_cur);
>  	register_sched_domain_sysctl();
> @@ -6896,6 +6945,22 @@ static void detach_destroy_domains(const
>  	arch_destroy_sched_domains(cpu_map);
>  }
> 
> +/* handle null as "default" */
> +static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
> +			struct sched_domain_attr *new, int idx_new)
> +{
> +	struct sched_domain_attr tmp;
> +
> +	/* fast path */
> +	if (!new && !cur)
> +		return 1;
> +
> +	tmp = SD_ATTR_INIT;
> +	return !memcmp(cur ? (cur + idx_cur) : &tmp,
> +			new ? (new + idx_new) : &tmp,
> +			sizeof(struct sched_domain_attr));
> +}
> +
>  /*
>   * Partition sched domains as specified by the 'ndoms_new'
>   * cpumasks in the array doms_new[] of cpumasks. This compares
> @@ -6917,7 +6982,8 @@ static void detach_destroy_domains(const
>   *
>   * Call with hotplug lock held
>   */
> -void partition_sched_domains(int ndoms_new, cpumask_t *doms_new)
> +void partition_sched_domains(int ndoms_new, cpumask_t *doms_new,
> +			     struct sched_domain_attr *dattr_new)
>  {
>  	int i, j;
> 
> @@ -6929,13 +6995,15 @@ void partition_sched_domains(int ndoms_n
>  	if (doms_new == NULL) {
>  		ndoms_new = 1;
>  		doms_new = &fallback_doms;
> +		dattr_new = NULL;
>  		cpus_andnot(doms_new[0], cpu_online_map, cpu_isolated_map);
>  	}
> 
>  	/* Destroy deleted domains */
>  	for (i = 0; i < ndoms_cur; i++) {
>  		for (j = 0; j < ndoms_new; j++) {
> -			if (cpus_equal(doms_cur[i], doms_new[j]))
> +			if (cpus_equal(doms_cur[i], doms_new[j])
> +			    && dattrs_equal(dattr_cur, i, dattr_new, j))
>  				goto match1;
>  		}
>  		/* no match - a current sched domain not in new doms_new[] */
> @@ -6947,11 +7015,13 @@ match1:
>  	/* Build new domains */
>  	for (i = 0; i < ndoms_new; i++) {
>  		for (j = 0; j < ndoms_cur; j++) {
> -			if (cpus_equal(doms_new[i], doms_cur[j]))
> +			if (cpus_equal(doms_new[i], doms_cur[j])
> +			    && dattrs_equal(dattr_new, i, dattr_cur, j))
>  				goto match2;
>  		}
>  		/* no match - add a new doms_new */
> -		build_sched_domains(doms_new + i);
> +		__build_sched_domains(doms_new + i,
> +					dattr_new ? dattr_new + i : NULL);
>  match2:
>  		;
>  	}
> @@ -6959,7 +7029,9 @@ match2:
>  	/* Remember the new sched domains */
>  	if (doms_cur != &fallback_doms)
>  		kfree(doms_cur);
> +	kfree(dattr_cur);	/* kfree(NULL) is safe */
>  	doms_cur = doms_new;
> +	dattr_cur = dattr_new;
>  	ndoms_cur = ndoms_new;
> 
>  	register_sched_domain_sysctl();
> Index: GIT-torvalds/kernel/sched_fair.c
> ===================================================================
> --- GIT-torvalds.orig/kernel/sched_fair.c
> +++ GIT-torvalds/kernel/sched_fair.c
> @@ -957,7 +957,9 @@ static int wake_idle(int cpu, struct tas
>  		return cpu;
> 
>  	for_each_domain(cpu, sd) {
> -		if (sd->flags & SD_WAKE_IDLE) {
> +		if ((sd->flags & SD_WAKE_IDLE)
> +		    || ((sd->flags & SD_WAKE_IDLE_FAR)
> +			&& !task_hot(p, task_rq(p)->clock, sd))) {
>  			cpus_and(tmp, sd->span, p->cpus_allowed);
>  			for_each_cpu_mask(i, tmp) {
>  				if (idle_cpu(i)) {
> Index: GIT-torvalds/kernel/cpuset.c
> ===================================================================
> --- GIT-torvalds.orig/kernel/cpuset.c
> +++ GIT-torvalds/kernel/cpuset.c
> @@ -98,6 +98,9 @@ struct cpuset {
>  	/* partition number for rebuild_sched_domains() */
>  	int pn;
> 
> +	/* for custom sched domain */
> +	int relax_domain_level;
> +
>  	/* used for walking a cpuset heirarchy */
>  	struct list_head stack_list;
>  };
> @@ -478,6 +481,16 @@ static int cpusets_overlap(struct cpuset
>  	return cpus_intersects(a->cpus_allowed, b->cpus_allowed);
>  }
> 
> +static void
> +update_domain_attr(struct sched_domain_attr *dattr, struct cpuset *c)
> +{
> +	if (!dattr)
> +		return;
> +	if (dattr->relax_domain_level < c->relax_domain_level)
> +		dattr->relax_domain_level = c->relax_domain_level;
> +	return;
> +}
> +
>  /*
>   * rebuild_sched_domains()
>   *
> @@ -553,12 +566,14 @@ static void rebuild_sched_domains(void)
>  	int csn;		/* how many cpuset ptrs in csa so far */
>  	int i, j, k;		/* indices for partition finding loops */
>  	cpumask_t *doms;	/* resulting partition; i.e. sched domains */
> +	struct sched_domain_attr *dattr;  /* attributes for custom domains */
>  	int ndoms;		/* number of sched domains in result */
>  	int nslot;		/* next empty doms[] cpumask_t slot */
> 
>  	q = NULL;
>  	csa = NULL;
>  	doms = NULL;
> +	dattr = NULL;
> 
>  	/* Special case for the 99% of systems with one, full, sched domain */
>  	if (is_sched_load_balance(&top_cpuset)) {
> @@ -566,6 +581,11 @@ static void rebuild_sched_domains(void)
>  		doms = kmalloc(sizeof(cpumask_t), GFP_KERNEL);
>  		if (!doms)
>  			goto rebuild;
> +		dattr = kmalloc(sizeof(struct sched_domain_attr), GFP_KERNEL);
> +		if (dattr) {
> +			*dattr = SD_ATTR_INIT;
> +			update_domain_attr(dattr, &top_cpuset);
> +		}
>  		*doms = top_cpuset.cpus_allowed;
>  		goto rebuild;
>  	}
> @@ -622,6 +642,7 @@ restart:
>  	doms = kmalloc(ndoms * sizeof(cpumask_t), GFP_KERNEL);
>  	if (!doms)
>  		goto rebuild;
> +	dattr = kmalloc(ndoms * sizeof(struct sched_domain_attr), GFP_KERNEL);
> 
>  	for (nslot = 0, i = 0; i < csn; i++) {
>  		struct cpuset *a = csa[i];
> @@ -644,12 +665,15 @@ restart:
>  			}
> 
>  			cpus_clear(*dp);
> +			if (dattr)
> +				*(dattr + nslot) = SD_ATTR_INIT;
>  			for (j = i; j < csn; j++) {
>  				struct cpuset *b = csa[j];
> 
>  				if (apn == b->pn) {
>  					cpus_or(*dp, *dp, b->cpus_allowed);
>  					b->pn = -1;
> +					update_domain_attr(dattr, b);
>  				}
>  			}
>  			nslot++;
> @@ -660,7 +684,7 @@ restart:
>  rebuild:
>  	/* Have scheduler rebuild sched domains */
>  	get_online_cpus();
> -	partition_sched_domains(ndoms, doms);
> +	partition_sched_domains(ndoms, doms, dattr);
>  	put_online_cpus();
> 
>  done:
> @@ -668,6 +692,7 @@ done:
>  		kfifo_free(q);
>  	kfree(csa);
>  	/* Don't kfree(doms) -- partition_sched_domains() does that. */
> +	/* Don't kfree(dattr) -- partition_sched_domains() does that. */
>  }
> 
>  static inline int started_after_time(struct task_struct *t1,
> @@ -1011,6 +1036,21 @@ static int update_memory_pressure_enable
>  	return 0;
>  }
> 
> +static int update_relax_domain_level(struct cpuset *cs, char *buf)
> +{
> +	int val = simple_strtol(buf, NULL, 10);
> +
> +	if (val < 0)
> +		val = -1;
> +
> +	if (val != cs->relax_domain_level) {
> +		cs->relax_domain_level = val;
> +		rebuild_sched_domains();
> +	}
> +
> +	return 0;
> +}
> +
>  /*
>   * update_flag - read a 0 or a 1 in a file and update associated flag
>   * bit:	the bit to update (CS_CPU_EXCLUSIVE, CS_MEM_EXCLUSIVE,
> @@ -1202,6 +1242,7 @@ typedef enum {
>  	FILE_CPU_EXCLUSIVE,
>  	FILE_MEM_EXCLUSIVE,
>  	FILE_SCHED_LOAD_BALANCE,
> +	FILE_SCHED_RELAX_DOMAIN_LEVEL,
>  	FILE_MEMORY_PRESSURE_ENABLED,
>  	FILE_MEMORY_PRESSURE,
>  	FILE_SPREAD_PAGE,
> @@ -1256,6 +1297,9 @@ static ssize_t cpuset_common_file_write(
>  	case FILE_SCHED_LOAD_BALANCE:
>  		retval = update_flag(CS_SCHED_LOAD_BALANCE, cs, buffer);
>  		break;
> +	case FILE_SCHED_RELAX_DOMAIN_LEVEL:
> +		retval = update_relax_domain_level(cs, buffer);
> +		break;
>  	case FILE_MEMORY_MIGRATE:
>  		retval = update_flag(CS_MEMORY_MIGRATE, cs, buffer);
>  		break;
> @@ -1354,6 +1398,9 @@ static ssize_t cpuset_common_file_read(s
>  	case FILE_SCHED_LOAD_BALANCE:
>  		*s++ = is_sched_load_balance(cs) ? '1' : '0';
>  		break;
> +	case FILE_SCHED_RELAX_DOMAIN_LEVEL:
> +		s += sprintf(s, "%d", cs->relax_domain_level);
> +		break;
>  	case FILE_MEMORY_MIGRATE:
>  		*s++ = is_memory_migrate(cs) ? '1' : '0';
>  		break;
> @@ -1424,6 +1471,13 @@ static struct cftype cft_sched_load_bala
>  	.private = FILE_SCHED_LOAD_BALANCE,
>  };
> 
> +static struct cftype cft_sched_relax_domain_level = {
> +	.name = "sched_relax_domain_level",
> +	.read = cpuset_common_file_read,
> +	.write = cpuset_common_file_write,
> +	.private = FILE_SCHED_RELAX_DOMAIN_LEVEL,
> +};
> +
>  static struct cftype cft_memory_migrate = {
>  	.name = "memory_migrate",
>  	.read = cpuset_common_file_read,
> @@ -1475,6 +1529,9 @@ static int cpuset_populate(struct cgroup
>  		return err;
>  	if ((err = cgroup_add_file(cont, ss, &cft_sched_load_balance)) < 0)
>  		return err;
> +	if ((err = cgroup_add_file(cont, ss,
> +					&cft_sched_relax_domain_level)) < 0)
> +		return err;
>  	if ((err = cgroup_add_file(cont, ss, &cft_memory_pressure)) < 0)
>  		return err;
>  	if ((err = cgroup_add_file(cont, ss, &cft_spread_page)) < 0)
> @@ -1559,6 +1616,7 @@ static struct cgroup_subsys_state *cpuse
>  	cs->mems_allowed = NODE_MASK_NONE;
>  	cs->mems_generation = cpuset_mems_generation++;
>  	fmeter_init(&cs->fmeter);
> +	cs->relax_domain_level = -1;
> 
>  	cs->parent = parent;
>  	number_of_cpusets++;
> @@ -1631,6 +1689,7 @@ int __init cpuset_init(void)
>  	fmeter_init(&top_cpuset.fmeter);
>  	top_cpuset.mems_generation = cpuset_mems_generation++;
>  	set_bit(CS_SCHED_LOAD_BALANCE, &top_cpuset.flags);
> +	top_cpuset.relax_domain_level = -1;
> 
>  	err = register_filesystem(&cpuset_fs_type);
>  	if (err < 0)
> Index: GIT-torvalds/include/asm-ia64/topology.h
> ===================================================================
> --- GIT-torvalds.orig/include/asm-ia64/topology.h
> +++ GIT-torvalds/include/asm-ia64/topology.h
> @@ -93,7 +93,7 @@ void build_cpu_to_node_map(void);
>  	.cache_nice_tries	= 2,			\
>  	.busy_idx		= 3,			\
>  	.idle_idx		= 2,			\
> -	.newidle_idx		= 0, /* unused */	\
> +	.newidle_idx		= 2,			\
>  	.wake_idx		= 1,			\
>  	.forkexec_idx		= 1,			\
>  	.flags			= SD_LOAD_BALANCE	\
> Index: GIT-torvalds/include/asm-sh/topology.h
> ===================================================================
> --- GIT-torvalds.orig/include/asm-sh/topology.h
> +++ GIT-torvalds/include/asm-sh/topology.h
> @@ -16,7 +16,7 @@
>  	.cache_nice_tries	= 2,			\
>  	.busy_idx		= 3,			\
>  	.idle_idx		= 2,			\
> -	.newidle_idx		= 0,			\
> +	.newidle_idx		= 2,			\
>  	.wake_idx		= 1,			\
>  	.forkexec_idx		= 1,			\
>  	.flags			= SD_LOAD_BALANCE	\
> Index: GIT-torvalds/include/asm-x86/topology.h
> ===================================================================
> --- GIT-torvalds.orig/include/asm-x86/topology.h
> +++ GIT-torvalds/include/asm-x86/topology.h
> @@ -129,7 +129,7 @@ extern unsigned long node_remap_size[];
> 
>  # define SD_CACHE_NICE_TRIES	2
>  # define SD_IDLE_IDX		2
> -# define SD_NEWIDLE_IDX		0
> +# define SD_NEWIDLE_IDX		2
>  # define SD_FORKEXEC_IDX	1
> 
>  #endif
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] Customize sched domain via cpuset (v2)
  2008-04-10 14:53   ` Peter Zijlstra
@ 2008-04-14  1:45     ` Hidetoshi Seto
  2008-04-14 15:38       ` Paul Jackson
  0 siblings, 1 reply; 19+ messages in thread
From: Hidetoshi Seto @ 2008-04-14  1:45 UTC (permalink / raw
  To: Paul Jackson; +Cc: Peter Zijlstra, linux-kernel, Ingo Molnar, Andi Kleen

Peter Zijlstra wrote:
> This seems like a sufficiently flexible interface. Paul, have you got
> any outstanding objections?
> 
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

Paul, could you give me your response?
I think it is better to push these patches into Ingo's scheduler tree
if you can also "acked-by" to them.

Thanks,
H.Seto

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] Customize sched domain via cpuset (v2)
  2008-04-14  1:45     ` Hidetoshi Seto
@ 2008-04-14 15:38       ` Paul Jackson
  0 siblings, 0 replies; 19+ messages in thread
From: Paul Jackson @ 2008-04-14 15:38 UTC (permalink / raw
  To: Hidetoshi Seto; +Cc: peterz, linux-kernel, mingo, andi

H.Seto wrote:
> Paul, could you give me your response?

Ah - sorry.  I was tied up last week and failed to review this
patch in a timely manner.

Yes - this looks good.  Thanks.

Acked-by: Paul Jackson <pj@sgi.com>

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2008-04-14 15:38 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-01 11:26 [PATCH 1/2] Customize sched domain via cpuset Hidetoshi Seto
2008-04-01 11:40 ` Andi Kleen
2008-04-01 11:56   ` Peter Zijlstra
2008-04-01 13:29     ` Andi Kleen
2008-04-01 13:38       ` Peter Zijlstra
2008-04-01 11:48 ` Peter Zijlstra
2008-04-01 11:55 ` Paul Jackson
2008-04-01 11:59   ` Peter Zijlstra
2008-04-02  8:39   ` Hidetoshi Seto
2008-04-02 11:14     ` Paul Jackson
2008-04-03  3:21       ` Hidetoshi Seto
2008-04-03 10:46         ` Peter Zijlstra
2008-04-03 12:56         ` Paul Jackson
2008-04-03 13:14         ` Paul Jackson
2008-04-04  9:10 ` [PATCH 1/2] Customize sched domain via cpuset (v2) Hidetoshi Seto
2008-04-04  9:11 ` [PATCH 2/2] " Hidetoshi Seto
2008-04-10 14:53   ` Peter Zijlstra
2008-04-14  1:45     ` Hidetoshi Seto
2008-04-14 15:38       ` Paul Jackson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.