All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [CFS Bandwidth Control v4 0/7] Introduction
@ 2011-02-16  3:18 Paul Turner
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 1/7] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
                   ` (8 more replies)
  0 siblings, 9 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-16  3:18 UTC (permalink / raw
  To: linux-kernel
  Cc: Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen

Hi all,

Please find attached v4 of CFS bandwidth control; while this rebase against
some of the latest SCHED_NORMAL code is new, the features and methodology are
fairly mature at this point and have proved both effective and stable for
several workloads.

As always, all comments/feedback welcome.

Changes since v3:
- Rebased to current tip, update to work with new group scheduling accounting
- (Bug fix) Fixed Race with unthrottling (due to changing global limit) fixed
- (Bug fix) Fixed buddy interactions -- in particular, prevent buddy 
  nominations from re-picking throttled entities

The skeleton of our approach is as follows:
- We maintain a global pool (per-tg) pool of unassigned quota.  Within it
  we track the bandwidth period, quota per period, and runtime remaining in
  the current period.  As bandwidth is used within a period it is decremented
  from runtime.  Runtime is currently synchronized using a spinlock, in the
  current implementation there's no reason this couldn't be done using
  atomic ops instead however the spinlock allows for a little more flexibility
  in experimentation with other schemes.
- When a cfs_rq participating in a bandwidth constrained task_group executes
  it acquires time in sysctl_sched_cfs_bandwidth_slice (default currently
  10ms) size chunks from the global pool, this synchronizes under rq->lock and
  is part of the update_curr path.
- Throttled entities are dequeued, we protect against their re-introduction to
  the scheduling hierarchy via checking for a, per cfs_rq, throttled bit.

Interface:
----------
Three new cgroupfs files are exported by the cpu subsystem:
  cpu.cfs_period_us : period over which bandwidth is to be regulated
  cpu.cfs_quota_us  : bandwidth available for consumption per period
  cpu.stat          : statistics (such as number of throttled periods and
                      total throttled time)
One important interface change that this introduces (versus the rate limits
proposal) is that the defined bandwidth becomes an absolute quantifier.

Previous postings:
-----------------
v3:
https://lkml.org/lkml/2010/10/12/44
v2:
http://lkml.org/lkml/2010/4/28/88
Original posting:
http://lkml.org/lkml/2010/2/12/393

Prior approaches:
http://lkml.org/lkml/2010/1/5/44 ("CFS Hard limits v5")

Thanks,

- Paul



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [CFS Bandwidth Control v4 1/7] sched: introduce primitives to account for CFS bandwidth tracking
  2011-02-16  3:18 [CFS Bandwidth Control v4 0/7] Introduction Paul Turner
@ 2011-02-16  3:18 ` Paul Turner
  2011-02-16 16:52   ` Balbir Singh
  2011-02-23 13:32   ` Peter Zijlstra
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage Paul Turner
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-16  3:18 UTC (permalink / raw
  To: linux-kernel
  Cc: Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen, Nikhil Rao

[-- Attachment #1: sched-bwc-add_cfs_tg_bandwidth.patch --]
[-- Type: text/plain, Size: 12772 bytes --]

In this patch we introduce the notion of CFS bandwidth, to account for the
realities of SMP this is partitioned into globally unassigned bandwidth, and
locally claimed bandwidth:
- The global bandwidth is per task_group, it represents a pool of unclaimed
  bandwidth that cfs_rq's can allocate from.  It uses the new cfs_bandwidth
  structure.
- The local bandwidth is tracked per-cfs_rq, this represents allotments from
  the global pool
  bandwidth assigned to a task_group, this is tracked using the
  new cfs_bandwidth structure.

Bandwidth is managed via cgroupfs via two new files in the cpu subsystem:
- cpu.cfs_period_us : the bandwidth period in usecs
- cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
  to consume over period above.

A per-cfs_bandwidth timer is also introduced to handle future refresh at
period expiration.  There's some minor refactoring here so that
start_bandwidth_timer() functionality can be shared

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 init/Kconfig        |    9 +
 kernel/sched.c      |  264 +++++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched_fair.c |   19 +++
 3 files changed, 269 insertions(+), 23 deletions(-)

Index: tip/init/Kconfig
===================================================================
--- tip.orig/init/Kconfig
+++ tip/init/Kconfig
@@ -698,6 +698,15 @@ config FAIR_GROUP_SCHED
 	depends on CGROUP_SCHED
 	default CGROUP_SCHED
 
+config CFS_BANDWIDTH
+	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
+	depends on EXPERIMENTAL
+	depends on FAIR_GROUP_SCHED
+	default n
+	help
+	  This option allows users to define quota and period for cpu
+	  bandwidth provisioning on a per-cgroup basis.
+
 config RT_GROUP_SCHED
 	bool "Group scheduling for SCHED_RR/FIFO"
 	depends on EXPERIMENTAL
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -194,10 +194,28 @@ static inline int rt_bandwidth_enabled(v
 	return sysctl_sched_rt_runtime >= 0;
 }
 
-static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 {
-	ktime_t now;
+	unsigned long delta;
+	ktime_t soft, hard, now;
+
+	for (;;) {
+		if (hrtimer_active(period_timer))
+			break;
 
+		now = hrtimer_cb_get_time(period_timer);
+		hrtimer_forward(period_timer, now, period);
+
+		soft = hrtimer_get_softexpires(period_timer);
+		hard = hrtimer_get_expires(period_timer);
+		delta = ktime_to_ns(ktime_sub(hard, soft));
+		__hrtimer_start_range_ns(period_timer, soft, delta,
+					 HRTIMER_MODE_ABS_PINNED, 0);
+	}
+}
+
+static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+{
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return;
 
@@ -205,22 +223,7 @@ static void start_rt_bandwidth(struct rt
 		return;
 
 	raw_spin_lock(&rt_b->rt_runtime_lock);
-	for (;;) {
-		unsigned long delta;
-		ktime_t soft, hard;
-
-		if (hrtimer_active(&rt_b->rt_period_timer))
-			break;
-
-		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
-		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
-
-		soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
-		hard = hrtimer_get_expires(&rt_b->rt_period_timer);
-		delta = ktime_to_ns(ktime_sub(hard, soft));
-		__hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
-				HRTIMER_MODE_ABS_PINNED, 0);
-	}
+	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
 	raw_spin_unlock(&rt_b->rt_runtime_lock);
 }
 
@@ -245,6 +248,15 @@ struct cfs_rq;
 
 static LIST_HEAD(task_groups);
 
+#ifdef CONFIG_CFS_BANDWIDTH
+struct cfs_bandwidth {
+	raw_spinlock_t		lock;
+	ktime_t			period;
+	u64			runtime, quota;
+	struct hrtimer		period_timer;
+};
+#endif
+
 /* task group related information */
 struct task_group {
 	struct cgroup_subsys_state css;
@@ -276,6 +288,10 @@ struct task_group {
 #ifdef CONFIG_SCHED_AUTOGROUP
 	struct autogroup *autogroup;
 #endif
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	struct cfs_bandwidth cfs_bandwidth;
+#endif
 };
 
 /* task_group_lock serializes the addition/removal of task groups */
@@ -370,9 +386,76 @@ struct cfs_rq {
 
 	unsigned long load_contribution;
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	u64 quota_assigned, quota_used;
+#endif
 #endif
 };
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
+
+static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
+{
+	struct cfs_bandwidth *cfs_b =
+		container_of(timer, struct cfs_bandwidth, period_timer);
+	ktime_t now;
+	int overrun;
+	int idle = 0;
+
+	for (;;) {
+		now = hrtimer_cb_get_time(timer);
+		overrun = hrtimer_forward(timer, now, cfs_b->period);
+
+		if (!overrun)
+			break;
+
+		idle = do_sched_cfs_period_timer(cfs_b, overrun);
+	}
+
+	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
+}
+
+static
+void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
+{
+	raw_spin_lock_init(&cfs_b->lock);
+	cfs_b->quota = cfs_b->runtime = quota;
+	cfs_b->period = ns_to_ktime(period);
+
+	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	cfs_b->period_timer.function = sched_cfs_period_timer;
+}
+
+static
+void init_cfs_rq_quota(struct cfs_rq *cfs_rq)
+{
+	cfs_rq->quota_used = 0;
+	if (cfs_rq->tg->cfs_bandwidth.quota == RUNTIME_INF)
+		cfs_rq->quota_assigned = RUNTIME_INF;
+	else
+		cfs_rq->quota_assigned = 0;
+}
+
+static void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	if (cfs_b->quota == RUNTIME_INF)
+		return;
+
+	if (hrtimer_active(&cfs_b->period_timer))
+		return;
+
+	raw_spin_lock(&cfs_b->lock);
+	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
+	raw_spin_unlock(&cfs_b->lock);
+}
+
+static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
+{
+	hrtimer_cancel(&cfs_b->period_timer);
+}
+#endif
+
 /* Real-Time classes' related field in a runqueue: */
 struct rt_rq {
 	struct rt_prio_array active;
@@ -8038,6 +8121,9 @@ static void init_tg_cfs_entry(struct tas
 	tg->cfs_rq[cpu] = cfs_rq;
 	init_cfs_rq(cfs_rq, rq);
 	cfs_rq->tg = tg;
+#ifdef CONFIG_CFS_BANDWIDTH
+	init_cfs_rq_quota(cfs_rq);
+#endif
 
 	tg->se[cpu] = se;
 	/* se could be NULL for root_task_group */
@@ -8173,6 +8259,10 @@ void __init sched_init(void)
 		 * We achieve this by letting root_task_group's tasks sit
 		 * directly in rq->cfs (i.e root_task_group->se[] = NULL).
 		 */
+#ifdef CONFIG_CFS_BANDWIDTH
+		init_cfs_bandwidth(&root_task_group.cfs_bandwidth,
+				RUNTIME_INF, sched_cfs_bandwidth_period);
+#endif
 		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -8415,6 +8505,10 @@ static void free_fair_sched_group(struct
 {
 	int i;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+	destroy_cfs_bandwidth(&tg->cfs_bandwidth);
+#endif
+
 	for_each_possible_cpu(i) {
 		if (tg->cfs_rq)
 			kfree(tg->cfs_rq[i]);
@@ -8442,7 +8536,10 @@ int alloc_fair_sched_group(struct task_g
 		goto err;
 
 	tg->shares = NICE_0_LOAD;
-
+#ifdef CONFIG_CFS_BANDWIDTH
+	init_cfs_bandwidth(&tg->cfs_bandwidth, RUNTIME_INF,
+			sched_cfs_bandwidth_period);
+#endif
 	for_each_possible_cpu(i) {
 		rq = cpu_rq(i);
 
@@ -8822,7 +8919,7 @@ static int __rt_schedulable(struct task_
 	return walk_tg_tree(tg_schedulable, tg_nop, &data);
 }
 
-static int tg_set_bandwidth(struct task_group *tg,
+static int tg_set_rt_bandwidth(struct task_group *tg,
 		u64 rt_period, u64 rt_runtime)
 {
 	int i, err = 0;
@@ -8861,7 +8958,7 @@ int sched_group_set_rt_runtime(struct ta
 	if (rt_runtime_us < 0)
 		rt_runtime = RUNTIME_INF;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_runtime(struct task_group *tg)
@@ -8886,7 +8983,7 @@ int sched_group_set_rt_period(struct tas
 	if (rt_period == 0)
 		return -EINVAL;
 
-	return tg_set_bandwidth(tg, rt_period, rt_runtime);
+	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
 
 long sched_group_rt_period(struct task_group *tg)
@@ -9107,6 +9204,116 @@ static u64 cpu_shares_read_u64(struct cg
 
 	return (u64) tg->shares;
 }
+
+#ifdef CONFIG_CFS_BANDWIDTH
+static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
+{
+	int i;
+	static DEFINE_MUTEX(mutex);
+
+	if (tg == &root_task_group)
+		return -EINVAL;
+
+	if (!period)
+		return -EINVAL;
+
+	/*
+	 * Ensure we have at least one tick of bandwidth every period.  This is
+	 * to prevent reaching a state of large arrears when throttled via
+	 * entity_tick() resulting in prolonged exit starvation.
+	 */
+	if (NS_TO_JIFFIES(quota) < 1)
+		return -EINVAL;
+
+	mutex_lock(&mutex);
+	raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
+	tg->cfs_bandwidth.period = ns_to_ktime(period);
+	tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
+	raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
+
+	for_each_possible_cpu(i) {
+		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+		struct rq *rq = rq_of(cfs_rq);
+
+		raw_spin_lock_irq(&rq->lock);
+		init_cfs_rq_quota(cfs_rq);
+		raw_spin_unlock_irq(&rq->lock);
+	}
+	mutex_unlock(&mutex);
+
+	return 0;
+}
+
+int tg_set_cfs_quota(struct task_group *tg, long cfs_runtime_us)
+{
+	u64 quota, period;
+
+	period = ktime_to_ns(tg->cfs_bandwidth.period);
+	if (cfs_runtime_us < 0)
+		quota = RUNTIME_INF;
+	else
+		quota = (u64)cfs_runtime_us * NSEC_PER_USEC;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_quota(struct task_group *tg)
+{
+	u64 quota_us;
+
+	if (tg->cfs_bandwidth.quota == RUNTIME_INF)
+		return -1;
+
+	quota_us = tg->cfs_bandwidth.quota;
+	do_div(quota_us, NSEC_PER_USEC);
+	return quota_us;
+}
+
+int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
+{
+	u64 quota, period;
+
+	period = (u64)cfs_period_us * NSEC_PER_USEC;
+	quota = tg->cfs_bandwidth.quota;
+
+	if (period <= 0)
+		return -EINVAL;
+
+	return tg_set_cfs_bandwidth(tg, period, quota);
+}
+
+long tg_get_cfs_period(struct task_group *tg)
+{
+	u64 cfs_period_us;
+
+	cfs_period_us = ktime_to_ns(tg->cfs_bandwidth.period);
+	do_div(cfs_period_us, NSEC_PER_USEC);
+	return cfs_period_us;
+}
+
+static s64 cpu_cfs_quota_read_s64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_quota(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_quota_write_s64(struct cgroup *cgrp, struct cftype *cftype,
+				s64 cfs_quota_us)
+{
+	return tg_set_cfs_quota(cgroup_tg(cgrp), cfs_quota_us);
+}
+
+static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
+{
+	return tg_get_cfs_period(cgroup_tg(cgrp));
+}
+
+static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
+				u64 cfs_period_us)
+{
+	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
+}
+
+#endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 #ifdef CONFIG_RT_GROUP_SCHED
@@ -9141,6 +9348,18 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_shares_write_u64,
 	},
 #endif
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.name = "cfs_quota_us",
+		.read_s64 = cpu_cfs_quota_read_s64,
+		.write_s64 = cpu_cfs_quota_write_s64,
+	},
+	{
+		.name = "cfs_period_us",
+		.read_u64 = cpu_cfs_period_read_u64,
+		.write_u64 = cpu_cfs_period_write_u64,
+	},
+#endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
 		.name = "rt_runtime_us",
@@ -9450,4 +9669,3 @@ struct cgroup_subsys cpuacct_subsys = {
 	.subsys_id = cpuacct_subsys_id,
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
-
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -88,6 +88,15 @@ const_debug unsigned int sysctl_sched_mi
  */
 unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
 
+
+#ifdef CONFIG_CFS_BANDWIDTH
+/*
+ * default period for cfs group bandwidth.
+ * default: 0.5s, units: nanoseconds
+ */
+static u64 sched_cfs_bandwidth_period = 500000000ULL;
+#endif
+
 static const struct sched_class fair_sched_class;
 
 /**************************************************************
@@ -397,6 +406,9 @@ static void __enqueue_entity(struct cfs_
 
 	rb_link_node(&se->run_node, parent, link);
 	rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
+#ifdef CONFIG_CFS_BANDWIDTH
+	start_cfs_bandwidth(&cfs_rq->tg->cfs_bandwidth);
+#endif
 }
 
 static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -1369,6 +1381,13 @@ static void dequeue_task_fair(struct rq 
 	hrtick_update(rq);
 }
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
+{
+	return 1;
+}
+#endif
+
 #ifdef CONFIG_SMP
 
 static void task_waking_fair(struct rq *rq, struct task_struct *p)



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage
  2011-02-16  3:18 [CFS Bandwidth Control v4 0/7] Introduction Paul Turner
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 1/7] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
@ 2011-02-16  3:18 ` Paul Turner
  2011-02-16 17:45   ` Balbir Singh
  2011-02-23 13:32   ` Peter Zijlstra
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota Paul Turner
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-16  3:18 UTC (permalink / raw
  To: linux-kernel
  Cc: Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen, Nikhil Rao

[-- Attachment #1: sched-bwc-accumulate_cfs_rq_usage.patch --]
[-- Type: text/plain, Size: 4595 bytes --]

Introduce account_cfs_rq_quota() to account bandwidth usage on the cfs_rq
level versus task_groups for which bandwidth has been assigned.  This is
tracked by whether the local cfs_rq->quota_assigned is finite or infinite
(RUNTIME_INF).

For cfs_rq's that belong to a bandwidth constrained task_group we introduce
tg_request_cfs_quota() which attempts to allocate quota from the global pool
for use locally.  Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local pools are protected by rq->lock.

This patch only attempts to assign and track quota, no action is taken in the
case that cfs_rq->quota_used exceeds cfs_rq->quota_assigned.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 include/linux/sched.h |    4 +++
 kernel/sched_fair.c   |   62 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c       |   10 ++++++++
 3 files changed, 76 insertions(+)

Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -95,6 +95,13 @@ unsigned int __read_mostly sysctl_sched_
  * default: 0.5s, units: nanoseconds
  */
 static u64 sched_cfs_bandwidth_period = 500000000ULL;
+
+/*
+ * default slice of quota to allocate from global tg to local cfs_rq pool on
+ * each refresh
+ * default: 10ms, units: microseconds
+ */
+unsigned int sysctl_sched_cfs_bandwidth_slice = 10000UL;
 #endif
 
 static const struct sched_class fair_sched_class;
@@ -313,6 +320,21 @@ find_matching_se(struct sched_entity **s
 
 #endif	/* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_CFS_BANDWIDTH
+static inline u64 sched_cfs_bandwidth_slice(void)
+{
+	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
+}
+
+static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
+{
+	return &tg->cfs_bandwidth;
+}
+
+static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
+		unsigned long delta_exec);
+#endif
+
 
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
@@ -609,6 +631,9 @@ static void update_curr(struct cfs_rq *c
 		cpuacct_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
+#ifdef CONFIG_CFS_BANDWIDTH
+	account_cfs_rq_quota(cfs_rq, delta_exec);
+#endif
 }
 
 static inline void
@@ -1382,6 +1407,43 @@ static void dequeue_task_fair(struct rq 
 }
 
 #ifdef CONFIG_CFS_BANDWIDTH
+static u64 tg_request_cfs_quota(struct task_group *tg)
+{
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+	u64 delta = 0;
+
+	if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
+		raw_spin_lock(&cfs_b->lock);
+		/*
+		 * it's possible a bandwidth update has changed the global
+		 * pool.
+		 */
+		if (cfs_b->quota == RUNTIME_INF)
+			delta = sched_cfs_bandwidth_slice();
+		else {
+			delta = min(cfs_b->runtime,
+					sched_cfs_bandwidth_slice());
+			cfs_b->runtime -= delta;
+		}
+		raw_spin_unlock(&cfs_b->lock);
+	}
+	return delta;
+}
+
+static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
+		unsigned long delta_exec)
+{
+	if (cfs_rq->quota_assigned == RUNTIME_INF)
+		return;
+
+	cfs_rq->quota_used += delta_exec;
+
+	if (cfs_rq->quota_used < cfs_rq->quota_assigned)
+		return;
+
+	cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
+}
+
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
 	return 1;
Index: tip/kernel/sysctl.c
===================================================================
--- tip.orig/kernel/sysctl.c
+++ tip/kernel/sysctl.c
@@ -361,6 +361,16 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= sched_rt_handler,
 	},
+#ifdef CONFIG_CFS_BANDWIDTH
+	{
+		.procname	= "sched_cfs_bandwidth_slice_us",
+		.data		= &sysctl_sched_cfs_bandwidth_slice,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+	},
+#endif
 #ifdef CONFIG_SCHED_AUTOGROUP
 	{
 		.procname	= "sched_autogroup_enabled",
Index: tip/include/linux/sched.h
===================================================================
--- tip.orig/include/linux/sched.h
+++ tip/include/linux/sched.h
@@ -1943,6 +1943,10 @@ int sched_rt_handler(struct ctl_table *t
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
 
+#ifdef CONFIG_CFS_BANDWIDTH
+extern unsigned int sysctl_sched_cfs_bandwidth_slice;
+#endif
+
 #ifdef CONFIG_SCHED_AUTOGROUP
 extern unsigned int sysctl_sched_autogroup_enabled;
 



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-16  3:18 [CFS Bandwidth Control v4 0/7] Introduction Paul Turner
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 1/7] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage Paul Turner
@ 2011-02-16  3:18 ` Paul Turner
  2011-02-18  6:52   ` Balbir Singh
                     ` (2 more replies)
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-16  3:18 UTC (permalink / raw
  To: linux-kernel
  Cc: Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen, Nikhil Rao

[-- Attachment #1: sched-bwc-throttle_entities.patch --]
[-- Type: text/plain, Size: 7813 bytes --]

In account_cfs_rq_quota() (via update_curr()) we track consumption versus a
cfs_rq's local quota and whether there is global quota available to continue
enabling it in the event we run out.

This patch adds the required support for the latter case, throttling entities
until quota is available to run.  Throttling dequeues the entity in question
and sends a reschedule to the owning cpu so that it can be evicted.

The following restrictions apply to a throttled cfs_rq:
- It is dequeued from sched_entity hierarchy and restricted from being
  re-enqueued.  This means that new/waking children of this entity will be
  queued up to it, but not past it.
- It does not contribute to weight calculations in tg_shares_up
- In the case that the cfs_rq of the cpu we are trying to pull from is throttled
  it is  is ignored by the loadbalancer in __load_balance_fair() and
  move_one_task_fair().

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |    3 +
 kernel/sched_fair.c |  121 +++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 114 insertions(+), 10 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -388,6 +388,7 @@ struct cfs_rq {
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
 	u64 quota_assigned, quota_used;
+	int throttled;
 #endif
 #endif
 };
@@ -1656,6 +1657,8 @@ static void update_h_load(long cpu)
 
 static void double_rq_lock(struct rq *rq1, struct rq *rq2);
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
+
 /*
  * fair double_lock_balance: Safely acquires both rq->locks in a fair
  * way at the expense of forcing extra atomic operations in all
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -331,8 +331,34 @@ static inline struct cfs_bandwidth *tg_c
 	return &tg->cfs_bandwidth;
 }
 
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_rq->throttled;
+}
+
+/* it's possible to be 'on_rq' in a dequeued (e.g. throttled) hierarchy */
+static inline int entity_on_rq(struct sched_entity *se)
+{
+	for_each_sched_entity(se)
+		if (!se->on_rq)
+			return 0;
+
+	return 1;
+}
+
 static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec);
+#else
+static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
+{
+	return 0;
+}
+
+static inline int entity_on_rq(struct sched_entity *se)
+{
+	return se->on_rq;
+}
+
 #endif
 
 
@@ -744,9 +770,10 @@ static void update_cfs_rq_load_contribut
 					    int global_update)
 {
 	struct task_group *tg = cfs_rq->tg;
-	long load_avg;
+	long load_avg = 0;
 
-	load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
+	if (!cfs_rq_throttled(cfs_rq))
+		load_avg = div64_u64(cfs_rq->load_avg, cfs_rq->load_period+1);
 	load_avg -= cfs_rq->load_contribution;
 
 	if (global_update || abs(load_avg) > cfs_rq->load_contribution / 8) {
@@ -761,7 +788,11 @@ static void update_cfs_load(struct cfs_r
 	u64 now, delta;
 	unsigned long load = cfs_rq->load.weight;
 
-	if (cfs_rq->tg == &root_task_group)
+	/*
+	 * Don't maintain averages for the root task group, or while we are
+	 * throttled.
+	 */
+	if (cfs_rq->tg == &root_task_group || cfs_rq_throttled(cfs_rq))
 		return;
 
 	now = rq_of(cfs_rq)->clock_task;
@@ -1015,6 +1046,14 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+
+
+#ifdef CONFIG_CFS_BANDWIDTH
+	if (!entity_is_task(se) && (cfs_rq_throttled(group_cfs_rq(se)) ||
+	     !group_cfs_rq(se)->nr_running))
+		return;
+#endif
+
 	update_cfs_load(cfs_rq, 0);
 	account_entity_enqueue(cfs_rq, se);
 	update_cfs_shares(cfs_rq);
@@ -1087,6 +1126,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
 	 */
 	update_curr(cfs_rq);
 
+#ifdef CONFIG_CFS_BANDWIDTH
+	if (!entity_is_task(se) && cfs_rq_throttled(group_cfs_rq(se)))
+		return;
+#endif
+
 	update_stats_dequeue(cfs_rq, se);
 	if (flags & DEQUEUE_SLEEP) {
 #ifdef CONFIG_SCHEDSTATS
@@ -1363,6 +1407,9 @@ enqueue_task_fair(struct rq *rq, struct 
 			break;
 		cfs_rq = cfs_rq_of(se);
 		enqueue_entity(cfs_rq, se, flags);
+		/* don't continue to enqueue if our parent is throttled */
+		if (cfs_rq_throttled(cfs_rq))
+			break;
 		flags = ENQUEUE_WAKEUP;
 	}
 
@@ -1390,8 +1437,11 @@ static void dequeue_task_fair(struct rq 
 		cfs_rq = cfs_rq_of(se);
 		dequeue_entity(cfs_rq, se, flags);
 
-		/* Don't dequeue parent if it has other entities besides us */
-		if (cfs_rq->load.weight)
+		/*
+		 * Don't dequeue parent if it has other entities besides us,
+		 * or if it is throttled
+		 */
+		if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
 			break;
 		flags |= DEQUEUE_SLEEP;
 	}
@@ -1430,6 +1480,42 @@ static u64 tg_request_cfs_quota(struct t
 	return delta;
 }
 
+static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *se;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	/* account load preceeding throttle */
+	update_cfs_load(cfs_rq, 0);
+
+	/* prevent previous buddy nominations from re-picking this se */
+	clear_buddies(cfs_rq_of(se), se);
+
+	/*
+	 * It's possible for the current task to block and re-wake before task
+	 * switch, leading to a throttle within enqueue_task->update_curr()
+	 * versus an an entity that has not technically been enqueued yet.
+	 *
+	 * In this case, since we haven't actually done the enqueue yet, cut
+	 * out and allow enqueue_entity() to short-circuit
+	 */
+	if (!se->on_rq)
+		goto out_throttled;
+
+	for_each_sched_entity(se) {
+		struct cfs_rq *cfs_rq = cfs_rq_of(se);
+
+		dequeue_entity(cfs_rq, se, 1);
+		if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
+			break;
+	}
+
+out_throttled:
+	cfs_rq->throttled = 1;
+	update_cfs_rq_load_contribution(cfs_rq, 1);
+}
+
 static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec)
 {
@@ -1438,10 +1524,16 @@ static void account_cfs_rq_quota(struct 
 
 	cfs_rq->quota_used += delta_exec;
 
-	if (cfs_rq->quota_used < cfs_rq->quota_assigned)
+	if (cfs_rq_throttled(cfs_rq) ||
+		cfs_rq->quota_used < cfs_rq->quota_assigned)
 		return;
 
 	cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
+
+	if (cfs_rq->quota_used >= cfs_rq->quota_assigned) {
+		throttle_cfs_rq(cfs_rq);
+		resched_task(cfs_rq->rq->curr);
+	}
 }
 
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
@@ -1941,6 +2033,12 @@ static void check_preempt_wakeup(struct 
 	if (unlikely(se == pse))
 		return;
 
+#ifdef CONFIG_CFS_BANDWIDTH
+	/* avoid pre-emption check/buddy nomination for throttled tasks */
+	if (!entity_on_rq(pse))
+		return;
+#endif
+
 	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK))
 		set_next_buddy(pse);
 
@@ -2060,7 +2158,8 @@ static bool yield_to_task_fair(struct rq
 {
 	struct sched_entity *se = &p->se;
 
-	if (!se->on_rq)
+	/* ensure entire hierarchy is on rq (e.g. running & not throttled) */
+	if (!entity_on_rq(se))
 		return false;
 
 	/* Tell the scheduler that we'd really like pse to run next. */
@@ -2280,7 +2379,8 @@ static void update_shares(int cpu)
 
 	rcu_read_lock();
 	for_each_leaf_cfs_rq(rq, cfs_rq)
-		update_shares_cpu(cfs_rq->tg, cpu);
+		if (!cfs_rq_throttled(cfs_rq))
+			update_shares_cpu(cfs_rq->tg, cpu);
 	rcu_read_unlock();
 }
 
@@ -2304,9 +2404,10 @@ load_balance_fair(struct rq *this_rq, in
 		u64 rem_load, moved_load;
 
 		/*
-		 * empty group
+		 * empty group or throttled cfs_rq
 		 */
-		if (!busiest_cfs_rq->task_weight)
+		if (!busiest_cfs_rq->task_weight ||
+				cfs_rq_throttled(busiest_cfs_rq))
 			continue;
 
 		rem_load = (u64)rem_load_move * busiest_weight;



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [CFS Bandwidth Control v4 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-02-16  3:18 [CFS Bandwidth Control v4 0/7] Introduction Paul Turner
                   ` (2 preceding siblings ...)
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota Paul Turner
@ 2011-02-16  3:18 ` Paul Turner
  2011-02-18  7:19   ` Balbir Singh
                     ` (2 more replies)
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 5/7] sched: add exports tracking cfs bandwidth control statistics Paul Turner
                   ` (4 subsequent siblings)
  8 siblings, 3 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-16  3:18 UTC (permalink / raw
  To: linux-kernel
  Cc: Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen, Nikhil Rao

[-- Attachment #1: sched-bwc-unthrottle_entities.patch --]
[-- Type: text/plain, Size: 5774 bytes --]

At the start of a new period there are several actions we must take:
- Refresh global bandwidth pool
- Unthrottle entities who ran out of quota as refreshed bandwidth permits

Unthrottled entities have the cfs_rq->throttled flag set and are re-enqueued
into the cfs entity hierarchy.

sched_rt_period_mask() is refactored slightly into sched_bw_period_mask()
since it is now shared by both cfs and rt bandwidth period timers.

The !CONFIG_RT_GROUP_SCHED && CONFIG_SMP case has been collapsed to use
rd->span instead of cpu_online_mask since I think that was incorrect before
(don't want to hit cpu's outside of your root_domain for RT bandwidth).

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |   16 +++++++++++
 kernel/sched_fair.c |   74 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched_rt.c   |   19 -------------
 3 files changed, 90 insertions(+), 19 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -1561,6 +1561,8 @@ static int tg_nop(struct task_group *tg,
 }
 #endif
 
+static inline const struct cpumask *sched_bw_period_mask(void);
+
 #ifdef CONFIG_SMP
 /* Used instead of source_load when we know the type == 0 */
 static unsigned long weighted_cpuload(const int cpu)
@@ -8503,6 +8505,18 @@ void set_curr_task(int cpu, struct task_
 
 #endif
 
+#ifdef CONFIG_SMP
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+	return cpu_rq(smp_processor_id())->rd->span;
+}
+#else
+static inline const struct cpumask *sched_bw_period_mask(void)
+{
+	return cpu_online_mask;
+}
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void free_fair_sched_group(struct task_group *tg)
 {
@@ -9240,6 +9254,8 @@ static int tg_set_cfs_bandwidth(struct t
 
 		raw_spin_lock_irq(&rq->lock);
 		init_cfs_rq_quota(cfs_rq);
+		if (cfs_rq_throttled(cfs_rq))
+			unthrottle_cfs_rq(cfs_rq);
 		raw_spin_unlock_irq(&rq->lock);
 	}
 	mutex_unlock(&mutex);
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -327,6 +327,13 @@ static inline u64 sched_cfs_bandwidth_sl
 	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
 }
 
+static inline
+struct cfs_rq *cfs_bandwidth_cfs_rq(struct cfs_bandwidth *cfs_b, int cpu)
+{
+	return container_of(cfs_b, struct task_group,
+			cfs_bandwidth)->cfs_rq[cpu];
+}
+
 static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
 {
 	return &tg->cfs_bandwidth;
@@ -1513,6 +1520,33 @@ out_throttled:
 	update_cfs_rq_load_contribution(cfs_rq, 1);
 }
 
+static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+	struct sched_entity *se;
+
+	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+
+	update_rq_clock(rq);
+	/* (Try to) avoid maintaining share statistics for idle time */
+	cfs_rq->load_stamp = cfs_rq->load_last = rq->clock_task;
+
+	cfs_rq->throttled = 0;
+	for_each_sched_entity(se) {
+		if (se->on_rq)
+			break;
+
+		cfs_rq = cfs_rq_of(se);
+		enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
+		if (cfs_rq_throttled(cfs_rq))
+			break;
+	}
+
+	/* determine whether we need to wake up potentally idle cpu */
+	if (rq->curr == rq->idle && rq->cfs.nr_running)
+		resched_task(rq->curr);
+}
+
 static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
 		unsigned long delta_exec)
 {
@@ -1535,8 +1569,46 @@ static void account_cfs_rq_quota(struct 
 
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	return 1;
+	int i, idle = 1;
+	u64 delta;
+	const struct cpumask *span;
+
+	if (cfs_b->quota == RUNTIME_INF)
+		return 1;
+
+	/* reset group quota */
+	raw_spin_lock(&cfs_b->lock);
+	cfs_b->runtime = cfs_b->quota;
+	raw_spin_unlock(&cfs_b->lock);
+
+	span = sched_bw_period_mask();
+	for_each_cpu(i, span) {
+		struct rq *rq = cpu_rq(i);
+		struct cfs_rq *cfs_rq = cfs_bandwidth_cfs_rq(cfs_b, i);
+
+		if (cfs_rq->nr_running)
+			idle = 0;
+
+		if (!cfs_rq_throttled(cfs_rq))
+			continue;
+
+		delta = tg_request_cfs_quota(cfs_rq->tg);
+
+		if (delta) {
+			raw_spin_lock(&rq->lock);
+			cfs_rq->quota_assigned += delta;
+
+			/* avoid race with tg_set_cfs_bandwidth */
+			if (cfs_rq_throttled(cfs_rq) &&
+			     cfs_rq->quota_used < cfs_rq->quota_assigned)
+				unthrottle_cfs_rq(cfs_rq);
+			raw_spin_unlock(&rq->lock);
+		}
+	}
+
+	return idle;
 }
+
 #endif
 
 #ifdef CONFIG_SMP
Index: tip/kernel/sched_rt.c
===================================================================
--- tip.orig/kernel/sched_rt.c
+++ tip/kernel/sched_rt.c
@@ -252,18 +252,6 @@ static int rt_se_boosted(struct sched_rt
 	return p->prio != p->normal_prio;
 }
 
-#ifdef CONFIG_SMP
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_rq(smp_processor_id())->rd->span;
-}
-#else
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_online_mask;
-}
-#endif
-
 static inline
 struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
 {
@@ -321,11 +309,6 @@ static inline int rt_rq_throttled(struct
 	return rt_rq->rt_throttled;
 }
 
-static inline const struct cpumask *sched_rt_period_mask(void)
-{
-	return cpu_online_mask;
-}
-
 static inline
 struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
 {
@@ -543,7 +526,7 @@ static int do_sched_rt_period_timer(stru
 	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
 		return 1;
 
-	span = sched_rt_period_mask();
+	span = sched_bw_period_mask();
 	for_each_cpu(i, span) {
 		int enqueue = 0;
 		struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [CFS Bandwidth Control v4 5/7] sched: add exports tracking cfs bandwidth control statistics
  2011-02-16  3:18 [CFS Bandwidth Control v4 0/7] Introduction Paul Turner
                   ` (3 preceding siblings ...)
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
@ 2011-02-16  3:18 ` Paul Turner
  2011-02-22  3:14   ` Balbir Singh
  2011-02-23 13:32   ` Peter Zijlstra
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 6/7] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-16  3:18 UTC (permalink / raw
  To: linux-kernel
  Cc: Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen, Nikhil Rao

[-- Attachment #1: sched-bwc-throttle_stats.patch --]
[-- Type: text/plain, Size: 3986 bytes --]

From: Nikhil Rao <ncrao@google.com>

This change introduces statistics exports for the cpu sub-system, these are
added through the use of a stat file similar to that exported by other
subsystems.

The following exports are included:

nr_periods:	number of periods in which execution occurred
nr_throttled:	the number of periods above in which execution was throttle
throttled_time:	cumulative wall-time that any cpus have been throttled for
this group

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Nikhil Rao <ncrao@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |   26 ++++++++++++++++++++++++++
 kernel/sched_fair.c |   16 +++++++++++++++-
 2 files changed, 41 insertions(+), 1 deletion(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -254,6 +254,11 @@ struct cfs_bandwidth {
 	ktime_t			period;
 	u64			runtime, quota;
 	struct hrtimer		period_timer;
+
+	/* throttle statistics */
+	u64			nr_periods;
+	u64			nr_throttled;
+	u64			throttled_time;
 };
 #endif
 
@@ -389,6 +394,7 @@ struct cfs_rq {
 #ifdef CONFIG_CFS_BANDWIDTH
 	u64 quota_assigned, quota_used;
 	int throttled;
+	u64 throttled_timestamp;
 #endif
 #endif
 };
@@ -426,6 +432,10 @@ void init_cfs_bandwidth(struct cfs_bandw
 
 	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	cfs_b->period_timer.function = sched_cfs_period_timer;
+
+	cfs_b->nr_periods = 0;
+	cfs_b->nr_throttled = 0;
+	cfs_b->throttled_time = 0;
 }
 
 static
@@ -9332,6 +9342,18 @@ static int cpu_cfs_period_write_u64(stru
 	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
 }
 
+static int cpu_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
+
+	cb->fill(cb, "nr_periods", cfs_b->nr_periods);
+	cb->fill(cb, "nr_throttled", cfs_b->nr_throttled);
+	cb->fill(cb, "throttled_time", cfs_b->throttled_time);
+
+	return 0;
+}
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
@@ -9378,6 +9400,10 @@ static struct cftype cpu_files[] = {
 		.read_u64 = cpu_cfs_period_read_u64,
 		.write_u64 = cpu_cfs_period_write_u64,
 	},
+	{
+		.name = "stat",
+		.read_map = cpu_stats_show,
+	},
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
 	{
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1519,17 +1519,25 @@ static void throttle_cfs_rq(struct cfs_r
 
 out_throttled:
 	cfs_rq->throttled = 1;
+	cfs_rq->throttled_timestamp = rq_of(cfs_rq)->clock;
 	update_cfs_rq_load_contribution(cfs_rq, 1);
 }
 
 static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
+	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
 	struct sched_entity *se;
 
 	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
 
 	update_rq_clock(rq);
+	/* update stats */
+	raw_spin_lock(&cfs_b->lock);
+	cfs_b->throttled_time += (rq->clock - cfs_rq->throttled_timestamp);
+	raw_spin_unlock(&cfs_b->lock);
+	cfs_rq->throttled_timestamp = 0;
+
 	/* (Try to) avoid maintaining share statistics for idle time */
 	cfs_rq->load_stamp = cfs_rq->load_last = rq->clock_task;
 
@@ -1571,7 +1579,7 @@ static void account_cfs_rq_quota(struct 
 
 static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
 {
-	int i, idle = 1;
+	int i, idle = 1, num_throttled = 0;
 	u64 delta;
 	const struct cpumask *span;
 
@@ -1593,6 +1601,7 @@ static int do_sched_cfs_period_timer(str
 
 		if (!cfs_rq_throttled(cfs_rq))
 			continue;
+		num_throttled++;
 
 		delta = tg_request_cfs_quota(cfs_rq->tg);
 
@@ -1608,6 +1617,11 @@ static int do_sched_cfs_period_timer(str
 		}
 	}
 
+	/* update throttled stats */
+	cfs_b->nr_periods++;
+	if (num_throttled)
+		cfs_b->nr_throttled++;
+
 	return idle;
 }
 



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [CFS Bandwidth Control v4 6/7] sched: hierarchical task accounting for SCHED_OTHER
  2011-02-16  3:18 [CFS Bandwidth Control v4 0/7] Introduction Paul Turner
                   ` (4 preceding siblings ...)
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 5/7] sched: add exports tracking cfs bandwidth control statistics Paul Turner
@ 2011-02-16  3:18 ` Paul Turner
  2011-02-22  3:17   ` Balbir Singh
                     ` (2 more replies)
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 7/7] sched: add documentation for bandwidth control Paul Turner
                   ` (2 subsequent siblings)
  8 siblings, 3 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-16  3:18 UTC (permalink / raw
  To: linux-kernel
  Cc: Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen

[-- Attachment #1: sched-bwc-account_cfs_tasks.patch --]
[-- Type: text/plain, Size: 5254 bytes --]

With task entities participating in throttled sub-trees it is possible for
task activation/de-activation to not lead to root visible changes to
rq->nr_running.  This in turn leads to incorrect idle and weight-per-task load
balance decisions.

To allow correct accounting we move responsibility for updating rq->nr_running
to the respective sched::classes.  In the fair-group case this update is
hierarchical, tracking the number of active tasks rooted at each group entity.

Note: technically this issue also exists with the existing sched_rt
throttling; however due to the nearly complete provisioning of system
resources for rt scheduling this is much less common by default.

Signed-off-by: Paul Turner <pjt@google.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
---
 kernel/sched.c      |    9 ++++++---
 kernel/sched_fair.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_rt.c   |    5 ++++-
 3 files changed, 52 insertions(+), 4 deletions(-)

Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -330,7 +330,7 @@ struct task_group root_task_group;
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight load;
-	unsigned long nr_running;
+	unsigned long nr_running, h_nr_tasks;
 
 	u64 exec_clock;
 	u64 min_vruntime;
@@ -1846,6 +1846,11 @@ static const struct sched_class rt_sched
 
 #include "sched_stats.h"
 
+static void mod_nr_running(struct rq *rq, long delta)
+{
+	rq->nr_running += delta;
+}
+
 static void inc_nr_running(struct rq *rq)
 {
 	rq->nr_running++;
@@ -1896,7 +1901,6 @@ static void activate_task(struct rq *rq,
 		rq->nr_uninterruptible--;
 
 	enqueue_task(rq, p, flags);
-	inc_nr_running(rq);
 }
 
 /*
@@ -1908,7 +1912,6 @@ static void deactivate_task(struct rq *r
 		rq->nr_uninterruptible++;
 
 	dequeue_task(rq, p, flags);
-	dec_nr_running(rq);
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -81,6 +81,8 @@ unsigned int normalized_sysctl_sched_wak
 
 const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
 
+static void account_hier_tasks(struct sched_entity *se, int delta);
+
 /*
  * The exponential sliding  window over which load is averaged for shares
  * distribution.
@@ -933,6 +935,40 @@ static inline void update_entity_shares_
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
+#ifdef CONFIG_CFS_BANDWIDTH
+/* maintain hierarchal task counts on group entities */
+static void account_hier_tasks(struct sched_entity *se, int delta)
+{
+	struct rq *rq = rq_of(cfs_rq_of(se));
+	struct cfs_rq *cfs_rq;
+
+	for_each_sched_entity(se) {
+		/* a throttled entity cannot affect its parent hierarchy */
+		if (group_cfs_rq(se) && cfs_rq_throttled(group_cfs_rq(se)))
+			break;
+
+		/* we affect our queuing entity */
+		cfs_rq = cfs_rq_of(se);
+		cfs_rq->h_nr_tasks += delta;
+	}
+
+	/* account for global nr_running delta to hierarchy change */
+	if (!se)
+		mod_nr_running(rq, delta);
+}
+#else
+/*
+ * In the absence of group throttling, all operations are guaranteed to be
+ * globally visible at the root rq level.
+ */
+static void account_hier_tasks(struct sched_entity *se, int delta)
+{
+	struct rq *rq = rq_of(cfs_rq_of(se));
+
+	mod_nr_running(rq, delta);
+}
+#endif
+
 static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 #ifdef CONFIG_SCHEDSTATS
@@ -1428,6 +1464,7 @@ enqueue_task_fair(struct rq *rq, struct 
 		update_cfs_shares(cfs_rq);
 	}
 
+	account_hier_tasks(&p->se, 1);
 	hrtick_update(rq);
 }
 
@@ -1461,6 +1498,7 @@ static void dequeue_task_fair(struct rq 
 		update_cfs_shares(cfs_rq);
 	}
 
+	account_hier_tasks(&p->se, -1);
 	hrtick_update(rq);
 }
 
@@ -1488,6 +1526,8 @@ static u64 tg_request_cfs_quota(struct t
 	return delta;
 }
 
+static void account_hier_tasks(struct sched_entity *se, int delta);
+
 static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se;
@@ -1507,6 +1547,7 @@ static void throttle_cfs_rq(struct cfs_r
 	if (!se->on_rq)
 		goto out_throttled;
 
+	account_hier_tasks(se, -cfs_rq->h_nr_tasks);
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
@@ -1541,6 +1582,7 @@ static void unthrottle_cfs_rq(struct cfs
 	cfs_rq->load_stamp = cfs_rq->load_last = rq->clock_task;
 
 	cfs_rq->throttled = 0;
+	account_hier_tasks(se, cfs_rq->h_nr_tasks);
 	for_each_sched_entity(se) {
 		if (se->on_rq)
 			break;
Index: tip/kernel/sched_rt.c
===================================================================
--- tip.orig/kernel/sched_rt.c
+++ tip/kernel/sched_rt.c
@@ -906,6 +906,8 @@ enqueue_task_rt(struct rq *rq, struct ta
 
 	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
 		enqueue_pushable_task(rq, p);
+
+	inc_nr_running(rq);
 }
 
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
@@ -916,6 +918,8 @@ static void dequeue_task_rt(struct rq *r
 	dequeue_rt_entity(rt_se);
 
 	dequeue_pushable_task(rq, p);
+
+	dec_nr_running(rq);
 }
 
 /*
@@ -1783,4 +1787,3 @@ static void print_rt_stats(struct seq_fi
 	rcu_read_unlock();
 }
 #endif /* CONFIG_SCHED_DEBUG */
-



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [CFS Bandwidth Control v4 7/7] sched: add documentation for bandwidth control
  2011-02-16  3:18 [CFS Bandwidth Control v4 0/7] Introduction Paul Turner
                   ` (5 preceding siblings ...)
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 6/7] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
@ 2011-02-16  3:18 ` Paul Turner
  2011-02-21  2:47 ` [CFS Bandwidth Control v4 0/7] Introduction Xiao Guangrong
       [not found] ` <20110224161111.7d83a884@jacob-laptop>
  8 siblings, 0 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-16  3:18 UTC (permalink / raw
  To: linux-kernel
  Cc: Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen

[-- Attachment #1: sched-bwc-documentation.patch --]
[-- Type: text/plain, Size: 4351 bytes --]

From: Bharata B Rao <bharata@linux.vnet.ibm.com>

Basic description of usage and effect for CFS Bandwidth Control.

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Paul Turner <pjt@google.com>
---
 Documentation/scheduler/sched-bwc.txt |   98
 ++++++++++++++++++++++++++++++++++
  1 file changed, 98 insertions(+)

--- /dev/null
+++ b/Documentation/scheduler/sched-bwc.txt
@@ -0,0 +1,98 @@
+CFS Bandwidth Control (aka CPU hard limits)
+===========================================
+
+[ This document talks about CPU bandwidth control of CFS groups only.
+  The bandwidth control of RT groups is explained in
+  Documentation/scheduler/sched-rt-group.txt ]
+
+CFS bandwidth control is a group scheduler extension that can be used to
+control the maximum CPU bandwidth obtained by a CPU cgroup.
+
+Bandwidth allowed for a group is specified using quota and period. Within
+a given "period" (microseconds), a group is allowed to consume up to "quota"
+microseconds of CPU time, which is the upper limit or the hard limit. When the
+CPU bandwidth consumption of a group exceeds the hard limit, the tasks in the
+group are throttled and are not allowed to run until the end of the period at
+which time the group's quota is replenished.
+
+Runtime available to the group is tracked globally. At the beginning of
+every period, group's global runtime pool is replenished with "quota"
+microseconds worth of runtime. The runtime consumption happens locally at each
+CPU by fetching runtimes in "slices" from the global pool.
+
+Interface
+---------
+Quota and period can be set via cgroup files.
+
+cpu.cfs_quota_us: the enforcement interval (microseconds)
+cpu.cfs_period_us: the maximum allowed bandwidth (microseconds)
+
+Within a period of cpu.cfs_period_us, the group as a whole will not be allowed
+to consume more than cpu_cfs_quota_us worth of runtime.
+
+The default value of cpu.cfs_period_us is 500ms and the default value
+for cpu.cfs_quota_us is -1.
+
+A group with cpu.cfs_quota_us as -1 indicates that the group has infinite
+bandwidth, which means that it is not bandwidth controlled.
+
+Writing any negative value to cpu.cfs_quota_us will turn the group into
+an infinite bandwidth group. Reading cpu.cfs_quota_us for an infinite
+bandwidth group will always return -1.
+
+System wide settings
+--------------------
+The amount of runtime obtained from global pool every time a CPU wants the
+group quota locally is controlled by a sysctl parameter
+sched_cfs_bandwidth_slice_us. The current default is 10ms. This can be changed
+by writing to /proc/sys/kernel/sched_cfs_bandwidth_slice_us.
+
+Statistics
+----------
+cpu.stat file lists three different stats related to CPU bandwidth control.
+
+nr_periods: Number of enforcement intervals that have elapsed.
+nr_throttled: Number of times the group has been throttled/limited.
+throttled_time: The total time duration (in nanoseconds) for which the group
+remained throttled.
+
+These files are read-only.
+
+Hierarchy considerations
+------------------------
+Each group's bandwidth (quota and period) can be set independent of its
+parent or child groups. There are two ways in which a group can get
+throttled:
+
+- it consumed its quota within the period
+- it has quota left but the parent's quota is exhausted.
+
+In the 2nd case, even though the child has quota left, it will not be
+able to run since the parent itself is throttled. Similarly groups that are
+not bandwidth constrained might end up being throttled if any parent
+in their hierarchy is throttled.
+
+Examples
+--------
+1. Limit a group to 1 CPU worth of runtime.
+
+	If period is 500ms and quota is also 500ms, the group will get
+	1 CPU worth of runtime every 500ms.
+
+	# echo 500000 > cpu.cfs_quota_us /* quota = 500ms */
+	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
+
+2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
+
+	With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
+	runtime every 500ms.
+
+	# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
+	# echo 500000 > cpu.cfs_period_us /* period = 500ms */
+
+3. Limit a group to 20% of 1 CPU.
+
+	With 500ms period, 100ms quota will be equivalent to 20% of 1 CPU.
+
+	# echo 100000 > cpu.cfs_quota_us /* quota = 100ms */
+	# echo 500000 > cpu.cfs_period_us /* period = 500ms */



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 1/7] sched: introduce primitives to account for CFS bandwidth tracking
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 1/7] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
@ 2011-02-16 16:52   ` Balbir Singh
  2011-02-17  2:54     ` Bharata B Rao
  2011-02-23 13:32   ` Peter Zijlstra
  1 sibling, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2011-02-16 16:52 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen, Nikhil Rao

* Paul Turner <pjt@google.com> [2011-02-15 19:18:32]:

> In this patch we introduce the notion of CFS bandwidth, to account for the
> realities of SMP this is partitioned into globally unassigned bandwidth, and
> locally claimed bandwidth:
> - The global bandwidth is per task_group, it represents a pool of unclaimed
>   bandwidth that cfs_rq's can allocate from.  It uses the new cfs_bandwidth
>   structure.
> - The local bandwidth is tracked per-cfs_rq, this represents allotments from
>   the global pool
>   bandwidth assigned to a task_group, this is tracked using the
>   new cfs_bandwidth structure.
> 
> Bandwidth is managed via cgroupfs via two new files in the cpu subsystem:
> - cpu.cfs_period_us : the bandwidth period in usecs
> - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
>   to consume over period above.
> 
> A per-cfs_bandwidth timer is also introduced to handle future refresh at
> period expiration.  There's some minor refactoring here so that
> start_bandwidth_timer() functionality can be shared
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---

Looks good, minor nits below


Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

>  init/Kconfig        |    9 +
>  kernel/sched.c      |  264 +++++++++++++++++++++++++++++++++++++++++++++++-----
>  kernel/sched_fair.c |   19 +++
>  3 files changed, 269 insertions(+), 23 deletions(-)
> 
> Index: tip/init/Kconfig
> ===================================================================
> --- tip.orig/init/Kconfig
> +++ tip/init/Kconfig
> @@ -698,6 +698,15 @@ config FAIR_GROUP_SCHED
>  	depends on CGROUP_SCHED
>  	default CGROUP_SCHED
> 
> +config CFS_BANDWIDTH
> +	bool "CPU bandwidth provisioning for FAIR_GROUP_SCHED"
> +	depends on EXPERIMENTAL
> +	depends on FAIR_GROUP_SCHED
> +	default n
> +	help
> +	  This option allows users to define quota and period for cpu
> +	  bandwidth provisioning on a per-cgroup basis.
> +
>  config RT_GROUP_SCHED
>  	bool "Group scheduling for SCHED_RR/FIFO"
>  	depends on EXPERIMENTAL
> Index: tip/kernel/sched.c
> ===================================================================
> --- tip.orig/kernel/sched.c
> +++ tip/kernel/sched.c
> @@ -194,10 +194,28 @@ static inline int rt_bandwidth_enabled(v
>  	return sysctl_sched_rt_runtime >= 0;
>  }
> 
> -static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> +static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
>  {
> -	ktime_t now;
> +	unsigned long delta;
> +	ktime_t soft, hard, now;
> +
> +	for (;;) {
> +		if (hrtimer_active(period_timer))
> +			break;
> 
> +		now = hrtimer_cb_get_time(period_timer);
> +		hrtimer_forward(period_timer, now, period);
> +
> +		soft = hrtimer_get_softexpires(period_timer);
> +		hard = hrtimer_get_expires(period_timer);
> +		delta = ktime_to_ns(ktime_sub(hard, soft));
> +		__hrtimer_start_range_ns(period_timer, soft, delta,
> +					 HRTIMER_MODE_ABS_PINNED, 0);
> +	}
> +}
> +
> +static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
> +{
>  	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
>  		return;
> 
> @@ -205,22 +223,7 @@ static void start_rt_bandwidth(struct rt
>  		return;
> 
>  	raw_spin_lock(&rt_b->rt_runtime_lock);
> -	for (;;) {
> -		unsigned long delta;
> -		ktime_t soft, hard;
> -
> -		if (hrtimer_active(&rt_b->rt_period_timer))
> -			break;
> -
> -		now = hrtimer_cb_get_time(&rt_b->rt_period_timer);
> -		hrtimer_forward(&rt_b->rt_period_timer, now, rt_b->rt_period);
> -
> -		soft = hrtimer_get_softexpires(&rt_b->rt_period_timer);
> -		hard = hrtimer_get_expires(&rt_b->rt_period_timer);
> -		delta = ktime_to_ns(ktime_sub(hard, soft));
> -		__hrtimer_start_range_ns(&rt_b->rt_period_timer, soft, delta,
> -				HRTIMER_MODE_ABS_PINNED, 0);
> -	}
> +	start_bandwidth_timer(&rt_b->rt_period_timer, rt_b->rt_period);
>  	raw_spin_unlock(&rt_b->rt_runtime_lock);
>  }
> 
> @@ -245,6 +248,15 @@ struct cfs_rq;
> 
>  static LIST_HEAD(task_groups);
> 
> +#ifdef CONFIG_CFS_BANDWIDTH
> +struct cfs_bandwidth {
> +	raw_spinlock_t		lock;
> +	ktime_t			period;
> +	u64			runtime, quota;
> +	struct hrtimer		period_timer;
> +};
> +#endif
> +
>  /* task group related information */
>  struct task_group {
>  	struct cgroup_subsys_state css;
> @@ -276,6 +288,10 @@ struct task_group {
>  #ifdef CONFIG_SCHED_AUTOGROUP
>  	struct autogroup *autogroup;
>  #endif
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	struct cfs_bandwidth cfs_bandwidth;
> +#endif
>  };
> 
>  /* task_group_lock serializes the addition/removal of task groups */
> @@ -370,9 +386,76 @@ struct cfs_rq {
> 
>  	unsigned long load_contribution;
>  #endif
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	u64 quota_assigned, quota_used;
> +#endif
>  #endif
>  };
> 
> +#ifdef CONFIG_CFS_BANDWIDTH
> +static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
> +
> +static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> +{
> +	struct cfs_bandwidth *cfs_b =
> +		container_of(timer, struct cfs_bandwidth, period_timer);
> +	ktime_t now;
> +	int overrun;
> +	int idle = 0;
> +
> +	for (;;) {
> +		now = hrtimer_cb_get_time(timer);
> +		overrun = hrtimer_forward(timer, now, cfs_b->period);
> +
> +		if (!overrun)
> +			break;
> +
> +		idle = do_sched_cfs_period_timer(cfs_b, overrun);

This patch just sets up to return do_sched_cfs_period_timer to return
1. I am afraid I don't understand why this function is introduced
here.

> +	}
> +
> +	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> +}
> +
> +static
> +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
> +{
> +	raw_spin_lock_init(&cfs_b->lock);
> +	cfs_b->quota = cfs_b->runtime = quota;
> +	cfs_b->period = ns_to_ktime(period);
> +
> +	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	cfs_b->period_timer.function = sched_cfs_period_timer;
> +}
> +
> +static
> +void init_cfs_rq_quota(struct cfs_rq *cfs_rq)
> +{
> +	cfs_rq->quota_used = 0;
> +	if (cfs_rq->tg->cfs_bandwidth.quota == RUNTIME_INF)
> +		cfs_rq->quota_assigned = RUNTIME_INF;
> +	else
> +		cfs_rq->quota_assigned = 0;
> +}
> +
> +static void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +{
> +	if (cfs_b->quota == RUNTIME_INF)
> +		return;
> +
> +	if (hrtimer_active(&cfs_b->period_timer))
> +		return;
> +
> +	raw_spin_lock(&cfs_b->lock);
> +	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
> +	raw_spin_unlock(&cfs_b->lock);
> +}
> +
> +static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +{
> +	hrtimer_cancel(&cfs_b->period_timer);
> +}
> +#endif
> +
>  /* Real-Time classes' related field in a runqueue: */
>  struct rt_rq {
>  	struct rt_prio_array active;
> @@ -8038,6 +8121,9 @@ static void init_tg_cfs_entry(struct tas
>  	tg->cfs_rq[cpu] = cfs_rq;
>  	init_cfs_rq(cfs_rq, rq);
>  	cfs_rq->tg = tg;
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	init_cfs_rq_quota(cfs_rq);
> +#endif
> 
>  	tg->se[cpu] = se;
>  	/* se could be NULL for root_task_group */
> @@ -8173,6 +8259,10 @@ void __init sched_init(void)
>  		 * We achieve this by letting root_task_group's tasks sit
>  		 * directly in rq->cfs (i.e root_task_group->se[] = NULL).
>  		 */
> +#ifdef CONFIG_CFS_BANDWIDTH
> +		init_cfs_bandwidth(&root_task_group.cfs_bandwidth,
> +				RUNTIME_INF, sched_cfs_bandwidth_period);
> +#endif
>  		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
> 
> @@ -8415,6 +8505,10 @@ static void free_fair_sched_group(struct
>  {
>  	int i;
> 
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	destroy_cfs_bandwidth(&tg->cfs_bandwidth);
> +#endif
> +
>  	for_each_possible_cpu(i) {
>  		if (tg->cfs_rq)
>  			kfree(tg->cfs_rq[i]);
> @@ -8442,7 +8536,10 @@ int alloc_fair_sched_group(struct task_g
>  		goto err;
> 
>  	tg->shares = NICE_0_LOAD;
> -
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	init_cfs_bandwidth(&tg->cfs_bandwidth, RUNTIME_INF,
> +			sched_cfs_bandwidth_period);
> +#endif
>  	for_each_possible_cpu(i) {
>  		rq = cpu_rq(i);
> 
> @@ -8822,7 +8919,7 @@ static int __rt_schedulable(struct task_
>  	return walk_tg_tree(tg_schedulable, tg_nop, &data);
>  }
> 
> -static int tg_set_bandwidth(struct task_group *tg,
> +static int tg_set_rt_bandwidth(struct task_group *tg,
>  		u64 rt_period, u64 rt_runtime)
>  {
>  	int i, err = 0;
> @@ -8861,7 +8958,7 @@ int sched_group_set_rt_runtime(struct ta
>  	if (rt_runtime_us < 0)
>  		rt_runtime = RUNTIME_INF;
> 
> -	return tg_set_bandwidth(tg, rt_period, rt_runtime);
> +	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
>  }
> 
>  long sched_group_rt_runtime(struct task_group *tg)
> @@ -8886,7 +8983,7 @@ int sched_group_set_rt_period(struct tas
>  	if (rt_period == 0)
>  		return -EINVAL;
> 
> -	return tg_set_bandwidth(tg, rt_period, rt_runtime);
> +	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
>  }
> 
>  long sched_group_rt_period(struct task_group *tg)
> @@ -9107,6 +9204,116 @@ static u64 cpu_shares_read_u64(struct cg
> 
>  	return (u64) tg->shares;
>  }
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> +{
> +	int i;
> +	static DEFINE_MUTEX(mutex);
> +
> +	if (tg == &root_task_group)
> +		return -EINVAL;
> +
> +	if (!period)
> +		return -EINVAL;
> +
> +	/*
> +	 * Ensure we have at least one tick of bandwidth every period.  This is
> +	 * to prevent reaching a state of large arrears when throttled via
> +	 * entity_tick() resulting in prolonged exit starvation.
> +	 */
> +	if (NS_TO_JIFFIES(quota) < 1)
> +		return -EINVAL;
> +
> +	mutex_lock(&mutex);
> +	raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
> +	tg->cfs_bandwidth.period = ns_to_ktime(period);
> +	tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
> +	raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
> +
> +	for_each_possible_cpu(i) {

Why for each possible cpu - to avoid hotplug handling?

> +		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
> +		struct rq *rq = rq_of(cfs_rq);
> +
> +		raw_spin_lock_irq(&rq->lock);
> +		init_cfs_rq_quota(cfs_rq);
> +		raw_spin_unlock_irq(&rq->lock);
> +	}
> +	mutex_unlock(&mutex);
> +
> +	return 0;
> +}
> +
> +int tg_set_cfs_quota(struct task_group *tg, long cfs_runtime_us)
> +{
> +	u64 quota, period;
> +
> +	period = ktime_to_ns(tg->cfs_bandwidth.period);
> +	if (cfs_runtime_us < 0)
> +		quota = RUNTIME_INF;
> +	else
> +		quota = (u64)cfs_runtime_us * NSEC_PER_USEC;
> +
> +	return tg_set_cfs_bandwidth(tg, period, quota);
> +}
> +
> +long tg_get_cfs_quota(struct task_group *tg)
> +{
> +	u64 quota_us;
> +
> +	if (tg->cfs_bandwidth.quota == RUNTIME_INF)
> +		return -1;
> +
> +	quota_us = tg->cfs_bandwidth.quota;
> +	do_div(quota_us, NSEC_PER_USEC);
> +	return quota_us;
> +}
> +
> +int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
> +{
> +	u64 quota, period;
> +
> +	period = (u64)cfs_period_us * NSEC_PER_USEC;
> +	quota = tg->cfs_bandwidth.quota;
> +
> +	if (period <= 0)
> +		return -EINVAL;
> +
> +	return tg_set_cfs_bandwidth(tg, period, quota);
> +}
> +
> +long tg_get_cfs_period(struct task_group *tg)
> +{
> +	u64 cfs_period_us;
> +
> +	cfs_period_us = ktime_to_ns(tg->cfs_bandwidth.period);
> +	do_div(cfs_period_us, NSEC_PER_USEC);
> +	return cfs_period_us;
> +}
> +
> +static s64 cpu_cfs_quota_read_s64(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	return tg_get_cfs_quota(cgroup_tg(cgrp));
> +}
> +
> +static int cpu_cfs_quota_write_s64(struct cgroup *cgrp, struct cftype *cftype,
> +				s64 cfs_quota_us)
> +{
> +	return tg_set_cfs_quota(cgroup_tg(cgrp), cfs_quota_us);
> +}
> +
> +static u64 cpu_cfs_period_read_u64(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	return tg_get_cfs_period(cgroup_tg(cgrp));
> +}
> +
> +static int cpu_cfs_period_write_u64(struct cgroup *cgrp, struct cftype *cftype,
> +				u64 cfs_period_us)
> +{
> +	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
> +}
> +
> +#endif /* CONFIG_CFS_BANDWIDTH */
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
> 
>  #ifdef CONFIG_RT_GROUP_SCHED
> @@ -9141,6 +9348,18 @@ static struct cftype cpu_files[] = {
>  		.write_u64 = cpu_shares_write_u64,
>  	},
>  #endif
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	{
> +		.name = "cfs_quota_us",
> +		.read_s64 = cpu_cfs_quota_read_s64,
> +		.write_s64 = cpu_cfs_quota_write_s64,
> +	},
> +	{
> +		.name = "cfs_period_us",
> +		.read_u64 = cpu_cfs_period_read_u64,
> +		.write_u64 = cpu_cfs_period_write_u64,
> +	},
> +#endif
>  #ifdef CONFIG_RT_GROUP_SCHED
>  	{
>  		.name = "rt_runtime_us",
> @@ -9450,4 +9669,3 @@ struct cgroup_subsys cpuacct_subsys = {
>  	.subsys_id = cpuacct_subsys_id,
>  };
>  #endif	/* CONFIG_CGROUP_CPUACCT */
> -
> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -88,6 +88,15 @@ const_debug unsigned int sysctl_sched_mi
>   */
>  unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
> 
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> +/*
> + * default period for cfs group bandwidth.
> + * default: 0.5s, units: nanoseconds
> + */
> +static u64 sched_cfs_bandwidth_period = 500000000ULL;
> +#endif
> +
>  static const struct sched_class fair_sched_class;
> 
>  /**************************************************************
> @@ -397,6 +406,9 @@ static void __enqueue_entity(struct cfs_
> 
>  	rb_link_node(&se->run_node, parent, link);
>  	rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	start_cfs_bandwidth(&cfs_rq->tg->cfs_bandwidth);
> +#endif
>  }
> 
>  static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> @@ -1369,6 +1381,13 @@ static void dequeue_task_fair(struct rq 
>  	hrtick_update(rq);
>  }
> 
> +#ifdef CONFIG_CFS_BANDWIDTH
> +static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
> +{
> +	return 1;
> +}
> +#endif
> +
>  #ifdef CONFIG_SMP
> 
>  static void task_waking_fair(struct rq *rq, struct task_struct *p)
> 
> 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage Paul Turner
@ 2011-02-16 17:45   ` Balbir Singh
  2011-02-23 13:32   ` Peter Zijlstra
  1 sibling, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2011-02-16 17:45 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen, Nikhil Rao

* Paul Turner <pjt@google.com> [2011-02-15 19:18:33]:

> Introduce account_cfs_rq_quota() to account bandwidth usage on the cfs_rq
> level versus task_groups for which bandwidth has been assigned.  This is
> tracked by whether the local cfs_rq->quota_assigned is finite or infinite
> (RUNTIME_INF).
> 
> For cfs_rq's that belong to a bandwidth constrained task_group we introduce
> tg_request_cfs_quota() which attempts to allocate quota from the global pool
> for use locally.  Updates involving the global pool are currently protected
> under cfs_bandwidth->lock, local pools are protected by rq->lock.
> 
> This patch only attempts to assign and track quota, no action is taken in the
> case that cfs_rq->quota_used exceeds cfs_rq->quota_assigned.
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---


Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 1/7] sched: introduce primitives to account for CFS bandwidth tracking
  2011-02-16 16:52   ` Balbir Singh
@ 2011-02-17  2:54     ` Bharata B Rao
  0 siblings, 0 replies; 71+ messages in thread
From: Bharata B Rao @ 2011-02-17  2:54 UTC (permalink / raw
  To: Balbir Singh
  Cc: Paul Turner, linux-kernel, Dhaval Giani, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Wed, Feb 16, 2011 at 10:22:16PM +0530, Balbir Singh wrote:
> * Paul Turner <pjt@google.com> [2011-02-15 19:18:32]:
> 
> > In this patch we introduce the notion of CFS bandwidth, to account for the
> > realities of SMP this is partitioned into globally unassigned bandwidth, and
> > locally claimed bandwidth:
> > - The global bandwidth is per task_group, it represents a pool of unclaimed
> >   bandwidth that cfs_rq's can allocate from.  It uses the new cfs_bandwidth
> >   structure.
> > - The local bandwidth is tracked per-cfs_rq, this represents allotments from
> >   the global pool
> >   bandwidth assigned to a task_group, this is tracked using the
> >   new cfs_bandwidth structure.
> > 
> > Bandwidth is managed via cgroupfs via two new files in the cpu subsystem:
> > - cpu.cfs_period_us : the bandwidth period in usecs
> > - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
> >   to consume over period above.
> > 
> > A per-cfs_bandwidth timer is also introduced to handle future refresh at
> > period expiration.  There's some minor refactoring here so that
> > start_bandwidth_timer() functionality can be shared
> > 
> > Signed-off-by: Paul Turner <pjt@google.com>
> > Signed-off-by: Nikhil Rao <ncrao@google.com>
> > Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> > ---
> 
> Looks good, minor nits below
> 
> 
> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>

Thanks Balbir.

> > +
> > +static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> > +{
> > +	struct cfs_bandwidth *cfs_b =
> > +		container_of(timer, struct cfs_bandwidth, period_timer);
> > +	ktime_t now;
> > +	int overrun;
> > +	int idle = 0;
> > +
> > +	for (;;) {
> > +		now = hrtimer_cb_get_time(timer);
> > +		overrun = hrtimer_forward(timer, now, cfs_b->period);
> > +
> > +		if (!overrun)
> > +			break;
> > +
> > +		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> 
> This patch just sets up to return do_sched_cfs_period_timer to return
> 1. I am afraid I don't understand why this function is introduced
> here.

Answered this during last post: http://lkml.org/lkml/2010/10/14/31

> > +
> > +	mutex_lock(&mutex);
> > +	raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
> > +	tg->cfs_bandwidth.period = ns_to_ktime(period);
> > +	tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
> > +	raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
> > +
> > +	for_each_possible_cpu(i) {
> 
> Why for each possible cpu - to avoid hotplug handling?

Touched upon this during last post: https://lkml.org/lkml/2010/12/6/49

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota Paul Turner
@ 2011-02-18  6:52   ` Balbir Singh
  2011-02-23 13:32   ` Peter Zijlstra
  2011-03-02  7:23   ` Bharata B Rao
  2 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2011-02-18  6:52 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen, Nikhil Rao

* Paul Turner <pjt@google.com> [2011-02-15 19:18:34]:

> In account_cfs_rq_quota() (via update_curr()) we track consumption versus a
> cfs_rq's local quota and whether there is global quota available to continue
> enabling it in the event we run out.
> 
> This patch adds the required support for the latter case, throttling entities
> until quota is available to run.  Throttling dequeues the entity in question
> and sends a reschedule to the owning cpu so that it can be evicted.
> 
> The following restrictions apply to a throttled cfs_rq:
> - It is dequeued from sched_entity hierarchy and restricted from being
>   re-enqueued.  This means that new/waking children of this entity will be
>   queued up to it, but not past it.
> - It does not contribute to weight calculations in tg_shares_up
> - In the case that the cfs_rq of the cpu we are trying to pull from is throttled
>   it is  is ignored by the loadbalancer in __load_balance_fair() and
>   move_one_task_fair().
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>


Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 
-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
@ 2011-02-18  7:19   ` Balbir Singh
  2011-02-18  8:10     ` Bharata B Rao
  2011-02-23 12:23   ` Peter Zijlstra
  2011-02-23 13:32   ` Peter Zijlstra
  2 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2011-02-18  7:19 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen, Nikhil Rao

* Paul Turner <pjt@google.com> [2011-02-15 19:18:35]:

> At the start of a new period there are several actions we must take:
> - Refresh global bandwidth pool
> - Unthrottle entities who ran out of quota as refreshed bandwidth permits
> 
> Unthrottled entities have the cfs_rq->throttled flag set and are re-enqueued
> into the cfs entity hierarchy.
> 
> sched_rt_period_mask() is refactored slightly into sched_bw_period_mask()
> since it is now shared by both cfs and rt bandwidth period timers.
> 
> The !CONFIG_RT_GROUP_SCHED && CONFIG_SMP case has been collapsed to use
> rd->span instead of cpu_online_mask since I think that was incorrect before
> (don't want to hit cpu's outside of your root_domain for RT bandwidth).
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---
>  kernel/sched.c      |   16 +++++++++++
>  kernel/sched_fair.c |   74 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  kernel/sched_rt.c   |   19 -------------
>  3 files changed, 90 insertions(+), 19 deletions(-)
> 
> Index: tip/kernel/sched.c
> ===================================================================
> --- tip.orig/kernel/sched.c
> +++ tip/kernel/sched.c
> @@ -1561,6 +1561,8 @@ static int tg_nop(struct task_group *tg,
>  }
>  #endif
> 
> +static inline const struct cpumask *sched_bw_period_mask(void);
> +
>  #ifdef CONFIG_SMP
>  /* Used instead of source_load when we know the type == 0 */
>  static unsigned long weighted_cpuload(const int cpu)
> @@ -8503,6 +8505,18 @@ void set_curr_task(int cpu, struct task_
> 
>  #endif
> 
> +#ifdef CONFIG_SMP
> +static inline const struct cpumask *sched_bw_period_mask(void)
> +{
> +	return cpu_rq(smp_processor_id())->rd->span;
> +}
> +#else
> +static inline const struct cpumask *sched_bw_period_mask(void)
> +{
> +	return cpu_online_mask;
> +}
> +#endif
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  static void free_fair_sched_group(struct task_group *tg)
>  {
> @@ -9240,6 +9254,8 @@ static int tg_set_cfs_bandwidth(struct t
> 
>  		raw_spin_lock_irq(&rq->lock);
>  		init_cfs_rq_quota(cfs_rq);
> +		if (cfs_rq_throttled(cfs_rq))
> +			unthrottle_cfs_rq(cfs_rq);
>  		raw_spin_unlock_irq(&rq->lock);
>  	}
>  	mutex_unlock(&mutex);
> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -327,6 +327,13 @@ static inline u64 sched_cfs_bandwidth_sl
>  	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
>  }
> 
> +static inline
> +struct cfs_rq *cfs_bandwidth_cfs_rq(struct cfs_bandwidth *cfs_b, int cpu)
> +{
> +	return container_of(cfs_b, struct task_group,
> +			cfs_bandwidth)->cfs_rq[cpu];
> +}
> +
>  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
>  {
>  	return &tg->cfs_bandwidth;
> @@ -1513,6 +1520,33 @@ out_throttled:
>  	update_cfs_rq_load_contribution(cfs_rq, 1);
>  }
> 
> +static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> +{
> +	struct rq *rq = rq_of(cfs_rq);
> +	struct sched_entity *se;
> +
> +	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> +
> +	update_rq_clock(rq);
> +	/* (Try to) avoid maintaining share statistics for idle time */
> +	cfs_rq->load_stamp = cfs_rq->load_last = rq->clock_task;
> +
> +	cfs_rq->throttled = 0;
> +	for_each_sched_entity(se) {
> +		if (se->on_rq)
> +			break;
> +
> +		cfs_rq = cfs_rq_of(se);
> +		enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
> +		if (cfs_rq_throttled(cfs_rq))
> +			break;
> +	}
> +
> +	/* determine whether we need to wake up potentally idle cpu */
> +	if (rq->curr == rq->idle && rq->cfs.nr_running)
> +		resched_task(rq->curr);
> +}
> +
>  static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>  		unsigned long delta_exec)
>  {
> @@ -1535,8 +1569,46 @@ static void account_cfs_rq_quota(struct 
> 
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>  {
> -	return 1;
> +	int i, idle = 1;
> +	u64 delta;
> +	const struct cpumask *span;
> +
> +	if (cfs_b->quota == RUNTIME_INF)
> +		return 1;
> +
> +	/* reset group quota */
> +	raw_spin_lock(&cfs_b->lock);
> +	cfs_b->runtime = cfs_b->quota;
> +	raw_spin_unlock(&cfs_b->lock);
> +
> +	span = sched_bw_period_mask();
> +	for_each_cpu(i, span) {
> +		struct rq *rq = cpu_rq(i);
> +		struct cfs_rq *cfs_rq = cfs_bandwidth_cfs_rq(cfs_b, i);
> +
> +		if (cfs_rq->nr_running)
> +			idle = 0;
> +
> +		if (!cfs_rq_throttled(cfs_rq))
> +			continue;

This is an interesting situation, does this mean we got scheduled in
between periods with quota to last us enough to cross the period.
Should we be donating quota back to the global pool?

> +
> +		delta = tg_request_cfs_quota(cfs_rq->tg);
> +
> +		if (delta) {
> +			raw_spin_lock(&rq->lock);
> +			cfs_rq->quota_assigned += delta;
> +
> +			/* avoid race with tg_set_cfs_bandwidth */
> +			if (cfs_rq_throttled(cfs_rq) &&
> +			     cfs_rq->quota_used < cfs_rq->quota_assigned)
> +				unthrottle_cfs_rq(cfs_rq);
> +			raw_spin_unlock(&rq->lock);
> +		}
> +	}
> +
> +	return idle;
>  }
> +
>  #endif
> 
>  #ifdef CONFIG_SMP
> Index: tip/kernel/sched_rt.c
> ===================================================================
> --- tip.orig/kernel/sched_rt.c
> +++ tip/kernel/sched_rt.c
> @@ -252,18 +252,6 @@ static int rt_se_boosted(struct sched_rt
>  	return p->prio != p->normal_prio;
>  }
> 
> -#ifdef CONFIG_SMP
> -static inline const struct cpumask *sched_rt_period_mask(void)
> -{
> -	return cpu_rq(smp_processor_id())->rd->span;
> -}
> -#else
> -static inline const struct cpumask *sched_rt_period_mask(void)
> -{
> -	return cpu_online_mask;
> -}
> -#endif
> -
>  static inline
>  struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
>  {
> @@ -321,11 +309,6 @@ static inline int rt_rq_throttled(struct
>  	return rt_rq->rt_throttled;
>  }
> 
> -static inline const struct cpumask *sched_rt_period_mask(void)
> -{
> -	return cpu_online_mask;
> -}
> -
>  static inline
>  struct rt_rq *sched_rt_period_rt_rq(struct rt_bandwidth *rt_b, int cpu)
>  {
> @@ -543,7 +526,7 @@ static int do_sched_rt_period_timer(stru
>  	if (!rt_bandwidth_enabled() || rt_b->rt_runtime == RUNTIME_INF)
>  		return 1;
> 
> -	span = sched_rt_period_mask();
> +	span = sched_bw_period_mask();
>  	for_each_cpu(i, span) {
>  		int enqueue = 0;
>  		struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);
> 
> 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-02-18  7:19   ` Balbir Singh
@ 2011-02-18  8:10     ` Bharata B Rao
  0 siblings, 0 replies; 71+ messages in thread
From: Bharata B Rao @ 2011-02-18  8:10 UTC (permalink / raw
  To: Balbir Singh
  Cc: Paul Turner, linux-kernel, Dhaval Giani, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Fri, Feb 18, 2011 at 12:49:04PM +0530, Balbir Singh wrote:
> * Paul Turner <pjt@google.com> [2011-02-15 19:18:35]:
> 
> > At the start of a new period there are several actions we must take:
> > - Refresh global bandwidth pool
> > - Unthrottle entities who ran out of quota as refreshed bandwidth permits
> > 
> > Unthrottled entities have the cfs_rq->throttled flag set and are re-enqueued
> > into the cfs entity hierarchy.
> > 
> > sched_rt_period_mask() is refactored slightly into sched_bw_period_mask()
> > since it is now shared by both cfs and rt bandwidth period timers.
> > 
> > The !CONFIG_RT_GROUP_SCHED && CONFIG_SMP case has been collapsed to use
> > rd->span instead of cpu_online_mask since I think that was incorrect before
> > (don't want to hit cpu's outside of your root_domain for RT bandwidth).
> > 
> > Signed-off-by: Paul Turner <pjt@google.com>
> > Signed-off-by: Nikhil Rao <ncrao@google.com>
> > Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> > ---
> >  kernel/sched.c      |   16 +++++++++++
> >  kernel/sched_fair.c |   74 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  kernel/sched_rt.c   |   19 -------------
> >  3 files changed, 90 insertions(+), 19 deletions(-)
> > 
> > Index: tip/kernel/sched.c
> > ===================================================================
> > --- tip.orig/kernel/sched.c
> > +++ tip/kernel/sched.c
> > @@ -1561,6 +1561,8 @@ static int tg_nop(struct task_group *tg,
> >  }
> >  #endif
> > 
> > +static inline const struct cpumask *sched_bw_period_mask(void);
> > +
> >  #ifdef CONFIG_SMP
> >  /* Used instead of source_load when we know the type == 0 */
> >  static unsigned long weighted_cpuload(const int cpu)
> > @@ -8503,6 +8505,18 @@ void set_curr_task(int cpu, struct task_
> > 
> >  #endif
> > 
> > +#ifdef CONFIG_SMP
> > +static inline const struct cpumask *sched_bw_period_mask(void)
> > +{
> > +	return cpu_rq(smp_processor_id())->rd->span;
> > +}
> > +#else
> > +static inline const struct cpumask *sched_bw_period_mask(void)
> > +{
> > +	return cpu_online_mask;
> > +}
> > +#endif
> > +
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
> >  static void free_fair_sched_group(struct task_group *tg)
> >  {
> > @@ -9240,6 +9254,8 @@ static int tg_set_cfs_bandwidth(struct t
> > 
> >  		raw_spin_lock_irq(&rq->lock);
> >  		init_cfs_rq_quota(cfs_rq);
> > +		if (cfs_rq_throttled(cfs_rq))
> > +			unthrottle_cfs_rq(cfs_rq);
> >  		raw_spin_unlock_irq(&rq->lock);
> >  	}
> >  	mutex_unlock(&mutex);
> > Index: tip/kernel/sched_fair.c
> > ===================================================================
> > --- tip.orig/kernel/sched_fair.c
> > +++ tip/kernel/sched_fair.c
> > @@ -327,6 +327,13 @@ static inline u64 sched_cfs_bandwidth_sl
> >  	return (u64)sysctl_sched_cfs_bandwidth_slice * NSEC_PER_USEC;
> >  }
> > 
> > +static inline
> > +struct cfs_rq *cfs_bandwidth_cfs_rq(struct cfs_bandwidth *cfs_b, int cpu)
> > +{
> > +	return container_of(cfs_b, struct task_group,
> > +			cfs_bandwidth)->cfs_rq[cpu];
> > +}
> > +
> >  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg)
> >  {
> >  	return &tg->cfs_bandwidth;
> > @@ -1513,6 +1520,33 @@ out_throttled:
> >  	update_cfs_rq_load_contribution(cfs_rq, 1);
> >  }
> > 
> > +static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> > +{
> > +	struct rq *rq = rq_of(cfs_rq);
> > +	struct sched_entity *se;
> > +
> > +	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> > +
> > +	update_rq_clock(rq);
> > +	/* (Try to) avoid maintaining share statistics for idle time */
> > +	cfs_rq->load_stamp = cfs_rq->load_last = rq->clock_task;
> > +
> > +	cfs_rq->throttled = 0;
> > +	for_each_sched_entity(se) {
> > +		if (se->on_rq)
> > +			break;
> > +
> > +		cfs_rq = cfs_rq_of(se);
> > +		enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
> > +		if (cfs_rq_throttled(cfs_rq))
> > +			break;
> > +	}
> > +
> > +	/* determine whether we need to wake up potentally idle cpu */
> > +	if (rq->curr == rq->idle && rq->cfs.nr_running)
> > +		resched_task(rq->curr);
> > +}
> > +
> >  static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> >  		unsigned long delta_exec)
> >  {
> > @@ -1535,8 +1569,46 @@ static void account_cfs_rq_quota(struct 
> > 
> >  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
> >  {
> > -	return 1;
> > +	int i, idle = 1;
> > +	u64 delta;
> > +	const struct cpumask *span;
> > +
> > +	if (cfs_b->quota == RUNTIME_INF)
> > +		return 1;
> > +
> > +	/* reset group quota */
> > +	raw_spin_lock(&cfs_b->lock);
> > +	cfs_b->runtime = cfs_b->quota;
> > +	raw_spin_unlock(&cfs_b->lock);
> > +
> > +	span = sched_bw_period_mask();
> > +	for_each_cpu(i, span) {
> > +		struct rq *rq = cpu_rq(i);
> > +		struct cfs_rq *cfs_rq = cfs_bandwidth_cfs_rq(cfs_b, i);
> > +
> > +		if (cfs_rq->nr_running)
> > +			idle = 0;
> > +
> > +		if (!cfs_rq_throttled(cfs_rq))
> > +			continue;
> 
> This is an interesting situation, does this mean we got scheduled in
> between periods with quota to last us enough to cross the period.
> Should we be donating quota back to the global pool?

I see 2 possibilities of this happening:

1. The tasks in the group were dequeued (due to sleep) before utilizing
the full quota within the period.

2. The tasks in the group couldn't fully utilize the available quota before
the period timer fired.

Note that the period timer fires every period as long as the group has
active tasks.

In either case, we reset the global pool.

We could potentially return the local slice back to global pool. I had a patch
in v3 to cover case 1. But we decided to get the base stuff accepted before
having such optimizations.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 0/7] Introduction
  2011-02-16  3:18 [CFS Bandwidth Control v4 0/7] Introduction Paul Turner
                   ` (6 preceding siblings ...)
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 7/7] sched: add documentation for bandwidth control Paul Turner
@ 2011-02-21  2:47 ` Xiao Guangrong
  2011-02-22 10:28   ` Bharata B Rao
  2011-02-23  7:42   ` Paul Turner
       [not found] ` <20110224161111.7d83a884@jacob-laptop>
  8 siblings, 2 replies; 71+ messages in thread
From: Xiao Guangrong @ 2011-02-21  2:47 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen

[-- Attachment #1: Type: text/plain, Size: 4386 bytes --]

On 02/16/2011 11:18 AM, Paul Turner wrote:
> Hi all,
> 
> Please find attached v4 of CFS bandwidth control; while this rebase against
> some of the latest SCHED_NORMAL code is new, the features and methodology are
> fairly mature at this point and have proved both effective and stable for
> several workloads.
> 
> As always, all comments/feedback welcome.
> 

Hi Paul,

Thanks for the great features!

I applied the patchset to kvm tree, then tested with kvm guest, unfortunately,
it seems don't work normally. 

The steps is follow:

# mount -t cgroup -o cpu none /mnt/
# qemu-system-x86_64 -enable-kvm  -smp 4 -m 512M -drive file=fc64.img,index=0,media=disk

Don't do any configuration in cgroup, and run the kvm guest directly (don't use libvirt),
the guest booted very slowly and i saw some "soft lockup" bugs reported in the guest,
i also noticed one CPU usage is 100% for more than 60s and other CPUs is 10%~30% in the host
when guest was booting.

And if cgroup is not mounted, the guest runs well.

The kernel config file is attached and my system cpu info is:

# cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 30
model name	: Intel(R) Core(TM) i5 CPU         760  @ 2.80GHz
stepping	: 5
cpu MHz		: 1197.000
cache size	: 8192 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5584.73
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 30
model name	: Intel(R) Core(TM) i5 CPU         760  @ 2.80GHz
stepping	: 5
cpu MHz		: 1197.000
cache size	: 8192 KB
physical id	: 0
siblings	: 4
core id		: 1
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5585.03
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 30
model name	: Intel(R) Core(TM) i5 CPU         760  @ 2.80GHz
stepping	: 5
cpu MHz		: 1197.000
cache size	: 8192 KB
physical id	: 0
siblings	: 4
core id		: 2
cpu cores	: 4
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5585.03
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 30
model name	: Intel(R) Core(TM) i5 CPU         760  @ 2.80GHz
stepping	: 5
cpu MHz		: 1197.000
cache size	: 8192 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5585.03
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:



[-- Attachment #2: .config --]
[-- Type: text/plain, Size: 92992 bytes --]

#
# Automatically generated make config: don't edit
# Linux/x86_64 2.6.38-rc4 Kernel Configuration
# Mon Feb 21 09:42:01 2011
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_GPIO=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11"
# CONFIG_KTIME_SCALAR is not set
CONFIG_ARCH_CPU_PROBE_RELEASE=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_CONSTRUCTORS=y
CONFIG_HAVE_IRQ_WORK=y
CONFIG_IRQ_WORK=y

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y
CONFIG_HAVE_GENERIC_HARDIRQS=y

#
# IRQ subsystem
#
CONFIG_GENERIC_HARDIRQS=y
# CONFIG_GENERIC_HARDIRQS_NO_DEPRECATED is not set
CONFIG_HAVE_SPARSE_IRQ=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
# CONFIG_AUTO_IRQ_AFFINITY is not set
# CONFIG_IRQ_PER_CPU is not set
# CONFIG_HARDIRQS_SW_RESEND is not set
CONFIG_SPARSE_IRQ=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
# CONFIG_RCU_TRACE is not set
CONFIG_RCU_FANOUT=64
# CONFIG_RCU_FANOUT_EXACT is not set
CONFIG_RCU_FAST_NO_HZ=y
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=17
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_NS=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_RESOURCE_COUNTERS=y
CONFIG_CGROUP_MEM_RES_CTLR=y
CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
# CONFIG_SCHED_AUTOGROUP is not set
CONFIG_MM_OWNER=y
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
# CONFIG_EXPERT is not set
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_HAVE_PERF_EVENTS=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
CONFIG_PERF_COUNTERS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_OPROFILE=m
CONFIG_OPROFILE_EVENT_MULTIPLEX=y
CONFIG_HAVE_OPROFILE=y
CONFIG_KPROBES=y
# CONFIG_JUMP_LABEL is not set
CONFIG_OPTPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_BSG=y
CONFIG_BLK_DEV_INTEGRITY=y
CONFIG_BLK_DEV_THROTTLING=y
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_CFQ_GROUP_IOSCHED=y
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_INLINE_SPIN_TRYLOCK is not set
# CONFIG_INLINE_SPIN_TRYLOCK_BH is not set
# CONFIG_INLINE_SPIN_LOCK is not set
# CONFIG_INLINE_SPIN_LOCK_BH is not set
# CONFIG_INLINE_SPIN_LOCK_IRQ is not set
# CONFIG_INLINE_SPIN_LOCK_IRQSAVE is not set
# CONFIG_INLINE_SPIN_UNLOCK is not set
# CONFIG_INLINE_SPIN_UNLOCK_BH is not set
# CONFIG_INLINE_SPIN_UNLOCK_IRQ is not set
# CONFIG_INLINE_SPIN_UNLOCK_IRQRESTORE is not set
# CONFIG_INLINE_READ_TRYLOCK is not set
# CONFIG_INLINE_READ_LOCK is not set
# CONFIG_INLINE_READ_LOCK_BH is not set
# CONFIG_INLINE_READ_LOCK_IRQ is not set
# CONFIG_INLINE_READ_LOCK_IRQSAVE is not set
# CONFIG_INLINE_READ_UNLOCK is not set
# CONFIG_INLINE_READ_UNLOCK_BH is not set
# CONFIG_INLINE_READ_UNLOCK_IRQ is not set
# CONFIG_INLINE_READ_UNLOCK_IRQRESTORE is not set
# CONFIG_INLINE_WRITE_TRYLOCK is not set
# CONFIG_INLINE_WRITE_LOCK is not set
# CONFIG_INLINE_WRITE_LOCK_BH is not set
# CONFIG_INLINE_WRITE_LOCK_IRQ is not set
# CONFIG_INLINE_WRITE_LOCK_IRQSAVE is not set
# CONFIG_INLINE_WRITE_UNLOCK is not set
# CONFIG_INLINE_WRITE_UNLOCK_BH is not set
# CONFIG_INLINE_WRITE_UNLOCK_IRQ is not set
# CONFIG_INLINE_WRITE_UNLOCK_IRQRESTORE is not set
# CONFIG_MUTEX_SPIN_ON_OWNER is not set
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
CONFIG_X86_X2APIC=y
CONFIG_X86_MPPARSE=y
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_VSMP is not set
# CONFIG_X86_UV is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_PARAVIRT_GUEST=y
CONFIG_XEN=y
CONFIG_XEN_DOM0=y
CONFIG_XEN_PRIVILEGED_GUEST=y
CONFIG_XEN_PVHVM=y
CONFIG_XEN_MAX_DOMAIN_MEMORY=128
CONFIG_XEN_SAVE_RESTORE=y
CONFIG_XEN_DEBUG_FS=y
CONFIG_KVM_CLOCK=y
CONFIG_KVM_GUEST=y
CONFIG_PARAVIRT=y
# CONFIG_PARAVIRT_SPINLOCKS is not set
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_PARAVIRT_DEBUG is not set
CONFIG_NO_BOOTMEM=y
# CONFIG_MEMTEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=7
CONFIG_X86_CMPXCHG=y
CONFIG_CMPXCHG_LOCAL=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_CALGARY_IOMMU=y
CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y
CONFIG_AMD_IOMMU=y
CONFIG_AMD_IOMMU_STATS=y
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
CONFIG_IOMMU_API=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=256
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_IRQ_TIME_ACCOUNTING is not set
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
CONFIG_X86_THERMAL_VECTOR=y
CONFIG_I8K=m
CONFIG_MICROCODE=m
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
CONFIG_NUMA=y
CONFIG_AMD_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=9
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_MEMORY_PROBE=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK=y
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG_SPARSE=y
CONFIG_MEMORY_HOTREMOVE=y
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=999999
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MEMORY_FAILURE=y
CONFIG_HWPOISON_INJECT=m
# CONFIG_TRANSPARENT_HUGEPAGE is not set
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
CONFIG_X86_RESERVE_LOW=64
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=1
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_EFI=y
CONFIG_SECCOMP=y
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_KEXEC_JUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_PM=y
CONFIG_PM_DEBUG=y
CONFIG_PM_ADVANCED_DEBUG=y
# CONFIG_PM_VERBOSE is not set
CONFIG_CAN_PM_TRACE=y
CONFIG_PM_TRACE=y
CONFIG_PM_TRACE_RTC=y
CONFIG_PM_SLEEP_SMP=y
CONFIG_PM_SLEEP=y
# CONFIG_PM_SLEEP_ADVANCED_DEBUG is not set
CONFIG_SUSPEND=y
# CONFIG_PM_TEST_SUSPEND is not set
CONFIG_SUSPEND_FREEZER=y
CONFIG_HIBERNATION=y
CONFIG_PM_STD_PARTITION=""
CONFIG_PM_RUNTIME=y
CONFIG_PM_OPS=y
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_PROCFS=y
CONFIG_ACPI_PROCFS_POWER=y
CONFIG_ACPI_POWER_METER=m
# CONFIG_ACPI_EC_DEBUGFS is not set
# CONFIG_ACPI_PROC_EVENT is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=m
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
# CONFIG_ACPI_IPMI is not set
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_PROCESSOR_AGGREGATOR=m
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_PCI_SLOT=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_HOTPLUG_MEMORY=m
CONFIG_ACPI_SBS=m
CONFIG_ACPI_HED=m
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=m
# CONFIG_ACPI_APEI_EINJ is not set
# CONFIG_ACPI_APEI_ERST_DEBUG is not set
CONFIG_SFI=y

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=m
CONFIG_CPU_FREQ_DEBUG=y
CONFIG_CPU_FREQ_STAT=m
CONFIG_CPU_FREQ_STAT_DETAILS=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m

#
# CPUFreq processor drivers
#
CONFIG_X86_PCC_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ=m
CONFIG_X86_POWERNOW_K8=m
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
CONFIG_X86_P4_CLOCKMOD=m

#
# shared options
#
CONFIG_X86_SPEEDSTEP_LIB=m
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
CONFIG_INTEL_IDLE=y

#
# Memory power savings
#
CONFIG_I7300_IDLE_IOAT_CHANNEL=y
CONFIG_I7300_IDLE=m

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_XEN=y
CONFIG_PCI_DOMAINS=y
# CONFIG_PCI_CNB20LE_QUIRK is not set
CONFIG_DMAR=y
CONFIG_DMAR_DEFAULT_ON=y
CONFIG_DMAR_FLOPPY_WA=y
CONFIG_INTR_REMAP=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=y
CONFIG_PCIEAER=y
CONFIG_PCIE_ECRC=y
CONFIG_PCIEAER_INJECT=m
CONFIG_PCIEASPM=y
# CONFIG_PCIEASPM_DEBUG is not set
CONFIG_PCIE_PME=y
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
# CONFIG_PCI_DEBUG is not set
CONFIG_PCI_STUB=y
CONFIG_XEN_PCIDEV_FRONTEND=y
CONFIG_HT_IRQ=y
CONFIG_PCI_IOV=y
CONFIG_PCI_IOAPIC=y
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
CONFIG_PCCARD=y
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_CARDBUS=y

#
# PC-card bridges
#
CONFIG_YENTA=m
CONFIG_YENTA_O2=y
CONFIG_YENTA_RICOH=y
CONFIG_YENTA_TI=y
CONFIG_YENTA_ENE_TUNE=y
CONFIG_YENTA_TOSHIBA=y
CONFIG_PD6729=m
CONFIG_I82092=m
CONFIG_PCCARD_NONSTATIC=y
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_FAKE=m
CONFIG_HOTPLUG_PCI_ACPI=y
CONFIG_HOTPLUG_PCI_ACPI_IBM=m
# CONFIG_HOTPLUG_PCI_CPCI is not set
CONFIG_HOTPLUG_PCI_SHPC=m

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
# CONFIG_HAVE_AOUT is not set
CONFIG_BINFMT_MISC=y
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_HAVE_TEXT_POKE_SMP=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=y
CONFIG_XFRM_SUB_POLICY=y
CONFIG_XFRM_MIGRATE=y
CONFIG_XFRM_STATISTICS=y
CONFIG_XFRM_IPCOMP=m
CONFIG_NET_KEY=m
CONFIG_NET_KEY_MIGRATE=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
CONFIG_NET_IPIP=m
# CONFIG_NET_IPGRE_DEMUX is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
CONFIG_ARPD=y
CONFIG_SYN_COOKIES=y
CONFIG_INET_AH=m
CONFIG_INET_ESP=m
CONFIG_INET_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_TUNNEL=m
CONFIG_INET_XFRM_MODE_TRANSPORT=m
CONFIG_INET_XFRM_MODE_TUNNEL=m
CONFIG_INET_XFRM_MODE_BEET=m
CONFIG_INET_LRO=y
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=m
CONFIG_TCP_CONG_CUBIC=y
CONFIG_TCP_CONG_WESTWOOD=m
CONFIG_TCP_CONG_HTCP=m
CONFIG_TCP_CONG_HSTCP=m
CONFIG_TCP_CONG_HYBLA=m
CONFIG_TCP_CONG_VEGAS=m
CONFIG_TCP_CONG_SCALABLE=m
CONFIG_TCP_CONG_LP=m
CONFIG_TCP_CONG_VENO=m
CONFIG_TCP_CONG_YEAH=m
CONFIG_TCP_CONG_ILLINOIS=m
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
# CONFIG_IPV6 is not set
CONFIG_NETWORK_SECMARK=y
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_NETFILTER_ADVANCED=y
CONFIG_BRIDGE_NETFILTER=y

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
CONFIG_NF_CONNTRACK=m
CONFIG_NF_CONNTRACK_MARK=y
# CONFIG_NF_CONNTRACK_SECMARK is not set
# CONFIG_NF_CONNTRACK_EVENTS is not set
# CONFIG_NF_CT_PROTO_DCCP is not set
# CONFIG_NF_CT_PROTO_SCTP is not set
# CONFIG_NF_CT_PROTO_UDPLITE is not set
# CONFIG_NF_CONNTRACK_AMANDA is not set
# CONFIG_NF_CONNTRACK_FTP is not set
# CONFIG_NF_CONNTRACK_H323 is not set
# CONFIG_NF_CONNTRACK_IRC is not set
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
# CONFIG_NF_CONNTRACK_PPTP is not set
# CONFIG_NF_CONNTRACK_SANE is not set
# CONFIG_NF_CONNTRACK_SIP is not set
# CONFIG_NF_CONNTRACK_TFTP is not set
# CONFIG_NF_CT_NETLINK is not set
# CONFIG_NETFILTER_TPROXY is not set
CONFIG_NETFILTER_XTABLES=m

#
# Xtables combined modules
#
CONFIG_NETFILTER_XT_MARK=m
CONFIG_NETFILTER_XT_CONNMARK=m

#
# Xtables targets
#
CONFIG_NETFILTER_XT_TARGET_CHECKSUM=m
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
CONFIG_NETFILTER_XT_TARGET_CONNMARK=m
# CONFIG_NETFILTER_XT_TARGET_CT is not set
# CONFIG_NETFILTER_XT_TARGET_DSCP is not set
CONFIG_NETFILTER_XT_TARGET_HL=m
CONFIG_NETFILTER_XT_TARGET_IDLETIMER=m
CONFIG_NETFILTER_XT_TARGET_LED=m
CONFIG_NETFILTER_XT_TARGET_MARK=m
CONFIG_NETFILTER_XT_TARGET_NFLOG=m
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=m
CONFIG_NETFILTER_XT_TARGET_NOTRACK=m
CONFIG_NETFILTER_XT_TARGET_RATEEST=m
CONFIG_NETFILTER_XT_TARGET_TEE=m
CONFIG_NETFILTER_XT_TARGET_TRACE=m
CONFIG_NETFILTER_XT_TARGET_SECMARK=m
CONFIG_NETFILTER_XT_TARGET_TCPMSS=m
CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP=m

#
# Xtables matches
#
CONFIG_NETFILTER_XT_MATCH_CLUSTER=m
CONFIG_NETFILTER_XT_MATCH_COMMENT=m
CONFIG_NETFILTER_XT_MATCH_CONNBYTES=m
CONFIG_NETFILTER_XT_MATCH_CONNLIMIT=m
CONFIG_NETFILTER_XT_MATCH_CONNMARK=m
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=m
CONFIG_NETFILTER_XT_MATCH_CPU=m
CONFIG_NETFILTER_XT_MATCH_DCCP=m
CONFIG_NETFILTER_XT_MATCH_DSCP=m
CONFIG_NETFILTER_XT_MATCH_ESP=m
CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=m
CONFIG_NETFILTER_XT_MATCH_HELPER=m
CONFIG_NETFILTER_XT_MATCH_HL=m
CONFIG_NETFILTER_XT_MATCH_IPRANGE=m
CONFIG_NETFILTER_XT_MATCH_LENGTH=m
CONFIG_NETFILTER_XT_MATCH_LIMIT=m
CONFIG_NETFILTER_XT_MATCH_MAC=m
CONFIG_NETFILTER_XT_MATCH_MARK=m
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m
CONFIG_NETFILTER_XT_MATCH_OSF=m
CONFIG_NETFILTER_XT_MATCH_OWNER=m
CONFIG_NETFILTER_XT_MATCH_POLICY=m
CONFIG_NETFILTER_XT_MATCH_PHYSDEV=m
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m
CONFIG_NETFILTER_XT_MATCH_QUOTA=m
CONFIG_NETFILTER_XT_MATCH_RATEEST=m
CONFIG_NETFILTER_XT_MATCH_REALM=m
CONFIG_NETFILTER_XT_MATCH_RECENT=m
# CONFIG_NETFILTER_XT_MATCH_SCTP is not set
CONFIG_NETFILTER_XT_MATCH_STATE=m
CONFIG_NETFILTER_XT_MATCH_STATISTIC=m
CONFIG_NETFILTER_XT_MATCH_STRING=m
CONFIG_NETFILTER_XT_MATCH_TCPMSS=m
CONFIG_NETFILTER_XT_MATCH_TIME=m
CONFIG_NETFILTER_XT_MATCH_U32=m
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=m
CONFIG_NF_CONNTRACK_IPV4=m
CONFIG_NF_CONNTRACK_PROC_COMPAT=y
CONFIG_IP_NF_QUEUE=m
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_MATCH_ADDRTYPE=m
CONFIG_IP_NF_MATCH_AH=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_LOG=m
CONFIG_IP_NF_TARGET_ULOG=m
CONFIG_NF_NAT=m
CONFIG_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_TARGET_NETMAP=m
CONFIG_IP_NF_TARGET_REDIRECT=m
CONFIG_NF_NAT_SNMP_BASIC=m
# CONFIG_NF_NAT_FTP is not set
# CONFIG_NF_NAT_IRC is not set
# CONFIG_NF_NAT_TFTP is not set
# CONFIG_NF_NAT_AMANDA is not set
# CONFIG_NF_NAT_PPTP is not set
# CONFIG_NF_NAT_H323 is not set
# CONFIG_NF_NAT_SIP is not set
CONFIG_IP_NF_MANGLE=m
CONFIG_IP_NF_TARGET_CLUSTERIP=m
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_TTL=m
CONFIG_IP_NF_RAW=m
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_ARP_MANGLE=m
# CONFIG_BRIDGE_NF_EBTABLES is not set
# CONFIG_IP_DCCP is not set
CONFIG_IP_SCTP=m
CONFIG_NET_SCTPPROBE=m
# CONFIG_SCTP_DBG_MSG is not set
# CONFIG_SCTP_DBG_OBJCNT is not set
# CONFIG_SCTP_HMAC_NONE is not set
CONFIG_SCTP_HMAC_SHA1=y
# CONFIG_SCTP_HMAC_MD5 is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
CONFIG_STP=m
CONFIG_GARP=m
CONFIG_BRIDGE=m
CONFIG_BRIDGE_IGMP_SNOOPING=y
CONFIG_NET_DSA=y
CONFIG_NET_DSA_TAG_DSA=y
CONFIG_NET_DSA_TAG_EDSA=y
CONFIG_NET_DSA_TAG_TRAILER=y
CONFIG_NET_DSA_MV88E6XXX=y
CONFIG_NET_DSA_MV88E6060=y
CONFIG_NET_DSA_MV88E6XXX_NEED_PPU=y
CONFIG_NET_DSA_MV88E6131=y
CONFIG_NET_DSA_MV88E6123_61_65=y
CONFIG_VLAN_8021Q=m
CONFIG_VLAN_8021Q_GVRP=y
# CONFIG_DECNET is not set
CONFIG_LLC=m
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
CONFIG_WAN_ROUTER=m
# CONFIG_PHONET is not set
# CONFIG_IEEE802154 is not set
# CONFIG_NET_SCHED is not set
CONFIG_NET_CLS_ROUTE=y
CONFIG_DCB=y
CONFIG_DNS_RESOLVER=y
# CONFIG_BATMAN_ADV is not set
CONFIG_RPS=y
CONFIG_XPS=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
# CONFIG_NET_DROP_MONITOR is not set
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
CONFIG_FIB_RULES=y
# CONFIG_WIRELESS is not set
# CONFIG_WIMAX is not set
# CONFIG_RFKILL is not set
# CONFIG_NET_9P is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH=""
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
# CONFIG_FIRMWARE_IN_KERNEL is not set
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
CONFIG_DEBUG_DEVRES=y
# CONFIG_SYS_HYPERVISOR is not set
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_MTD is not set
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
CONFIG_PARPORT_SERIAL=m
# CONFIG_PARPORT_PC_FIFO is not set
# CONFIG_PARPORT_PC_SUPERIO is not set
CONFIG_PARPORT_PC_PCMCIA=m
# CONFIG_PARPORT_GSC is not set
# CONFIG_PARPORT_AX88796 is not set
CONFIG_PARPORT_1284=y
CONFIG_PARPORT_NOT_PC=y
CONFIG_PNP=y
# CONFIG_PNP_DEBUG_MESSAGES is not set

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_FD is not set
CONFIG_PARIDE=m

#
# Parallel IDE high-level drivers
#
CONFIG_PARIDE_PD=m
CONFIG_PARIDE_PCD=m
CONFIG_PARIDE_PF=m
CONFIG_PARIDE_PT=m
CONFIG_PARIDE_PG=m

#
# Parallel IDE protocol modules
#
# CONFIG_PARIDE_ATEN is not set
# CONFIG_PARIDE_BPCK is not set
# CONFIG_PARIDE_COMM is not set
# CONFIG_PARIDE_DSTR is not set
# CONFIG_PARIDE_FIT2 is not set
# CONFIG_PARIDE_FIT3 is not set
# CONFIG_PARIDE_EPAT is not set
# CONFIG_PARIDE_EPIA is not set
# CONFIG_PARIDE_FRIQ is not set
# CONFIG_PARIDE_FRPW is not set
# CONFIG_PARIDE_KBIC is not set
# CONFIG_PARIDE_KTTI is not set
# CONFIG_PARIDE_ON20 is not set
# CONFIG_PARIDE_ON26 is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_CRYPTOLOOP=m
# CONFIG_BLK_DEV_DRBD is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_OSD is not set
CONFIG_BLK_DEV_SX8=m
# CONFIG_BLK_DEV_UB is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=16384
# CONFIG_BLK_DEV_XIP is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
# CONFIG_XEN_BLKDEV_FRONTEND is not set
# CONFIG_VIRTIO_BLK is not set
# CONFIG_BLK_DEV_HD is not set
# CONFIG_BLK_DEV_RBD is not set
CONFIG_MISC_DEVICES=y
# CONFIG_AD525X_DPOT is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
CONFIG_TIFM_CORE=m
# CONFIG_TIFM_7XX1 is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_CS5535_MFGPT is not set
# CONFIG_HP_ILO is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
# CONFIG_ISL29020 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1780 is not set
# CONFIG_SENSORS_BH1770 is not set
# CONFIG_SENSORS_APDS990X is not set
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
# CONFIG_VMWARE_BALLOON is not set
# CONFIG_BMP085 is not set
# CONFIG_PCH_PHUB is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
CONFIG_EEPROM_AT24=m
CONFIG_EEPROM_LEGACY=m
CONFIG_EEPROM_MAX6875=m
CONFIG_EEPROM_93CX6=m
CONFIG_CB710_CORE=m
# CONFIG_CB710_DEBUG is not set
CONFIG_CB710_DEBUG_ASSUMPTIONS=y
# CONFIG_IWMC3200TOP is not set

#
# Texas Instruments shared transport line discipline
#
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
CONFIG_RAID_ATTRS=m
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_TGT=m
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=m
CONFIG_CHR_DEV_OSST=m
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=y
CONFIG_CHR_DEV_SCH=m
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=m
CONFIG_SCSI_FC_ATTRS=m
CONFIG_SCSI_FC_TGT_ATTRS=y
CONFIG_SCSI_ISCSI_ATTRS=m
CONFIG_SCSI_SAS_ATTRS=m
CONFIG_SCSI_SAS_LIBSAS=m
CONFIG_SCSI_SAS_ATA=y
CONFIG_SCSI_SAS_HOST_SMP=y
# CONFIG_SCSI_SAS_LIBSAS_DEBUG is not set
CONFIG_SCSI_SRP_ATTRS=m
CONFIG_SCSI_SRP_TGT_ATTRS=y
CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=m
CONFIG_ISCSI_BOOT_SYSFS=m
CONFIG_SCSI_BNX2_ISCSI=m
CONFIG_BE2ISCSI=m
CONFIG_BLK_DEV_3W_XXXX_RAID=m
CONFIG_SCSI_HPSA=m
CONFIG_SCSI_3W_9XXX=m
CONFIG_SCSI_3W_SAS=m
CONFIG_SCSI_ACARD=m
CONFIG_SCSI_AACRAID=m
CONFIG_SCSI_AIC7XXX=m
CONFIG_AIC7XXX_CMDS_PER_DEVICE=4
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
# CONFIG_AIC7XXX_DEBUG_ENABLE is not set
CONFIG_AIC7XXX_DEBUG_MASK=0
# CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set
CONFIG_SCSI_AIC7XXX_OLD=m
CONFIG_SCSI_AIC79XX=m
CONFIG_AIC79XX_CMDS_PER_DEVICE=4
CONFIG_AIC79XX_RESET_DELAY_MS=15000
# CONFIG_AIC79XX_DEBUG_ENABLE is not set
CONFIG_AIC79XX_DEBUG_MASK=0
# CONFIG_AIC79XX_REG_PRETTY_PRINT is not set
CONFIG_SCSI_AIC94XX=m
# CONFIG_AIC94XX_DEBUG is not set
CONFIG_SCSI_MVSAS=m
# CONFIG_SCSI_MVSAS_DEBUG is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
CONFIG_SCSI_ARCMSR=m
CONFIG_SCSI_ARCMSR_AER=y
CONFIG_MEGARAID_NEWGEN=y
CONFIG_MEGARAID_MM=m
CONFIG_MEGARAID_MAILBOX=m
CONFIG_MEGARAID_LEGACY=m
CONFIG_MEGARAID_SAS=m
CONFIG_SCSI_MPT2SAS=m
CONFIG_SCSI_MPT2SAS_MAX_SGE=128
CONFIG_SCSI_MPT2SAS_LOGGING=y
CONFIG_SCSI_HPTIOP=m
CONFIG_SCSI_BUSLOGIC=m
CONFIG_VMWARE_PVSCSI=m
CONFIG_LIBFC=m
CONFIG_LIBFCOE=m
CONFIG_FCOE=m
CONFIG_FCOE_FNIC=m
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
CONFIG_SCSI_GDTH=m
CONFIG_SCSI_IPS=m
CONFIG_SCSI_INITIO=m
CONFIG_SCSI_INIA100=m
CONFIG_SCSI_PPA=m
CONFIG_SCSI_IMM=m
# CONFIG_SCSI_IZIP_EPP16 is not set
# CONFIG_SCSI_IZIP_SLOW_CTR is not set
CONFIG_SCSI_STEX=m
CONFIG_SCSI_SYM53C8XX_2=m
CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1
CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16
CONFIG_SCSI_SYM53C8XX_MAX_TAGS=64
CONFIG_SCSI_SYM53C8XX_MMIO=y
# CONFIG_SCSI_IPR is not set
CONFIG_SCSI_QLOGIC_1280=m
CONFIG_SCSI_QLA_FC=m
CONFIG_SCSI_QLA_ISCSI=m
CONFIG_SCSI_LPFC=m
# CONFIG_SCSI_LPFC_DEBUG_FS is not set
CONFIG_SCSI_DC395x=m
CONFIG_SCSI_DC390T=m
CONFIG_SCSI_DEBUG=m
CONFIG_SCSI_PMCRAID=m
CONFIG_SCSI_PM8001=m
CONFIG_SCSI_SRP=m
CONFIG_SCSI_BFA_FC=m
CONFIG_SCSI_LOWLEVEL_PCMCIA=y
# CONFIG_PCMCIA_AHA152X is not set
# CONFIG_PCMCIA_FDOMAIN is not set
CONFIG_PCMCIA_QLOGIC=m
CONFIG_PCMCIA_SYM53C500=m
CONFIG_SCSI_DH=y
CONFIG_SCSI_DH_RDAC=m
CONFIG_SCSI_DH_HP_SW=m
CONFIG_SCSI_DH_EMC=m
CONFIG_SCSI_DH_ALUA=m
CONFIG_SCSI_OSD_INITIATOR=m
CONFIG_SCSI_OSD_ULD=m
CONFIG_SCSI_OSD_DPRINT_SENSE=1
# CONFIG_SCSI_OSD_DEBUG is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=y
CONFIG_SATA_AHCI_PLATFORM=m
CONFIG_SATA_INIC162X=m
# CONFIG_SATA_ACARD_AHCI is not set
CONFIG_SATA_SIL24=m
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
CONFIG_PDC_ADMA=m
CONFIG_SATA_QSTOR=m
CONFIG_SATA_SX4=m
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
CONFIG_ATA_PIIX=y
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_SVW is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set

#
# PATA SFF controllers with BMDMA
#
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_ATP867X is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CS5536 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87415 is not set
CONFIG_PATA_OLDPIIX=m
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC2027X is not set
CONFIG_PATA_PDC_OLD=m
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RDC is not set
# CONFIG_PATA_SC1200 is not set
CONFIG_PATA_SCH=m
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_SIL680 is not set
CONFIG_PATA_SIS=m
# CONFIG_PATA_TOSHIBA is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set

#
# PIO-only SFF controllers
#
# CONFIG_PATA_CMD640_PCI is not set
CONFIG_PATA_MPIIX=m
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_PCMCIA is not set
# CONFIG_PATA_RZ1000 is not set

#
# Generic fallback / legacy drivers
#
CONFIG_PATA_ACPI=m
CONFIG_ATA_GENERIC=m
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=m
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=m
CONFIG_MD_RAID10=m
CONFIG_MD_RAID456=m
# CONFIG_MULTICORE_RAID456 is not set
CONFIG_MD_MULTIPATH=m
CONFIG_MD_FAULTY=m
CONFIG_BLK_DEV_DM=y
CONFIG_DM_DEBUG=y
CONFIG_DM_CRYPT=m
CONFIG_DM_SNAPSHOT=y
CONFIG_DM_MIRROR=y
# CONFIG_DM_RAID is not set
CONFIG_DM_LOG_USERSPACE=m
CONFIG_DM_ZERO=y
CONFIG_DM_MULTIPATH=m
CONFIG_DM_MULTIPATH_QL=m
CONFIG_DM_MULTIPATH_ST=m
# CONFIG_DM_DELAY is not set
CONFIG_DM_UEVENT=y
# CONFIG_TARGET_CORE is not set
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
# CONFIG_I2O is not set
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_DUMMY=m
CONFIG_BONDING=m
CONFIG_MACVLAN=m
CONFIG_MACVTAP=m
CONFIG_EQUALIZER=m
CONFIG_TUN=m
CONFIG_VETH=m
CONFIG_NET_SB1000=m
# CONFIG_ARCNET is not set
CONFIG_MII=m
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
CONFIG_MARVELL_PHY=m
CONFIG_DAVICOM_PHY=m
CONFIG_QSEMI_PHY=m
CONFIG_LXT_PHY=m
CONFIG_CICADA_PHY=m
CONFIG_VITESSE_PHY=m
CONFIG_SMSC_PHY=m
CONFIG_BROADCOM_PHY=m
# CONFIG_BCM63XX_PHY is not set
CONFIG_ICPLUS_PHY=m
CONFIG_REALTEK_PHY=m
CONFIG_NATIONAL_PHY=m
CONFIG_STE10XP=m
CONFIG_LSI_ET1011C_PHY=m
CONFIG_MICREL_PHY=m
CONFIG_FIXED_PHY=y
CONFIG_MDIO_BITBANG=m
# CONFIG_MDIO_GPIO is not set
CONFIG_NET_ETHERNET=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NET_VENDOR_3COM is not set
# CONFIG_ETHOC is not set
# CONFIG_DNET is not set
# CONFIG_NET_TULIP is not set
# CONFIG_HP100 is not set
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set
# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set
# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set
CONFIG_NET_PCI=y
# CONFIG_PCNET32 is not set
# CONFIG_AMD8111_ETH is not set
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_KSZ884X_PCI is not set
# CONFIG_B44 is not set
# CONFIG_FORCEDETH is not set
CONFIG_E100=m
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
# CONFIG_R6040 is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SMSC9420 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
# CONFIG_KS8842 is not set
# CONFIG_KS8851_MLL is not set
# CONFIG_VIA_RHINE is not set
# CONFIG_SC92031 is not set
# CONFIG_NET_POCKET is not set
# CONFIG_ATL2 is not set
CONFIG_NETDEV_1000=y
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
CONFIG_E1000=m
CONFIG_E1000E=m
# CONFIG_IP1000 is not set
CONFIG_IGB=m
CONFIG_IGB_DCA=y
CONFIG_IGBVF=m
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
# CONFIG_R8169 is not set
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_VIA_VELOCITY is not set
# CONFIG_TIGON3 is not set
CONFIG_BNX2=m
CONFIG_CNIC=m
# CONFIG_QLA3XXX is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
# CONFIG_JME is not set
# CONFIG_STMMAC_ETH is not set
# CONFIG_PCH_GBE is not set
# CONFIG_NETDEV_10000 is not set
# CONFIG_TR is not set
# CONFIG_WLAN is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_IPHETH is not set
# CONFIG_NET_PCMCIA is not set
# CONFIG_WAN is not set

#
# CAIF transport drivers
#
CONFIG_XEN_NETDEV_FRONTEND=m
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PLIP is not set
CONFIG_PPP=m
# CONFIG_PPP_MULTILINK is not set
# CONFIG_PPP_FILTER is not set
# CONFIG_PPP_ASYNC is not set
# CONFIG_PPP_SYNC_TTY is not set
# CONFIG_PPP_DEFLATE is not set
# CONFIG_PPP_BSDCOMP is not set
# CONFIG_PPP_MPPE is not set
# CONFIG_PPPOE is not set
# CONFIG_SLIP is not set
CONFIG_SLHC=m
CONFIG_NET_FC=y
CONFIG_NETCONSOLE=m
CONFIG_NETCONSOLE_DYNAMIC=y
CONFIG_NETPOLL=y
CONFIG_NETPOLL_TRAP=y
CONFIG_NET_POLL_CONTROLLER=y
CONFIG_VIRTIO_NET=m
CONFIG_VMXNET3=m
# CONFIG_ISDN is not set
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=m
CONFIG_INPUT_SPARSEKMAP=y

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set
# CONFIG_XEN_KBDDEV_FRONTEND is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ADP5588=m
CONFIG_KEYBOARD_ATKBD=y
CONFIG_KEYBOARD_QT2160=m
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_GPIO is not set
# CONFIG_KEYBOARD_GPIO_POLLED is not set
# CONFIG_KEYBOARD_TCA6416 is not set
# CONFIG_KEYBOARD_MATRIX is not set
# CONFIG_KEYBOARD_LM8323 is not set
CONFIG_KEYBOARD_MAX7359=m
# CONFIG_KEYBOARD_MCS is not set
# CONFIG_KEYBOARD_NEWTON is not set
CONFIG_KEYBOARD_OPENCORES=m
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_INPUT_MOUSE is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_AD714X is not set
CONFIG_INPUT_PCSPKR=m
CONFIG_INPUT_APANEL=m
CONFIG_INPUT_ATLAS_BTNS=m
CONFIG_INPUT_ATI_REMOTE=m
CONFIG_INPUT_ATI_REMOTE2=m
CONFIG_INPUT_KEYSPAN_REMOTE=m
CONFIG_INPUT_POWERMATE=m
CONFIG_INPUT_YEALINK=m
CONFIG_INPUT_CM109=m
CONFIG_INPUT_UINPUT=m
# CONFIG_INPUT_PCF8574 is not set
CONFIG_INPUT_GPIO_ROTARY_ENCODER=m
# CONFIG_INPUT_ADXL34X is not set
# CONFIG_INPUT_CMA3000 is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PARKBD is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
CONFIG_SERIO_ALTERA_PS2=m
# CONFIG_SERIO_PS2MULT is not set
CONFIG_GAMEPORT=m
CONFIG_GAMEPORT_NS558=m
CONFIG_GAMEPORT_L4=m
CONFIG_GAMEPORT_EMU10K1=m
CONFIG_GAMEPORT_FM801=m

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
# CONFIG_DEVKMEM is not set
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_COMPUTONE is not set
CONFIG_ROCKETPORT=m
CONFIG_CYCLADES=m
# CONFIG_CYZ_INTR is not set
# CONFIG_DIGIEPCA is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_ISI is not set
CONFIG_SYNCLINK=m
CONFIG_SYNCLINKMP=m
CONFIG_SYNCLINK_GT=m
CONFIG_N_HDLC=m
CONFIG_N_GSM=m
# CONFIG_RISCOM8 is not set
# CONFIG_SPECIALIX is not set
# CONFIG_STALDRV is not set
CONFIG_NOZOMI=m

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_CS=m
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y
CONFIG_SERIAL_8250_RSA=y

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_MFD_HSU is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_SERIAL_JSM=m
# CONFIG_SERIAL_TIMBERDALE is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_PCH_UART is not set
CONFIG_UNIX98_PTYS=y
CONFIG_DEVPTS_MULTIPLE_INSTANCES=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_PRINTER=m
CONFIG_LP_CONSOLE=y
CONFIG_PPDEV=m
CONFIG_HVC_DRIVER=y
CONFIG_HVC_IRQ=y
CONFIG_HVC_XEN=y
CONFIG_VIRTIO_CONSOLE=m
CONFIG_IPMI_HANDLER=m
# CONFIG_IPMI_PANIC_EVENT is not set
CONFIG_IPMI_DEVICE_INTERFACE=m
CONFIG_IPMI_SI=m
CONFIG_IPMI_WATCHDOG=m
CONFIG_IPMI_POWEROFF=m
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_TIMERIOMEM=m
CONFIG_HW_RANDOM_INTEL=m
CONFIG_HW_RANDOM_AMD=m
CONFIG_HW_RANDOM_VIA=m
CONFIG_HW_RANDOM_VIRTIO=m
CONFIG_NVRAM=y
CONFIG_R3964=m
# CONFIG_APPLICOM is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
CONFIG_CARDMAN_4000=m
CONFIG_CARDMAN_4040=m
CONFIG_IPWIRELESS=m
CONFIG_MWAVE=m
CONFIG_RAW_DRIVER=y
CONFIG_MAX_RAW_DEVS=8192
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
CONFIG_HANGCHECK_TIMER=m
CONFIG_TCG_TPM=y
CONFIG_TCG_TIS=y
CONFIG_TCG_NSC=m
CONFIG_TCG_ATMEL=m
CONFIG_TCG_INFINEON=m
CONFIG_TELCLOCK=m
CONFIG_DEVPORT=y
# CONFIG_RAMOOPS is not set
CONFIG_I2C=m
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
CONFIG_I2C_CHARDEV=m
# CONFIG_I2C_MUX is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_SMBUS=m
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCA=m

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
CONFIG_I2C_AMD756=m
CONFIG_I2C_AMD756_S4882=m
CONFIG_I2C_AMD8111=m
CONFIG_I2C_I801=m
CONFIG_I2C_ISCH=m
CONFIG_I2C_PIIX4=m
CONFIG_I2C_NFORCE2=m
CONFIG_I2C_NFORCE2_S4985=m
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
CONFIG_I2C_SIS96X=m
CONFIG_I2C_VIA=m
CONFIG_I2C_VIAPRO=m

#
# ACPI drivers
#
CONFIG_I2C_SCMI=m

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_GPIO is not set
# CONFIG_I2C_INTEL_MID is not set
# CONFIG_I2C_OCORES is not set
CONFIG_I2C_PCA_PLATFORM=m
CONFIG_I2C_SIMTEC=m
# CONFIG_I2C_XILINX is not set
# CONFIG_I2C_EG20T is not set

#
# External I2C/SMBus adapter drivers
#
CONFIG_I2C_PARPORT=m
CONFIG_I2C_PARPORT_LIGHT=m
# CONFIG_I2C_TAOS_EVM is not set
CONFIG_I2C_TINY_USB=m

#
# Other I2C/SMBus bus drivers
#
CONFIG_I2C_STUB=m
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_SPI is not set

#
# PPS support
#
# CONFIG_PPS is not set

#
# PPS generators support
#
# CONFIG_PPS_GENERATOR_PARPORT is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
CONFIG_GPIOLIB=y
# CONFIG_DEBUG_GPIO is not set
CONFIG_GPIO_SYSFS=y

#
# Memory mapped GPIO expanders:
#
# CONFIG_GPIO_BASIC_MMIO is not set
# CONFIG_GPIO_IT8761E is not set
CONFIG_GPIO_SCH=m
# CONFIG_GPIO_VX855 is not set

#
# I2C GPIO expanders:
#
# CONFIG_GPIO_MAX7300 is not set
# CONFIG_GPIO_MAX732X is not set
# CONFIG_GPIO_PCA953X is not set
# CONFIG_GPIO_PCF857X is not set
# CONFIG_GPIO_ADP5588 is not set

#
# PCI GPIO expanders:
#
# CONFIG_GPIO_CS5535 is not set
CONFIG_GPIO_LANGWELL=y
# CONFIG_GPIO_PCH is not set
# CONFIG_GPIO_ML_IOH is not set
# CONFIG_GPIO_RDC321X is not set

#
# SPI GPIO expanders:
#

#
# AC97 GPIO expanders:
#

#
# MODULbus GPIO expanders:
#
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_TEST_POWER is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_BQ20Z75 is not set
CONFIG_BATTERY_BQ27x00=m
CONFIG_BATTERY_MAX17040=m
# CONFIG_BATTERY_MAX17042 is not set
# CONFIG_CHARGER_ISP1704 is not set
# CONFIG_CHARGER_GPIO is not set
CONFIG_HWMON=y
CONFIG_HWMON_VID=y
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_K10TEMP is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS620 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
CONFIG_SENSORS_F71882FG=y
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_GPIO_FAN is not set
CONFIG_SENSORS_CORETEMP=m
CONFIG_SENSORS_PKGTEMP=m
# CONFIG_SENSORS_IBMAEM is not set
# CONFIG_SENSORS_IBMPEX is not set
CONFIG_SENSORS_IT87=y
# CONFIG_SENSORS_JC42 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LTC4261 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_SENSORS_SHT15 is not set
# CONFIG_SENSORS_SHT21 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_SMM665 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_EMC1403 is not set
# CONFIG_SENSORS_EMC2103 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP102 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83795 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_LIS3_I2C is not set
# CONFIG_SENSORS_APPLESMC is not set

#
# ACPI drivers
#
# CONFIG_SENSORS_ATK0110 is not set
# CONFIG_SENSORS_LIS3LV02D is not set
CONFIG_THERMAL=y
CONFIG_THERMAL_HWMON=y
# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set
CONFIG_MFD_SUPPORT=y
CONFIG_MFD_CORE=m
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_UCB1400_CORE is not set
# CONFIG_TPS65010 is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_ABX500_CORE is not set
# CONFIG_MFD_CS5535 is not set
# CONFIG_MFD_TIMBERDALE is not set
CONFIG_LPC_SCH=m
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_VX855 is not set
# CONFIG_MFD_WL1273_CORE is not set
# CONFIG_REGULATOR is not set
CONFIG_MEDIA_SUPPORT=m

#
# Multimedia core support
#
CONFIG_VIDEO_DEV=m
CONFIG_VIDEO_V4L2_COMMON=m
CONFIG_DVB_CORE=m
CONFIG_VIDEO_MEDIA=m

#
# Multimedia drivers
#
CONFIG_VIDEO_SAA7146=m
CONFIG_VIDEO_SAA7146_VV=m
CONFIG_RC_CORE=m
CONFIG_LIRC=m
CONFIG_RC_MAP=m
CONFIG_IR_NEC_DECODER=m
CONFIG_IR_RC5_DECODER=m
CONFIG_IR_RC6_DECODER=m
CONFIG_IR_JVC_DECODER=m
CONFIG_IR_SONY_DECODER=m
CONFIG_IR_RC5_SZ_DECODER=m
CONFIG_IR_LIRC_CODEC=m
CONFIG_IR_ENE=m
CONFIG_IR_IMON=m
CONFIG_IR_MCEUSB=m
CONFIG_IR_NUVOTON=m
CONFIG_IR_STREAMZAP=m
# CONFIG_IR_WINBOND_CIR is not set
# CONFIG_RC_LOOPBACK is not set
CONFIG_MEDIA_ATTACH=y
CONFIG_MEDIA_TUNER=m
CONFIG_MEDIA_TUNER_CUSTOMISE=y

#
# Customize TV tuners
#
CONFIG_MEDIA_TUNER_SIMPLE=m
CONFIG_MEDIA_TUNER_TDA8290=m
CONFIG_MEDIA_TUNER_TDA827X=m
CONFIG_MEDIA_TUNER_TDA18271=m
CONFIG_MEDIA_TUNER_TDA9887=m
CONFIG_MEDIA_TUNER_TEA5761=m
CONFIG_MEDIA_TUNER_TEA5767=m
CONFIG_MEDIA_TUNER_MT20XX=m
CONFIG_MEDIA_TUNER_MT2060=m
CONFIG_MEDIA_TUNER_MT2266=m
CONFIG_MEDIA_TUNER_MT2131=m
CONFIG_MEDIA_TUNER_QT1010=m
CONFIG_MEDIA_TUNER_XC2028=m
CONFIG_MEDIA_TUNER_XC5000=m
CONFIG_MEDIA_TUNER_MXL5005S=m
CONFIG_MEDIA_TUNER_MXL5007T=m
CONFIG_MEDIA_TUNER_MC44S803=m
CONFIG_MEDIA_TUNER_MAX2165=m
CONFIG_MEDIA_TUNER_TDA18218=m
CONFIG_VIDEO_V4L2=m
CONFIG_VIDEOBUF_GEN=m
CONFIG_VIDEOBUF_DMA_SG=m
CONFIG_VIDEOBUF_VMALLOC=m
CONFIG_VIDEOBUF_DVB=m
CONFIG_VIDEO_BTCX=m
CONFIG_VIDEO_TVEEPROM=m
CONFIG_VIDEO_TUNER=m
CONFIG_VIDEO_CAPTURE_DRIVERS=y
# CONFIG_VIDEO_ADV_DEBUG is not set
# CONFIG_VIDEO_FIXED_MINOR_RANGES is not set
CONFIG_VIDEO_HELPER_CHIPS_AUTO=y
CONFIG_VIDEO_IR_I2C=m

#
# Audio decoders
#
CONFIG_VIDEO_TVAUDIO=m
CONFIG_VIDEO_TDA7432=m
CONFIG_VIDEO_TDA9840=m
CONFIG_VIDEO_TEA6415C=m
CONFIG_VIDEO_TEA6420=m
CONFIG_VIDEO_MSP3400=m
CONFIG_VIDEO_CS5345=m
CONFIG_VIDEO_CS53L32A=m
CONFIG_VIDEO_M52790=m
CONFIG_VIDEO_WM8775=m
CONFIG_VIDEO_WM8739=m
CONFIG_VIDEO_VP27SMPX=m

#
# RDS decoders
#
CONFIG_VIDEO_SAA6588=m

#
# Video decoders
#
CONFIG_VIDEO_BT819=m
CONFIG_VIDEO_BT856=m
CONFIG_VIDEO_BT866=m
CONFIG_VIDEO_KS0127=m
CONFIG_VIDEO_MT9V011=m
CONFIG_VIDEO_SAA7110=m
CONFIG_VIDEO_SAA711X=m
CONFIG_VIDEO_SAA717X=m
CONFIG_VIDEO_TVP5150=m
CONFIG_VIDEO_VPX3220=m

#
# Video and audio decoders
#
CONFIG_VIDEO_CX25840=m

#
# MPEG video encoders
#
CONFIG_VIDEO_CX2341X=m

#
# Video encoders
#
CONFIG_VIDEO_SAA7127=m
CONFIG_VIDEO_SAA7185=m
CONFIG_VIDEO_ADV7170=m
CONFIG_VIDEO_ADV7175=m

#
# Video improvement chips
#
CONFIG_VIDEO_UPD64031A=m
CONFIG_VIDEO_UPD64083=m
# CONFIG_VIDEO_VIVI is not set
CONFIG_VIDEO_BT848=m
CONFIG_VIDEO_BT848_DVB=y
CONFIG_VIDEO_BWQCAM=m
CONFIG_VIDEO_CQCAM=m
CONFIG_VIDEO_W9966=m
CONFIG_VIDEO_CPIA2=m
CONFIG_VIDEO_ZORAN=m
CONFIG_VIDEO_ZORAN_DC30=m
CONFIG_VIDEO_ZORAN_ZR36060=m
CONFIG_VIDEO_ZORAN_BUZ=m
CONFIG_VIDEO_ZORAN_DC10=m
CONFIG_VIDEO_ZORAN_LML33=m
CONFIG_VIDEO_ZORAN_LML33R10=m
CONFIG_VIDEO_ZORAN_AVS6EYES=m
CONFIG_VIDEO_SAA7134=m
CONFIG_VIDEO_SAA7134_ALSA=m
CONFIG_VIDEO_SAA7134_RC=y
CONFIG_VIDEO_SAA7134_DVB=m
CONFIG_VIDEO_MXB=m
CONFIG_VIDEO_HEXIUM_ORION=m
CONFIG_VIDEO_HEXIUM_GEMINI=m
# CONFIG_VIDEO_TIMBERDALE is not set
CONFIG_VIDEO_CX88=m
CONFIG_VIDEO_CX88_ALSA=m
CONFIG_VIDEO_CX88_BLACKBIRD=m
CONFIG_VIDEO_CX88_DVB=m
CONFIG_VIDEO_CX88_MPEG=m
CONFIG_VIDEO_CX88_VP3054=m
CONFIG_VIDEO_CX23885=m
CONFIG_VIDEO_AU0828=m
CONFIG_VIDEO_IVTV=m
CONFIG_VIDEO_FB_IVTV=m
CONFIG_VIDEO_CX18=m
CONFIG_VIDEO_CX18_ALSA=m
CONFIG_VIDEO_SAA7164=m
# CONFIG_VIDEO_CAFE_CCIC is not set
# CONFIG_VIDEO_SR030PC30 is not set
# CONFIG_VIDEO_VIA_CAMERA is not set
CONFIG_SOC_CAMERA=m
# CONFIG_SOC_CAMERA_IMX074 is not set
CONFIG_SOC_CAMERA_MT9M001=m
CONFIG_SOC_CAMERA_MT9M111=m
CONFIG_SOC_CAMERA_MT9T031=m
CONFIG_SOC_CAMERA_MT9T112=m
CONFIG_SOC_CAMERA_MT9V022=m
CONFIG_SOC_CAMERA_RJ54N1=m
CONFIG_SOC_CAMERA_TW9910=m
CONFIG_SOC_CAMERA_PLATFORM=m
# CONFIG_SOC_CAMERA_OV2640 is not set
# CONFIG_SOC_CAMERA_OV6650 is not set
CONFIG_SOC_CAMERA_OV772X=m
CONFIG_SOC_CAMERA_OV9640=m
CONFIG_V4L_USB_DRIVERS=y
CONFIG_USB_VIDEO_CLASS=m
CONFIG_USB_VIDEO_CLASS_INPUT_EVDEV=y
CONFIG_USB_GSPCA=m
CONFIG_USB_M5602=m
CONFIG_USB_STV06XX=m
CONFIG_USB_GL860=m
CONFIG_USB_GSPCA_BENQ=m
CONFIG_USB_GSPCA_CONEX=m
CONFIG_USB_GSPCA_CPIA1=m
CONFIG_USB_GSPCA_ETOMS=m
CONFIG_USB_GSPCA_FINEPIX=m
CONFIG_USB_GSPCA_JEILINJ=m
# CONFIG_USB_GSPCA_KONICA is not set
CONFIG_USB_GSPCA_MARS=m
CONFIG_USB_GSPCA_MR97310A=m
CONFIG_USB_GSPCA_OV519=m
CONFIG_USB_GSPCA_OV534=m
CONFIG_USB_GSPCA_OV534_9=m
CONFIG_USB_GSPCA_PAC207=m
CONFIG_USB_GSPCA_PAC7302=m
CONFIG_USB_GSPCA_PAC7311=m
CONFIG_USB_GSPCA_SN9C2028=m
CONFIG_USB_GSPCA_SN9C20X=m
CONFIG_USB_GSPCA_SONIXB=m
CONFIG_USB_GSPCA_SONIXJ=m
CONFIG_USB_GSPCA_SPCA500=m
CONFIG_USB_GSPCA_SPCA501=m
CONFIG_USB_GSPCA_SPCA505=m
CONFIG_USB_GSPCA_SPCA506=m
CONFIG_USB_GSPCA_SPCA508=m
CONFIG_USB_GSPCA_SPCA561=m
# CONFIG_USB_GSPCA_SPCA1528 is not set
CONFIG_USB_GSPCA_SQ905=m
CONFIG_USB_GSPCA_SQ905C=m
# CONFIG_USB_GSPCA_SQ930X is not set
CONFIG_USB_GSPCA_STK014=m
CONFIG_USB_GSPCA_STV0680=m
CONFIG_USB_GSPCA_SUNPLUS=m
CONFIG_USB_GSPCA_T613=m
CONFIG_USB_GSPCA_TV8532=m
CONFIG_USB_GSPCA_VC032X=m
# CONFIG_USB_GSPCA_XIRLINK_CIT is not set
CONFIG_USB_GSPCA_ZC3XX=m
CONFIG_VIDEO_PVRUSB2=m
CONFIG_VIDEO_PVRUSB2_SYSFS=y
CONFIG_VIDEO_PVRUSB2_DVB=y
# CONFIG_VIDEO_PVRUSB2_DEBUGIFC is not set
CONFIG_VIDEO_HDPVR=m
CONFIG_VIDEO_EM28XX=m
CONFIG_VIDEO_EM28XX_ALSA=m
CONFIG_VIDEO_EM28XX_DVB=m
CONFIG_VIDEO_TLG2300=m
CONFIG_VIDEO_CX231XX=m
CONFIG_VIDEO_CX231XX_RC=y
CONFIG_VIDEO_CX231XX_ALSA=m
CONFIG_VIDEO_CX231XX_DVB=m
CONFIG_VIDEO_USBVISION=m
# CONFIG_USB_ET61X251 is not set
# CONFIG_USB_SN9C102 is not set
CONFIG_USB_PWC=m
# CONFIG_USB_PWC_DEBUG is not set
CONFIG_USB_PWC_INPUT_EVDEV=y
CONFIG_USB_ZR364XX=m
CONFIG_USB_STKWEBCAM=m
CONFIG_USB_S2255=m
CONFIG_V4L_MEM2MEM_DRIVERS=y
# CONFIG_VIDEO_MEM2MEM_TESTDEV is not set
CONFIG_RADIO_ADAPTERS=y
CONFIG_RADIO_MAXIRADIO=m
CONFIG_RADIO_MAESTRO=m
CONFIG_I2C_SI4713=m
CONFIG_RADIO_SI4713=m
CONFIG_USB_DSBR=m
CONFIG_RADIO_SI470X=y
CONFIG_USB_SI470X=m
CONFIG_I2C_SI470X=m
CONFIG_USB_MR800=m
# CONFIG_RADIO_TEA5764 is not set
# CONFIG_RADIO_SAA7706H is not set
# CONFIG_RADIO_TEF6862 is not set
# CONFIG_RADIO_WL1273 is not set
CONFIG_DVB_MAX_ADAPTERS=8
CONFIG_DVB_DYNAMIC_MINORS=y
CONFIG_DVB_CAPTURE_DRIVERS=y

#
# Supported SAA7146 based PCI Adapters
#
CONFIG_TTPCI_EEPROM=m
CONFIG_DVB_AV7110=m
CONFIG_DVB_AV7110_OSD=y
CONFIG_DVB_BUDGET_CORE=m
CONFIG_DVB_BUDGET=m
CONFIG_DVB_BUDGET_CI=m
CONFIG_DVB_BUDGET_AV=m
CONFIG_DVB_BUDGET_PATCH=m

#
# Supported USB Adapters
#
CONFIG_DVB_USB=m
# CONFIG_DVB_USB_DEBUG is not set
CONFIG_DVB_USB_A800=m
CONFIG_DVB_USB_DIBUSB_MB=m
# CONFIG_DVB_USB_DIBUSB_MB_FAULTY is not set
CONFIG_DVB_USB_DIBUSB_MC=m
CONFIG_DVB_USB_DIB0700=m
CONFIG_DVB_USB_UMT_010=m
CONFIG_DVB_USB_CXUSB=m
CONFIG_DVB_USB_M920X=m
CONFIG_DVB_USB_GL861=m
CONFIG_DVB_USB_AU6610=m
CONFIG_DVB_USB_DIGITV=m
CONFIG_DVB_USB_VP7045=m
CONFIG_DVB_USB_VP702X=m
CONFIG_DVB_USB_GP8PSK=m
CONFIG_DVB_USB_NOVA_T_USB2=m
CONFIG_DVB_USB_TTUSB2=m
CONFIG_DVB_USB_DTT200U=m
CONFIG_DVB_USB_OPERA1=m
CONFIG_DVB_USB_AF9005=m
CONFIG_DVB_USB_AF9005_REMOTE=m
CONFIG_DVB_USB_DW2102=m
CONFIG_DVB_USB_CINERGY_T2=m
CONFIG_DVB_USB_ANYSEE=m
CONFIG_DVB_USB_DTV5100=m
CONFIG_DVB_USB_AF9015=m
CONFIG_DVB_USB_CE6230=m
CONFIG_DVB_USB_FRIIO=m
CONFIG_DVB_USB_EC168=m
CONFIG_DVB_USB_AZ6027=m
# CONFIG_DVB_USB_LME2510 is not set
CONFIG_DVB_TTUSB_BUDGET=m
CONFIG_DVB_TTUSB_DEC=m
CONFIG_SMS_SIANO_MDTV=m

#
# Siano module components
#
CONFIG_SMS_USB_DRV=m
CONFIG_SMS_SDIO_DRV=m

#
# Supported FlexCopII (B2C2) Adapters
#
CONFIG_DVB_B2C2_FLEXCOP=m
CONFIG_DVB_B2C2_FLEXCOP_PCI=m
CONFIG_DVB_B2C2_FLEXCOP_USB=m
# CONFIG_DVB_B2C2_FLEXCOP_DEBUG is not set

#
# Supported BT878 Adapters
#
CONFIG_DVB_BT8XX=m

#
# Supported Pluto2 Adapters
#
CONFIG_DVB_PLUTO2=m

#
# Supported SDMC DM1105 Adapters
#
CONFIG_DVB_DM1105=m

#
# Supported Earthsoft PT1 Adapters
#
CONFIG_DVB_PT1=m

#
# Supported Mantis Adapters
#
CONFIG_MANTIS_CORE=m
CONFIG_DVB_MANTIS=m
CONFIG_DVB_HOPPER=m

#
# Supported nGene Adapters
#
CONFIG_DVB_NGENE=m

#
# Supported DVB Frontends
#
CONFIG_DVB_FE_CUSTOMISE=y

#
# Customise DVB Frontends
#

#
# Multistandard (satellite) frontends
#
CONFIG_DVB_STB0899=m
CONFIG_DVB_STB6100=m
CONFIG_DVB_STV090x=m
CONFIG_DVB_STV6110x=m

#
# DVB-S (satellite) frontends
#
CONFIG_DVB_CX24110=m
CONFIG_DVB_CX24123=m
CONFIG_DVB_MT312=m
CONFIG_DVB_ZL10036=m
CONFIG_DVB_ZL10039=m
CONFIG_DVB_S5H1420=m
CONFIG_DVB_STV0288=m
CONFIG_DVB_STB6000=m
CONFIG_DVB_STV0299=m
CONFIG_DVB_STV6110=m
CONFIG_DVB_STV0900=m
CONFIG_DVB_TDA8083=m
CONFIG_DVB_TDA10086=m
CONFIG_DVB_TDA8261=m
CONFIG_DVB_VES1X93=m
CONFIG_DVB_TUNER_ITD1000=m
CONFIG_DVB_TUNER_CX24113=m
CONFIG_DVB_TDA826X=m
CONFIG_DVB_TUA6100=m
CONFIG_DVB_CX24116=m
CONFIG_DVB_SI21XX=m
CONFIG_DVB_DS3000=m
CONFIG_DVB_MB86A16=m

#
# DVB-T (terrestrial) frontends
#
CONFIG_DVB_SP8870=m
CONFIG_DVB_SP887X=m
CONFIG_DVB_CX22700=m
CONFIG_DVB_CX22702=m
CONFIG_DVB_S5H1432=m
CONFIG_DVB_DRX397XD=m
CONFIG_DVB_L64781=m
CONFIG_DVB_TDA1004X=m
CONFIG_DVB_NXT6000=m
CONFIG_DVB_MT352=m
CONFIG_DVB_ZL10353=m
CONFIG_DVB_DIB3000MB=m
CONFIG_DVB_DIB3000MC=m
CONFIG_DVB_DIB7000M=m
CONFIG_DVB_DIB7000P=m
CONFIG_DVB_TDA10048=m
CONFIG_DVB_AF9013=m
CONFIG_DVB_EC100=m

#
# DVB-C (cable) frontends
#
CONFIG_DVB_VES1820=m
CONFIG_DVB_TDA10021=m
CONFIG_DVB_TDA10023=m
CONFIG_DVB_STV0297=m

#
# ATSC (North American/Korean Terrestrial/Cable DTV) frontends
#
CONFIG_DVB_NXT200X=m
CONFIG_DVB_OR51211=m
CONFIG_DVB_OR51132=m
CONFIG_DVB_BCM3510=m
CONFIG_DVB_LGDT330X=m
CONFIG_DVB_LGDT3305=m
CONFIG_DVB_S5H1409=m
CONFIG_DVB_AU8522=m
CONFIG_DVB_S5H1411=m

#
# ISDB-T (terrestrial) frontends
#
CONFIG_DVB_S921=m
CONFIG_DVB_DIB8000=m
CONFIG_DVB_MB86A20S=m

#
# Digital terrestrial only tuners/PLL
#
CONFIG_DVB_PLL=m
CONFIG_DVB_TUNER_DIB0070=m
CONFIG_DVB_TUNER_DIB0090=m

#
# SEC control devices for DVB-S
#
CONFIG_DVB_LNBP21=m
CONFIG_DVB_ISL6405=m
CONFIG_DVB_ISL6421=m
CONFIG_DVB_ISL6423=m
CONFIG_DVB_LGS8GL5=m
CONFIG_DVB_LGS8GXX=m
CONFIG_DVB_ATBM8830=m
CONFIG_DVB_TDA665x=m
CONFIG_DVB_IX2505V=m

#
# Tools to develop new frontends
#
CONFIG_DVB_DUMMY_FE=m

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
CONFIG_AGP_VIA=y
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
CONFIG_VGA_SWITCHEROO=y
CONFIG_DRM=m
CONFIG_DRM_KMS_HELPER=m
CONFIG_DRM_TTM=m
CONFIG_DRM_TDFX=m
CONFIG_DRM_R128=m
CONFIG_DRM_RADEON=m
CONFIG_DRM_RADEON_KMS=y
CONFIG_DRM_I810=m
# CONFIG_DRM_I830 is not set
CONFIG_DRM_I915=m
CONFIG_DRM_I915_KMS=y
CONFIG_DRM_MGA=m
CONFIG_DRM_SIS=m
CONFIG_DRM_VIA=m
CONFIG_DRM_SAVAGE=m
# CONFIG_STUB_POULSBO is not set
CONFIG_VGASTATE=m
CONFIG_VIDEO_OUTPUT_CONTROL=m
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_DDC=m
CONFIG_FB_BOOT_VESA_SUPPORT=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
CONFIG_FB_SYS_FILLRECT=y
CONFIG_FB_SYS_COPYAREA=y
CONFIG_FB_SYS_IMAGEBLIT=y
# CONFIG_FB_FOREIGN_ENDIAN is not set
CONFIG_FB_SYS_FOPS=y
# CONFIG_FB_WMT_GE_ROPS is not set
CONFIG_FB_DEFERRED_IO=y
CONFIG_FB_SVGALIB=m
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
CONFIG_FB_CIRRUS=m
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
CONFIG_FB_VGA16=m
# CONFIG_FB_UVESA is not set
CONFIG_FB_VESA=y
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
CONFIG_FB_NVIDIA=m
CONFIG_FB_NVIDIA_I2C=y
# CONFIG_FB_NVIDIA_DEBUG is not set
CONFIG_FB_NVIDIA_BACKLIGHT=y
CONFIG_FB_RIVA=m
# CONFIG_FB_RIVA_I2C is not set
# CONFIG_FB_RIVA_DEBUG is not set
CONFIG_FB_RIVA_BACKLIGHT=y
# CONFIG_FB_LE80578 is not set
CONFIG_FB_MATROX=m
CONFIG_FB_MATROX_MILLENIUM=y
CONFIG_FB_MATROX_MYSTIQUE=y
CONFIG_FB_MATROX_G=y
CONFIG_FB_MATROX_I2C=m
CONFIG_FB_MATROX_MAVEN=m
CONFIG_FB_RADEON=m
CONFIG_FB_RADEON_I2C=y
CONFIG_FB_RADEON_BACKLIGHT=y
# CONFIG_FB_RADEON_DEBUG is not set
CONFIG_FB_ATY128=m
CONFIG_FB_ATY128_BACKLIGHT=y
CONFIG_FB_ATY=m
CONFIG_FB_ATY_CT=y
CONFIG_FB_ATY_GENERIC_LCD=y
CONFIG_FB_ATY_GX=y
CONFIG_FB_ATY_BACKLIGHT=y
CONFIG_FB_S3=m
CONFIG_FB_SAVAGE=m
CONFIG_FB_SAVAGE_I2C=y
CONFIG_FB_SAVAGE_ACCEL=y
# CONFIG_FB_SIS is not set
CONFIG_FB_VIA=m
# CONFIG_FB_VIA_DIRECT_PROCFS is not set
CONFIG_FB_NEOMAGIC=m
CONFIG_FB_KYRO=m
CONFIG_FB_3DFX=m
CONFIG_FB_3DFX_ACCEL=y
CONFIG_FB_3DFX_I2C=y
CONFIG_FB_VOODOO1=m
# CONFIG_FB_VT8623 is not set
CONFIG_FB_TRIDENT=m
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_TMIO is not set
# CONFIG_FB_UDL is not set
CONFIG_FB_VIRTUAL=m
CONFIG_XEN_FBDEV_FRONTEND=y
CONFIG_FB_METRONOME=m
CONFIG_FB_MB862XX=m
CONFIG_FB_MB862XX_PCI_GDC=y
# CONFIG_FB_BROADSHEET is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_LCD_CLASS_DEVICE=m
CONFIG_LCD_PLATFORM=m
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_GENERIC is not set
CONFIG_BACKLIGHT_PROGEAR=m
CONFIG_BACKLIGHT_MBP_NVIDIA=m
# CONFIG_BACKLIGHT_SAHARA is not set
# CONFIG_BACKLIGHT_ADP8860 is not set

#
# Display device support
#
CONFIG_DISPLAY_SUPPORT=m

#
# Display hardware drivers
#

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
CONFIG_SOUND=m
CONFIG_SOUND_OSS_CORE=y
CONFIG_SOUND_OSS_CORE_PRECLAIM=y
CONFIG_SND=m
CONFIG_SND_TIMER=m
CONFIG_SND_PCM=m
CONFIG_SND_HWDEP=m
CONFIG_SND_RAWMIDI=m
CONFIG_SND_JACK=y
CONFIG_SND_SEQUENCER=m
CONFIG_SND_SEQ_DUMMY=m
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=m
CONFIG_SND_PCM_OSS=m
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_SEQUENCER_OSS=y
CONFIG_SND_HRTIMER=m
CONFIG_SND_SEQ_HRTIMER_DEFAULT=y
CONFIG_SND_DYNAMIC_MINORS=y
# CONFIG_SND_SUPPORT_OLD_API is not set
CONFIG_SND_VERBOSE_PROCFS=y
CONFIG_SND_VERBOSE_PRINTK=y
CONFIG_SND_DEBUG=y
# CONFIG_SND_DEBUG_VERBOSE is not set
CONFIG_SND_PCM_XRUN_DEBUG=y
CONFIG_SND_VMASTER=y
CONFIG_SND_DMA_SGBUF=y
CONFIG_SND_RAWMIDI_SEQ=m
CONFIG_SND_OPL3_LIB_SEQ=m
# CONFIG_SND_OPL4_LIB_SEQ is not set
# CONFIG_SND_SBAWE_SEQ is not set
CONFIG_SND_EMU10K1_SEQ=m
CONFIG_SND_MPU401_UART=m
CONFIG_SND_OPL3_LIB=m
CONFIG_SND_VX_LIB=m
CONFIG_SND_AC97_CODEC=m
CONFIG_SND_DRIVERS=y
CONFIG_SND_PCSP=m
CONFIG_SND_DUMMY=m
# CONFIG_SND_ALOOP is not set
CONFIG_SND_VIRMIDI=m
CONFIG_SND_MTPAV=m
CONFIG_SND_MTS64=m
CONFIG_SND_SERIAL_U16550=m
CONFIG_SND_MPU401=m
CONFIG_SND_PORTMAN2X4=m
CONFIG_SND_AC97_POWER_SAVE=y
CONFIG_SND_AC97_POWER_SAVE_DEFAULT=0
CONFIG_SND_SB_COMMON=m
CONFIG_SND_SB16_DSP=m
CONFIG_SND_PCI=y
CONFIG_SND_AD1889=m
CONFIG_SND_ALS300=m
CONFIG_SND_ALS4000=m
CONFIG_SND_ALI5451=m
CONFIG_SND_ASIHPI=m
CONFIG_SND_ATIIXP=m
CONFIG_SND_ATIIXP_MODEM=m
CONFIG_SND_AU8810=m
CONFIG_SND_AU8820=m
CONFIG_SND_AU8830=m
# CONFIG_SND_AW2 is not set
CONFIG_SND_AZT3328=m
CONFIG_SND_BT87X=m
# CONFIG_SND_BT87X_OVERCLOCK is not set
CONFIG_SND_CA0106=m
CONFIG_SND_CMIPCI=m
CONFIG_SND_OXYGEN_LIB=m
CONFIG_SND_OXYGEN=m
CONFIG_SND_CS4281=m
CONFIG_SND_CS46XX=m
CONFIG_SND_CS46XX_NEW_DSP=y
CONFIG_SND_CS5530=m
CONFIG_SND_CS5535AUDIO=m
CONFIG_SND_CTXFI=m
CONFIG_SND_DARLA20=m
CONFIG_SND_GINA20=m
CONFIG_SND_LAYLA20=m
CONFIG_SND_DARLA24=m
CONFIG_SND_GINA24=m
CONFIG_SND_LAYLA24=m
CONFIG_SND_MONA=m
CONFIG_SND_MIA=m
CONFIG_SND_ECHO3G=m
CONFIG_SND_INDIGO=m
CONFIG_SND_INDIGOIO=m
CONFIG_SND_INDIGODJ=m
CONFIG_SND_INDIGOIOX=m
CONFIG_SND_INDIGODJX=m
CONFIG_SND_EMU10K1=m
CONFIG_SND_EMU10K1X=m
CONFIG_SND_ENS1370=m
CONFIG_SND_ENS1371=m
CONFIG_SND_ES1938=m
CONFIG_SND_ES1968=m
CONFIG_SND_ES1968_INPUT=y
CONFIG_SND_FM801=m
CONFIG_SND_FM801_TEA575X_BOOL=y
CONFIG_SND_FM801_TEA575X=m
CONFIG_SND_HDA_INTEL=m
CONFIG_SND_HDA_HWDEP=y
CONFIG_SND_HDA_RECONFIG=y
CONFIG_SND_HDA_INPUT_BEEP=y
CONFIG_SND_HDA_INPUT_BEEP_MODE=0
CONFIG_SND_HDA_INPUT_JACK=y
CONFIG_SND_HDA_PATCH_LOADER=y
CONFIG_SND_HDA_CODEC_REALTEK=y
CONFIG_SND_HDA_CODEC_ANALOG=y
CONFIG_SND_HDA_CODEC_SIGMATEL=y
CONFIG_SND_HDA_CODEC_VIA=y
CONFIG_SND_HDA_CODEC_HDMI=y
CONFIG_SND_HDA_CODEC_CIRRUS=y
CONFIG_SND_HDA_CODEC_CONEXANT=y
CONFIG_SND_HDA_CODEC_CA0110=y
CONFIG_SND_HDA_CODEC_CMEDIA=y
CONFIG_SND_HDA_CODEC_SI3054=y
CONFIG_SND_HDA_GENERIC=y
CONFIG_SND_HDA_POWER_SAVE=y
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=0
CONFIG_SND_HDSP=m
CONFIG_SND_HDSPM=m
CONFIG_SND_ICE1712=m
CONFIG_SND_ICE1724=m
CONFIG_SND_INTEL8X0=m
CONFIG_SND_INTEL8X0M=m
CONFIG_SND_KORG1212=m
CONFIG_SND_LX6464ES=m
CONFIG_SND_MAESTRO3=m
CONFIG_SND_MAESTRO3_INPUT=y
CONFIG_SND_MIXART=m
CONFIG_SND_NM256=m
CONFIG_SND_PCXHR=m
CONFIG_SND_RIPTIDE=m
CONFIG_SND_RME32=m
CONFIG_SND_RME96=m
CONFIG_SND_RME9652=m
CONFIG_SND_SONICVIBES=m
CONFIG_SND_TRIDENT=m
CONFIG_SND_VIA82XX=m
CONFIG_SND_VIA82XX_MODEM=m
CONFIG_SND_VIRTUOSO=m
CONFIG_SND_VX222=m
CONFIG_SND_YMFPCI=m
CONFIG_SND_USB=y
CONFIG_SND_USB_AUDIO=m
CONFIG_SND_USB_UA101=m
CONFIG_SND_USB_USX2Y=m
CONFIG_SND_USB_CAIAQ=m
CONFIG_SND_USB_CAIAQ_INPUT=y
CONFIG_SND_USB_US122L=m
CONFIG_SND_PCMCIA=y
CONFIG_SND_VXPOCKET=m
CONFIG_SND_PDAUDIOCF=m
# CONFIG_SND_SOC is not set
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=m
CONFIG_HID_SUPPORT=y
CONFIG_HID=y
CONFIG_HIDRAW=y

#
# USB Input Devices
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y

#
# Special HID drivers
#
CONFIG_HID_3M_PCT=y
CONFIG_HID_A4TECH=y
# CONFIG_HID_ACRUX_FF is not set
CONFIG_HID_APPLE=y
CONFIG_HID_BELKIN=y
CONFIG_HID_CANDO=m
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
CONFIG_HID_PRODIKEYS=m
CONFIG_HID_CYPRESS=y
CONFIG_HID_DRAGONRISE=m
CONFIG_DRAGONRISE_FF=y
# CONFIG_HID_EMS_FF is not set
CONFIG_HID_EGALAX=m
CONFIG_HID_EZKEY=y
CONFIG_HID_KYE=y
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
CONFIG_HID_GYRATION=m
CONFIG_HID_TWINHAN=m
CONFIG_HID_KENSINGTON=y
CONFIG_HID_LOGITECH=y
CONFIG_LOGITECH_FF=y
CONFIG_LOGIRUMBLEPAD2_FF=y
CONFIG_LOGIG940_FF=y
# CONFIG_LOGIWII_FF is not set
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MOSART=y
CONFIG_HID_MONTEREY=y
# CONFIG_HID_MULTITOUCH is not set
CONFIG_HID_NTRIG=y
CONFIG_HID_ORTEK=m
CONFIG_HID_PANTHERLORD=m
CONFIG_PANTHERLORD_FF=y
CONFIG_HID_PETALYNX=m
CONFIG_HID_PICOLCD=m
CONFIG_HID_PICOLCD_FB=y
CONFIG_HID_PICOLCD_BACKLIGHT=y
CONFIG_HID_PICOLCD_LCD=y
CONFIG_HID_PICOLCD_LEDS=y
CONFIG_HID_QUANTA=y
CONFIG_HID_ROCCAT=m
CONFIG_HID_ROCCAT_KONE=m
# CONFIG_HID_ROCCAT_KONEPLUS is not set
# CONFIG_HID_ROCCAT_PYRA is not set
CONFIG_HID_SAMSUNG=m
CONFIG_HID_SONY=m
CONFIG_HID_STANTUM=y
CONFIG_HID_SUNPLUS=m
CONFIG_HID_GREENASIA=m
CONFIG_GREENASIA_FF=y
CONFIG_HID_SMARTJOYPLUS=m
CONFIG_SMARTJOYPLUS_FF=y
CONFIG_HID_TOPSEED=m
CONFIG_HID_THRUSTMASTER=m
CONFIG_THRUSTMASTER_FF=y
CONFIG_HID_ZEROPLUS=m
CONFIG_ZEROPLUS_FF=y
CONFIG_HID_ZYDACRON=m
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
# CONFIG_USB_DEVICE_CLASS is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
CONFIG_USB_SUSPEND=y
# CONFIG_USB_OTG is not set
CONFIG_USB_MON=y
CONFIG_USB_WUSB=m
CONFIG_USB_WUSB_CBAF=m
# CONFIG_USB_WUSB_CBAF_DEBUG is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_XHCI_HCD=m
# CONFIG_USB_XHCI_HCD_DEBUGGING is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
CONFIG_USB_ISP1362_HCD=m
CONFIG_USB_OHCI_HCD=y
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=y
CONFIG_USB_U132_HCD=m
CONFIG_USB_SL811_HCD=m
# CONFIG_USB_SL811_CS is not set
# CONFIG_USB_R8A66597_HCD is not set
CONFIG_USB_WHCI_HCD=m
CONFIG_USB_HWA_HCD=m

#
# USB Device Class drivers
#
CONFIG_USB_ACM=m
CONFIG_USB_PRINTER=m
CONFIG_USB_WDM=m
CONFIG_USB_TMC=m

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
CONFIG_USB_STORAGE_DATAFAB=m
CONFIG_USB_STORAGE_FREECOM=m
CONFIG_USB_STORAGE_ISD200=m
CONFIG_USB_STORAGE_USBAT=m
CONFIG_USB_STORAGE_SDDR09=m
CONFIG_USB_STORAGE_SDDR55=m
CONFIG_USB_STORAGE_JUMPSHOT=m
CONFIG_USB_STORAGE_ALAUDA=m
CONFIG_USB_STORAGE_ONETOUCH=m
CONFIG_USB_STORAGE_KARMA=m
CONFIG_USB_STORAGE_CYPRESS_ATACB=m
# CONFIG_USB_UAS is not set
# CONFIG_USB_LIBUSUAL is not set

#
# USB Imaging devices
#
CONFIG_USB_MDC800=m
CONFIG_USB_MICROTEK=m

#
# USB port drivers
#
CONFIG_USB_USS720=m
CONFIG_USB_SERIAL=m
CONFIG_USB_EZUSB=y
CONFIG_USB_SERIAL_GENERIC=y
CONFIG_USB_SERIAL_AIRCABLE=m
CONFIG_USB_SERIAL_ARK3116=m
CONFIG_USB_SERIAL_BELKIN=m
CONFIG_USB_SERIAL_CH341=m
CONFIG_USB_SERIAL_WHITEHEAT=m
CONFIG_USB_SERIAL_DIGI_ACCELEPORT=m
CONFIG_USB_SERIAL_CP210X=m
CONFIG_USB_SERIAL_CYPRESS_M8=m
CONFIG_USB_SERIAL_EMPEG=m
CONFIG_USB_SERIAL_FTDI_SIO=m
CONFIG_USB_SERIAL_FUNSOFT=m
CONFIG_USB_SERIAL_VISOR=m
CONFIG_USB_SERIAL_IPAQ=m
CONFIG_USB_SERIAL_IR=m
CONFIG_USB_SERIAL_EDGEPORT=m
CONFIG_USB_SERIAL_EDGEPORT_TI=m
CONFIG_USB_SERIAL_GARMIN=m
CONFIG_USB_SERIAL_IPW=m
CONFIG_USB_SERIAL_IUU=m
CONFIG_USB_SERIAL_KEYSPAN_PDA=m
CONFIG_USB_SERIAL_KEYSPAN=m
CONFIG_USB_SERIAL_KLSI=m
CONFIG_USB_SERIAL_KOBIL_SCT=m
CONFIG_USB_SERIAL_MCT_U232=m
CONFIG_USB_SERIAL_MOS7720=m
CONFIG_USB_SERIAL_MOS7715_PARPORT=y
CONFIG_USB_SERIAL_MOS7840=m
CONFIG_USB_SERIAL_MOTOROLA=m
CONFIG_USB_SERIAL_NAVMAN=m
CONFIG_USB_SERIAL_PL2303=m
CONFIG_USB_SERIAL_OTI6858=m
CONFIG_USB_SERIAL_QCAUX=m
CONFIG_USB_SERIAL_QUALCOMM=m
CONFIG_USB_SERIAL_SPCP8X5=m
CONFIG_USB_SERIAL_HP4X=m
CONFIG_USB_SERIAL_SAFE=m
CONFIG_USB_SERIAL_SAFE_PADDED=y
# CONFIG_USB_SERIAL_SAMBA is not set
CONFIG_USB_SERIAL_SIEMENS_MPI=m
CONFIG_USB_SERIAL_SIERRAWIRELESS=m
CONFIG_USB_SERIAL_SYMBOL=m
CONFIG_USB_SERIAL_TI=m
CONFIG_USB_SERIAL_CYBERJACK=m
CONFIG_USB_SERIAL_XIRCOM=m
CONFIG_USB_SERIAL_WWAN=m
CONFIG_USB_SERIAL_OPTION=m
CONFIG_USB_SERIAL_OMNINET=m
CONFIG_USB_SERIAL_OPTICON=m
CONFIG_USB_SERIAL_VIVOPAY_SERIAL=m
# CONFIG_USB_SERIAL_ZIO is not set
# CONFIG_USB_SERIAL_SSU100 is not set
CONFIG_USB_SERIAL_DEBUG=m

#
# USB Miscellaneous drivers
#
CONFIG_USB_EMI62=m
CONFIG_USB_EMI26=m
CONFIG_USB_ADUTUX=m
CONFIG_USB_SEVSEG=m
# CONFIG_USB_RIO500 is not set
CONFIG_USB_LEGOTOWER=m
CONFIG_USB_LCD=m
CONFIG_USB_LED=m
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
CONFIG_USB_IDMOUSE=m
CONFIG_USB_FTDI_ELAN=m
CONFIG_USB_APPLEDISPLAY=m
CONFIG_USB_SISUSBVGA=m
CONFIG_USB_SISUSBVGA_CON=y
CONFIG_USB_LD=m
CONFIG_USB_TRANCEVIBRATOR=m
CONFIG_USB_IOWARRIOR=m
# CONFIG_USB_TEST is not set
CONFIG_USB_ISIGHTFW=m
# CONFIG_USB_YUREX is not set
# CONFIG_USB_GADGET is not set

#
# OTG and related infrastructure
#
CONFIG_USB_OTG_UTILS=y
# CONFIG_USB_GPIO_VBUS is not set
CONFIG_NOP_USB_XCEIV=m
CONFIG_UWB=m
CONFIG_UWB_HWA=m
CONFIG_UWB_WHCI=m
CONFIG_UWB_I1480U=m
CONFIG_MMC=m
# CONFIG_MMC_DEBUG is not set
# CONFIG_MMC_UNSAFE_RESUME is not set
# CONFIG_MMC_CLKGATE is not set

#
# MMC/SD/SDIO Card Drivers
#
CONFIG_MMC_BLOCK=m
CONFIG_MMC_BLOCK_MINORS=8
CONFIG_MMC_BLOCK_BOUNCE=y
CONFIG_SDIO_UART=m
# CONFIG_MMC_TEST is not set

#
# MMC/SD/SDIO Host Controller Drivers
#
CONFIG_MMC_SDHCI=m
CONFIG_MMC_SDHCI_PCI=m
CONFIG_MMC_RICOH_MMC=y
CONFIG_MMC_SDHCI_PLTFM=m
CONFIG_MMC_WBSD=m
CONFIG_MMC_TIFM_SD=m
CONFIG_MMC_SDRICOH_CS=m
CONFIG_MMC_CB710=m
CONFIG_MMC_VIA_SDMMC=m
# CONFIG_MMC_USHC is not set
CONFIG_MEMSTICK=y
# CONFIG_MEMSTICK_DEBUG is not set

#
# MemoryStick drivers
#
# CONFIG_MEMSTICK_UNSAFE_RESUME is not set
CONFIG_MSPRO_BLOCK=m

#
# MemoryStick Host Controller Drivers
#
CONFIG_MEMSTICK_TIFM_MS=m
CONFIG_MEMSTICK_JMICRON_38X=m
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y

#
# LED drivers
#
# CONFIG_LEDS_ALIX2 is not set
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_GPIO is not set
# CONFIG_LEDS_LP3944 is not set
# CONFIG_LEDS_LP5521 is not set
# CONFIG_LEDS_LP5523 is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_BD2802 is not set
# CONFIG_LEDS_INTEL_SS4200 is not set
# CONFIG_LEDS_LT3593 is not set
CONFIG_LEDS_TRIGGERS=y

#
# LED Triggers
#
# CONFIG_LEDS_TRIGGER_TIMER is not set
# CONFIG_LEDS_TRIGGER_HEARTBEAT is not set
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_GPIO is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_NFC_DEVICES is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
# CONFIG_EDAC is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
CONFIG_RTC_DRV_DS1307=m
CONFIG_RTC_DRV_DS1374=m
CONFIG_RTC_DRV_DS1672=m
# CONFIG_RTC_DRV_DS3232 is not set
CONFIG_RTC_DRV_MAX6900=m
CONFIG_RTC_DRV_RS5C372=m
CONFIG_RTC_DRV_ISL1208=m
# CONFIG_RTC_DRV_ISL12022 is not set
CONFIG_RTC_DRV_X1205=m
CONFIG_RTC_DRV_PCF8563=m
CONFIG_RTC_DRV_PCF8583=m
CONFIG_RTC_DRV_M41T80=m
CONFIG_RTC_DRV_M41T80_WDT=y
CONFIG_RTC_DRV_BQ32K=m
# CONFIG_RTC_DRV_S35390A is not set
CONFIG_RTC_DRV_FM3130=m
CONFIG_RTC_DRV_RX8581=m
CONFIG_RTC_DRV_RX8025=m

#
# SPI RTC drivers
#

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
CONFIG_RTC_DRV_DS1286=m
CONFIG_RTC_DRV_DS1511=m
CONFIG_RTC_DRV_DS1553=m
CONFIG_RTC_DRV_DS1742=m
CONFIG_RTC_DRV_STK17TA8=m
# CONFIG_RTC_DRV_M48T86 is not set
CONFIG_RTC_DRV_M48T35=m
CONFIG_RTC_DRV_M48T59=m
CONFIG_RTC_DRV_MSM6242=m
CONFIG_RTC_DRV_BQ4802=m
CONFIG_RTC_DRV_RP5C01=m
CONFIG_RTC_DRV_V3020=m

#
# on-CPU RTC drivers
#
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
CONFIG_INTEL_MID_DMAC=m
CONFIG_INTEL_IOATDMA=m
# CONFIG_TIMB_DMA is not set
# CONFIG_PCH_DMA is not set
CONFIG_DMA_ENGINE=y

#
# DMA Clients
#
CONFIG_NET_DMA=y
CONFIG_ASYNC_TX_DMA=y
# CONFIG_DMATEST is not set
CONFIG_DCA=m
# CONFIG_AUXDISPLAY is not set
CONFIG_UIO=m
# CONFIG_UIO_CIF is not set
# CONFIG_UIO_PDRV is not set
# CONFIG_UIO_PDRV_GENIRQ is not set
# CONFIG_UIO_AEC is not set
# CONFIG_UIO_SERCOS3 is not set
# CONFIG_UIO_PCI_GENERIC is not set
# CONFIG_UIO_NETX is not set

#
# Xen driver support
#
# CONFIG_XEN_BALLOON is not set
# CONFIG_XEN_DEV_EVTCHN is not set
CONFIG_XEN_BACKEND=y
# CONFIG_XENFS is not set
# CONFIG_XEN_SYS_HYPERVISOR is not set
CONFIG_XEN_XENBUS_FRONTEND=y
# CONFIG_XEN_GNTDEV is not set
# CONFIG_XEN_PLATFORM_PCI is not set
CONFIG_SWIOTLB_XEN=y
CONFIG_STAGING=y
# CONFIG_STAGING_EXCLUDE_BUILD is not set
# CONFIG_ET131X is not set
# CONFIG_SLICOSS is not set
# CONFIG_VIDEO_GO7007 is not set
# CONFIG_VIDEO_CX25821 is not set
# CONFIG_VIDEO_TM6000 is not set
CONFIG_USB_DABUSB=m
CONFIG_USB_SE401=m
CONFIG_VIDEO_USBVIDEO=m
CONFIG_USB_VICAM=m
# CONFIG_USB_IP_COMMON is not set
# CONFIG_ECHO is not set
# CONFIG_COMEDI is not set
# CONFIG_ASUS_OLED is not set
# CONFIG_PANEL is not set
# CONFIG_TRANZPORT is not set
# CONFIG_POHMELFS is not set
# CONFIG_AUTOFS_FS is not set
# CONFIG_IDE_PHISON is not set
# CONFIG_LINE6_USB is not set
CONFIG_DRM_VMWGFX=m
CONFIG_DRM_NOUVEAU=m
CONFIG_DRM_NOUVEAU_BACKLIGHT=y
CONFIG_DRM_NOUVEAU_DEBUG=y

#
# I2C encoder or helper chips
#
CONFIG_DRM_I2C_CH7006=m
CONFIG_DRM_I2C_SIL164=m
# CONFIG_USB_SERIAL_QUATECH2 is not set
# CONFIG_USB_SERIAL_QUATECH_USB2 is not set
# CONFIG_HYPERV is not set
# CONFIG_VME_BUS is not set
# CONFIG_DX_SEP is not set
# CONFIG_IIO is not set
# CONFIG_ZRAM is not set
# CONFIG_FB_SM7XX is not set
# CONFIG_VIDEO_DT3155 is not set
CONFIG_CRYSTALHD=m

#
# Texas Instruments shared transport line discipline
#
# CONFIG_FB_XGI is not set
CONFIG_LIRC_STAGING=y
CONFIG_LIRC_BT829=m
CONFIG_LIRC_IGORPLUGUSB=m
CONFIG_LIRC_IMON=m
CONFIG_LIRC_IT87=m
CONFIG_LIRC_ITE8709=m
# CONFIG_LIRC_PARALLEL is not set
CONFIG_LIRC_SASEM=m
CONFIG_LIRC_SERIAL=m
CONFIG_LIRC_SERIAL_TRANSMITTER=y
CONFIG_LIRC_SIR=m
CONFIG_LIRC_TTUSBIR=m
CONFIG_LIRC_ZILOG=m
# CONFIG_SMB_FS is not set
# CONFIG_EASYCAP is not set
# CONFIG_SOLO6X10 is not set
# CONFIG_ACPI_QUICKSTART is not set
CONFIG_MACH_NO_WESTBRIDGE=y
# CONFIG_USB_ENESTORAGE is not set
# CONFIG_BCM_WIMAX is not set
# CONFIG_FT1000 is not set

#
# Speakup console speech
#
# CONFIG_SPEAKUP is not set
# CONFIG_TOUCHSCREEN_CLEARPAD_TM1217 is not set
# CONFIG_TOUCHSCREEN_SYNAPTICS_I2C_RMI4 is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ACERHDF is not set
# CONFIG_ASUS_LAPTOP is not set
# CONFIG_FUJITSU_LAPTOP is not set
CONFIG_PANASONIC_LAPTOP=y
# CONFIG_THINKPAD_ACPI is not set
CONFIG_SENSORS_HDAPS=m
# CONFIG_INTEL_MENLOW is not set
# CONFIG_EEEPC_LAPTOP is not set
# CONFIG_ACPI_WMI is not set
# CONFIG_ACPI_ASUS is not set
# CONFIG_TOPSTAR_LAPTOP is not set
# CONFIG_ACPI_TOSHIBA is not set
# CONFIG_TOSHIBA_BT_RFKILL is not set
# CONFIG_ACPI_CMPC is not set
# CONFIG_INTEL_IPS is not set
# CONFIG_IBM_RTL is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
# CONFIG_EFI_VARS is not set
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
# CONFIG_DMIID is not set
# CONFIG_ISCSI_IBFT_FIND is not set

#
# File systems
#
CONFIG_EXT2_FS=m
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
CONFIG_EXT2_FS_XIP=y
CONFIG_EXT3_FS=y
CONFIG_EXT3_DEFAULTS_TO_ORDERED=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=y
CONFIG_EXT4_FS_XATTR=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_DEBUG is not set
CONFIG_FS_XIP=y
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_BTRFS_FS is not set
# CONFIG_NILFS2_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=m
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_FANOTIFY is not set
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
# CONFIG_PRINT_QUOTA_WARNING is not set
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_QUOTACTL_COMPAT=y
CONFIG_AUTOFS4_FS=y
CONFIG_FUSE_FS=m
CONFIG_CUSE=m
CONFIG_GENERIC_ACL=y

#
# Caches
#
CONFIG_FSCACHE=m
CONFIG_FSCACHE_STATS=y
# CONFIG_FSCACHE_HISTOGRAM is not set
# CONFIG_FSCACHE_DEBUG is not set
CONFIG_FSCACHE_OBJECT_LIST=y
CONFIG_CACHEFILES=m
# CONFIG_CACHEFILES_DEBUG is not set
# CONFIG_CACHEFILES_HISTOGRAM is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
# CONFIG_JOLIET is not set
# CONFIG_ZISOFS is not set
# CONFIG_UDF_FS is not set

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="ascii"
CONFIG_NTFS_FS=m
# CONFIG_NTFS_DEBUG is not set
CONFIG_NTFS_RW=y

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=m
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_ECRYPT_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_LOGFS is not set
CONFIG_CRAMFS=m
CONFIG_SQUASHFS=m
# CONFIG_SQUASHFS_XATTR is not set
# CONFIG_SQUASHFS_LZO is not set
# CONFIG_SQUASHFS_XZ is not set
# CONFIG_SQUASHFS_EMBEDDED is not set
CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE=3
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_ROMFS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
# CONFIG_EXOFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
CONFIG_NFS_V4_1=y
CONFIG_PNFS_FILE_LAYOUT=m
CONFIG_NFS_FSCACHE=y
# CONFIG_NFS_USE_LEGACY_DNS is not set
CONFIG_NFS_USE_KERNEL_DNS=y
# CONFIG_NFS_USE_NEW_IDMAPPER is not set
CONFIG_NFSD=m
CONFIG_NFSD_DEPRECATED=y
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=m
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=m
CONFIG_SUNRPC_GSS=m
CONFIG_RPCSEC_GSS_KRB5=m
# CONFIG_CEPH_FS is not set
CONFIG_CIFS=m
# CONFIG_CIFS_STATS is not set
CONFIG_CIFS_WEAK_PW_HASH=y
CONFIG_CIFS_UPCALL=y
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
CONFIG_CIFS_DEBUG2=y
# CONFIG_CIFS_DFS_UPCALL is not set
# CONFIG_CIFS_FSCACHE is not set
# CONFIG_CIFS_ACL is not set
# CONFIG_CIFS_EXPERIMENTAL is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
# CONFIG_LDM_PARTITION is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
# CONFIG_NLS_CODEPAGE_437 is not set
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
CONFIG_NLS_CODEPAGE_936=m
CONFIG_NLS_CODEPAGE_950=m
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
# CONFIG_NLS_ISO8859_1 is not set
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
CONFIG_NLS_UTF8=m
CONFIG_DLM=m
CONFIG_DLM_DEBUG=y

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
# CONFIG_ENABLE_WARN_DEPRECATED is not set
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
CONFIG_STRIP_ASM_SYMS=y
CONFIG_UNUSED_SYMBOLS=y
CONFIG_DEBUG_FS=y
CONFIG_HEADERS_CHECK=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SHIRQ=y
CONFIG_LOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR=y
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=1
CONFIG_DETECT_HUNG_TASK=y
CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=1
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_TIMER_STATS=y
# CONFIG_DEBUG_OBJECTS is not set
CONFIG_SLUB_DEBUG_ON=y
# CONFIG_SLUB_STATS is not set
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_PI_LIST=y
CONFIG_RT_MUTEX_TESTER=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_BKL=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RCU=y
CONFIG_PROVE_RCU_REPEATEDLY=y
CONFIG_SPARSE_RCU_POINTER=y
CONFIG_LOCKDEP=y
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_LOCKDEP=y
CONFIG_TRACE_IRQFLAGS=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_DEBUG_LOCKING_API_SELFTESTS=y
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_INFO_REDUCED is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_DEBUG_LIST=y
# CONFIG_TEST_LIST_SORT is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_CREDENTIALS is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
CONFIG_BOOT_PRINTK_DELAY=y
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_CPU_STALL_DETECTOR is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
CONFIG_LKDTM=y
# CONFIG_CPU_NOTIFIER_ERROR_INJECT is not set
# CONFIG_FAULT_INJECTION is not set
CONFIG_LATENCYTOP=y
# CONFIG_SYSCTL_SYSCALL_CHECK is not set
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FTRACE_NMI_ENTER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_RING_BUFFER=y
CONFIG_FTRACE_NMI_ENTER=y
CONFIG_EVENT_TRACING=y
CONFIG_EVENT_POWER_TRACING_DEPRECATED=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_RING_BUFFER_ALLOW_SWAP=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
# CONFIG_IRQSOFF_TRACER is not set
CONFIG_SCHED_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
CONFIG_STACK_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_KPROBE_EVENT=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_SELFTEST=y
CONFIG_FTRACE_STARTUP_TEST=y
# CONFIG_EVENT_TRACE_TEST_SYSCALLS is not set
CONFIG_MMIOTRACE=y
# CONFIG_MMIOTRACE_TEST is not set
CONFIG_RING_BUFFER_BENCHMARK=m
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
CONFIG_BUILD_DOCSRC=y
CONFIG_DYNAMIC_DEBUG=y
# CONFIG_DMA_API_DEBUG is not set
CONFIG_ATOMIC64_SELFTEST=y
CONFIG_ASYNC_RAID6_TEST=m
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
CONFIG_HAVE_ARCH_KMEMCHECK=y
CONFIG_STRICT_DEVMEM=y
# CONFIG_X86_VERBOSE_BOOTUP is not set
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
CONFIG_DEBUG_STACKOVERFLOW=y
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
CONFIG_X86_PTDUMP=y
CONFIG_DEBUG_RODATA=y
CONFIG_DEBUG_RODATA_TEST=y
# CONFIG_DEBUG_SET_MODULE_RONX is not set
CONFIG_DEBUG_NX_TEST=m
# CONFIG_IOMMU_DEBUG is not set
# CONFIG_IOMMU_STRESS is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
CONFIG_X86_DECODER_SELFTEST=y
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
CONFIG_DEBUG_BOOT_PARAMS=y
# CONFIG_CPA_DEBUG is not set
CONFIG_OPTIMIZE_INLINING=y
# CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is not set

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_TRUSTED_KEYS is not set
# CONFIG_KEYS_DEBUG_PROC_KEYS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
# CONFIG_SECURITY is not set
CONFIG_SECURITYFS=y
CONFIG_INTEL_TXT=y
CONFIG_DEFAULT_SECURITY_DAC=y
CONFIG_DEFAULT_SECURITY=""
CONFIG_XOR_BLOCKS=m
CONFIG_ASYNC_CORE=m
CONFIG_ASYNC_MEMCPY=m
CONFIG_ASYNC_XOR=m
CONFIG_ASYNC_PQ=m
CONFIG_ASYNC_RAID6_RECOV=m
CONFIG_ASYNC_TX_DISABLE_PQ_VAL_DMA=y
CONFIG_ASYNC_TX_DISABLE_XOR_VAL_DMA=y
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=m
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=m
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=m
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP=m
CONFIG_CRYPTO_PCOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set
CONFIG_CRYPTO_GF128MUL=m
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_PCRYPT is not set
CONFIG_CRYPTO_WORKQUEUE=y
CONFIG_CRYPTO_CRYPTD=m
CONFIG_CRYPTO_AUTHENC=m
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
CONFIG_CRYPTO_SEQIV=m

#
# Block modes
#
CONFIG_CRYPTO_CBC=m
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
CONFIG_CRYPTO_ECB=m
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=m
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_GHASH is not set
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_MICHAEL_MIC=m
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_SHA256=m
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL is not set

#
# Ciphers
#
CONFIG_CRYPTO_AES=m
CONFIG_CRYPTO_AES_X86_64=m
# CONFIG_CRYPTO_AES_NI_INTEL is not set
# CONFIG_CRYPTO_ANUBIS is not set
CONFIG_CRYPTO_ARC4=m
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
CONFIG_CRYPTO_DES=m
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=m
CONFIG_CRYPTO_ZLIB=m
CONFIG_CRYPTO_LZO=m

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
# CONFIG_CRYPTO_USER_API_HASH is not set
# CONFIG_CRYPTO_USER_API_SKCIPHER is not set
# CONFIG_CRYPTO_HW is not set
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_APIC_ARCHITECTURE=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
CONFIG_KVM_INTEL=m
CONFIG_KVM_AMD=m
CONFIG_KVM_MMU_AUDIT=y
CONFIG_VHOST_NET=m
CONFIG_VIRTIO=m
CONFIG_VIRTIO_RING=m
CONFIG_VIRTIO_PCI=m
CONFIG_VIRTIO_BALLOON=m
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_RAID6_PQ=m
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_GENERIC_FIND_LAST_BIT=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
CONFIG_CRC_ITU_T=m
CONFIG_CRC32=y
CONFIG_CRC7=m
CONFIG_LIBCRC32C=m
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=m
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
CONFIG_XZ_DEC_IA64=y
CONFIG_XZ_DEC_ARM=y
CONFIG_XZ_DEC_ARMTHUMB=y
CONFIG_XZ_DEC_SPARC=y
CONFIG_XZ_DEC_BCJ=y
# CONFIG_XZ_DEC_TEST is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_CHECK_SIGNATURE=y
CONFIG_NLATTR=y

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 5/7] sched: add exports tracking cfs bandwidth control statistics
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 5/7] sched: add exports tracking cfs bandwidth control statistics Paul Turner
@ 2011-02-22  3:14   ` Balbir Singh
  2011-02-22  4:13     ` Bharata B Rao
  2011-02-23 13:32   ` Peter Zijlstra
  1 sibling, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2011-02-22  3:14 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen, Nikhil Rao

* Paul Turner <pjt@google.com> [2011-02-15 19:18:36]:

> From: Nikhil Rao <ncrao@google.com>
> 
> This change introduces statistics exports for the cpu sub-system, these are
> added through the use of a stat file similar to that exported by other
> subsystems.
> 
> The following exports are included:
> 
> nr_periods:	number of periods in which execution occurred
> nr_throttled:	the number of periods above in which execution was throttle
> throttled_time:	cumulative wall-time that any cpus have been throttled for
> this group
> 
> Signed-off-by: Paul Turner <pjt@google.com>
> Signed-off-by: Nikhil Rao <ncrao@google.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> ---
>  kernel/sched.c      |   26 ++++++++++++++++++++++++++
>  kernel/sched_fair.c |   16 +++++++++++++++-
>  2 files changed, 41 insertions(+), 1 deletion(-)
> 
> Index: tip/kernel/sched.c
> ===================================================================
> --- tip.orig/kernel/sched.c
> +++ tip/kernel/sched.c
> @@ -254,6 +254,11 @@ struct cfs_bandwidth {
>  	ktime_t			period;
>  	u64			runtime, quota;
>  	struct hrtimer		period_timer;
> +
> +	/* throttle statistics */
> +	u64			nr_periods;
> +	u64			nr_throttled;
> +	u64			throttled_time;
>  };
>  #endif
> 
> @@ -389,6 +394,7 @@ struct cfs_rq {
>  #ifdef CONFIG_CFS_BANDWIDTH
>  	u64 quota_assigned, quota_used;
>  	int throttled;
> +	u64 throttled_timestamp;
>  #endif
>  #endif
>  };
> @@ -426,6 +432,10 @@ void init_cfs_bandwidth(struct cfs_bandw
> 
>  	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>  	cfs_b->period_timer.function = sched_cfs_period_timer;
> +
> +	cfs_b->nr_periods = 0;
> +	cfs_b->nr_throttled = 0;
> +	cfs_b->throttled_time = 0;
>  }
> 
>  static
> @@ -9332,6 +9342,18 @@ static int cpu_cfs_period_write_u64(stru
>  	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
>  }
> 
> +static int cpu_stats_show(struct cgroup *cgrp, struct cftype *cft,
> +		struct cgroup_map_cb *cb)
> +{
> +	struct task_group *tg = cgroup_tg(cgrp);
> +	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> +
> +	cb->fill(cb, "nr_periods", cfs_b->nr_periods);
> +	cb->fill(cb, "nr_throttled", cfs_b->nr_throttled);
> +	cb->fill(cb, "throttled_time", cfs_b->throttled_time);
> +
> +	return 0;
> +}
>  #endif /* CONFIG_CFS_BANDWIDTH */
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
> 
> @@ -9378,6 +9400,10 @@ static struct cftype cpu_files[] = {
>  		.read_u64 = cpu_cfs_period_read_u64,
>  		.write_u64 = cpu_cfs_period_write_u64,
>  	},
> +	{
> +		.name = "stat",
> +		.read_map = cpu_stats_show,
> +	},
>  #endif
>  #ifdef CONFIG_RT_GROUP_SCHED
>  	{
> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -1519,17 +1519,25 @@ static void throttle_cfs_rq(struct cfs_r
> 
>  out_throttled:
>  	cfs_rq->throttled = 1;
> +	cfs_rq->throttled_timestamp = rq_of(cfs_rq)->clock;
>  	update_cfs_rq_load_contribution(cfs_rq, 1);
>  }
> 
>  static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>  	struct rq *rq = rq_of(cfs_rq);
> +	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
>  	struct sched_entity *se;
> 
>  	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> 
>  	update_rq_clock(rq);
> +	/* update stats */
> +	raw_spin_lock(&cfs_b->lock);
> +	cfs_b->throttled_time += (rq->clock - cfs_rq->throttled_timestamp);
> +	raw_spin_unlock(&cfs_b->lock);
> +	cfs_rq->throttled_timestamp = 0;
> +
>  	/* (Try to) avoid maintaining share statistics for idle time */
>  	cfs_rq->load_stamp = cfs_rq->load_last = rq->clock_task;
> 
> @@ -1571,7 +1579,7 @@ static void account_cfs_rq_quota(struct 
> 
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>  {
> -	int i, idle = 1;
> +	int i, idle = 1, num_throttled = 0;
>  	u64 delta;
>  	const struct cpumask *span;
> 
> @@ -1593,6 +1601,7 @@ static int do_sched_cfs_period_timer(str
> 
>  		if (!cfs_rq_throttled(cfs_rq))
>  			continue;
> +		num_throttled++;
> 
>  		delta = tg_request_cfs_quota(cfs_rq->tg);
> 
> @@ -1608,6 +1617,11 @@ static int do_sched_cfs_period_timer(str
>  		}
>  	}
> 
> +	/* update throttled stats */
> +	cfs_b->nr_periods++;
> +	if (num_throttled)
> +		cfs_b->nr_throttled++;
> +
>  	return idle;
>  }
>

Should we consider integrating this in cpuacct, it would be difficult
if we spill over stats between controllers. 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 6/7] sched: hierarchical task accounting for SCHED_OTHER
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 6/7] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
@ 2011-02-22  3:17   ` Balbir Singh
  2011-02-23  8:05     ` Paul Turner
  2011-02-23  2:02   ` Hidetoshi Seto
  2011-02-23 13:32   ` Peter Zijlstra
  2 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2011-02-22  3:17 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen

* Paul Turner <pjt@google.com> [2011-02-15 19:18:37]:

> +#ifdef CONFIG_CFS_BANDWIDTH
> +/* maintain hierarchal task counts on group entities */
> +static void account_hier_tasks(struct sched_entity *se, int delta)

I don't like the use of hier, I'd expand it to hierarchical

I am not too sure about the RT bits, but other than that


Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 5/7] sched: add exports tracking cfs bandwidth control statistics
  2011-02-22  3:14   ` Balbir Singh
@ 2011-02-22  4:13     ` Bharata B Rao
  2011-02-22  4:40       ` Balbir Singh
  0 siblings, 1 reply; 71+ messages in thread
From: Bharata B Rao @ 2011-02-22  4:13 UTC (permalink / raw
  To: Balbir Singh
  Cc: Paul Turner, linux-kernel, Dhaval Giani, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Tue, Feb 22, 2011 at 08:44:20AM +0530, Balbir Singh wrote:
> * Paul Turner <pjt@google.com> [2011-02-15 19:18:36]:
> 
> > From: Nikhil Rao <ncrao@google.com>
> > 
> > This change introduces statistics exports for the cpu sub-system, these are
> > added through the use of a stat file similar to that exported by other
> > subsystems.
> > 
> > The following exports are included:
> > 
> > nr_periods:	number of periods in which execution occurred
> > nr_throttled:	the number of periods above in which execution was throttle
> > throttled_time:	cumulative wall-time that any cpus have been throttled for
> > this group
> > 
> > Signed-off-by: Paul Turner <pjt@google.com>
> > Signed-off-by: Nikhil Rao <ncrao@google.com>
> > Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> > ---
> >  kernel/sched.c      |   26 ++++++++++++++++++++++++++
> >  kernel/sched_fair.c |   16 +++++++++++++++-
> >  2 files changed, 41 insertions(+), 1 deletion(-)
> > 
> > Index: tip/kernel/sched.c
> > ===================================================================
> > --- tip.orig/kernel/sched.c
> > +++ tip/kernel/sched.c
> > @@ -254,6 +254,11 @@ struct cfs_bandwidth {
> >  	ktime_t			period;
> >  	u64			runtime, quota;
> >  	struct hrtimer		period_timer;
> > +
> > +	/* throttle statistics */
> > +	u64			nr_periods;
> > +	u64			nr_throttled;
> > +	u64			throttled_time;
> >  };
> >  #endif
> > 
> > @@ -389,6 +394,7 @@ struct cfs_rq {
> >  #ifdef CONFIG_CFS_BANDWIDTH
> >  	u64 quota_assigned, quota_used;
> >  	int throttled;
> > +	u64 throttled_timestamp;
> >  #endif
> >  #endif
> >  };
> > @@ -426,6 +432,10 @@ void init_cfs_bandwidth(struct cfs_bandw
> > 
> >  	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> >  	cfs_b->period_timer.function = sched_cfs_period_timer;
> > +
> > +	cfs_b->nr_periods = 0;
> > +	cfs_b->nr_throttled = 0;
> > +	cfs_b->throttled_time = 0;
> >  }
> > 
> >  static
> > @@ -9332,6 +9342,18 @@ static int cpu_cfs_period_write_u64(stru
> >  	return tg_set_cfs_period(cgroup_tg(cgrp), cfs_period_us);
> >  }
> > 
> > +static int cpu_stats_show(struct cgroup *cgrp, struct cftype *cft,
> > +		struct cgroup_map_cb *cb)
> > +{
> > +	struct task_group *tg = cgroup_tg(cgrp);
> > +	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> > +
> > +	cb->fill(cb, "nr_periods", cfs_b->nr_periods);
> > +	cb->fill(cb, "nr_throttled", cfs_b->nr_throttled);
> > +	cb->fill(cb, "throttled_time", cfs_b->throttled_time);
> > +
> > +	return 0;
> > +}
> >  #endif /* CONFIG_CFS_BANDWIDTH */
> >  #endif /* CONFIG_FAIR_GROUP_SCHED */
> > 
> > @@ -9378,6 +9400,10 @@ static struct cftype cpu_files[] = {
> >  		.read_u64 = cpu_cfs_period_read_u64,
> >  		.write_u64 = cpu_cfs_period_write_u64,
> >  	},
> > +	{
> > +		.name = "stat",
> > +		.read_map = cpu_stats_show,
> > +	},
> >  #endif
> >  #ifdef CONFIG_RT_GROUP_SCHED
> >  	{
> > Index: tip/kernel/sched_fair.c
> > ===================================================================
> > --- tip.orig/kernel/sched_fair.c
> > +++ tip/kernel/sched_fair.c
> > @@ -1519,17 +1519,25 @@ static void throttle_cfs_rq(struct cfs_r
> > 
> >  out_throttled:
> >  	cfs_rq->throttled = 1;
> > +	cfs_rq->throttled_timestamp = rq_of(cfs_rq)->clock;
> >  	update_cfs_rq_load_contribution(cfs_rq, 1);
> >  }
> > 
> >  static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> >  {
> >  	struct rq *rq = rq_of(cfs_rq);
> > +	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> >  	struct sched_entity *se;
> > 
> >  	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> > 
> >  	update_rq_clock(rq);
> > +	/* update stats */
> > +	raw_spin_lock(&cfs_b->lock);
> > +	cfs_b->throttled_time += (rq->clock - cfs_rq->throttled_timestamp);
> > +	raw_spin_unlock(&cfs_b->lock);
> > +	cfs_rq->throttled_timestamp = 0;
> > +
> >  	/* (Try to) avoid maintaining share statistics for idle time */
> >  	cfs_rq->load_stamp = cfs_rq->load_last = rq->clock_task;
> > 
> > @@ -1571,7 +1579,7 @@ static void account_cfs_rq_quota(struct 
> > 
> >  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
> >  {
> > -	int i, idle = 1;
> > +	int i, idle = 1, num_throttled = 0;
> >  	u64 delta;
> >  	const struct cpumask *span;
> > 
> > @@ -1593,6 +1601,7 @@ static int do_sched_cfs_period_timer(str
> > 
> >  		if (!cfs_rq_throttled(cfs_rq))
> >  			continue;
> > +		num_throttled++;
> > 
> >  		delta = tg_request_cfs_quota(cfs_rq->tg);
> > 
> > @@ -1608,6 +1617,11 @@ static int do_sched_cfs_period_timer(str
> >  		}
> >  	}
> > 
> > +	/* update throttled stats */
> > +	cfs_b->nr_periods++;
> > +	if (num_throttled)
> > +		cfs_b->nr_throttled++;
> > +
> >  	return idle;
> >  }
> >
> 
> Should we consider integrating this in cpuacct, it would be difficult
> if we spill over stats between controllers. 

Given that cpuacct controller can be mounted independently, I am not sure
if we should integrate these stats. These stats come from cpu controller.

I initially had similar stats as part of /proc/sched_debug since there
are a bunch of other group specific stats (including rt throttle stats)
in /proc/sched_debug.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 5/7] sched: add exports tracking cfs bandwidth control statistics
  2011-02-22  4:13     ` Bharata B Rao
@ 2011-02-22  4:40       ` Balbir Singh
  2011-02-23  8:03         ` Paul Turner
  0 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2011-02-22  4:40 UTC (permalink / raw
  To: Bharata B Rao
  Cc: Paul Turner, linux-kernel, Dhaval Giani, Vaidyanathan Srinivasan,
	Gautham R Shenoy, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

* Bharata B Rao <bharata@linux.vnet.ibm.com> [2011-02-22 09:43:33]:

> > 
> > Should we consider integrating this in cpuacct, it would be difficult
> > if we spill over stats between controllers. 
> 
> Given that cpuacct controller can be mounted independently, I am not sure
> if we should integrate these stats. These stats come from cpu controller.

The accounting controller was created to account. I'd still prefer
cpuacct, so that I can find everything in one place. NOTE: cpuacct was
created so that we do accounting with control - just account. I think
splitting stats creates a usability mess - no?

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 0/7] Introduction
  2011-02-21  2:47 ` [CFS Bandwidth Control v4 0/7] Introduction Xiao Guangrong
@ 2011-02-22 10:28   ` Bharata B Rao
  2011-02-23  7:42   ` Paul Turner
  1 sibling, 0 replies; 71+ messages in thread
From: Bharata B Rao @ 2011-02-22 10:28 UTC (permalink / raw
  To: Xiao Guangrong
  Cc: Paul Turner, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen

On Mon, Feb 21, 2011 at 10:47:12AM +0800, Xiao Guangrong wrote:
> On 02/16/2011 11:18 AM, Paul Turner wrote:
> > Hi all,
> > 
> > Please find attached v4 of CFS bandwidth control; while this rebase against
> > some of the latest SCHED_NORMAL code is new, the features and methodology are
> > fairly mature at this point and have proved both effective and stable for
> > several workloads.
> > 
> > As always, all comments/feedback welcome.
> > 
> 
> Hi Paul,
> 
> Thanks for the great features!
> 
> I applied the patchset to kvm tree, then tested with kvm guest, unfortunately,
> it seems don't work normally. 
> 
> The steps is follow:
> 
> # mount -t cgroup -o cpu none /mnt/
> # qemu-system-x86_64 -enable-kvm  -smp 4 -m 512M -drive file=fc64.img,index=0,media=disk
> 
> Don't do any configuration in cgroup, and run the kvm guest directly (don't use libvirt),
> the guest booted very slowly and i saw some "soft lockup" bugs reported in the guest,
> i also noticed one CPU usage is 100% for more than 60s and other CPUs is 10%~30% in the host
> when guest was booting.
> 
> And if cgroup is not mounted, the guest runs well.
> 

Hi Xiao Guangrong,

Thanks for testing the patches. I do see some soft lockups in the
guest when I mount the cgroup and start VM using qemu-kvm.

Will get back after further investigation.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 6/7] sched: hierarchical task accounting for SCHED_OTHER
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 6/7] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
  2011-02-22  3:17   ` Balbir Singh
@ 2011-02-23  2:02   ` Hidetoshi Seto
  2011-02-23  2:20     ` Paul Turner
  2011-02-23  2:43     ` Balbir Singh
  2011-02-23 13:32   ` Peter Zijlstra
  2 siblings, 2 replies; 71+ messages in thread
From: Hidetoshi Seto @ 2011-02-23  2:02 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen

(2011/02/16 12:18), Paul Turner wrote:
> @@ -1428,6 +1464,7 @@ enqueue_task_fair(struct rq *rq, struct 
>  		update_cfs_shares(cfs_rq);
>  	}
>  
> +	account_hier_tasks(&p->se, 1);
>  	hrtick_update(rq);
>  }
>  
> @@ -1461,6 +1498,7 @@ static void dequeue_task_fair(struct rq 
>  		update_cfs_shares(cfs_rq);
>  	}
>  
> +	account_hier_tasks(&p->se, -1);
>  	hrtick_update(rq);
>  }
>  

Why hrtick_update() is need to be delayed after modifying nr_running? 
You should not impact current hrtick logic without once mentioning.

> Index: tip/kernel/sched_rt.c
> ===================================================================
> --- tip.orig/kernel/sched_rt.c
> +++ tip/kernel/sched_rt.c
> @@ -906,6 +906,8 @@ enqueue_task_rt(struct rq *rq, struct ta
>  
>  	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
>  		enqueue_pushable_task(rq, p);
> +
> +	inc_nr_running(rq);
>  }
>  
>  static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
> @@ -916,6 +918,8 @@ static void dequeue_task_rt(struct rq *r
>  	dequeue_rt_entity(rt_se);
>  
>  	dequeue_pushable_task(rq, p);
> +
> +	dec_nr_running(rq);
>  }
>  
>  /*

I think similar change for sched_stoptask.c is required.

In fact I could not boot tip/master with this v4 patchset.
It reports rcu stall warns without applying following tweak:

--- a/kernel/sched_stoptask.c
+++ b/kernel/sched_stoptask.c
@@ -35,11 +35,13 @@ static struct task_struct *pick_next_task_stop(struct rq *rq
 static void
 enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
+       inc_nr_running(rq);
 }

 static void
 dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags)
 {
+       dec_nr_running(rq);
 }


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 6/7] sched: hierarchical task accounting for SCHED_OTHER
  2011-02-23  2:02   ` Hidetoshi Seto
@ 2011-02-23  2:20     ` Paul Turner
  2011-02-23  2:43     ` Balbir Singh
  1 sibling, 0 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-23  2:20 UTC (permalink / raw
  To: Hidetoshi Seto
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen

On Tue, Feb 22, 2011 at 6:02 PM, Hidetoshi Seto
<seto.hidetoshi@jp.fujitsu.com> wrote:
> (2011/02/16 12:18), Paul Turner wrote:
>> @@ -1428,6 +1464,7 @@ enqueue_task_fair(struct rq *rq, struct
>>               update_cfs_shares(cfs_rq);
>>       }
>>
>> +     account_hier_tasks(&p->se, 1);
>>       hrtick_update(rq);
>>  }
>>
>> @@ -1461,6 +1498,7 @@ static void dequeue_task_fair(struct rq
>>               update_cfs_shares(cfs_rq);
>>       }
>>
>> +     account_hier_tasks(&p->se, -1);
>>       hrtick_update(rq);
>>  }
>>
>
> Why hrtick_update() is need to be delayed after modifying nr_running?
> You should not impact current hrtick logic without once mentioning.

There shouldn't be any impact -- hrtick_update only cares about
cfs_rq->nr_running which is independent of rq->nr_running (maintained
by hierarchal accounting)

>
>> Index: tip/kernel/sched_rt.c
>> ===================================================================
>> --- tip.orig/kernel/sched_rt.c
>> +++ tip/kernel/sched_rt.c
>> @@ -906,6 +906,8 @@ enqueue_task_rt(struct rq *rq, struct ta
>>
>>       if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
>>               enqueue_pushable_task(rq, p);
>> +
>> +     inc_nr_running(rq);
>>  }
>>
>>  static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>> @@ -916,6 +918,8 @@ static void dequeue_task_rt(struct rq *r
>>       dequeue_rt_entity(rt_se);
>>
>>       dequeue_pushable_task(rq, p);
>> +
>> +     dec_nr_running(rq);
>>  }
>>
>>  /*
>
> I think similar change for sched_stoptask.c is required.

Aha!

Yes, I missed this addition when re-basing.

It is needed since when something is placed into the stop-class via
set_sched it goes through an activate/deactivate.

Will add, thanks!

>
> In fact I could not boot tip/master with this v4 patchset.
> It reports rcu stall warns without applying following tweak:
>
> --- a/kernel/sched_stoptask.c
> +++ b/kernel/sched_stoptask.c
> @@ -35,11 +35,13 @@ static struct task_struct *pick_next_task_stop(struct rq *rq
>  static void
>  enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
>  {
> +       inc_nr_running(rq);
>  }
>
>  static void
>  dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags)
>  {
> +       dec_nr_running(rq);
>  }
>
>
> Thanks,
> H.Seto
>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 6/7] sched: hierarchical task accounting for SCHED_OTHER
  2011-02-23  2:02   ` Hidetoshi Seto
  2011-02-23  2:20     ` Paul Turner
@ 2011-02-23  2:43     ` Balbir Singh
  1 sibling, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2011-02-23  2:43 UTC (permalink / raw
  To: Hidetoshi Seto
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen

* Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> [2011-02-23 11:02:40]:

> (2011/02/16 12:18), Paul Turner wrote:
> > @@ -1428,6 +1464,7 @@ enqueue_task_fair(struct rq *rq, struct 
> >  		update_cfs_shares(cfs_rq);
> >  	}
> >  
> > +	account_hier_tasks(&p->se, 1);
> >  	hrtick_update(rq);
> >  }
> >  
> > @@ -1461,6 +1498,7 @@ static void dequeue_task_fair(struct rq 
> >  		update_cfs_shares(cfs_rq);
> >  	}
> >  
> > +	account_hier_tasks(&p->se, -1);
> >  	hrtick_update(rq);
> >  }
> >  
> 
> Why hrtick_update() is need to be delayed after modifying nr_running? 
> You should not impact current hrtick logic without once mentioning.
> 
> > Index: tip/kernel/sched_rt.c
> > ===================================================================
> > --- tip.orig/kernel/sched_rt.c
> > +++ tip/kernel/sched_rt.c
> > @@ -906,6 +906,8 @@ enqueue_task_rt(struct rq *rq, struct ta
> >  
> >  	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
> >  		enqueue_pushable_task(rq, p);
> > +
> > +	inc_nr_running(rq);
> >  }
> >  
> >  static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
> > @@ -916,6 +918,8 @@ static void dequeue_task_rt(struct rq *r
> >  	dequeue_rt_entity(rt_se);
> >  
> >  	dequeue_pushable_task(rq, p);
> > +
> > +	dec_nr_running(rq);
> >  }
> >  
> >  /*
> 
> I think similar change for sched_stoptask.c is required.
> 
> In fact I could not boot tip/master with this v4 patchset.
> It reports rcu stall warns without applying following tweak:
> 
> --- a/kernel/sched_stoptask.c
> +++ b/kernel/sched_stoptask.c
> @@ -35,11 +35,13 @@ static struct task_struct *pick_next_task_stop(struct rq *rq
>  static void
>  enqueue_task_stop(struct rq *rq, struct task_struct *p, int flags)
>  {
> +       inc_nr_running(rq);
>  }
> 
>  static void
>  dequeue_task_stop(struct rq *rq, struct task_struct *p, int flags)
>  {
> +       dec_nr_running(rq);
>  }
>

Good catch!

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 0/7] Introduction
  2011-02-21  2:47 ` [CFS Bandwidth Control v4 0/7] Introduction Xiao Guangrong
  2011-02-22 10:28   ` Bharata B Rao
@ 2011-02-23  7:42   ` Paul Turner
  2011-02-23  7:51     ` Balbir Singh
  1 sibling, 1 reply; 71+ messages in thread
From: Paul Turner @ 2011-02-23  7:42 UTC (permalink / raw
  To: Xiao Guangrong
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen

Thanks for the report Xiao -- I wasn't able to reproduce this yet with
a simple guest, I will try a more modern image tomorrow.

One suspicion is that this might be connected with the missing
runnable accounting in sched_stoptask.c.

On Sun, Feb 20, 2011 at 6:47 PM, Xiao Guangrong
<xiaoguangrong@cn.fujitsu.com> wrote:
> On 02/16/2011 11:18 AM, Paul Turner wrote:
>> Hi all,
>>
>> Please find attached v4 of CFS bandwidth control; while this rebase against
>> some of the latest SCHED_NORMAL code is new, the features and methodology are
>> fairly mature at this point and have proved both effective and stable for
>> several workloads.
>>
>> As always, all comments/feedback welcome.
>>
>
> Hi Paul,
>
> Thanks for the great features!
>
> I applied the patchset to kvm tree, then tested with kvm guest, unfortunately,
> it seems don't work normally.
>
> The steps is follow:
>
> # mount -t cgroup -o cpu none /mnt/
> # qemu-system-x86_64 -enable-kvm  -smp 4 -m 512M -drive file=fc64.img,index=0,media=disk
>
> Don't do any configuration in cgroup, and run the kvm guest directly (don't use libvirt),
> the guest booted very slowly and i saw some "soft lockup" bugs reported in the guest,
> i also noticed one CPU usage is 100% for more than 60s and other CPUs is 10%~30% in the host
> when guest was booting.
>
> And if cgroup is not mounted, the guest runs well.
>
> The kernel config file is attached and my system cpu info is:
>
> # cat /proc/cpuinfo
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 30
> model name      : Intel(R) Core(TM) i5 CPU         760  @ 2.80GHz
> stepping        : 5
> cpu MHz         : 1197.000
> cache size      : 8192 KB
> physical id     : 0
> siblings        : 4
> core id         : 0
> cpu cores       : 4
> apicid          : 0
> initial apicid  : 0
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 11
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
> bogomips        : 5584.73
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 36 bits physical, 48 bits virtual
> power management:
>
> processor       : 1
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 30
> model name      : Intel(R) Core(TM) i5 CPU         760  @ 2.80GHz
> stepping        : 5
> cpu MHz         : 1197.000
> cache size      : 8192 KB
> physical id     : 0
> siblings        : 4
> core id         : 1
> cpu cores       : 4
> apicid          : 2
> initial apicid  : 2
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 11
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
> bogomips        : 5585.03
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 36 bits physical, 48 bits virtual
> power management:
>
> processor       : 2
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 30
> model name      : Intel(R) Core(TM) i5 CPU         760  @ 2.80GHz
> stepping        : 5
> cpu MHz         : 1197.000
> cache size      : 8192 KB
> physical id     : 0
> siblings        : 4
> core id         : 2
> cpu cores       : 4
> apicid          : 4
> initial apicid  : 4
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 11
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
> bogomips        : 5585.03
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 36 bits physical, 48 bits virtual
> power management:
>
> processor       : 3
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 30
> model name      : Intel(R) Core(TM) i5 CPU         760  @ 2.80GHz
> stepping        : 5
> cpu MHz         : 1197.000
> cache size      : 8192 KB
> physical id     : 0
> siblings        : 4
> core id         : 3
> cpu cores       : 4
> apicid          : 6
> initial apicid  : 6
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 11
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
> bogomips        : 5585.03
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 36 bits physical, 48 bits virtual
> power management:
>
>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 0/7] Introduction
  2011-02-23  7:42   ` Paul Turner
@ 2011-02-23  7:51     ` Balbir Singh
  2011-02-23  7:56       ` Paul Turner
  0 siblings, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2011-02-23  7:51 UTC (permalink / raw
  To: Paul Turner
  Cc: Xiao Guangrong, linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen

* Paul Turner <pjt@google.com> [2011-02-22 23:42:48]:

> Thanks for the report Xiao -- I wasn't able to reproduce this yet with
> a simple guest, I will try a more modern image tomorrow.
> 
> One suspicion is that this might be connected with the missing
> runnable accounting in sched_stoptask.c.
>

I can confirm that, my guests work fine after the changes posted this
morning. I still see some lockdep errors, but none associated with the
scheduler. 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 0/7] Introduction
  2011-02-23  7:51     ` Balbir Singh
@ 2011-02-23  7:56       ` Paul Turner
  2011-02-23  8:31         ` Bharata B Rao
  0 siblings, 1 reply; 71+ messages in thread
From: Paul Turner @ 2011-02-23  7:56 UTC (permalink / raw
  To: balbir
  Cc: Xiao Guangrong, linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen

On Tue, Feb 22, 2011 at 11:51 PM, Balbir Singh
<balbir@linux.vnet.ibm.com> wrote:
> * Paul Turner <pjt@google.com> [2011-02-22 23:42:48]:
>
>> Thanks for the report Xiao -- I wasn't able to reproduce this yet with
>> a simple guest, I will try a more modern image tomorrow.
>>
>> One suspicion is that this might be connected with the missing
>> runnable accounting in sched_stoptask.c.
>>
>
> I can confirm that, my guests work fine after the changes posted this
> morning. I still see some lockdep errors, but none associated with the
> scheduler.
>

Excellent!

Ok, if this is resolved I'll roll this up and repost tomorrow, thanks!

> --
>        Three Cheers,
>        Balbir
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 5/7] sched: add exports tracking cfs bandwidth control statistics
  2011-02-22  4:40       ` Balbir Singh
@ 2011-02-23  8:03         ` Paul Turner
  2011-02-23 10:13           ` Balbir Singh
  0 siblings, 1 reply; 71+ messages in thread
From: Paul Turner @ 2011-02-23  8:03 UTC (permalink / raw
  To: balbir
  Cc: Bharata B Rao, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen, Nikhil Rao

On Mon, Feb 21, 2011 at 8:40 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * Bharata B Rao <bharata@linux.vnet.ibm.com> [2011-02-22 09:43:33]:
>
>> >
>> > Should we consider integrating this in cpuacct, it would be difficult
>> > if we spill over stats between controllers.
>>
>> Given that cpuacct controller can be mounted independently, I am not sure
>> if we should integrate these stats. These stats come from cpu controller.
>
> The accounting controller was created to account. I'd still prefer
> cpuacct, so that I can find everything in one place. NOTE: cpuacct was
> created so that we do accounting with control - just account. I think
> splitting stats creates a usability mess - no?
>

One problem with rolling it into cpuacct is that some of the
statistics have a 1:1 association with the hierarchy being throttled.
For example, the number of periods in which throttling occurred or the
count of elapsed periods.

If it were rolled into cpuacct the only meaningful export would be the
total throttled time -- perhaps this is sufficient?

> --
>        Three Cheers,
>        Balbir
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 6/7] sched: hierarchical task accounting for SCHED_OTHER
  2011-02-22  3:17   ` Balbir Singh
@ 2011-02-23  8:05     ` Paul Turner
  0 siblings, 0 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-23  8:05 UTC (permalink / raw
  To: balbir
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen

On Mon, Feb 21, 2011 at 7:17 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> * Paul Turner <pjt@google.com> [2011-02-15 19:18:37]:
>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +/* maintain hierarchal task counts on group entities */
>> +static void account_hier_tasks(struct sched_entity *se, int delta)
>
> I don't like the use of hier, I'd expand it to hierarchical
>

Sure

> I am not too sure about the RT bits, but other than that
>
>
> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>
>
> --
>        Three Cheers,
>        Balbir
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 0/7] Introduction
  2011-02-23  7:56       ` Paul Turner
@ 2011-02-23  8:31         ` Bharata B Rao
  0 siblings, 0 replies; 71+ messages in thread
From: Bharata B Rao @ 2011-02-23  8:31 UTC (permalink / raw
  To: Paul Turner
  Cc: balbir, Xiao Guangrong, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Peter Zijlstra, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen

On Tue, Feb 22, 2011 at 11:56:20PM -0800, Paul Turner wrote:
> On Tue, Feb 22, 2011 at 11:51 PM, Balbir Singh
> <balbir@linux.vnet.ibm.com> wrote:
> > * Paul Turner <pjt@google.com> [2011-02-22 23:42:48]:
> >
> >> Thanks for the report Xiao -- I wasn't able to reproduce this yet with
> >> a simple guest, I will try a more modern image tomorrow.
> >>
> >> One suspicion is that this might be connected with the missing
> >> runnable accounting in sched_stoptask.c.
> >>
> >
> > I can confirm that, my guests work fine after the changes posted this
> > morning. I still see some lockdep errors, but none associated with the
> > scheduler.
> >
> 
> Excellent!
> 
> Ok, if this is resolved I'll roll this up and repost tomorrow, thanks!

As I said in an earlier reply, I too saw the lockups in guests. However those
lockups didn't occur consistently. After the sched_stoptask.c changes, I haven't
seen any lockups till now.

BTW, the lockups looked like this for me:

...
Mounting sysfs filesystem
Creating /dev
Creating initial device nodes
Loading /lib/kbd/keymaps/i386/qwerty/us.map
BUG: soft lockup - CPU#0 stuck for 61s! [init:1]
Modules linked in:

Pid: 1, comm: init Not tainted (2.6.27.24-170.2.68.fc10.i686 #1) KVM
EIP: 0060:[<c041b93e>] EFLAGS: 00000297 CPU: 0
EIP is at __ticket_spin_lock+0x13/0x19
EAX: c080ff00 EBX: 00000000 ECX: c05751ca EDX: 00008584
ESI: df92c904 EDI: dd700d80 EBP: df819e50 ESP: df819e50
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: 0805d3dc CR3: 1d706000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<c06ab6cc>] lock_kernel+0x1f/0x2d
 [<c05751dc>] tty_open+0x12/0x2aa
 [<c0494a8c>] ? exact_match+0x0/0x7
 [<c0494dc7>] chrdev_open+0x12b/0x142
 [<c049141e>] __dentry_open+0x10e/0x1fc
 [<c0491593>] nameidata_to_filp+0x1f/0x33
 [<c0494c9c>] ? chrdev_open+0x0/0x142
 [<c049b0b7>] do_filp_open+0x31c/0x611
 [<c0422602>] ? set_next_entity+0x8b/0xf7
 [<c041f81a>] ? need_resched+0x18/0x22
 [<c049123c>] do_sys_open+0x42/0xb7
 [<c04912f3>] sys_open+0x1e/0x26
 [<c0404c8a>] syscall_call+0x7/0xb
 =======================
BUG: soft lockup - CPU#3 stuck for 61s! [plymouthd:562]
Modules linked in:

Pid: 562, comm: plymouthd Not tainted (2.6.27.24-170.2.68.fc10.i686 #1) KVM
EIP: 0060:[<c041b93e>] EFLAGS: 00000293 CPU: 3
EIP is at __ticket_spin_lock+0x13/0x19
EAX: c080ff00 EBX: 00000000 ECX: df469198 EDX: 00008684
ESI: df469198 EDI: df424018 EBP: dd6b9dc0 ESP: dd6b9dc0
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: 0804e767 CR3: 1d6b7000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<c06ab6cc>] lock_kernel+0x1f/0x2d
 [<c04c827a>] proc_lookup_de+0x15/0xc0
 [<c04c8337>] proc_lookup+0x12/0x17
 [<c04c4934>] proc_root_lookup+0x11/0x2b
 [<c0498fe2>] do_lookup+0xae/0x11e
 [<c049a408>] __link_path_walk+0x57e/0x6b5
 [<c049a92b>] path_walk+0x4c/0x9b
 [<c049ab27>] do_path_lookup+0x12d/0x175
 [<c049abb4>] __path_lookup_intent_open+0x45/0x76
 [<c049abf5>] path_lookup_open+0x10/0x12
 [<c049ae3c>] do_filp_open+0xa1/0x611
 [<c04fadaa>] ? selinux_file_alloc_security+0x22/0x41
 [<c052048c>] ? trace_hardirqs_on_thunk+0xc/0x10
 [<c0404cd7>] ? restore_nocheck_notrace+0x0/0xe
 [<c049123c>] do_sys_open+0x42/0xb7
 [<c04912f3>] sys_open+0x1e/0x26
 [<c0404c8a>] syscall_call+0x7/0xb
 =======================
BUG: soft lockup - CPU#2 stuck for 63s! [setfont:559]
Modules linked in:

Pid: 559, comm: setfont Not tainted (2.6.27.24-170.2.68.fc10.i686 #1) KVM
EIP: 0060:[<c053a892>] EFLAGS: 00010283 CPU: 2
EIP is at vgacon_do_font_op+0x177/0x3ec
EAX: c095a1f8 EBX: df904000 ECX: 00000001 EDX: 00000043
ESI: 00000004 EDI: c095a1f4 EBP: dd743e0c ESP: dd743df0
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: 0826d000 CR3: 1d73c000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<c053abac>] vgacon_font_set+0x59/0x208
 [<c04427d8>] ? down+0x2b/0x2f
 [<c053ab53>] ? vgacon_font_set+0x0/0x208
 [<c057f659>] con_font_op+0x15c/0x378
 [<c041f81a>] ? need_resched+0x18/0x22
 [<c057a80f>] vt_ioctl+0x1338/0x14fd
 [<c04fa419>] ? inode_has_perm+0x5b/0x65
 [<c05794d7>] ? vt_ioctl+0x0/0x14fd
 [<c0574b88>] tty_ioctl+0x665/0x6cf
 [<c04fa689>] ? file_has_perm+0x7b/0x84
 [<c0574523>] ? tty_ioctl+0x0/0x6cf
 [<c049c74a>] vfs_ioctl+0x22/0x69
 [<c049c9cc>] do_vfs_ioctl+0x23b/0x247
 [<c04fa7ac>] ? selinux_file_ioctl+0x35/0x38
 [<c049ca18>] sys_ioctl+0x40/0x5c
 [<c0404c8a>] syscall_call+0x7/0xb
 =======================
BUG: soft lockup - CPU#1 stuck for 63s! [loadkeys:560]
Modules linked in:

Pid: 560, comm: loadkeys Not tainted (2.6.27.24-170.2.68.fc10.i686 #1) KVM
EIP: 0060:[<c06ab54e>] EFLAGS: 00000282 CPU: 1
EIP is at _spin_unlock_irqrestore+0x2d/0x38
EAX: 00000282 EBX: 00000282 ECX: dd5b7f80 EDX: 00000282
ESI: c096f740 EDI: c096f7fc EBP: c08a5f78 ESP: c08a5f74
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 8005003b CR2: 009f0210 CR3: 1d73b000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<c0593456>] serial8250_handle_port+0x220/0x230
 [<c059114e>] ? serial_in+0x5a/0x61
 [<c05934af>] serial8250_interrupt+0x49/0xc9
 [<c0465313>] handle_IRQ_event+0x2f/0x64
 [<c04662b5>] handle_edge_irq+0xb2/0xf4
 [<c0466203>] ? handle_edge_irq+0x0/0xf4
 [<c0406e6e>] do_IRQ+0xc7/0xfe
 [<c0405668>] common_interrupt+0x28/0x30
 [<c06ab54e>] ? _spin_unlock_irqrestore+0x2d/0x38
 [<c058ef7d>] uart_start+0x4e/0x53
 [<c058fa1e>] uart_write+0xce/0xd9
 [<c0575d88>] write_chan+0x1e5/0x2b0
 [<c0428218>] ? default_wake_function+0x0/0xd
 [<c05742c9>] tty_write+0x155/0x1d5
 [<c0575ba3>] ? write_chan+0x0/0x2b0
 [<c05743a9>] redirected_tty_write+0x60/0x6d
 [<c0574349>] ? redirected_tty_write+0x0/0x6d
 [<c04930d6>] vfs_write+0x84/0xdf
 [<c04931ca>] sys_write+0x3b/0x60
 [<c0404c8a>] syscall_call+0x7/0xb
 =======================
input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
Setting up hotplug.
Creating block device nodes.
Creating character device nodes.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 5/7] sched: add exports tracking cfs bandwidth control statistics
  2011-02-23  8:03         ` Paul Turner
@ 2011-02-23 10:13           ` Balbir Singh
  0 siblings, 0 replies; 71+ messages in thread
From: Balbir Singh @ 2011-02-23 10:13 UTC (permalink / raw
  To: Paul Turner
  Cc: Bharata B Rao, linux-kernel, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Peter Zijlstra, Pavel Emelyanov,
	Herbert Poetzl, Avi Kivity, Chris Friesen, Nikhil Rao

* Paul Turner <pjt@google.com> [2011-02-23 00:03:42]:

> On Mon, Feb 21, 2011 at 8:40 PM, Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > * Bharata B Rao <bharata@linux.vnet.ibm.com> [2011-02-22 09:43:33]:
> >
> >> >
> >> > Should we consider integrating this in cpuacct, it would be difficult
> >> > if we spill over stats between controllers.
> >>
> >> Given that cpuacct controller can be mounted independently, I am not sure
> >> if we should integrate these stats. These stats come from cpu controller.
> >
> > The accounting controller was created to account. I'd still prefer
> > cpuacct, so that I can find everything in one place. NOTE: cpuacct was
> > created so that we do accounting with control - just account. I think
> > splitting stats creates a usability mess - no?
> >
> 
> One problem with rolling it into cpuacct is that some of the
> statistics have a 1:1 association with the hierarchy being throttled.
> For example, the number of periods in which throttling occurred or the
> count of elapsed periods.
> 
> If it were rolled into cpuacct the only meaningful export would be the
> total throttled time -- perhaps this is sufficient?
>

Good point, lets keep it as is. nr_throttled and nr_periods is also
important (although it can be derived).

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
  2011-02-18  7:19   ` Balbir Singh
@ 2011-02-23 12:23   ` Peter Zijlstra
  2011-02-23 13:32   ` Peter Zijlstra
  2 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2011-02-23 12:23 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:

> +static inline const struct cpumask *sched_bw_period_mask(void)
> +{
> +       return cpu_rq(smp_processor_id())->rd->span;
> +}

Some day I'm going to bite the bullet and merge part of cpusets into
this controller and have Balbir take the node part into the memcg thing
and then make (cpu||memcg) ^ cpuset

Feh.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 6/7] sched: hierarchical task accounting for SCHED_OTHER
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 6/7] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
  2011-02-22  3:17   ` Balbir Singh
  2011-02-23  2:02   ` Hidetoshi Seto
@ 2011-02-23 13:32   ` Peter Zijlstra
  2011-02-25  3:25     ` Paul Turner
  2 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2011-02-23 13:32 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen

On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:

> @@ -1846,6 +1846,11 @@ static const struct sched_class rt_sched
>  
>  #include "sched_stats.h"
>  
> +static void mod_nr_running(struct rq *rq, long delta)
> +{
> +	rq->nr_running += delta;
> +}

I personally don't see much use in such trivial wrappers.. if you're
going to rework all the nr_running stuff you might as well remove all of
that.

> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -81,6 +81,8 @@ unsigned int normalized_sysctl_sched_wak
>  
>  const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
>  
> +static void account_hier_tasks(struct sched_entity *se, int delta);
> +
>  /*
>   * The exponential sliding  window over which load is averaged for shares
>   * distribution.
> @@ -933,6 +935,40 @@ static inline void update_entity_shares_
>  }
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  
> +#ifdef CONFIG_CFS_BANDWIDTH
> +/* maintain hierarchal task counts on group entities */
> +static void account_hier_tasks(struct sched_entity *se, int delta)
> +{
> +	struct rq *rq = rq_of(cfs_rq_of(se));
> +	struct cfs_rq *cfs_rq;
> +
> +	for_each_sched_entity(se) {
> +		/* a throttled entity cannot affect its parent hierarchy */
> +		if (group_cfs_rq(se) && cfs_rq_throttled(group_cfs_rq(se)))
> +			break;
> +
> +		/* we affect our queuing entity */
> +		cfs_rq = cfs_rq_of(se);
> +		cfs_rq->h_nr_tasks += delta;
> +	}
> +
> +	/* account for global nr_running delta to hierarchy change */
> +	if (!se)
> +		mod_nr_running(rq, delta);
> +}
> +#else
> +/*
> + * In the absence of group throttling, all operations are guaranteed to be
> + * globally visible at the root rq level.
> + */
> +static void account_hier_tasks(struct sched_entity *se, int delta)
> +{
> +	struct rq *rq = rq_of(cfs_rq_of(se));
> +
> +	mod_nr_running(rq, delta);
> +}
> +#endif

While Balbir suggested expanding the _hier_ thing, I'd suggest to simply
drop it altogether, way too much typing ;-), but that is if you cannot
get rid of the extra hierarchy iteration, see below.

> +
>  static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>  #ifdef CONFIG_SCHEDSTATS
> @@ -1428,6 +1464,7 @@ enqueue_task_fair(struct rq *rq, struct 
>  		update_cfs_shares(cfs_rq);
>  	}
>  
> +	account_hier_tasks(&p->se, 1);
>  	hrtick_update(rq);
>  }
>  
> @@ -1461,6 +1498,7 @@ static void dequeue_task_fair(struct rq 
>  		update_cfs_shares(cfs_rq);
>  	}
>  
> +	account_hier_tasks(&p->se, -1);
>  	hrtick_update(rq);
>  }
>  
> @@ -1488,6 +1526,8 @@ static u64 tg_request_cfs_quota(struct t
>  	return delta;
>  }
>  
> +static void account_hier_tasks(struct sched_entity *se, int delta);
> +
>  static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>  	struct sched_entity *se;
> @@ -1507,6 +1547,7 @@ static void throttle_cfs_rq(struct cfs_r
>  	if (!se->on_rq)
>  		goto out_throttled;
>  
> +	account_hier_tasks(se, -cfs_rq->h_nr_tasks);
>  	for_each_sched_entity(se) {
>  		struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  
> @@ -1541,6 +1582,7 @@ static void unthrottle_cfs_rq(struct cfs
>  	cfs_rq->load_stamp = cfs_rq->load_last = rq->clock_task;
>  
>  	cfs_rq->throttled = 0;
> +	account_hier_tasks(se, cfs_rq->h_nr_tasks);
>  	for_each_sched_entity(se) {
>  		if (se->on_rq)
>  			break;

All call-sites are right next to a for_each_sched_entity() iteration, is
there really no way to fold those loops?

> Index: tip/kernel/sched_rt.c
> ===================================================================
> --- tip.orig/kernel/sched_rt.c
> +++ tip/kernel/sched_rt.c
> @@ -906,6 +906,8 @@ enqueue_task_rt(struct rq *rq, struct ta
>  
>  	if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
>  		enqueue_pushable_task(rq, p);
> +
> +	inc_nr_running(rq);
>  }
>  
>  static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
> @@ -916,6 +918,8 @@ static void dequeue_task_rt(struct rq *r
>  	dequeue_rt_entity(rt_se);
>  
>  	dequeue_pushable_task(rq, p);
> +
> +	dec_nr_running(rq);
>  }
>  
>  /*
> @@ -1783,4 +1787,3 @@ static void print_rt_stats(struct seq_fi
>  	rcu_read_unlock();
>  }
>  #endif /* CONFIG_SCHED_DEBUG */


You mentioned something about -rt having the same problem, yet you don't
fix it.. tskkk :-)


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 5/7] sched: add exports tracking cfs bandwidth control statistics
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 5/7] sched: add exports tracking cfs bandwidth control statistics Paul Turner
  2011-02-22  3:14   ` Balbir Singh
@ 2011-02-23 13:32   ` Peter Zijlstra
  2011-02-25  3:26     ` Paul Turner
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2011-02-23 13:32 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
> +       raw_spin_lock(&cfs_b->lock);
> +       cfs_b->throttled_time += (rq->clock - cfs_rq->throttled_timestamp);
> +       raw_spin_unlock(&cfs_b->lock); 

That seems to put the cost of things on the wrong side. Read is rare,
update is frequent, and you made the frequent thing the most expensive
one.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
  2011-02-18  7:19   ` Balbir Singh
  2011-02-23 12:23   ` Peter Zijlstra
@ 2011-02-23 13:32   ` Peter Zijlstra
  2011-02-24  7:04     ` Bharata B Rao
  2011-02-26  0:02     ` Paul Turner
  2 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2011-02-23 13:32 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:

> +static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> +{
> +	struct rq *rq = rq_of(cfs_rq);
> +	struct sched_entity *se;
> +
> +	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> +
> +	update_rq_clock(rq);
> +	/* (Try to) avoid maintaining share statistics for idle time */
> +	cfs_rq->load_stamp = cfs_rq->load_last = rq->clock_task;

Ok, so here you try to compensate for some of the weirdness from the
previous patch.. wouldn't it be much saner to fully consider the
throttled things dequeued for the load calculation etc.?

> +
> +	cfs_rq->throttled = 0;
> +	for_each_sched_entity(se) {
> +		if (se->on_rq)
> +			break;
> +
> +		cfs_rq = cfs_rq_of(se);
> +		enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
> +		if (cfs_rq_throttled(cfs_rq))
> +			break;

That's just weird, it was throttled, you enqueued it but find it
throttled.

> +	}
> +
> +	/* determine whether we need to wake up potentally idle cpu */

SP: potentially, also isn't there a determiner missing?

> +	if (rq->curr == rq->idle && rq->cfs.nr_running)
> +		resched_task(rq->curr);
> +}
> +
>  static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>  		unsigned long delta_exec)
>  {
> @@ -1535,8 +1569,46 @@ static void account_cfs_rq_quota(struct 
>  
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>  {
> -	return 1;
> +	int i, idle = 1;
> +	u64 delta;
> +	const struct cpumask *span;
> +
> +	if (cfs_b->quota == RUNTIME_INF)
> +		return 1;
> +
> +	/* reset group quota */
> +	raw_spin_lock(&cfs_b->lock);
> +	cfs_b->runtime = cfs_b->quota;

Shouldn't that be something like:

cfs_b->runtime = 
   min(cfs_b->runtime + overrun * cfs_b->quota, cfs_b->quota);

afaict runtime can go negative in which case we need to compensate for
that, but we cannot ever get more than quota because we allow for
overcommit, so not limiting things would allow us to accrue an unlimited
amount of runtime.

Or can only the per-cpu quota muck go negative? In that case it should
probably be propagated back into the global bw on throttle, otherwise
you can get deficits on CPUs that remain unused for a while.

> +	raw_spin_unlock(&cfs_b->lock);
> +
> +	span = sched_bw_period_mask();
> +	for_each_cpu(i, span) {
> +		struct rq *rq = cpu_rq(i);
> +		struct cfs_rq *cfs_rq = cfs_bandwidth_cfs_rq(cfs_b, i);
> +
> +		if (cfs_rq->nr_running)
> +			idle = 0;
> +
> +		if (!cfs_rq_throttled(cfs_rq))
> +			continue;
> +
> +		delta = tg_request_cfs_quota(cfs_rq->tg);
> +
> +		if (delta) {
> +			raw_spin_lock(&rq->lock);
> +			cfs_rq->quota_assigned += delta;
> +
> +			/* avoid race with tg_set_cfs_bandwidth */

*what* race, and *how*

> +			if (cfs_rq_throttled(cfs_rq) &&
> +			     cfs_rq->quota_used < cfs_rq->quota_assigned)
> +				unthrottle_cfs_rq(cfs_rq);
> +			raw_spin_unlock(&rq->lock);
> +		}
> +	}
> +
> +	return idle;
>  }

This whole positive quota muck makes my head hurt, whatever did you do
that for? Also it doesn't deal with wrapping, which admittedly won't
really happen but still.




^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota Paul Turner
  2011-02-18  6:52   ` Balbir Singh
@ 2011-02-23 13:32   ` Peter Zijlstra
  2011-02-24  5:21     ` Bharata B Rao
  2011-02-25  3:10     ` Paul Turner
  2011-03-02  7:23   ` Bharata B Rao
  2 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2011-02-23 13:32 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:

> +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
> +{
> +	return cfs_rq->throttled;
> +}
> +
> +/* it's possible to be 'on_rq' in a dequeued (e.g. throttled) hierarchy */
> +static inline int entity_on_rq(struct sched_entity *se)
> +{
> +	for_each_sched_entity(se)
> +		if (!se->on_rq)
> +			return 0;

Please add block braces over multi line stmts even if not strictly
needed.

> +
> +	return 1;
> +}


> @@ -761,7 +788,11 @@ static void update_cfs_load(struct cfs_r
>  	u64 now, delta;
>  	unsigned long load = cfs_rq->load.weight;
>  
> -	if (cfs_rq->tg == &root_task_group)
> +	/*
> +	 * Don't maintain averages for the root task group, or while we are
> +	 * throttled.
> +	 */
> +	if (cfs_rq->tg == &root_task_group || cfs_rq_throttled(cfs_rq))
>  		return;
>  
>  	now = rq_of(cfs_rq)->clock_task;

Placing the return there avoids updating the timestamps, so once we get
unthrottled we'll observe a very long period and skew the load avg?

Ideally we'd never call this on throttled groups to begin with and
handle them like full dequeue/enqueue like things.

> @@ -1015,6 +1046,14 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>  	 * Update run-time statistics of the 'current'.
>  	 */
>  	update_curr(cfs_rq);
> +
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	if (!entity_is_task(se) && (cfs_rq_throttled(group_cfs_rq(se)) ||
> +	     !group_cfs_rq(se)->nr_running))
> +		return;
> +#endif
> +
>  	update_cfs_load(cfs_rq, 0);
>  	account_entity_enqueue(cfs_rq, se);
>  	update_cfs_shares(cfs_rq);
> @@ -1087,6 +1126,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>  	 */
>  	update_curr(cfs_rq);
>  
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	if (!entity_is_task(se) && cfs_rq_throttled(group_cfs_rq(se)))
> +		return;
> +#endif
> +
>  	update_stats_dequeue(cfs_rq, se);
>  	if (flags & DEQUEUE_SLEEP) {
>  #ifdef CONFIG_SCHEDSTATS

These make me very nervous, on enqueue you bail after adding
min_vruntime to ->vruntime and calling update_curr(), but on dequeue you
bail before subtracting min_vruntime from ->vruntime.

> @@ -1363,6 +1407,9 @@ enqueue_task_fair(struct rq *rq, struct 
>  			break;
>  		cfs_rq = cfs_rq_of(se);
>  		enqueue_entity(cfs_rq, se, flags);
> +		/* don't continue to enqueue if our parent is throttled */
> +		if (cfs_rq_throttled(cfs_rq))
> +			break;
>  		flags = ENQUEUE_WAKEUP;
>  	}
>  
> @@ -1390,8 +1437,11 @@ static void dequeue_task_fair(struct rq 
>  		cfs_rq = cfs_rq_of(se);
>  		dequeue_entity(cfs_rq, se, flags);
>  
> -		/* Don't dequeue parent if it has other entities besides us */
> -		if (cfs_rq->load.weight)
> +		/*
> +		 * Don't dequeue parent if it has other entities besides us,
> +		 * or if it is throttled
> +		 */
> +		if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
>  			break;
>  		flags |= DEQUEUE_SLEEP;
>  	}

How could we even be running if our parent was throttled?

> @@ -1430,6 +1480,42 @@ static u64 tg_request_cfs_quota(struct t
>  	return delta;
>  }
>  
> +static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
> +{
> +	struct sched_entity *se;
> +
> +	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> +
> +	/* account load preceeding throttle */

My spell checker thinks that should be written as: preceding.

> +	update_cfs_load(cfs_rq, 0);
> +
> +	/* prevent previous buddy nominations from re-picking this se */
> +	clear_buddies(cfs_rq_of(se), se);
> +
> +	/*
> +	 * It's possible for the current task to block and re-wake before task
> +	 * switch, leading to a throttle within enqueue_task->update_curr()
> +	 * versus an an entity that has not technically been enqueued yet.

I'm not quite seeing how this would happen.. care to expand on this?

> +	 * In this case, since we haven't actually done the enqueue yet, cut
> +	 * out and allow enqueue_entity() to short-circuit
> +	 */
> +	if (!se->on_rq)
> +		goto out_throttled;
> +
> +	for_each_sched_entity(se) {
> +		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +
> +		dequeue_entity(cfs_rq, se, 1);
> +		if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
> +			break;
> +	}
> +
> +out_throttled:
> +	cfs_rq->throttled = 1;
> +	update_cfs_rq_load_contribution(cfs_rq, 1);
> +}
> +
>  static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>  		unsigned long delta_exec)
>  {
> @@ -1438,10 +1524,16 @@ static void account_cfs_rq_quota(struct 
>  
>  	cfs_rq->quota_used += delta_exec;
>  
> -	if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> +	if (cfs_rq_throttled(cfs_rq) ||
> +		cfs_rq->quota_used < cfs_rq->quota_assigned)
>  		return;

So we are throttled but running anyway, I suppose this comes from the PI
ceiling muck?

>  	cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
> +
> +	if (cfs_rq->quota_used >= cfs_rq->quota_assigned) {
> +		throttle_cfs_rq(cfs_rq);
> +		resched_task(cfs_rq->rq->curr);
> +	}
>  }
>  
>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
> @@ -1941,6 +2033,12 @@ static void check_preempt_wakeup(struct 
>  	if (unlikely(se == pse))
>  		return;
>  
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	/* avoid pre-emption check/buddy nomination for throttled tasks */

Somehow my spell checker doesn't like that hyphen.

> +	if (!entity_on_rq(pse))
> +		return;
> +#endif

Ideally that #ifdef'ery would go away too. 

>  	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK))
>  		set_next_buddy(pse);
>  
> @@ -2060,7 +2158,8 @@ static bool yield_to_task_fair(struct rq
>  {
>  	struct sched_entity *se = &p->se;
>  
> -	if (!se->on_rq)
> +	/* ensure entire hierarchy is on rq (e.g. running & not throttled) */
> +	if (!entity_on_rq(se))
>  		return false;

like here..

>  	/* Tell the scheduler that we'd really like pse to run next. */
> @@ -2280,7 +2379,8 @@ static void update_shares(int cpu)
>  
>  	rcu_read_lock();
>  	for_each_leaf_cfs_rq(rq, cfs_rq)
> -		update_shares_cpu(cfs_rq->tg, cpu);
> +		if (!cfs_rq_throttled(cfs_rq))
> +			update_shares_cpu(cfs_rq->tg, cpu);

This wants extra braces

>  	rcu_read_unlock();
>  }
>  
> @@ -2304,9 +2404,10 @@ load_balance_fair(struct rq *this_rq, in
>  		u64 rem_load, moved_load;
>  
>  		/*
> -		 * empty group
> +		 * empty group or throttled cfs_rq
>  		 */
> -		if (!busiest_cfs_rq->task_weight)
> +		if (!busiest_cfs_rq->task_weight ||
> +				cfs_rq_throttled(busiest_cfs_rq))
>  			continue;
>  
>  		rem_load = (u64)rem_load_move * busiest_weight;
> 
> 


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage Paul Turner
  2011-02-16 17:45   ` Balbir Singh
@ 2011-02-23 13:32   ` Peter Zijlstra
  2011-02-25  3:33     ` Paul Turner
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2011-02-23 13:32 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:

> @@ -609,6 +631,9 @@ static void update_curr(struct cfs_rq *c
>  		cpuacct_charge(curtask, delta_exec);
>  		account_group_exec_runtime(curtask, delta_exec);
>  	}
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	account_cfs_rq_quota(cfs_rq, delta_exec);
> +#endif
>  }

Not too hard to make the #ifdef'ery go away I'd guess.

>  static inline void
> @@ -1382,6 +1407,43 @@ static void dequeue_task_fair(struct rq 
>  }
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
> +static u64 tg_request_cfs_quota(struct task_group *tg)
> +{
> +	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> +	u64 delta = 0;
> +
> +	if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
> +		raw_spin_lock(&cfs_b->lock);
> +		/*
> +		 * it's possible a bandwidth update has changed the global
> +		 * pool.
> +		 */
> +		if (cfs_b->quota == RUNTIME_INF)
> +			delta = sched_cfs_bandwidth_slice();

Why do we bother at all when there's infinite time? Shouldn't the action
that sets it to infinite also make cfs_rq->quota_assinged to to
RUNTIME_INF, in which case the below check will make it all go away?

> +		else {
> +			delta = min(cfs_b->runtime,
> +					sched_cfs_bandwidth_slice());
> +			cfs_b->runtime -= delta;
> +		}
> +		raw_spin_unlock(&cfs_b->lock);
> +	}
> +	return delta;
> +}

Also, shouldn't this all try and steal time from other cpus when the
global limit ran out? Suppose you have say 24 cpus, and you had a short
burst where all 24 cpus had some runtime, so you distribute 240ms. But
23 of those cpus only ran for 0.5ms, leaving 23.5ms of unused time on 23
cpus while your one active cpu will then throttle.

I would much rather see all the accounting tight first and optimize
later than start with leaky stuff and try and plug holes later.

> +
> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> +		unsigned long delta_exec)
> +{
> +	if (cfs_rq->quota_assigned == RUNTIME_INF)
> +		return;
> +
> +	cfs_rq->quota_used += delta_exec;
> +
> +	if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> +		return;
> +
> +	cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
> +}

So why isn't this hierarchical?, also all this positive quota stuff
looks weird, why not decrement and try to supplement when negative?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 1/7] sched: introduce primitives to account for CFS bandwidth tracking
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 1/7] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
  2011-02-16 16:52   ` Balbir Singh
@ 2011-02-23 13:32   ` Peter Zijlstra
  2011-02-25  3:11     ` Paul Turner
  2011-02-25 20:53     ` Paul Turner
  1 sibling, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2011-02-23 13:32 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:

> @@ -245,6 +248,15 @@ struct cfs_rq;
>  
>  static LIST_HEAD(task_groups);
>  
> +#ifdef CONFIG_CFS_BANDWIDTH
> +struct cfs_bandwidth {
> +	raw_spinlock_t		lock;
> +	ktime_t			period;
> +	u64			runtime, quota;
> +	struct hrtimer		period_timer;
> +};
> +#endif

If you write that as:

struct cfs_bandwidth {
#ifdef CONFIG_CFS_BANDWIDTH
	...
#endif
};

>  /* task group related information */
>  struct task_group {
>  	struct cgroup_subsys_state css;
> @@ -276,6 +288,10 @@ struct task_group {
>  #ifdef CONFIG_SCHED_AUTOGROUP
>  	struct autogroup *autogroup;
>  #endif
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	struct cfs_bandwidth cfs_bandwidth;
> +#endif
>  };

You can avoid the #ifdef'ery here

>  /* task_group_lock serializes the addition/removal of task groups */
> @@ -370,9 +386,76 @@ struct cfs_rq {

> +#ifdef CONFIG_CFS_BANDWIDTH
> +static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
> +
> +static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
> +{
> +	struct cfs_bandwidth *cfs_b =
> +		container_of(timer, struct cfs_bandwidth, period_timer);
> +	ktime_t now;
> +	int overrun;
> +	int idle = 0;
> +
> +	for (;;) {
> +		now = hrtimer_cb_get_time(timer);
> +		overrun = hrtimer_forward(timer, now, cfs_b->period);
> +
> +		if (!overrun)
> +			break;
> +
> +		idle = do_sched_cfs_period_timer(cfs_b, overrun);
> +	}
> +
> +	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
> +}
> +
> +static
> +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
> +{
> +	raw_spin_lock_init(&cfs_b->lock);
> +	cfs_b->quota = cfs_b->runtime = quota;
> +	cfs_b->period = ns_to_ktime(period);
> +
> +	hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	cfs_b->period_timer.function = sched_cfs_period_timer;
> +}
> +
> +static
> +void init_cfs_rq_quota(struct cfs_rq *cfs_rq)
> +{
> +	cfs_rq->quota_used = 0;
> +	if (cfs_rq->tg->cfs_bandwidth.quota == RUNTIME_INF)
> +		cfs_rq->quota_assigned = RUNTIME_INF;
> +	else
> +		cfs_rq->quota_assigned = 0;
> +}
> +
> +static void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +{
> +	if (cfs_b->quota == RUNTIME_INF)
> +		return;
> +
> +	if (hrtimer_active(&cfs_b->period_timer))
> +		return;
> +
> +	raw_spin_lock(&cfs_b->lock);
> +	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
> +	raw_spin_unlock(&cfs_b->lock);
> +}
> +
> +static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> +{
> +	hrtimer_cancel(&cfs_b->period_timer);
> +}
> +#endif

and #else

stubs
#endif

>  /* Real-Time classes' related field in a runqueue: */
>  struct rt_rq {
>  	struct rt_prio_array active;
> @@ -8038,6 +8121,9 @@ static void init_tg_cfs_entry(struct tas
>  	tg->cfs_rq[cpu] = cfs_rq;
>  	init_cfs_rq(cfs_rq, rq);
>  	cfs_rq->tg = tg;
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	init_cfs_rq_quota(cfs_rq);
> +#endif

also avoids #ifdef'ery here

>  	tg->se[cpu] = se;
>  	/* se could be NULL for root_task_group */
> @@ -8173,6 +8259,10 @@ void __init sched_init(void)
>  		 * We achieve this by letting root_task_group's tasks sit
>  		 * directly in rq->cfs (i.e root_task_group->se[] = NULL).
>  		 */
> +#ifdef CONFIG_CFS_BANDWIDTH
> +		init_cfs_bandwidth(&root_task_group.cfs_bandwidth,
> +				RUNTIME_INF, sched_cfs_bandwidth_period);
> +#endif

and here

>  		init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  
> @@ -8415,6 +8505,10 @@ static void free_fair_sched_group(struct
>  {
>  	int i;
>  
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	destroy_cfs_bandwidth(&tg->cfs_bandwidth);
> +#endif

and here

>  	for_each_possible_cpu(i) {
>  		if (tg->cfs_rq)
>  			kfree(tg->cfs_rq[i]);
> @@ -8442,7 +8536,10 @@ int alloc_fair_sched_group(struct task_g
>  		goto err;
>  
>  	tg->shares = NICE_0_LOAD;
> -
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	init_cfs_bandwidth(&tg->cfs_bandwidth, RUNTIME_INF,
> +			sched_cfs_bandwidth_period);
> +#endif

and here

>  	for_each_possible_cpu(i) {
>  		rq = cpu_rq(i);
>  

> @@ -9107,6 +9204,116 @@ static u64 cpu_shares_read_u64(struct cg
>  
>  	return (u64) tg->shares;
>  }
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> +{
> +	int i;
> +	static DEFINE_MUTEX(mutex);
> +
> +	if (tg == &root_task_group)
> +		return -EINVAL;
> +
> +	if (!period)
> +		return -EINVAL;
> +
> +	/*
> +	 * Ensure we have at least one tick of bandwidth every period.  This is
> +	 * to prevent reaching a state of large arrears when throttled via
> +	 * entity_tick() resulting in prolonged exit starvation.
> +	 */
> +	if (NS_TO_JIFFIES(quota) < 1)
> +		return -EINVAL;
> +
> +	mutex_lock(&mutex);
> +	raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
> +	tg->cfs_bandwidth.period = ns_to_ktime(period);
> +	tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
> +	raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
> +
> +	for_each_possible_cpu(i) {
> +		struct cfs_rq *cfs_rq = tg->cfs_rq[i];
> +		struct rq *rq = rq_of(cfs_rq);
> +
> +		raw_spin_lock_irq(&rq->lock);
> +		init_cfs_rq_quota(cfs_rq);
> +		raw_spin_unlock_irq(&rq->lock);

Any particular reason you didn't mirror rt_rq->rt_runtime_lock?

> +	}
> +	mutex_unlock(&mutex);
> +
> +	return 0;
> +}


> Index: tip/kernel/sched_fair.c
> ===================================================================
> --- tip.orig/kernel/sched_fair.c
> +++ tip/kernel/sched_fair.c
> @@ -88,6 +88,15 @@ const_debug unsigned int sysctl_sched_mi
>   */
>  unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
>  
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> +/*
> + * default period for cfs group bandwidth.
> + * default: 0.5s, units: nanoseconds
> + */
> +static u64 sched_cfs_bandwidth_period = 500000000ULL;
> +#endif
> +
>  static const struct sched_class fair_sched_class;
>  
>  /**************************************************************
> @@ -397,6 +406,9 @@ static void __enqueue_entity(struct cfs_
>  
>  	rb_link_node(&se->run_node, parent, link);
>  	rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	start_cfs_bandwidth(&cfs_rq->tg->cfs_bandwidth);
> +#endif
>  }

This really needs to life elsewhere, __*_entity() functions are for
rb-tree muck.
 
>  static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-23 13:32   ` Peter Zijlstra
@ 2011-02-24  5:21     ` Bharata B Rao
  2011-02-24 11:05       ` Peter Zijlstra
  2011-02-25  3:10     ` Paul Turner
  1 sibling, 1 reply; 71+ messages in thread
From: Bharata B Rao @ 2011-02-24  5:21 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: Paul Turner, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Nikhil Rao

Hi Peter,

I will only answer a couple of your questions and let Paul clarify the rest...

On Wed, Feb 23, 2011 at 02:32:13PM +0100, Peter Zijlstra wrote:
> On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
> 
> 
> > @@ -1363,6 +1407,9 @@ enqueue_task_fair(struct rq *rq, struct 
> >  			break;
> >  		cfs_rq = cfs_rq_of(se);
> >  		enqueue_entity(cfs_rq, se, flags);
> > +		/* don't continue to enqueue if our parent is throttled */
> > +		if (cfs_rq_throttled(cfs_rq))
> > +			break;
> >  		flags = ENQUEUE_WAKEUP;
> >  	}
> >  
> > @@ -1390,8 +1437,11 @@ static void dequeue_task_fair(struct rq 
> >  		cfs_rq = cfs_rq_of(se);
> >  		dequeue_entity(cfs_rq, se, flags);
> >  
> > -		/* Don't dequeue parent if it has other entities besides us */
> > -		if (cfs_rq->load.weight)
> > +		/*
> > +		 * Don't dequeue parent if it has other entities besides us,
> > +		 * or if it is throttled
> > +		 */
> > +		if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
> >  			break;
> >  		flags |= DEQUEUE_SLEEP;
> >  	}
> 
> How could we even be running if our parent was throttled?

The task isn't running actually. One of its parents up in the heirarchy has
been throttled and been already dequeued. Now this task sits on its immediate
parent's runqueue which isn't throttled but not really running also since
the hierarchy is throttled. In this situation, load balancer can try to pull
this task. When that happens, load balancer tries to dequeue it and this
check will ensure that we don't attempt to dequeue a group entity in our
hierarchy which has already been dequeued.

> > @@ -1438,10 +1524,16 @@ static void account_cfs_rq_quota(struct 
> >  
> >  	cfs_rq->quota_used += delta_exec;
> >  
> > -	if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> > +	if (cfs_rq_throttled(cfs_rq) ||
> > +		cfs_rq->quota_used < cfs_rq->quota_assigned)
> >  		return;
> 
> So we are throttled but running anyway, I suppose this comes from the PI
> ceiling muck?

When a cfs_rq is throttled, its representative se (and all its parent
se's) get dequeued and the task is marked for resched. But the task entity is
still on its throttled parent's cfs_rq (=> task->se.on_rq = 1). Next during
put_prev_task_fair(), we enqueue the task back on its throttled parent's
cfs_rq at which time we end up calling update_curr() on throttled cfs_rq.
This check would help us bail out from that situation.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-02-23 13:32   ` Peter Zijlstra
@ 2011-02-24  7:04     ` Bharata B Rao
  2011-02-24 11:14       ` Peter Zijlstra
  2011-02-26  0:02     ` Paul Turner
  1 sibling, 1 reply; 71+ messages in thread
From: Bharata B Rao @ 2011-02-24  7:04 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: Paul Turner, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Wed, Feb 23, 2011 at 02:32:12PM +0100, Peter Zijlstra wrote:
> On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
> 
> > +static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> > +{
> > +	struct rq *rq = rq_of(cfs_rq);
> > +	struct sched_entity *se;
> > +
> > +	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> > +
> > +	update_rq_clock(rq);
> > +	/* (Try to) avoid maintaining share statistics for idle time */
> > +	cfs_rq->load_stamp = cfs_rq->load_last = rq->clock_task;
> 
> Ok, so here you try to compensate for some of the weirdness from the
> previous patch.. wouldn't it be much saner to fully consider the
> throttled things dequeued for the load calculation etc.?
> 
> > +
> > +	cfs_rq->throttled = 0;
> > +	for_each_sched_entity(se) {
> > +		if (se->on_rq)
> > +			break;
> > +
> > +		cfs_rq = cfs_rq_of(se);
> > +		enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
> > +		if (cfs_rq_throttled(cfs_rq))
> > +			break;
> 
> That's just weird, it was throttled, you enqueued it but find it
> throttled.

se got enqueued to cfs_rq, but we find that cfs_rq is throttled and hence
refrain from enqueueing cfs_rq futher.

So essentially enqueing to a throttled cfs_rq is allowed, but a throttled 
group entitiy can't be enqueued further.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-24  5:21     ` Bharata B Rao
@ 2011-02-24 11:05       ` Peter Zijlstra
  2011-02-24 15:45         ` Bharata B Rao
  2011-02-25  3:41         ` Paul Turner
  0 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2011-02-24 11:05 UTC (permalink / raw
  To: bharata
  Cc: Paul Turner, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Nikhil Rao

On Thu, 2011-02-24 at 10:51 +0530, Bharata B Rao wrote:
> Hi Peter,
> 
> I will only answer a couple of your questions and let Paul clarify the rest...
> 
> On Wed, Feb 23, 2011 at 02:32:13PM +0100, Peter Zijlstra wrote:
> > On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
> > 
> > 
> > > @@ -1363,6 +1407,9 @@ enqueue_task_fair(struct rq *rq, struct 
> > >  			break;
> > >  		cfs_rq = cfs_rq_of(se);
> > >  		enqueue_entity(cfs_rq, se, flags);
> > > +		/* don't continue to enqueue if our parent is throttled */
> > > +		if (cfs_rq_throttled(cfs_rq))
> > > +			break;
> > >  		flags = ENQUEUE_WAKEUP;
> > >  	}
> > >  
> > > @@ -1390,8 +1437,11 @@ static void dequeue_task_fair(struct rq 
> > >  		cfs_rq = cfs_rq_of(se);
> > >  		dequeue_entity(cfs_rq, se, flags);
> > >  
> > > -		/* Don't dequeue parent if it has other entities besides us */
> > > -		if (cfs_rq->load.weight)
> > > +		/*
> > > +		 * Don't dequeue parent if it has other entities besides us,
> > > +		 * or if it is throttled
> > > +		 */
> > > +		if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
> > >  			break;
> > >  		flags |= DEQUEUE_SLEEP;
> > >  	}
> > 
> > How could we even be running if our parent was throttled?
> 
> The task isn't running actually. One of its parents up in the heirarchy has
> been throttled and been already dequeued. Now this task sits on its immediate
> parent's runqueue which isn't throttled but not really running also since
> the hierarchy is throttled. In this situation, load balancer can try to pull
> this task. When that happens, load balancer tries to dequeue it and this
> check will ensure that we don't attempt to dequeue a group entity in our
> hierarchy which has already been dequeued.

That's insane, its throttled, that means it should be dequeued and
should thus invisible for the load-balancer. If it is visible the
load-balancer will try and move tasks around to balance load, but all in
vain, it'll move phantom loads around and get most confused at best.

Pure and utter suckage if you ask me.

> > > @@ -1438,10 +1524,16 @@ static void account_cfs_rq_quota(struct 
> > >  
> > >  	cfs_rq->quota_used += delta_exec;
> > >  
> > > -	if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> > > +	if (cfs_rq_throttled(cfs_rq) ||
> > > +		cfs_rq->quota_used < cfs_rq->quota_assigned)
> > >  		return;
> > 
> > So we are throttled but running anyway, I suppose this comes from the PI
> > ceiling muck?
> 
> When a cfs_rq is throttled, its representative se (and all its parent
> se's) get dequeued and the task is marked for resched. But the task entity is
> still on its throttled parent's cfs_rq (=> task->se.on_rq = 1). Next during
> put_prev_task_fair(), we enqueue the task back on its throttled parent's
> cfs_rq at which time we end up calling update_curr() on throttled cfs_rq.
> This check would help us bail out from that situation.

But why bother with this early exit? At worst you'll call
tg_request_cfs_quota() in vain, at best you'll find there is runtime
because the period tick just happened on another cpu and you're good to
go, yay!



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-02-24  7:04     ` Bharata B Rao
@ 2011-02-24 11:14       ` Peter Zijlstra
  0 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2011-02-24 11:14 UTC (permalink / raw
  To: bharata
  Cc: Paul Turner, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Thu, 2011-02-24 at 12:34 +0530, Bharata B Rao wrote:
> On Wed, Feb 23, 2011 at 02:32:12PM +0100, Peter Zijlstra wrote:
> > On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
> > 
> > > +static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> > > +{
> > > +	struct rq *rq = rq_of(cfs_rq);
> > > +	struct sched_entity *se;
> > > +
> > > +	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> > > +
> > > +	update_rq_clock(rq);
> > > +	/* (Try to) avoid maintaining share statistics for idle time */
> > > +	cfs_rq->load_stamp = cfs_rq->load_last = rq->clock_task;
> > 
> > Ok, so here you try to compensate for some of the weirdness from the
> > previous patch.. wouldn't it be much saner to fully consider the
> > throttled things dequeued for the load calculation etc.?
> > 
> > > +
> > > +	cfs_rq->throttled = 0;
> > > +	for_each_sched_entity(se) {
> > > +		if (se->on_rq)
> > > +			break;
> > > +
> > > +		cfs_rq = cfs_rq_of(se);
> > > +		enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
> > > +		if (cfs_rq_throttled(cfs_rq))
> > > +			break;
> > 
> > That's just weird, it was throttled, you enqueued it but find it
> > throttled.
> 
> se got enqueued to cfs_rq, but we find that cfs_rq is throttled and hence
> refrain from enqueueing cfs_rq futher.
> 
> So essentially enqueing to a throttled cfs_rq is allowed, but a throttled 
> group entitiy can't be enqueued further.

Argh, so this is about the total trainwreck you have for hierarchy
semantics (which you've basically inherited from the RT bits I guess,
which are similarly broken).

Do you want to support per-cgroup throttle periods? If so we need to sit
down and work this out, if not the above should not be possible because
each cgroup can basically run from the same refresh timer and everybody
will get throttled at the same time.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-24 11:05       ` Peter Zijlstra
@ 2011-02-24 15:45         ` Bharata B Rao
  2011-02-24 15:52           ` Peter Zijlstra
  2011-02-25  3:41         ` Paul Turner
  1 sibling, 1 reply; 71+ messages in thread
From: Bharata B Rao @ 2011-02-24 15:45 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: Paul Turner, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Nikhil Rao

On Thu, Feb 24, 2011 at 12:05:01PM +0100, Peter Zijlstra wrote:
> On Thu, 2011-02-24 at 10:51 +0530, Bharata B Rao wrote:
> > Hi Peter,
> > 
> > I will only answer a couple of your questions and let Paul clarify the rest...
> > 
> > On Wed, Feb 23, 2011 at 02:32:13PM +0100, Peter Zijlstra wrote:
> > > On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
> > > 
> > > 
> > > > @@ -1363,6 +1407,9 @@ enqueue_task_fair(struct rq *rq, struct 
> > > >  			break;
> > > >  		cfs_rq = cfs_rq_of(se);
> > > >  		enqueue_entity(cfs_rq, se, flags);
> > > > +		/* don't continue to enqueue if our parent is throttled */
> > > > +		if (cfs_rq_throttled(cfs_rq))
> > > > +			break;
> > > >  		flags = ENQUEUE_WAKEUP;
> > > >  	}
> > > >  
> > > > @@ -1390,8 +1437,11 @@ static void dequeue_task_fair(struct rq 
> > > >  		cfs_rq = cfs_rq_of(se);
> > > >  		dequeue_entity(cfs_rq, se, flags);
> > > >  
> > > > -		/* Don't dequeue parent if it has other entities besides us */
> > > > -		if (cfs_rq->load.weight)
> > > > +		/*
> > > > +		 * Don't dequeue parent if it has other entities besides us,
> > > > +		 * or if it is throttled
> > > > +		 */
> > > > +		if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
> > > >  			break;
> > > >  		flags |= DEQUEUE_SLEEP;
> > > >  	}
> > > 
> > > How could we even be running if our parent was throttled?
> > 
> > The task isn't running actually. One of its parents up in the heirarchy has
> > been throttled and been already dequeued. Now this task sits on its immediate
> > parent's runqueue which isn't throttled but not really running also since
> > the hierarchy is throttled. In this situation, load balancer can try to pull
> > this task. When that happens, load balancer tries to dequeue it and this
> > check will ensure that we don't attempt to dequeue a group entity in our
> > hierarchy which has already been dequeued.
> 
> That's insane, its throttled, that means it should be dequeued and
> should thus invisible for the load-balancer. If it is visible the
> load-balancer will try and move tasks around to balance load, but all in
> vain, it'll move phantom loads around and get most confused at best.

We can't walk the se hierarchy downwards and hence can't really dequeue
the entire hierarchy if any one entity in the hierarchy is throttled.
However this semantics of retaining the child entities enqueued while
dequeuing the entities upwards of a throttled entity makes our life simple
during unthrottling. We just have to enqueue the entities upwards the
throttled entity and the rest of the entities downwards automatically become
available.

While I admit that our load balancing semantics wrt thorttled entities are
not consistent (we don't allow pulling of tasks directly from throttled
cfs_rqs, while allow pulling of tasks from a throttled hierarchy as in the
above case), I am beginning to think if it works out to be advantageous.
Is there a chance that the task gets to run on other CPU where the hierarchy
isn't throttled since runtime is still available ?

> 
> Pure and utter suckage if you ask me.
> 
> > > > @@ -1438,10 +1524,16 @@ static void account_cfs_rq_quota(struct 
> > > >  
> > > >  	cfs_rq->quota_used += delta_exec;
> > > >  
> > > > -	if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> > > > +	if (cfs_rq_throttled(cfs_rq) ||
> > > > +		cfs_rq->quota_used < cfs_rq->quota_assigned)
> > > >  		return;
> > > 
> > > So we are throttled but running anyway, I suppose this comes from the PI
> > > ceiling muck?
> > 
> > When a cfs_rq is throttled, its representative se (and all its parent
> > se's) get dequeued and the task is marked for resched. But the task entity is
> > still on its throttled parent's cfs_rq (=> task->se.on_rq = 1). Next during
> > put_prev_task_fair(), we enqueue the task back on its throttled parent's
> > cfs_rq at which time we end up calling update_curr() on throttled cfs_rq.
> > This check would help us bail out from that situation.
> 
> But why bother with this early exit? At worst you'll call
> tg_request_cfs_quota() in vain, at best you'll find there is runtime
> because the period tick just happened on another cpu and you're good to
> go, yay!

I see your point. I had this check in my version of hard limits patches earlier
for the reason I described above. Lets see if Paul had any other reason
to retain this check.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-24 15:45         ` Bharata B Rao
@ 2011-02-24 15:52           ` Peter Zijlstra
  2011-02-24 16:39             ` Bharata B Rao
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2011-02-24 15:52 UTC (permalink / raw
  To: bharata
  Cc: Paul Turner, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Nikhil Rao

On Thu, 2011-02-24 at 21:15 +0530, Bharata B Rao wrote:
> While I admit that our load balancing semantics wrt thorttled entities are
> not consistent (we don't allow pulling of tasks directly from throttled
> cfs_rqs, while allow pulling of tasks from a throttled hierarchy as in the
> above case), I am beginning to think if it works out to be advantageous.
> Is there a chance that the task gets to run on other CPU where the hierarchy
> isn't throttled since runtime is still available ? 

Possible yes, but the load-balancer doesn't know about that, not should
it (its complicated, and broken, enough, no need to add more cruft to
it).

I'm starting to think you all should just toss all this and start over,
its just too smelly.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-24 15:52           ` Peter Zijlstra
@ 2011-02-24 16:39             ` Bharata B Rao
  2011-02-24 17:20               ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Bharata B Rao @ 2011-02-24 16:39 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: Paul Turner, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Nikhil Rao

On Thu, Feb 24, 2011 at 04:52:53PM +0100, Peter Zijlstra wrote:
> On Thu, 2011-02-24 at 21:15 +0530, Bharata B Rao wrote:
> > While I admit that our load balancing semantics wrt thorttled entities are
> > not consistent (we don't allow pulling of tasks directly from throttled
> > cfs_rqs, while allow pulling of tasks from a throttled hierarchy as in the
> > above case), I am beginning to think if it works out to be advantageous.
> > Is there a chance that the task gets to run on other CPU where the hierarchy
> > isn't throttled since runtime is still available ? 
> 
> Possible yes, but the load-balancer doesn't know about that, not should
> it (its complicated, and broken, enough, no need to add more cruft to
> it).
> 
> I'm starting to think you all should just toss all this and start over,
> its just too smelly.

Hmm... You have brought up 3 concerns:

1. Hierarchy semantics

If you look at the heirarchy semantics we currently have while ignoring the
load balancer interactions for a moment, I guess what we have is a reasonable
one.

- Only group entities are throttled
- Throttled entities are taken off the runqueue and hence they never
  get picked up for scheduling.
- New or child entites are queued up to the throttled entities and not
  further up. As I said in another thread, having the tree intact and correct
  underneath the throttled entity allows us to rebuild the hierarchy during
  unthrottling with least amount of effort.
- Group entities in a hierarchy are throttled independent of each other based
  on their bandwidth specification.

2. Handling of throttled entities by load balancer

This definetely needs to improve and be more consistent. We can work on this.

3. per-cgroup vs global period specification

I thought per-cgroup specification would be most flexible and hence started
out with that. This would allow groups/workloads/VMs to define their
own bandwidth rate.

Let us know if you have other design concerns besides these.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-24 16:39             ` Bharata B Rao
@ 2011-02-24 17:20               ` Peter Zijlstra
  2011-02-25  3:59                 ` Paul Turner
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2011-02-24 17:20 UTC (permalink / raw
  To: bharata
  Cc: Paul Turner, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Nikhil Rao

On Thu, 2011-02-24 at 22:09 +0530, Bharata B Rao wrote:
> On Thu, Feb 24, 2011 at 04:52:53PM +0100, Peter Zijlstra wrote:
> > On Thu, 2011-02-24 at 21:15 +0530, Bharata B Rao wrote:
> > > While I admit that our load balancing semantics wrt thorttled entities are
> > > not consistent (we don't allow pulling of tasks directly from throttled
> > > cfs_rqs, while allow pulling of tasks from a throttled hierarchy as in the
> > > above case), I am beginning to think if it works out to be advantageous.
> > > Is there a chance that the task gets to run on other CPU where the hierarchy
> > > isn't throttled since runtime is still available ? 
> > 
> > Possible yes, but the load-balancer doesn't know about that, not should
> > it (its complicated, and broken, enough, no need to add more cruft to
> > it).
> > 
> > I'm starting to think you all should just toss all this and start over,
> > its just too smelly.
> 
> Hmm... You have brought up 3 concerns:
> 
> 1. Hierarchy semantics
> 
> If you look at the heirarchy semantics we currently have while ignoring the
> load balancer interactions for a moment, I guess what we have is a reasonable
> one.
> 
> - Only group entities are throttled
> - Throttled entities are taken off the runqueue and hence they never
>   get picked up for scheduling.
> - New or child entites are queued up to the throttled entities and not
>   further up. As I said in another thread, having the tree intact and correct
>   underneath the throttled entity allows us to rebuild the hierarchy during
>   unthrottling with least amount of effort.

It also gets you into all that load-balancer mess, and I'm not going to
let you off lightly there.

> - Group entities in a hierarchy are throttled independent of each other based
>   on their bandwidth specification.

That's missing out quite a few details.. for one there is no mention of
hierarchical implication of/constraints on bandwidth, can children have
more bandwidth than their parent (I hope not).

> 2. Handling of throttled entities by load balancer
> 
> This definetely needs to improve and be more consistent. We can work on this.

Feh, improve is being nice about it, it needs a complete overhaul, the
current situation is a cobbled together leaky mess.

> 3. per-cgroup vs global period specification
> 
> I thought per-cgroup specification would be most flexible and hence started
> out with that. This would allow groups/workloads/VMs to define their
> own bandwidth rate.

Most flexible yes, most 'interesting' too, now if you consider running a
child task is also running the parent entity and therefore you're
consuming bandwidth up the entire hierarchy, what happens when the
parent has a much larger period than the child?

In that case your child doesn't get ran while the parent is throttled,
and the child's period is violated.


> Let us know if you have other design concerns besides these.

Yeah, that weird time accounting muck, bandwidth should decrease on
usage and incremented on replenishment, this gets you 0 as the natural
boundary between credit and debt, no need to keep two variables.

Also, the above just about covers all the patch set does, isn't that
enough justification to throw the thing out and start over?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-23 13:32   ` Peter Zijlstra
  2011-02-24  5:21     ` Bharata B Rao
@ 2011-02-25  3:10     ` Paul Turner
  2011-02-25 13:58       ` Bharata B Rao
  2011-02-28 13:48       ` Peter Zijlstra
  1 sibling, 2 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-25  3:10 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Wed, Feb 23, 2011 at 5:32 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
>
>> +static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>> +{
>> +     return cfs_rq->throttled;
>> +}
>> +
>> +/* it's possible to be 'on_rq' in a dequeued (e.g. throttled) hierarchy */
>> +static inline int entity_on_rq(struct sched_entity *se)
>> +{
>> +     for_each_sched_entity(se)
>> +             if (!se->on_rq)
>> +                     return 0;
>
> Please add block braces over multi line stmts even if not strictly
> needed.
>

Done

>> +
>> +     return 1;
>> +}
>
>
>> @@ -761,7 +788,11 @@ static void update_cfs_load(struct cfs_r
>>       u64 now, delta;
>>       unsigned long load = cfs_rq->load.weight;
>>
>> -     if (cfs_rq->tg == &root_task_group)
>> +     /*
>> +      * Don't maintain averages for the root task group, or while we are
>> +      * throttled.
>> +      */
>> +     if (cfs_rq->tg == &root_task_group || cfs_rq_throttled(cfs_rq))
>>               return;
>>
>>       now = rq_of(cfs_rq)->clock_task;
>
> Placing the return there avoids updating the timestamps, so once we get
> unthrottled we'll observe a very long period and skew the load avg?
>

It's easier to avoid this by fixing up the load average on unthrottle,
since there's no point in moving up the intermediate timestamps on
each throttled update.

The one "gotcha" in either case is that it's possible for time to
drift on the child of a throttled group and I don't see an easy way
around this.

> Ideally we'd never call this on throttled groups to begin with and
> handle them like full dequeue/enqueue like things.
>

This is what is attempted -- however it's still possible actions such
as wakeup which may still occur against throttled groups regardless of
their queue state.

In this case we still need to preserve the correct child hierarchy
state so that it can be re-enqueued when there is again bandwidth.

>> @@ -1015,6 +1046,14 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>>        * Update run-time statistics of the 'current'.
>>        */
>>       update_curr(cfs_rq);
>> +
>> +
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     if (!entity_is_task(se) && (cfs_rq_throttled(group_cfs_rq(se)) ||
>> +          !group_cfs_rq(se)->nr_running))
>> +             return;
>> +#endif
>> +
>>       update_cfs_load(cfs_rq, 0);
>>       account_entity_enqueue(cfs_rq, se);
>>       update_cfs_shares(cfs_rq);
>> @@ -1087,6 +1126,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>>        */
>>       update_curr(cfs_rq);
>>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     if (!entity_is_task(se) && cfs_rq_throttled(group_cfs_rq(se)))
>> +             return;
>> +#endif
>> +
>>       update_stats_dequeue(cfs_rq, se);
>>       if (flags & DEQUEUE_SLEEP) {
>>  #ifdef CONFIG_SCHEDSTATS
>
> These make me very nervous, on enqueue you bail after adding
> min_vruntime to ->vruntime and calling update_curr(), but on dequeue you
> bail before subtracting min_vruntime from ->vruntime.
>

min_vruntime shouldn't be added in enqueue since unthrottling is
treated as a wakeup (which results in placement versus min as opposed
to normalization).

>> @@ -1363,6 +1407,9 @@ enqueue_task_fair(struct rq *rq, struct
>>                       break;
>>               cfs_rq = cfs_rq_of(se);
>>               enqueue_entity(cfs_rq, se, flags);
>> +             /* don't continue to enqueue if our parent is throttled */
>> +             if (cfs_rq_throttled(cfs_rq))
>> +                     break;
>>               flags = ENQUEUE_WAKEUP;
>>       }
>>
>> @@ -1390,8 +1437,11 @@ static void dequeue_task_fair(struct rq
>>               cfs_rq = cfs_rq_of(se);
>>               dequeue_entity(cfs_rq, se, flags);
>>
>> -             /* Don't dequeue parent if it has other entities besides us */
>> -             if (cfs_rq->load.weight)
>> +             /*
>> +              * Don't dequeue parent if it has other entities besides us,
>> +              * or if it is throttled
>> +              */
>> +             if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
>>                       break;
>>               flags |= DEQUEUE_SLEEP;
>>       }
>
> How could we even be running if our parent was throttled?
>

It's possible we throttled within the preceding dequeue_entity -- the
partial update_curr against cfs_rq might be just enough to push it
over the edge.  In which case that entity has already been dequeued
and we want to bail out.

>> @@ -1430,6 +1480,42 @@ static u64 tg_request_cfs_quota(struct t
>>       return delta;
>>  }
>>
>> +static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>> +{
>> +     struct sched_entity *se;
>> +
>> +     se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
>> +
>> +     /* account load preceeding throttle */
>
> My spell checker thinks that should be written as: preceding.
>

My fat fingers have corrected this typo.

>> +     update_cfs_load(cfs_rq, 0);
>> +
>> +     /* prevent previous buddy nominations from re-picking this se */
>> +     clear_buddies(cfs_rq_of(se), se);
>> +
>> +     /*
>> +      * It's possible for the current task to block and re-wake before task
>> +      * switch, leading to a throttle within enqueue_task->update_curr()
>> +      * versus an an entity that has not technically been enqueued yet.
>
> I'm not quite seeing how this would happen.. care to expand on this?
>

I'm not sure the example Bharata gave is correct -- I'm going to treat
that discussion separately as it's not the intent here.

Here the task _is_ running.

Specifically:

- Suppose the current task on a cfs_rq blocks
- Accordingly we issue dequeue against that task (however it remains
as curr until the put)
- Before we get to the put some other activity (e.g. network bottom
half) gets to run and re-wake the task
- The time elapsed for this is charged to the task, which might push
it over its reservation, it then gets throttled while we're trying to
queue it

BUT

We haven't actually done any of the enqueue work yet so there's
nothing to do to take it off rq.  So what we just mark it throttled
and make sure that the rest of the enqueue work gets short circuited.

The clock_task helps reduce the occurrence of this since the task will
be spared the majority of the SI time but it's still possible to push
it over.


>> +      * In this case, since we haven't actually done the enqueue yet, cut
>> +      * out and allow enqueue_entity() to short-circuit
>> +      */
>> +     if (!se->on_rq)
>> +             goto out_throttled;
>> +
>> +     for_each_sched_entity(se) {
>> +             struct cfs_rq *cfs_rq = cfs_rq_of(se);
>> +
>> +             dequeue_entity(cfs_rq, se, 1);
>> +             if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
>> +                     break;
>> +     }
>> +
>> +out_throttled:
>> +     cfs_rq->throttled = 1;
>> +     update_cfs_rq_load_contribution(cfs_rq, 1);
>> +}
>> +
>>  static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>>               unsigned long delta_exec)
>>  {
>> @@ -1438,10 +1524,16 @@ static void account_cfs_rq_quota(struct
>>
>>       cfs_rq->quota_used += delta_exec;
>>
>> -     if (cfs_rq->quota_used < cfs_rq->quota_assigned)
>> +     if (cfs_rq_throttled(cfs_rq) ||
>> +             cfs_rq->quota_used < cfs_rq->quota_assigned)
>>               return;
>
> So we are throttled but running anyway, I suppose this comes from the PI
> ceiling muck?
>

No -- this is just the fact that there are cases where reschedule
can't evict the task immediately.

e.g. softirq or any kernel time without config_preempt

Once we're throttled we know there's no time left or point in trying
to acquire it so just short circuit these until we get to a point
where this task can be removed from rq.


>>       cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
>> +
>> +     if (cfs_rq->quota_used >= cfs_rq->quota_assigned) {
>> +             throttle_cfs_rq(cfs_rq);
>> +             resched_task(cfs_rq->rq->curr);
>> +     }
>>  }
>>
>>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>> @@ -1941,6 +2033,12 @@ static void check_preempt_wakeup(struct
>>       if (unlikely(se == pse))
>>               return;
>>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     /* avoid pre-emption check/buddy nomination for throttled tasks */
>
> Somehow my spell checker doesn't like that hyphen.
>

Fixed

>> +     if (!entity_on_rq(pse))
>> +             return;
>> +#endif
>
> Ideally that #ifdef'ery would go away too.

This can 100% go away (and is already in the #ifdefs), but it will
always be true in the !BANDWIDTH case, so it's a micro-overhead.
Accompanying micro-optimization isn't really needed :)

>
>>       if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK))
>>               set_next_buddy(pse);
>>
>> @@ -2060,7 +2158,8 @@ static bool yield_to_task_fair(struct rq
>>  {
>>       struct sched_entity *se = &p->se;
>>
>> -     if (!se->on_rq)
>> +     /* ensure entire hierarchy is on rq (e.g. running & not throttled) */
>> +     if (!entity_on_rq(se))
>>               return false;
>
> like here..
>
>>       /* Tell the scheduler that we'd really like pse to run next. */
>> @@ -2280,7 +2379,8 @@ static void update_shares(int cpu)
>>
>>       rcu_read_lock();
>>       for_each_leaf_cfs_rq(rq, cfs_rq)
>> -             update_shares_cpu(cfs_rq->tg, cpu);
>> +             if (!cfs_rq_throttled(cfs_rq))
>> +                     update_shares_cpu(cfs_rq->tg, cpu);
>
> This wants extra braces
>

Fixed

>>       rcu_read_unlock();
>>  }
>>
>> @@ -2304,9 +2404,10 @@ load_balance_fair(struct rq *this_rq, in
>>               u64 rem_load, moved_load;
>>
>>               /*
>> -              * empty group
>> +              * empty group or throttled cfs_rq
>>                */
>> -             if (!busiest_cfs_rq->task_weight)
>> +             if (!busiest_cfs_rq->task_weight ||
>> +                             cfs_rq_throttled(busiest_cfs_rq))
>>                       continue;
>>
>>               rem_load = (u64)rem_load_move * busiest_weight;
>>
>>
>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 1/7] sched: introduce primitives to account for CFS bandwidth tracking
  2011-02-23 13:32   ` Peter Zijlstra
@ 2011-02-25  3:11     ` Paul Turner
  2011-02-25 20:53     ` Paul Turner
  1 sibling, 0 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-25  3:11 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Wed, Feb 23, 2011 at 5:32 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
>
>> @@ -245,6 +248,15 @@ struct cfs_rq;
>>
>>  static LIST_HEAD(task_groups);
>>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +struct cfs_bandwidth {
>> +     raw_spinlock_t          lock;
>> +     ktime_t                 period;
>> +     u64                     runtime, quota;
>> +     struct hrtimer          period_timer;
>> +};
>> +#endif
>
> If you write that as:
>
> struct cfs_bandwidth {
> #ifdef CONFIG_CFS_BANDWIDTH
>        ...
> #endif
> };
>

While I prefer (entirely subjectively) making the #ifdef's in cfs_rq
explicit; I have no real objection and this lets us kill #ifdefs
around init_cfs_bandwidth (since it does reference the member).

Done.

>>  /* task group related information */
>>  struct task_group {
>>       struct cgroup_subsys_state css;
>> @@ -276,6 +288,10 @@ struct task_group {
>>  #ifdef CONFIG_SCHED_AUTOGROUP
>>       struct autogroup *autogroup;
>>  #endif
>> +
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     struct cfs_bandwidth cfs_bandwidth;
>> +#endif
>>  };
>
> You can avoid the #ifdef'ery here
>

Done

>>  /* task_group_lock serializes the addition/removal of task groups */
>> @@ -370,9 +386,76 @@ struct cfs_rq {
>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
>> +
>> +static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> +{
>> +     struct cfs_bandwidth *cfs_b =
>> +             container_of(timer, struct cfs_bandwidth, period_timer);
>> +     ktime_t now;
>> +     int overrun;
>> +     int idle = 0;
>> +
>> +     for (;;) {
>> +             now = hrtimer_cb_get_time(timer);
>> +             overrun = hrtimer_forward(timer, now, cfs_b->period);
>> +
>> +             if (!overrun)
>> +                     break;
>> +
>> +             idle = do_sched_cfs_period_timer(cfs_b, overrun);
>> +     }
>> +
>> +     return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
>> +}
>> +
>> +static
>> +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
>> +{
>> +     raw_spin_lock_init(&cfs_b->lock);
>> +     cfs_b->quota = cfs_b->runtime = quota;
>> +     cfs_b->period = ns_to_ktime(period);
>> +
>> +     hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> +     cfs_b->period_timer.function = sched_cfs_period_timer;
>> +}
>> +
>> +static
>> +void init_cfs_rq_quota(struct cfs_rq *cfs_rq)
>> +{
>> +     cfs_rq->quota_used = 0;
>> +     if (cfs_rq->tg->cfs_bandwidth.quota == RUNTIME_INF)
>> +             cfs_rq->quota_assigned = RUNTIME_INF;
>> +     else
>> +             cfs_rq->quota_assigned = 0;
>> +}
>> +
>> +static void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> +{
>> +     if (cfs_b->quota == RUNTIME_INF)
>> +             return;
>> +
>> +     if (hrtimer_active(&cfs_b->period_timer))
>> +             return;
>> +
>> +     raw_spin_lock(&cfs_b->lock);
>> +     start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
>> +     raw_spin_unlock(&cfs_b->lock);
>> +}
>> +
>> +static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> +{
>> +     hrtimer_cancel(&cfs_b->period_timer);
>> +}
>> +#endif
>
> and #else
>
> stubs
> #endif
>
>>  /* Real-Time classes' related field in a runqueue: */
>>  struct rt_rq {
>>       struct rt_prio_array active;
>> @@ -8038,6 +8121,9 @@ static void init_tg_cfs_entry(struct tas
>>       tg->cfs_rq[cpu] = cfs_rq;
>>       init_cfs_rq(cfs_rq, rq);
>>       cfs_rq->tg = tg;
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     init_cfs_rq_quota(cfs_rq);
>> +#endif
>
> also avoids #ifdef'ery here
>

Done

>>       tg->se[cpu] = se;
>>       /* se could be NULL for root_task_group */
>> @@ -8173,6 +8259,10 @@ void __init sched_init(void)
>>                * We achieve this by letting root_task_group's tasks sit
>>                * directly in rq->cfs (i.e root_task_group->se[] = NULL).
>>                */
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +             init_cfs_bandwidth(&root_task_group.cfs_bandwidth,
>> +                             RUNTIME_INF, sched_cfs_bandwidth_period);
>> +#endif
>
> and here
>

Done

>>               init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
>>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>>
>> @@ -8415,6 +8505,10 @@ static void free_fair_sched_group(struct
>>  {
>>       int i;
>>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     destroy_cfs_bandwidth(&tg->cfs_bandwidth);
>> +#endif
>
> and here
>

Done

>>       for_each_possible_cpu(i) {
>>               if (tg->cfs_rq)
>>                       kfree(tg->cfs_rq[i]);
>> @@ -8442,7 +8536,10 @@ int alloc_fair_sched_group(struct task_g
>>               goto err;
>>
>>       tg->shares = NICE_0_LOAD;
>> -
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     init_cfs_bandwidth(&tg->cfs_bandwidth, RUNTIME_INF,
>> +                     sched_cfs_bandwidth_period);
>> +#endif
>
> and here
>

Done

>>       for_each_possible_cpu(i) {
>>               rq = cpu_rq(i);
>>
>
>> @@ -9107,6 +9204,116 @@ static u64 cpu_shares_read_u64(struct cg
>>
>>       return (u64) tg->shares;
>>  }
>> +
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> +{
>> +     int i;
>> +     static DEFINE_MUTEX(mutex);
>> +
>> +     if (tg == &root_task_group)
>> +             return -EINVAL;
>> +
>> +     if (!period)
>> +             return -EINVAL;
>> +
>> +     /*
>> +      * Ensure we have at least one tick of bandwidth every period.  This is
>> +      * to prevent reaching a state of large arrears when throttled via
>> +      * entity_tick() resulting in prolonged exit starvation.
>> +      */
>> +     if (NS_TO_JIFFIES(quota) < 1)
>> +             return -EINVAL;
>> +
>> +     mutex_lock(&mutex);
>> +     raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
>> +     tg->cfs_bandwidth.period = ns_to_ktime(period);
>> +     tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
>> +     raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
>> +
>> +     for_each_possible_cpu(i) {
>> +             struct cfs_rq *cfs_rq = tg->cfs_rq[i];
>> +             struct rq *rq = rq_of(cfs_rq);
>> +
>> +             raw_spin_lock_irq(&rq->lock);
>> +             init_cfs_rq_quota(cfs_rq);
>> +             raw_spin_unlock_irq(&rq->lock);
>
> Any particular reason you didn't mirror rt_rq->rt_runtime_lock?
>
>> +     }
>> +     mutex_unlock(&mutex);
>> +
>> +     return 0;
>> +}
>
>
>> Index: tip/kernel/sched_fair.c
>> ===================================================================
>> --- tip.orig/kernel/sched_fair.c
>> +++ tip/kernel/sched_fair.c
>> @@ -88,6 +88,15 @@ const_debug unsigned int sysctl_sched_mi
>>   */
>>  unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
>>
>> +
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +/*
>> + * default period for cfs group bandwidth.
>> + * default: 0.5s, units: nanoseconds
>> + */
>> +static u64 sched_cfs_bandwidth_period = 500000000ULL;
>> +#endif
>> +
>>  static const struct sched_class fair_sched_class;
>>
>>  /**************************************************************
>> @@ -397,6 +406,9 @@ static void __enqueue_entity(struct cfs_
>>
>>       rb_link_node(&se->run_node, parent, link);
>>       rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     start_cfs_bandwidth(&cfs_rq->tg->cfs_bandwidth);
>> +#endif
>>  }
>
> This really needs to life elsewhere, __*_entity() functions are for
> rb-tree muck.
>

Moved to enqueue_entity

>>  static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>
>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 6/7] sched: hierarchical task accounting for SCHED_OTHER
  2011-02-23 13:32   ` Peter Zijlstra
@ 2011-02-25  3:25     ` Paul Turner
  2011-02-25 12:17       ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Paul Turner @ 2011-02-25  3:25 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen

On Wed, Feb 23, 2011 at 5:32 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
>
>> @@ -1846,6 +1846,11 @@ static const struct sched_class rt_sched
>>
>>  #include "sched_stats.h"
>>
>> +static void mod_nr_running(struct rq *rq, long delta)
>> +{
>> +     rq->nr_running += delta;
>> +}
>
> I personally don't see much use in such trivial wrappers.. if you're
> going to rework all the nr_running stuff you might as well remove all of
> that.

I'm ok with that,  I was trying to preserve some of the existing
encapsulation but there isn't a need for it.

Will do.

>
>> Index: tip/kernel/sched_fair.c
>> ===================================================================
>> --- tip.orig/kernel/sched_fair.c
>> +++ tip/kernel/sched_fair.c
>> @@ -81,6 +81,8 @@ unsigned int normalized_sysctl_sched_wak
>>
>>  const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
>>
>> +static void account_hier_tasks(struct sched_entity *se, int delta);
>> +
>>  /*
>>   * The exponential sliding  window over which load is averaged for shares
>>   * distribution.
>> @@ -933,6 +935,40 @@ static inline void update_entity_shares_
>>  }
>>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +/* maintain hierarchal task counts on group entities */
>> +static void account_hier_tasks(struct sched_entity *se, int delta)
>> +{
>> +     struct rq *rq = rq_of(cfs_rq_of(se));
>> +     struct cfs_rq *cfs_rq;
>> +
>> +     for_each_sched_entity(se) {
>> +             /* a throttled entity cannot affect its parent hierarchy */
>> +             if (group_cfs_rq(se) && cfs_rq_throttled(group_cfs_rq(se)))
>> +                     break;
>> +
>> +             /* we affect our queuing entity */
>> +             cfs_rq = cfs_rq_of(se);
>> +             cfs_rq->h_nr_tasks += delta;
>> +     }
>> +
>> +     /* account for global nr_running delta to hierarchy change */
>> +     if (!se)
>> +             mod_nr_running(rq, delta);
>> +}
>> +#else
>> +/*
>> + * In the absence of group throttling, all operations are guaranteed to be
>> + * globally visible at the root rq level.
>> + */
>> +static void account_hier_tasks(struct sched_entity *se, int delta)
>> +{
>> +     struct rq *rq = rq_of(cfs_rq_of(se));
>> +
>> +     mod_nr_running(rq, delta);
>> +}
>> +#endif
>
> While Balbir suggested expanding the _hier_ thing, I'd suggest to simply
> drop it altogether, way too much typing ;-), but that is if you cannot
> get rid of the extra hierarchy iteration, see below.
>
>> +
>>  static void enqueue_sleeper(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>  {
>>  #ifdef CONFIG_SCHEDSTATS
>> @@ -1428,6 +1464,7 @@ enqueue_task_fair(struct rq *rq, struct
>>               update_cfs_shares(cfs_rq);
>>       }
>>
>> +     account_hier_tasks(&p->se, 1);
>>       hrtick_update(rq);
>>  }
>>
>> @@ -1461,6 +1498,7 @@ static void dequeue_task_fair(struct rq
>>               update_cfs_shares(cfs_rq);
>>       }
>>
>> +     account_hier_tasks(&p->se, -1);
>>       hrtick_update(rq);
>>  }
>>
>> @@ -1488,6 +1526,8 @@ static u64 tg_request_cfs_quota(struct t
>>       return delta;
>>  }
>>
>> +static void account_hier_tasks(struct sched_entity *se, int delta);
>> +
>>  static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>>  {
>>       struct sched_entity *se;
>> @@ -1507,6 +1547,7 @@ static void throttle_cfs_rq(struct cfs_r
>>       if (!se->on_rq)
>>               goto out_throttled;
>>
>> +     account_hier_tasks(se, -cfs_rq->h_nr_tasks);
>>       for_each_sched_entity(se) {
>>               struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>
>> @@ -1541,6 +1582,7 @@ static void unthrottle_cfs_rq(struct cfs
>>       cfs_rq->load_stamp = cfs_rq->load_last = rq->clock_task;
>>
>>       cfs_rq->throttled = 0;
>> +     account_hier_tasks(se, cfs_rq->h_nr_tasks);
>>       for_each_sched_entity(se) {
>>               if (se->on_rq)
>>                       break;
>
> All call-sites are right next to a for_each_sched_entity() iteration, is
> there really no way to fold those loops?
>

I was trying to avoid some duplication since many of those for_each
iterations are partial and I didn't want to dump the same loop for the
remnants everywhere.

However, thinking about it I think this can be cleaned up and refined
to by having account_hier iterate from wherever they finish and
avoiding that nastiness.


>> Index: tip/kernel/sched_rt.c
>> ===================================================================
>> --- tip.orig/kernel/sched_rt.c
>> +++ tip/kernel/sched_rt.c
>> @@ -906,6 +906,8 @@ enqueue_task_rt(struct rq *rq, struct ta
>>
>>       if (!task_current(rq, p) && p->rt.nr_cpus_allowed > 1)
>>               enqueue_pushable_task(rq, p);
>> +
>> +     inc_nr_running(rq);
>>  }
>>
>>  static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flags)
>> @@ -916,6 +918,8 @@ static void dequeue_task_rt(struct rq *r
>>       dequeue_rt_entity(rt_se);
>>
>>       dequeue_pushable_task(rq, p);
>> +
>> +     dec_nr_running(rq);
>>  }
>>
>>  /*
>> @@ -1783,4 +1787,3 @@ static void print_rt_stats(struct seq_fi
>>       rcu_read_unlock();
>>  }
>>  #endif /* CONFIG_SCHED_DEBUG */
>
>
> You mentioned something about -rt having the same problem, yet you don't
> fix it.. tskkk :-)
>

Yeah I agree :) -- I was planning on fixing this in a follow-up
[honest!] as it's not as important since it doesn't affect idle
balancing to the same magnitude.  I wanted to first air the notion of
hierarchal task accounting in one of the schedulers then mirror the
support across the other schedulers with consensus reached.

I can make that part of this series if you'd prefer.



>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 5/7] sched: add exports tracking cfs bandwidth control statistics
  2011-02-23 13:32   ` Peter Zijlstra
@ 2011-02-25  3:26     ` Paul Turner
  2011-02-25  8:54       ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Paul Turner @ 2011-02-25  3:26 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Wed, Feb 23, 2011 at 5:32 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
>> +       raw_spin_lock(&cfs_b->lock);
>> +       cfs_b->throttled_time += (rq->clock - cfs_rq->throttled_timestamp);
>> +       raw_spin_unlock(&cfs_b->lock);
>
> That seems to put the cost of things on the wrong side. Read is rare,
> update is frequent, and you made the frequent thing the most expensive
> one.

Hum.. the trade-off here is non-trivial I think

- This update is only once per-quota period (*if* we throttled within
that period).  This places the frequency in the 10s-100s of ms range.
- Sampling would probably occur on an order of once a second (assuming
some enterprise management system that cares about these statistics).

If we make the update cheaper by moving this per-cpu, then yes the
updates are cheaper but the reads now having per-cpu cost makes the
overall cost about the same (multiplying frequency by delta cost).

We could move the global accrual to an atomic, but this isn't any
cheaper given that this lock shouldn't be
contended.

>
>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage
  2011-02-23 13:32   ` Peter Zijlstra
@ 2011-02-25  3:33     ` Paul Turner
  2011-02-25 12:31       ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Paul Turner @ 2011-02-25  3:33 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Wed, Feb 23, 2011 at 5:32 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
>
>> @@ -609,6 +631,9 @@ static void update_curr(struct cfs_rq *c
>>               cpuacct_charge(curtask, delta_exec);
>>               account_group_exec_runtime(curtask, delta_exec);
>>       }
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     account_cfs_rq_quota(cfs_rq, delta_exec);
>> +#endif
>>  }
>
> Not too hard to make the #ifdef'ery go away I'd guess.
>

Done

>>  static inline void
>> @@ -1382,6 +1407,43 @@ static void dequeue_task_fair(struct rq
>>  }
>>
>>  #ifdef CONFIG_CFS_BANDWIDTH
>> +static u64 tg_request_cfs_quota(struct task_group *tg)
>> +{
>> +     struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
>> +     u64 delta = 0;
>> +
>> +     if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
>> +             raw_spin_lock(&cfs_b->lock);
>> +             /*
>> +              * it's possible a bandwidth update has changed the global
>> +              * pool.
>> +              */
>> +             if (cfs_b->quota == RUNTIME_INF)
>> +                     delta = sched_cfs_bandwidth_slice();
>
> Why do we bother at all when there's infinite time? Shouldn't the action
> that sets it to infinite also make cfs_rq->quota_assinged to to
> RUNTIME_INF, in which case the below check will make it all go away?
>

The request for bandwidth might be racing with an entity's request for
bandwidth.

e.g. someone updates cpu.cfs_quota_us to infinite while there's
bandwidth distribution in flight.

In this case we need to return some sane value so that the thread
requesting bandwidth can complete that operation (releasing the lock
which will then be taken to set quota_assigned to INF).

But more importantly we don't want to decrement the value doled out
FROM cfs_b->runtime since that would change it from the magic
RUNTIME_INF.  That's why the check exists.


>> +             else {
>> +                     delta = min(cfs_b->runtime,
>> +                                     sched_cfs_bandwidth_slice());
>> +                     cfs_b->runtime -= delta;
>> +             }
>> +             raw_spin_unlock(&cfs_b->lock);
>> +     }
>> +     return delta;
>> +}
>
> Also, shouldn't this all try and steal time from other cpus when the
> global limit ran out? Suppose you have say 24 cpus, and you had a short
> burst where all 24 cpus had some runtime, so you distribute 240ms. But
> 23 of those cpus only ran for 0.5ms, leaving 23.5ms of unused time on 23
> cpus while your one active cpu will then throttle.
>

In practice this only affects the first period since that slightly
stale bandwidth is then available on those other 23 cpus the next time
a micro-burst occurs.  In testing this has resulted in very stable
performance and "smooth" perturbations that resemble hardcapping by
affinity (for integer points).

The notion of stealing could certainly be introduced, the juncture of
reaching the zero point here would be a possible place to consider
that (although we would need to do a steal that avoids the asymptotic
convergence problem of RT).

I think returning (most) of the bandwidth to the global pool on
(voluntary) dequeue is a more scalable solution

> I would much rather see all the accounting tight first and optimize
> later than start with leaky stuff and try and plug holes later.
>

The complexity this introduces is non-trivial since the idea of
returning quota to the pool means you need to introduce the notion of
when that quota came to life (otherwise you get leaks in the reverse
direction!) -- as well as reversing some of the lock ordering.

While generations do this they don't greatly increase the efficacy and
I think there is value in performing the detailed review we are doing
now in isolation of that.

It's also still consistent regarding leakage since in that in any N
consecutive periods the maximum additional quota (by a user abusing
this) that can be received is N+1.  Does the complexity trade-off
merit improving this bound at this point?

>> +
>> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>> +             unsigned long delta_exec)
>> +{
>> +     if (cfs_rq->quota_assigned == RUNTIME_INF)
>> +             return;
>> +
>> +     cfs_rq->quota_used += delta_exec;
>> +
>> +     if (cfs_rq->quota_used < cfs_rq->quota_assigned)
>> +             return;
>> +
>> +     cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
>> +}
>
> So why isn't this hierarchical?,

It is naturally (since charging occurs within the existing hierarchal
accounting)

> also all this positive quota stuff
> looks weird, why not decrement and try to supplement when negative?
>

I thought I had a good reason for this.. at least I remember at one
point I think I did.. but I cannot see any need for it in the current
form.  I will revise it to a single decremented quota_remaining.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-24 11:05       ` Peter Zijlstra
  2011-02-24 15:45         ` Bharata B Rao
@ 2011-02-25  3:41         ` Paul Turner
  1 sibling, 0 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-25  3:41 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Nikhil Rao

On Thu, Feb 24, 2011 at 3:05 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2011-02-24 at 10:51 +0530, Bharata B Rao wrote:
>> Hi Peter,
>>
>> I will only answer a couple of your questions and let Paul clarify the rest...
>>
>> On Wed, Feb 23, 2011 at 02:32:13PM +0100, Peter Zijlstra wrote:
>> > On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
>> >
>> >
>> > > @@ -1363,6 +1407,9 @@ enqueue_task_fair(struct rq *rq, struct
>> > >                   break;
>> > >           cfs_rq = cfs_rq_of(se);
>> > >           enqueue_entity(cfs_rq, se, flags);
>> > > +         /* don't continue to enqueue if our parent is throttled */
>> > > +         if (cfs_rq_throttled(cfs_rq))
>> > > +                 break;
>> > >           flags = ENQUEUE_WAKEUP;
>> > >   }
>> > >
>> > > @@ -1390,8 +1437,11 @@ static void dequeue_task_fair(struct rq
>> > >           cfs_rq = cfs_rq_of(se);
>> > >           dequeue_entity(cfs_rq, se, flags);
>> > >
>> > > -         /* Don't dequeue parent if it has other entities besides us */
>> > > -         if (cfs_rq->load.weight)
>> > > +         /*
>> > > +          * Don't dequeue parent if it has other entities besides us,
>> > > +          * or if it is throttled
>> > > +          */
>> > > +         if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
>> > >                   break;
>> > >           flags |= DEQUEUE_SLEEP;
>> > >   }
>> >
>> > How could we even be running if our parent was throttled?

The only way this can happen is if we are on our way out.  The example
given doesn't apply to this case I don't think
>>
>> The task isn't running actually. One of its parents up in the heirarchy has
>> been throttled and been already dequeued. Now this task sits on its immediate
>> parent's runqueue which isn't throttled but not really running also since
>> the hierarchy is throttled.
>> In this situation, load balancer can try to pull
>> this task. When that happens, load balancer tries to dequeue it and this
>> check will ensure that we don't attempt to dequeue a group entity in our
>> hierarchy which has already been dequeued.
>
> That's insane, its throttled, that means it should be dequeued and
> should thus invisible for the load-balancer.

I agree.  We ensure this does not happen by making the h_load zero.
Something I thought I was doing but apparently not, will fix in
repost.

> load-balancer will try and move tasks around to balance load, but all in
> vain, it'll move phantom loads around and get most confused at best.
>

Yeah this shouldn't happen, I don't think this example is a valid one.

> Pure and utter suckage if you ask me.
>
>> > > @@ -1438,10 +1524,16 @@ static void account_cfs_rq_quota(struct
>> > >
>> > >   cfs_rq->quota_used += delta_exec;
>> > >
>> > > - if (cfs_rq->quota_used < cfs_rq->quota_assigned)
>> > > + if (cfs_rq_throttled(cfs_rq) ||
>> > > +         cfs_rq->quota_used < cfs_rq->quota_assigned)
>> > >           return;
>> >
>> > So we are throttled but running anyway, I suppose this comes from the PI
>> > ceiling muck?
>>
>> When a cfs_rq is throttled, its representative se (and all its parent
>> se's) get dequeued and the task is marked for resched. But the task entity is
>> still on its throttled parent's cfs_rq (=> task->se.on_rq = 1). Next during
>> put_prev_task_fair(), we enqueue the task back on its throttled parent's
>> cfs_rq at which time we end up calling update_curr() on throttled cfs_rq.
>> This check would help us bail out from that situation.
>
> But why bother with this early exit? At worst you'll call
> tg_request_cfs_quota() in vain, at best you'll find there is runtime
> because the period tick just happened on another cpu and you're good to
> go, yay!

It's for the non-preemptible case where we could be running for non
trivial time after reschedule()

Considering your second point I suppose there could be a micro-benefit
in checking in case the period tick did just happen to occur and then
self unthrottling... but I don't think it's really worth it.


>
>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-24 17:20               ` Peter Zijlstra
@ 2011-02-25  3:59                 ` Paul Turner
  0 siblings, 0 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-25  3:59 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: bharata, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Srivatsa Vaddagiri, Kamalesh Babulal,
	Ingo Molnar, Pavel Emelyanov, Herbert Poetzl, Avi Kivity,
	Chris Friesen, Nikhil Rao

On Thu, Feb 24, 2011 at 9:20 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2011-02-24 at 22:09 +0530, Bharata B Rao wrote:
>> On Thu, Feb 24, 2011 at 04:52:53PM +0100, Peter Zijlstra wrote:
>> > On Thu, 2011-02-24 at 21:15 +0530, Bharata B Rao wrote:
>> > > While I admit that our load balancing semantics wrt thorttled entities are
>> > > not consistent (we don't allow pulling of tasks directly from throttled
>> > > cfs_rqs, while allow pulling of tasks from a throttled hierarchy as in the
>> > > above case), I am beginning to think if it works out to be advantageous.
>> > > Is there a chance that the task gets to run on other CPU where the hierarchy
>> > > isn't throttled since runtime is still available ?
>> >
>> > Possible yes, but the load-balancer doesn't know about that, not should
>> > it (its complicated, and broken, enough, no need to add more cruft to
>> > it).
>> >
>> > I'm starting to think you all should just toss all this and start over,
>> > its just too smelly.
>>
>> Hmm... You have brought up 3 concerns:
>>
>> 1. Hierarchy semantics
>>
>> If you look at the heirarchy semantics we currently have while ignoring the
>> load balancer interactions for a moment, I guess what we have is a reasonable
>> one.
>>
>> - Only group entities are throttled
>> - Throttled entities are taken off the runqueue and hence they never
>>   get picked up for scheduling.
>> - New or child entites are queued up to the throttled entities and not
>>   further up. As I said in another thread, having the tree intact and correct
>>   underneath the throttled entity allows us to rebuild the hierarchy during
>>   unthrottling with least amount of effort.
>
> It also gets you into all that load-balancer mess, and I'm not going to
> let you off lightly there.
>

I think the example was a little cuckoo.  As you say, it's dequeued
and invisible to the load balancer.

The special case of block->wakeup->throttle->put only exists for the
current task which is ineligible for non-active load-balance anyway.

>> - Group entities in a hierarchy are throttled independent of each other based
>>   on their bandwidth specification.
>
> That's missing out quite a few details.. for one there is no mention of
> hierarchical implication of/constraints on bandwidth, can children have
> more bandwidth than their parent (I hope not).
>

I wasn't planning to enforce it since I believe there is  value in
non-conformant constraints:

Consider:

- I have some application that I want to limit to 3 cpus
I have a 2 workers in that application, across a period I would like
those workers to use a maximum of say 2.5 cpus each (suppose they
serve some sort of co-processor request per user and we want to
prevent a single user eating our entire limit and starving out
everything else).

The goal in this case is not preventing over-subscription, but
ensuring that some part threads is not allowed to blow our entire
quota, while not destroying the (relatively) work-conserving aspect of
its performance in general.

The above occurs sufficiently often that at the very least I think
conformance checking would have to be gated by a sysctl so that this
use case is still enabled.

- There's also the case of "I want to manage a newly abusive user,
being smart I've given his hierarchy a unique root so that I can
constrain them."
A non-conformant constraint avoids the adversarial problem of having
to find and bring all of their set (possibly maliciously large) limits
within the global limit I want to impose upon them.

My viewpoint was that if some idiot wants to set up such a tree
(unintentionally) it's their own damn fault but I suppose we should at
least give them a safety :)  I'll add it.

>> 2. Handling of throttled entities by load balancer
>>
>> This definetely needs to improve and be more consistent. We can work on this.
>
> Feh, improve is being nice about it, it needs a complete overhaul, the
> current situation is a cobbled together leaky mess.
>

I think as long as the higher level semantics are correct and
throttling happens /sanely/ this is a non-issue.

>> 3. per-cgroup vs global period specification
>>
>> I thought per-cgroup specification would be most flexible and hence started
>> out with that. This would allow groups/workloads/VMs to define their
>> own bandwidth rate.
>
> Most flexible yes, most 'interesting' too, now if you consider running a
> child task is also running the parent entity and therefore you're
> consuming bandwidth up the entire hierarchy, what happens when the
> parent has a much larger period than the child?
>
> In that case your child doesn't get ran while the parent is throttled,
> and the child's period is violated.
>

There are definitely cases where this is both valid and useful.  I
think gating conformancy allows for both (especially if it defaults to
"on").

>
>> Let us know if you have other design concerns besides these.
>
> Yeah, that weird time accounting muck, bandwidth should decrease on
> usage and incremented on replenishment, this gets you 0 as the natural
> boundary between credit and debt, no need to keep two variables.
>

Yes, agreed!  Fixing :)

> Also, the above just about covers all the patch set does, isn't that
> enough justification to throw the thing out and start over?
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 5/7] sched: add exports tracking cfs bandwidth control statistics
  2011-02-25  3:26     ` Paul Turner
@ 2011-02-25  8:54       ` Peter Zijlstra
  0 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2011-02-25  8:54 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Thu, 2011-02-24 at 19:26 -0800, Paul Turner wrote:
> On Wed, Feb 23, 2011 at 5:32 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
> >> +       raw_spin_lock(&cfs_b->lock);
> >> +       cfs_b->throttled_time += (rq->clock - cfs_rq->throttled_timestamp);
> >> +       raw_spin_unlock(&cfs_b->lock);
> >
> > That seems to put the cost of things on the wrong side. Read is rare,
> > update is frequent, and you made the frequent thing the most expensive
> > one.
> 
> Hum.. the trade-off here is non-trivial I think
> 
> - This update is only once per-quota period (*if* we throttled within
> that period).  This places the frequency in the 10s-100s of ms range.
> - Sampling would probably occur on an order of once a second (assuming
> some enterprise management system that cares about these statistics).

Ugh,. people are really polling state like that? If the event is rare
pushing state is much saner.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 0/7] Introduction
       [not found] ` <20110224161111.7d83a884@jacob-laptop>
@ 2011-02-25 10:03   ` Paul Turner
  2011-02-25 13:06     ` jacob pan
  0 siblings, 1 reply; 71+ messages in thread
From: Paul Turner @ 2011-02-25 10:03 UTC (permalink / raw
  To: jacob pan
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Arjan van de Ven, Rafael J. Wysocki, Matt Helsley

On Thu, Feb 24, 2011 at 4:11 PM, jacob pan
<jacob.jun.pan@linux.intel.com> wrote:
> On Tue, 15 Feb 2011 19:18:31 -0800
> Paul Turner <pjt@google.com> wrote:
>
>> Hi all,
>>
>> Please find attached v4 of CFS bandwidth control; while this rebase
>> against some of the latest SCHED_NORMAL code is new, the features and
>> methodology are fairly mature at this point and have proved both
>> effective and stable for several workloads.
>>
>> As always, all comments/feedback welcome.
>>
>
> Hi Paul,
>
> Your patches provide a very useful but slightly different feature for
> what we need to manage idle time in order to save power. What we
> need is kind of a quota/period in terms of idle time. I have been
> playing with your patches and noticed that when the cgroup cpu usage
> exceeds the quota the effect of throttling is similar to what I have
> been trying to do with freezer subsystem. i.e. freeze and thaw at given
> period and percentage runtime.
> https://lkml.org/lkml/2011/2/15/314
>
> Have you thought about adding such feature (please see detailed
> description in the link above) to your patches?
>

So reading the description it seems like rooting everything in a
'freezer' container and then setting up a quota of

(1 - frozen_percentage)  * nr_cpus * frozen_period * sec_to_usec

on a period of

frozen_period * sec_to_usec

Would provide the same functionality.  Is there other unduplicated
functionality beyond this?

One thing that does seem undesirable about your approach is (as it
seems to be described) threads will not be able to take advantage of
naturally occurring idle cycles and will incur a potential performance
penalty even at use << frozen_percentage.

e.g. From your post

       |  |<-- 90% frozen -     ->|  |                               |  |
____|  |________________x_|  |__________________|  |_____

        |<---- 5 seconds     ---->|


Suppose no threads active until the wake up at x, suppose there is an
accompanying 1 second of work for that thread to do.  That execution
time will be dilated to ~1.5 seconds (as it will span the 0.5 seconds
the freezer will stall for).  But the true usage for this period is
~20% <<< 90%

> Thanks,
>
> Jacob
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 6/7] sched: hierarchical task accounting for SCHED_OTHER
  2011-02-25  3:25     ` Paul Turner
@ 2011-02-25 12:17       ` Peter Zijlstra
  0 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2011-02-25 12:17 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen

On Thu, 2011-02-24 at 19:25 -0800, Paul Turner wrote:
> 
> Yeah I agree :) -- I was planning on fixing this in a follow-up
> [honest!] as it's not as important since it doesn't affect idle
> balancing to the same magnitude.  I wanted to first air the notion of
> hierarchal task accounting in one of the schedulers then mirror the
> support across the other schedulers with consensus reached. 

Right, I think I once argued against this, but the effect on a lot of
things are against me, so lets do this.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage
  2011-02-25  3:33     ` Paul Turner
@ 2011-02-25 12:31       ` Peter Zijlstra
  0 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2011-02-25 12:31 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Thu, 2011-02-24 at 19:33 -0800, Paul Turner wrote:

> >>  #ifdef CONFIG_CFS_BANDWIDTH
> >> +static u64 tg_request_cfs_quota(struct task_group *tg)
> >> +{
> >> +     struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
> >> +     u64 delta = 0;
> >> +
> >> +     if (cfs_b->runtime > 0 || cfs_b->quota == RUNTIME_INF) {
> >> +             raw_spin_lock(&cfs_b->lock);
> >> +             /*
> >> +              * it's possible a bandwidth update has changed the global
> >> +              * pool.
> >> +              */
> >> +             if (cfs_b->quota == RUNTIME_INF)
> >> +                     delta = sched_cfs_bandwidth_slice();
> >
> > Why do we bother at all when there's infinite time? Shouldn't the action
> > that sets it to infinite also make cfs_rq->quota_assinged to to
> > RUNTIME_INF, in which case the below check will make it all go away?

> But more importantly we don't want to decrement the value doled out
> FROM cfs_b->runtime since that would change it from the magic
> RUNTIME_INF.  That's why the check exists.

Ah, quite so.

> >> +             else {
> >> +                     delta = min(cfs_b->runtime,
> >> +                                     sched_cfs_bandwidth_slice());
> >> +                     cfs_b->runtime -= delta;
> >> +             }
> >> +             raw_spin_unlock(&cfs_b->lock);
> >> +     }
> >> +     return delta;
> >> +}
> >
> > Also, shouldn't this all try and steal time from other cpus when the
> > global limit ran out? Suppose you have say 24 cpus, and you had a short
> > burst where all 24 cpus had some runtime, so you distribute 240ms. But
> > 23 of those cpus only ran for 0.5ms, leaving 23.5ms of unused time on 23
> > cpus while your one active cpu will then throttle.
> >
> 
> In practice this only affects the first period since that slightly
> stale bandwidth is then available on those other 23 cpus the next time
> a micro-burst occurs.  In testing this has resulted in very stable
> performance and "smooth" perturbations that resemble hardcapping by
> affinity (for integer points).
> 
> The notion of stealing could certainly be introduced, the juncture of
> reaching the zero point here would be a possible place to consider
> that (although we would need to do a steal that avoids the asymptotic
> convergence problem of RT).
> 
> I think returning (most) of the bandwidth to the global pool on
> (voluntary) dequeue is a more scalable solution
> 
> > I would much rather see all the accounting tight first and optimize
> > later than start with leaky stuff and try and plug holes later.
> >
> 
> The complexity this introduces is non-trivial since the idea of
> returning quota to the pool means you need to introduce the notion of
> when that quota came to life (otherwise you get leaks in the reverse
> direction!) -- as well as reversing some of the lock ordering.

Right, nasty that, RT doesn't suffer this because of the lack of
over-commit.

> While generations do this they don't greatly increase the efficacy and
> I think there is value in performing the detailed review we are doing
> now in isolation of that.
> 
> It's also still consistent regarding leakage since in that in any N
> consecutive periods the maximum additional quota (by a user abusing
> this) that can be received is N+1.  Does the complexity trade-off
> merit improving this bound at this point?

Well, something yes, with N being potentially very large indeed these
days we need some feedback.

One idea would be to keep a cpu mask in the bandwidth thing and setting
the cpu when a cpu claims bandwidth from the global pool and iterate and
clear the complete mask on tick.

That also limits the scope on where to look for stealing time.

> >> +static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> >> +             unsigned long delta_exec)
> >> +{
> >> +     if (cfs_rq->quota_assigned == RUNTIME_INF)
> >> +             return;
> >> +
> >> +     cfs_rq->quota_used += delta_exec;
> >> +
> >> +     if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> >> +             return;
> >> +
> >> +     cfs_rq->quota_assigned += tg_request_cfs_quota(cfs_rq->tg);
> >> +}
> >
> > So why isn't this hierarchical?,
> 
> It is naturally (since charging occurs within the existing hierarchal
> accounting)

D'0h yes.. somehow I totally missed that.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 0/7] Introduction
  2011-02-25 10:03   ` Paul Turner
@ 2011-02-25 13:06     ` jacob pan
  2011-03-08  3:57       ` Balbir Singh
  2011-03-09 10:12       ` Paul Turner
  0 siblings, 2 replies; 71+ messages in thread
From: jacob pan @ 2011-02-25 13:06 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Arjan van de Ven, Rafael J. Wysocki, Matt Helsley

On Fri, 25 Feb 2011 02:03:54 -0800
Paul Turner <pjt@google.com> wrote:

> On Thu, Feb 24, 2011 at 4:11 PM, jacob pan
> <jacob.jun.pan@linux.intel.com> wrote:
> > On Tue, 15 Feb 2011 19:18:31 -0800
> > Paul Turner <pjt@google.com> wrote:
> >
> >> Hi all,
> >>
> >> Please find attached v4 of CFS bandwidth control; while this rebase
> >> against some of the latest SCHED_NORMAL code is new, the features
> >> and methodology are fairly mature at this point and have proved
> >> both effective and stable for several workloads.
> >>
> >> As always, all comments/feedback welcome.
> >>
> >
> > Hi Paul,
> >
> > Your patches provide a very useful but slightly different feature
> > for what we need to manage idle time in order to save power. What we
> > need is kind of a quota/period in terms of idle time. I have been
> > playing with your patches and noticed that when the cgroup cpu usage
> > exceeds the quota the effect of throttling is similar to what I have
> > been trying to do with freezer subsystem. i.e. freeze and thaw at
> > given period and percentage runtime.
> > https://lkml.org/lkml/2011/2/15/314
> >
> > Have you thought about adding such feature (please see detailed
> > description in the link above) to your patches?
> >
> 
> So reading the description it seems like rooting everything in a
> 'freezer' container and then setting up a quota of
> 
> (1 - frozen_percentage)  * nr_cpus * frozen_period * sec_to_usec
> 
I guess you meant frozen_percentage is less than 1, i.e. 90 is .90. my
code treat 90 as 90. just a clarification.
> on a period of
> 
> frozen_period * sec_to_usec
> 
> Would provide the same functionality.  Is there other unduplicated
> functionality beyond this?
Do you mean the same functionality as your patch? Not really, since my
approach will stop the tasks based on hard time slices. But seems your
patch will allow them to run if they don't exceed the quota. Am i
missing something?
That is the only functionality difference i know.

Like the reviewer of freezer patch pointed out, it is a more logical
fit to implement such feature in scheduler/yours in stead of freezer. So
i am wondering if your patch can be expended to include limiting quota
on real time.

I did a comparison study between CFS BW and freezer patch on skype with
identical quota setting as you pointed out earlier. Both use 2 sec
period and .2 sec quota (10%). Skype typically uses 5% of the CPU on my
system when placing a call(below cfs quota) and it wakes up every 100ms
to do some quick checks. Then I run skype in cpu then freezer cgroup
(with all its children). Here is my result based on timechart and
powertop.

patch name	wakeups		skype call?
------------------------------------------------------------------
CFS BW		10/sec		yes
freezer		1/sec		no

Skype might not be the best example to illustrate the real usage of the
feature, but we are targeting mobile device where they are mostly off or
often have only one application allowed in foreground. So we want to
reduce wakeups coming from the tasks that are not in the foreground.

> One thing that does seem undesirable about your approach is (as it
> seems to be described) threads will not be able to take advantage of
> naturally occurring idle cycles and will incur a potential performance
> penalty even at use << frozen_percentage.
> 
> e.g. From your post
> 
>        |  |<-- 90% frozen -     ->|  |
> |  | ____|  |________________x_|  |__________________|  |_____
> 
>         |<---- 5 seconds     ---->|
> 
> 
> Suppose no threads active until the wake up at x, suppose there is an
> accompanying 1 second of work for that thread to do.  That execution
> time will be dilated to ~1.5 seconds (as it will span the 0.5 seconds
> the freezer will stall for).  But the true usage for this period is
> ~20% <<< 90%
I agree my approach does not consider the natural cycle. But I am not
sure if a thread can wake up at x when FROZEN.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-25  3:10     ` Paul Turner
@ 2011-02-25 13:58       ` Bharata B Rao
  2011-02-25 20:51         ` Paul Turner
  2011-02-28 13:48       ` Peter Zijlstra
  1 sibling, 1 reply; 71+ messages in thread
From: Bharata B Rao @ 2011-02-25 13:58 UTC (permalink / raw
  To: Paul Turner
  Cc: Peter Zijlstra, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Thu, Feb 24, 2011 at 07:10:58PM -0800, Paul Turner wrote:
> On Wed, Feb 23, 2011 at 5:32 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
> 
> >> +     update_cfs_load(cfs_rq, 0);
> >> +
> >> +     /* prevent previous buddy nominations from re-picking this se */
> >> +     clear_buddies(cfs_rq_of(se), se);
> >> +
> >> +     /*
> >> +      * It's possible for the current task to block and re-wake before task
> >> +      * switch, leading to a throttle within enqueue_task->update_curr()
> >> +      * versus an an entity that has not technically been enqueued yet.
> >
> > I'm not quite seeing how this would happen.. care to expand on this?
> >
> 
> I'm not sure the example Bharata gave is correct -- I'm going to treat
> that discussion separately as it's not the intent here.

Just for the record, my examples were not given for the above question from
Peter.

I answered two questions and I am tempted to stand by those until proven
wrong :)

1. Why do we have cfs_rq_throtted() check in dequeue_task_fair() ? 
( => How could we be running if our parent was throttled ?)

Consider the following hierarchy.

Root Group
   |
   |
Group 1 (Bandwidth constrained group)
   |
   |
Group 2 (Infinite runtime group)

Assume both the groups have tasks in them.

When Group 1 is throttled, its cfs_rq is marked throttled, and is removed from
Root group's runqueue. But leaf tasks in Group 2 continue to be enqueued in
Group 1's runqueue.

Load balancer kicks in on CPU A and figures out that it can pull a few tasks
from CPU B (busiest_cpu). It iterates through all the task groups
(load_balance_fair) and considers Group 2 also. It tries to pull a task from
CPU B's cfs_rq for Group 2. I don't see anything that would prevent the
load balancer from bailing out here. Note that Group 2 is technically
not throttled, only its parent Group 1 is. Load balancer goes ahead and
starts pulling individual tasks from Group 2's cfs_rq on CPU B. This
results in dequeuing of task whose hierarchy is throttled.

When load balancer iterates through Group 1's cfs_rqs, the situation is
different because we have already marked Group 1's cfs_rqs as throttled.
And we check this in load_balance_fair() and bail out from pulling tasks
from throttled hierarchy.

This is my understanding. Let me know what I miss. Specifically I would
like to understand how do you ensure that load balancer doesn't consider
tasks from throttled cfs_rqs for pulling.

2. Why there is cfs_rq_throttled() check in account_cfs_rq_quota() ?

In addition to the case you described, I believe the situation I described
is also valid.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-25 13:58       ` Bharata B Rao
@ 2011-02-25 20:51         ` Paul Turner
  2011-02-28  3:50           ` Bharata B Rao
  0 siblings, 1 reply; 71+ messages in thread
From: Paul Turner @ 2011-02-25 20:51 UTC (permalink / raw
  To: bharata
  Cc: Peter Zijlstra, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Fri, Feb 25, 2011 at 5:58 AM, Bharata B Rao
<bharata@linux.vnet.ibm.com> wrote:
> On Thu, Feb 24, 2011 at 07:10:58PM -0800, Paul Turner wrote:
>> On Wed, Feb 23, 2011 at 5:32 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> > On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
>>
>> >> +     update_cfs_load(cfs_rq, 0);
>> >> +
>> >> +     /* prevent previous buddy nominations from re-picking this se */
>> >> +     clear_buddies(cfs_rq_of(se), se);
>> >> +
>> >> +     /*
>> >> +      * It's possible for the current task to block and re-wake before task
>> >> +      * switch, leading to a throttle within enqueue_task->update_curr()
>> >> +      * versus an an entity that has not technically been enqueued yet.
>> >
>> > I'm not quite seeing how this would happen.. care to expand on this?
>> >
>>
>> I'm not sure the example Bharata gave is correct -- I'm going to treat
>> that discussion separately as it's not the intent here.
>
> Just for the record, my examples were not given for the above question from
> Peter.
>
> I answered two questions and I am tempted to stand by those until proven
> wrong :)

This is important to get right, I'm happy to elaborate.

>
> 1. Why do we have cfs_rq_throtted() check in dequeue_task_fair() ?

The check is primarily needed because we could become throttled as
part of a regular dequeue.  At which point we bail because the parent
dequeue is actually complete.

(Were it necessitated by load balance we could actually not do this
and just perform a hierarchal check within load_balance_fair)

> ( => How could we be running if our parent was throttled ?)
>

The only way we can be running if our parent was throttled is if /we/
triggered that throttle and have been marked for re-schedule.

> Consider the following hierarchy.
>
> Root Group
>   |
>   |
> Group 1 (Bandwidth constrained group)
>   |
>   |
> Group 2 (Infinite runtime group)
>
> Assume both the groups have tasks in them.
>
> When Group 1 is throttled, its cfs_rq is marked throttled, and is removed from
> Root group's runqueue. But leaf tasks in Group 2 continue to be enqueued in
> Group 1's runqueue.
>

Yes, the hierarchy state is maintained in isolation.

> Load balancer kicks in on CPU A and figures out that it can pull a few tasks
> from CPU B (busiest_cpu). It iterates through all the task groups
> (load_balance_fair) and considers Group 2 also. It tries to pull a task from
> CPU B's cfs_rq for Group 2. I don't see anything that would prevent the
> load balancer from bailing out here.

Per above, the descendants of a throttled group are also identified
(and appropriately skipped) using h_load.

> Note that Group 2 is technically
> not throttled, only its parent Group 1 is. Load balancer goes ahead and
> starts pulling individual tasks from Group 2's cfs_rq on CPU B.

In general, not true -- load balancing against a throttled hierarchy
is crazy[*].

>  This results in dequeuing of task whose hierarchy is throttled.
>

[*]: There is one edge case in which it may sanely occur:

Namely, if the load balance races with a throttle (since we don't take
rq->locks until we start actually moving tasks).  In this case it's
still ok because the cached h_load ensures the load balancer is still
working from a sane load view and it's as if we performed a minute
re-ordering so it's as if the load-balance had occurred fractionally
before the throttle instead of fractionally after.


> When load balancer iterates through Group 1's cfs_rqs, the situation is
> different because we have already marked Group 1's cfs_rqs as throttled.
> And we check this in load_balance_fair() and bail out from pulling tasks
> from throttled hierarchy.
>
> This is my understanding. Let me know what I miss. Specifically I would
> like to understand how do you ensure that load balancer doesn't consider
> tasks from throttled cfs_rqs for pulling.
>
> 2. Why there is cfs_rq_throttled() check in account_cfs_rq_quota() ?
>
> In addition to the case you described, I believe the situation I described
> is also valid.
>

The point made above was that it's actually for any update that may
(legitimately) occur as throttling can not always be aligned with
eviction.  The case you gave is one of several -- there's nothing
particularly unique about it (nor did I actually disagree with it).

> Regards,
> Bharata.
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 1/7] sched: introduce primitives to account for CFS bandwidth tracking
  2011-02-23 13:32   ` Peter Zijlstra
  2011-02-25  3:11     ` Paul Turner
@ 2011-02-25 20:53     ` Paul Turner
  1 sibling, 0 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-25 20:53 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Wed, Feb 23, 2011 at 5:32 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
>
>> @@ -245,6 +248,15 @@ struct cfs_rq;
>>
>>  static LIST_HEAD(task_groups);
>>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +struct cfs_bandwidth {
>> +     raw_spinlock_t          lock;
>> +     ktime_t                 period;
>> +     u64                     runtime, quota;
>> +     struct hrtimer          period_timer;
>> +};
>> +#endif
>
> If you write that as:
>
> struct cfs_bandwidth {
> #ifdef CONFIG_CFS_BANDWIDTH
>        ...
> #endif
> };
>
>>  /* task group related information */
>>  struct task_group {
>>       struct cgroup_subsys_state css;
>> @@ -276,6 +288,10 @@ struct task_group {
>>  #ifdef CONFIG_SCHED_AUTOGROUP
>>       struct autogroup *autogroup;
>>  #endif
>> +
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     struct cfs_bandwidth cfs_bandwidth;
>> +#endif
>>  };
>
> You can avoid the #ifdef'ery here
>
>>  /* task_group_lock serializes the addition/removal of task groups */
>> @@ -370,9 +386,76 @@ struct cfs_rq {
>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun);
>> +
>> +static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
>> +{
>> +     struct cfs_bandwidth *cfs_b =
>> +             container_of(timer, struct cfs_bandwidth, period_timer);
>> +     ktime_t now;
>> +     int overrun;
>> +     int idle = 0;
>> +
>> +     for (;;) {
>> +             now = hrtimer_cb_get_time(timer);
>> +             overrun = hrtimer_forward(timer, now, cfs_b->period);
>> +
>> +             if (!overrun)
>> +                     break;
>> +
>> +             idle = do_sched_cfs_period_timer(cfs_b, overrun);
>> +     }
>> +
>> +     return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
>> +}
>> +
>> +static
>> +void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, u64 quota, u64 period)
>> +{
>> +     raw_spin_lock_init(&cfs_b->lock);
>> +     cfs_b->quota = cfs_b->runtime = quota;
>> +     cfs_b->period = ns_to_ktime(period);
>> +
>> +     hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>> +     cfs_b->period_timer.function = sched_cfs_period_timer;
>> +}
>> +
>> +static
>> +void init_cfs_rq_quota(struct cfs_rq *cfs_rq)
>> +{
>> +     cfs_rq->quota_used = 0;
>> +     if (cfs_rq->tg->cfs_bandwidth.quota == RUNTIME_INF)
>> +             cfs_rq->quota_assigned = RUNTIME_INF;
>> +     else
>> +             cfs_rq->quota_assigned = 0;
>> +}
>> +
>> +static void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> +{
>> +     if (cfs_b->quota == RUNTIME_INF)
>> +             return;
>> +
>> +     if (hrtimer_active(&cfs_b->period_timer))
>> +             return;
>> +
>> +     raw_spin_lock(&cfs_b->lock);
>> +     start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
>> +     raw_spin_unlock(&cfs_b->lock);
>> +}
>> +
>> +static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>> +{
>> +     hrtimer_cancel(&cfs_b->period_timer);
>> +}
>> +#endif
>
> and #else
>
> stubs
> #endif
>
>>  /* Real-Time classes' related field in a runqueue: */
>>  struct rt_rq {
>>       struct rt_prio_array active;
>> @@ -8038,6 +8121,9 @@ static void init_tg_cfs_entry(struct tas
>>       tg->cfs_rq[cpu] = cfs_rq;
>>       init_cfs_rq(cfs_rq, rq);
>>       cfs_rq->tg = tg;
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     init_cfs_rq_quota(cfs_rq);
>> +#endif
>
> also avoids #ifdef'ery here
>
>>       tg->se[cpu] = se;
>>       /* se could be NULL for root_task_group */
>> @@ -8173,6 +8259,10 @@ void __init sched_init(void)
>>                * We achieve this by letting root_task_group's tasks sit
>>                * directly in rq->cfs (i.e root_task_group->se[] = NULL).
>>                */
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +             init_cfs_bandwidth(&root_task_group.cfs_bandwidth,
>> +                             RUNTIME_INF, sched_cfs_bandwidth_period);
>> +#endif
>
> and here
>
>>               init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
>>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>>
>> @@ -8415,6 +8505,10 @@ static void free_fair_sched_group(struct
>>  {
>>       int i;
>>
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     destroy_cfs_bandwidth(&tg->cfs_bandwidth);
>> +#endif
>
> and here
>
>>       for_each_possible_cpu(i) {
>>               if (tg->cfs_rq)
>>                       kfree(tg->cfs_rq[i]);
>> @@ -8442,7 +8536,10 @@ int alloc_fair_sched_group(struct task_g
>>               goto err;
>>
>>       tg->shares = NICE_0_LOAD;
>> -
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     init_cfs_bandwidth(&tg->cfs_bandwidth, RUNTIME_INF,
>> +                     sched_cfs_bandwidth_period);
>> +#endif
>
> and here
>
>>       for_each_possible_cpu(i) {
>>               rq = cpu_rq(i);
>>
>
>> @@ -9107,6 +9204,116 @@ static u64 cpu_shares_read_u64(struct cg
>>
>>       return (u64) tg->shares;
>>  }
>> +
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
>> +{
>> +     int i;
>> +     static DEFINE_MUTEX(mutex);
>> +
>> +     if (tg == &root_task_group)
>> +             return -EINVAL;
>> +
>> +     if (!period)
>> +             return -EINVAL;
>> +
>> +     /*
>> +      * Ensure we have at least one tick of bandwidth every period.  This is
>> +      * to prevent reaching a state of large arrears when throttled via
>> +      * entity_tick() resulting in prolonged exit starvation.
>> +      */
>> +     if (NS_TO_JIFFIES(quota) < 1)
>> +             return -EINVAL;
>> +
>> +     mutex_lock(&mutex);
>> +     raw_spin_lock_irq(&tg->cfs_bandwidth.lock);
>> +     tg->cfs_bandwidth.period = ns_to_ktime(period);
>> +     tg->cfs_bandwidth.runtime = tg->cfs_bandwidth.quota = quota;
>> +     raw_spin_unlock_irq(&tg->cfs_bandwidth.lock);
>> +
>> +     for_each_possible_cpu(i) {
>> +             struct cfs_rq *cfs_rq = tg->cfs_rq[i];
>> +             struct rq *rq = rq_of(cfs_rq);
>> +
>> +             raw_spin_lock_irq(&rq->lock);
>> +             init_cfs_rq_quota(cfs_rq);
>> +             raw_spin_unlock_irq(&rq->lock);
>
> Any particular reason you didn't mirror rt_rq->rt_runtime_lock?
>

Missed this in original reply -- just that no additional locking is
required so we can avoid the overhead.  The existing rq->lock
synchronization against cfs_rq is sufficient.

>> +     }
>> +     mutex_unlock(&mutex);
>> +
>> +     return 0;
>> +}
>
>
>> Index: tip/kernel/sched_fair.c
>> ===================================================================
>> --- tip.orig/kernel/sched_fair.c
>> +++ tip/kernel/sched_fair.c
>> @@ -88,6 +88,15 @@ const_debug unsigned int sysctl_sched_mi
>>   */
>>  unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
>>
>> +
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +/*
>> + * default period for cfs group bandwidth.
>> + * default: 0.5s, units: nanoseconds
>> + */
>> +static u64 sched_cfs_bandwidth_period = 500000000ULL;
>> +#endif
>> +
>>  static const struct sched_class fair_sched_class;
>>
>>  /**************************************************************
>> @@ -397,6 +406,9 @@ static void __enqueue_entity(struct cfs_
>>
>>       rb_link_node(&se->run_node, parent, link);
>>       rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     start_cfs_bandwidth(&cfs_rq->tg->cfs_bandwidth);
>> +#endif
>>  }
>
> This really needs to life elsewhere, __*_entity() functions are for
> rb-tree muck.
>
>>  static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>
>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh
  2011-02-23 13:32   ` Peter Zijlstra
  2011-02-24  7:04     ` Bharata B Rao
@ 2011-02-26  0:02     ` Paul Turner
  1 sibling, 0 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-26  0:02 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

Oops missed this one before:

On Wed, Feb 23, 2011 at 5:32 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
>
>> +static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>> +{
>> +     struct rq *rq = rq_of(cfs_rq);
>> +     struct sched_entity *se;
>> +
>> +     se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
>> +
>> +     update_rq_clock(rq);
>> +     /* (Try to) avoid maintaining share statistics for idle time */
>> +     cfs_rq->load_stamp = cfs_rq->load_last = rq->clock_task;
>
> Ok, so here you try to compensate for some of the weirdness from the
> previous patch.. wouldn't it be much saner to fully consider the
> throttled things dequeued for the load calculation etc.?
>

That's attempted -- but there's no to control wakeups which will
trigger the usual updates so we do have to do something.

The alternative is more invasive re-ordering of the dequeue/enqueue
paths which I think actually ends up pretty ugly without improving
things.

>> +
>> +     cfs_rq->throttled = 0;
>> +     for_each_sched_entity(se) {
>> +             if (se->on_rq)
>> +                     break;
>> +
>> +             cfs_rq = cfs_rq_of(se);
>> +             enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
>> +             if (cfs_rq_throttled(cfs_rq))
>> +                     break;
>
> That's just weird, it was throttled, you enqueued it but find it
> throttled.
>

Two reasons:

a) We might be unthrottling a child in a throttled hierarchy.  This
can occur regardless of conformancy (e.g. different periods)
b) edge case: suppose there's no bandwidth already and the enqueue
pushes things back into a throttled state.

>> +     }
>> +
>> +     /* determine whether we need to wake up potentally idle cpu */
>
> SP: potentially, also isn't there a determiner missing?

Spelling fixed, I think the determiner is ok though:

- We know nr_running must have been zero before since rq->curr ==
rq->idle, (also if this *has* changed then there's already a resched
for that in flight and we don't need to.  This also implies that
rq->cfs.nr_running was == 0.

- Root cfs_rq.nr_running now being greater than zero tells us that our
unthrottle was root visible (specifically, it was not a throttled
child of another throttled hierachy) which tells us that there's a
task waiting.

Am I missing a case?

>
>> +     if (rq->curr == rq->idle && rq->cfs.nr_running)
>> +             resched_task(rq->curr);
>> +}
>> +
>>  static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>>               unsigned long delta_exec)
>>  {
>> @@ -1535,8 +1569,46 @@ static void account_cfs_rq_quota(struct
>>
>>  static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
>>  {
>> -     return 1;
>> +     int i, idle = 1;
>> +     u64 delta;
>> +     const struct cpumask *span;
>> +
>> +     if (cfs_b->quota == RUNTIME_INF)
>> +             return 1;
>> +
>> +     /* reset group quota */
>> +     raw_spin_lock(&cfs_b->lock);
>> +     cfs_b->runtime = cfs_b->quota;
>
> Shouldn't that be something like:
>
> cfs_b->runtime =
>   min(cfs_b->runtime + overrun * cfs_b->quota, cfs_b->quota);
>
> afaict runtime can go negative in which case we need to compensate for
> that, but we cannot ever get more than quota because we allow for
> overcommit, so not limiting things would allow us to accrue an unlimited
> amount of runtime.
>
> Or can only the per-cpu quota muck go negative?

The over-run can only occur on a local cpu (e.g. due to us being
unable to immediately evict a throttled entity).  By injecting a
constant amount of bandwidth into the global pool we are able to
correct that over-run in the subsequent period.

> In that case it should
> probably be propagated back into the global bw on throttle, otherwise
> you can get deficits on CPUs that remain unused for a while.
>

I think you mean surplus :).  Yes there is potentially a small amount
of surplus quota in the system, the "hard" bound is that across N
periods you can receive [N periods + (slice size * NR_CPUs)] quota,
since this is what may be outstanding as above surpluses.  Since the
slice size is fairly small this over-run is also fairly tight to the
stated bounds (as well as being manageable through the slice size
sysctl when required).


>> +     raw_spin_unlock(&cfs_b->lock);
>> +
>> +     span = sched_bw_period_mask();
>> +     for_each_cpu(i, span) {
>> +             struct rq *rq = cpu_rq(i);
>> +             struct cfs_rq *cfs_rq = cfs_bandwidth_cfs_rq(cfs_b, i);
>> +
>> +             if (cfs_rq->nr_running)
>> +                     idle = 0;
>> +
>> +             if (!cfs_rq_throttled(cfs_rq))
>> +                     continue;
>> +
>> +             delta = tg_request_cfs_quota(cfs_rq->tg);
>> +
>> +             if (delta) {
>> +                     raw_spin_lock(&rq->lock);
>> +                     cfs_rq->quota_assigned += delta;
>> +
>> +                     /* avoid race with tg_set_cfs_bandwidth */
>
> *what* race, and *how*
>

When a user sets a new bandwidth limit for the cgroup (e.g. removes
it, sets unlimited bandwidth).  That process may in itself unthrottle
the group.  Since we synchronize on rq->lock, rechecking this
condition is sufficient to avoid a double unthrottle here.

>> +                     if (cfs_rq_throttled(cfs_rq) &&
>> +                          cfs_rq->quota_used < cfs_rq->quota_assigned)
>> +                             unthrottle_cfs_rq(cfs_rq);
>> +                     raw_spin_unlock(&rq->lock);
>> +             }
>> +     }
>> +
>> +     return idle;
>>  }
>
> This whole positive quota muck makes my head hurt, whatever did you do
> that for? Also it doesn't deal with wrapping, which admittedly won't
> really happen but still.
>

Ah-ha! In going through and swapping things to a single counter I
remember the reason now:

It's that since we can overflow usage on the per-cpu tracking, in
using a single counter care must be taken to avoid collision with
RUNTIME_INF since it's (-1).

Now I'm debating whether the ugliness of these checks is worth it.
Perhaps moving RUNTIME_INF out of quota_remaining and having a
separate per-cpu quota enabled indicator would be the cleanest of all
three.

>
>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-25 20:51         ` Paul Turner
@ 2011-02-28  3:50           ` Bharata B Rao
  2011-02-28  6:38             ` Paul Turner
  0 siblings, 1 reply; 71+ messages in thread
From: Bharata B Rao @ 2011-02-28  3:50 UTC (permalink / raw
  To: Paul Turner
  Cc: Peter Zijlstra, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Fri, Feb 25, 2011 at 12:51:01PM -0800, Paul Turner wrote:
> On Fri, Feb 25, 2011 at 5:58 AM, Bharata B Rao
> <bharata@linux.vnet.ibm.com> wrote:
> > On Thu, Feb 24, 2011 at 07:10:58PM -0800, Paul Turner wrote:
> >> On Wed, Feb 23, 2011 at 5:32 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >> > On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
> >>
> >> >> +     update_cfs_load(cfs_rq, 0);
> >> >> +
> >> >> +     /* prevent previous buddy nominations from re-picking this se */
> >> >> +     clear_buddies(cfs_rq_of(se), se);
> >> >> +
> >> >> +     /*
> >> >> +      * It's possible for the current task to block and re-wake before task
> >> >> +      * switch, leading to a throttle within enqueue_task->update_curr()
> >> >> +      * versus an an entity that has not technically been enqueued yet.
> >> >
> >> > I'm not quite seeing how this would happen.. care to expand on this?
> >> >
> >>
> >> I'm not sure the example Bharata gave is correct -- I'm going to treat
> >> that discussion separately as it's not the intent here.
> >
> > Just for the record, my examples were not given for the above question from
> > Peter.
> >
> > I answered two questions and I am tempted to stand by those until proven
> > wrong :)
> 
> This is important to get right, I'm happy to elaborate.
> 
> >
> > 1. Why do we have cfs_rq_throtted() check in dequeue_task_fair() ?
> 
> The check is primarily needed because we could become throttled as
> part of a regular dequeue.  At which point we bail because the parent
> dequeue is actually complete.
> 
> (Were it necessitated by load balance we could actually not do this
> and just perform a hierarchal check within load_balance_fair)
> 
> > ( => How could we be running if our parent was throttled ?)
> >
> 
> The only way we can be running if our parent was throttled is if /we/
> triggered that throttle and have been marked for re-schedule.
> 
> > Consider the following hierarchy.
> >
> > Root Group
> >   |
> >   |
> > Group 1 (Bandwidth constrained group)
> >   |
> >   |
> > Group 2 (Infinite runtime group)
> >
> > Assume both the groups have tasks in them.
> >
> > When Group 1 is throttled, its cfs_rq is marked throttled, and is removed from
> > Root group's runqueue. But leaf tasks in Group 2 continue to be enqueued in
> > Group 1's runqueue.
> >
> 
> Yes, the hierarchy state is maintained in isolation.
> 
> > Load balancer kicks in on CPU A and figures out that it can pull a few tasks
> > from CPU B (busiest_cpu). It iterates through all the task groups
> > (load_balance_fair) and considers Group 2 also. It tries to pull a task from
> > CPU B's cfs_rq for Group 2. I don't see anything that would prevent the
> > load balancer from bailing out here.
> 
> Per above, the descendants of a throttled group are also identified
> (and appropriately skipped) using h_load.

This bit is still unclear to me. We do nothing in tg_load_down() to treat
throttled cfs_rqs differently when calculating h_load. Nor do we do
anything in load_balance_fair() to explicitly identify descendents of
throttled group using h_load AFAICS. All we have is
cfs_rq_throttled() check, which I think should be converted to entity_on_rq()
to check for the throttled hierarchy and discard pulling from throttled
hierarchies.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-28  3:50           ` Bharata B Rao
@ 2011-02-28  6:38             ` Paul Turner
  0 siblings, 0 replies; 71+ messages in thread
From: Paul Turner @ 2011-02-28  6:38 UTC (permalink / raw
  To: bharata
  Cc: Peter Zijlstra, linux-kernel, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Sun, Feb 27, 2011 at 7:50 PM, Bharata B Rao
<bharata@linux.vnet.ibm.com> wrote:
> On Fri, Feb 25, 2011 at 12:51:01PM -0800, Paul Turner wrote:
>> On Fri, Feb 25, 2011 at 5:58 AM, Bharata B Rao
>> <bharata@linux.vnet.ibm.com> wrote:
>> > On Thu, Feb 24, 2011 at 07:10:58PM -0800, Paul Turner wrote:
>> >> On Wed, Feb 23, 2011 at 5:32 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> >> > On Tue, 2011-02-15 at 19:18 -0800, Paul Turner wrote:
>> >>
>> >> >> +     update_cfs_load(cfs_rq, 0);
>> >> >> +
>> >> >> +     /* prevent previous buddy nominations from re-picking this se */
>> >> >> +     clear_buddies(cfs_rq_of(se), se);
>> >> >> +
>> >> >> +     /*
>> >> >> +      * It's possible for the current task to block and re-wake before task
>> >> >> +      * switch, leading to a throttle within enqueue_task->update_curr()
>> >> >> +      * versus an an entity that has not technically been enqueued yet.
>> >> >
>> >> > I'm not quite seeing how this would happen.. care to expand on this?
>> >> >
>> >>
>> >> I'm not sure the example Bharata gave is correct -- I'm going to treat
>> >> that discussion separately as it's not the intent here.
>> >
>> > Just for the record, my examples were not given for the above question from
>> > Peter.
>> >
>> > I answered two questions and I am tempted to stand by those until proven
>> > wrong :)
>>
>> This is important to get right, I'm happy to elaborate.
>>
>> >
>> > 1. Why do we have cfs_rq_throtted() check in dequeue_task_fair() ?
>>
>> The check is primarily needed because we could become throttled as
>> part of a regular dequeue.  At which point we bail because the parent
>> dequeue is actually complete.
>>
>> (Were it necessitated by load balance we could actually not do this
>> and just perform a hierarchal check within load_balance_fair)
>>
>> > ( => How could we be running if our parent was throttled ?)
>> >
>>
>> The only way we can be running if our parent was throttled is if /we/
>> triggered that throttle and have been marked for re-schedule.
>>
>> > Consider the following hierarchy.
>> >
>> > Root Group
>> >   |
>> >   |
>> > Group 1 (Bandwidth constrained group)
>> >   |
>> >   |
>> > Group 2 (Infinite runtime group)
>> >
>> > Assume both the groups have tasks in them.
>> >
>> > When Group 1 is throttled, its cfs_rq is marked throttled, and is removed from
>> > Root group's runqueue. But leaf tasks in Group 2 continue to be enqueued in
>> > Group 1's runqueue.
>> >
>>
>> Yes, the hierarchy state is maintained in isolation.
>>
>> > Load balancer kicks in on CPU A and figures out that it can pull a few tasks
>> > from CPU B (busiest_cpu). It iterates through all the task groups
>> > (load_balance_fair) and considers Group 2 also. It tries to pull a task from
>> > CPU B's cfs_rq for Group 2. I don't see anything that would prevent the
>> > load balancer from bailing out here.
>>
>> Per above, the descendants of a throttled group are also identified
>> (and appropriately skipped) using h_load.
>
> This bit is still unclear to me. We do nothing in tg_load_down() to treat
> throttled cfs_rqs differently when calculating h_load.

>From above:

"I agree.  We ensure this does not happen by making the h_load zero.
Something I thought I was doing but apparently not, will fix in
repost."

>Nor do we do
> anything in load_balance_fair() to explicitly identify descendents of
> throttled group using h_load AFAICS. All we have is
> cfs_rq_throttled() check, which I think should be converted to entity_on_rq()
> to check for the throttled hierarchy and discard pulling from throttled
> hierarchies.
>
> Regards,
> Bharata.
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-25  3:10     ` Paul Turner
  2011-02-25 13:58       ` Bharata B Rao
@ 2011-02-28 13:48       ` Peter Zijlstra
  2011-03-01  8:31         ` Paul Turner
  1 sibling, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2011-02-28 13:48 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Thu, 2011-02-24 at 19:10 -0800, Paul Turner wrote:

> >> @@ -761,7 +788,11 @@ static void update_cfs_load(struct cfs_r
> >>       u64 now, delta;
> >>       unsigned long load = cfs_rq->load.weight;
> >>
> >> -     if (cfs_rq->tg == &root_task_group)
> >> +     /*
> >> +      * Don't maintain averages for the root task group, or while we are
> >> +      * throttled.
> >> +      */
> >> +     if (cfs_rq->tg == &root_task_group || cfs_rq_throttled(cfs_rq))
> >>               return;
> >>
> >>       now = rq_of(cfs_rq)->clock_task;
> >
> > Placing the return there avoids updating the timestamps, so once we get
> > unthrottled we'll observe a very long period and skew the load avg?
> >
> 
> It's easier to avoid this by fixing up the load average on unthrottle,
> since there's no point in moving up the intermediate timestamps on
> each throttled update.
> 
> The one "gotcha" in either case is that it's possible for time to
> drift on the child of a throttled group and I don't see an easy way
> around this.

drift how? running while being throttled due to non-preempt and other
things?

> > Ideally we'd never call this on throttled groups to begin with and
> > handle them like full dequeue/enqueue like things.
> >
> 
> This is what is attempted -- however it's still possible actions such
> as wakeup which may still occur against throttled groups regardless of
> their queue state.
> 
> In this case we still need to preserve the correct child hierarchy
> state so that it can be re-enqueued when there is again bandwidth.

If wakeup is the one sore spot, why not terminate the hierarchy
iteration in enqueue_task_fair that does all the load bits?

> >> @@ -1015,6 +1046,14 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
> >>        * Update run-time statistics of the 'current'.
> >>        */
> >>       update_curr(cfs_rq);
> >> +
> >> +
> >> +#ifdef CONFIG_CFS_BANDWIDTH
> >> +     if (!entity_is_task(se) && (cfs_rq_throttled(group_cfs_rq(se)) ||
> >> +          !group_cfs_rq(se)->nr_running))
> >> +             return;
> >> +#endif
> >> +
> >>       update_cfs_load(cfs_rq, 0);
> >>       account_entity_enqueue(cfs_rq, se);
> >>       update_cfs_shares(cfs_rq);
> >> @@ -1087,6 +1126,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
> >>        */
> >>       update_curr(cfs_rq);
> >>
> >> +#ifdef CONFIG_CFS_BANDWIDTH
> >> +     if (!entity_is_task(se) && cfs_rq_throttled(group_cfs_rq(se)))
> >> +             return;
> >> +#endif
> >> +
> >>       update_stats_dequeue(cfs_rq, se);
> >>       if (flags & DEQUEUE_SLEEP) {
> >>  #ifdef CONFIG_SCHEDSTATS
> >
> > These make me very nervous, on enqueue you bail after adding
> > min_vruntime to ->vruntime and calling update_curr(), but on dequeue you
> > bail before subtracting min_vruntime from ->vruntime.
> >
> 
> min_vruntime shouldn't be added in enqueue since unthrottling is
> treated as a wakeup (which results in placement versus min as opposed
> to normalization).

Sure, but at least put a comment there, I mean that's a glaring
asymmetry.

> >> @@ -1363,6 +1407,9 @@ enqueue_task_fair(struct rq *rq, struct
> >>                       break;
> >>               cfs_rq = cfs_rq_of(se);
> >>               enqueue_entity(cfs_rq, se, flags);
> >> +             /* don't continue to enqueue if our parent is throttled */
> >> +             if (cfs_rq_throttled(cfs_rq))
> >> +                     break;
> >>               flags = ENQUEUE_WAKEUP;
> >>       }
> >>
> >> @@ -1390,8 +1437,11 @@ static void dequeue_task_fair(struct rq
> >>               cfs_rq = cfs_rq_of(se);
> >>               dequeue_entity(cfs_rq, se, flags);
> >>
> >> -             /* Don't dequeue parent if it has other entities besides us */
> >> -             if (cfs_rq->load.weight)
> >> +             /*
> >> +              * Don't dequeue parent if it has other entities besides us,
> >> +              * or if it is throttled
> >> +              */
> >> +             if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
> >>                       break;
> >>               flags |= DEQUEUE_SLEEP;
> >>       }
> >
> > How could we even be running if our parent was throttled?
> >
> 
> It's possible we throttled within the preceding dequeue_entity -- the
> partial update_curr against cfs_rq might be just enough to push it
> over the edge.  In which case that entity has already been dequeued
> and we want to bail out.

right.

> 
> >> @@ -1430,6 +1480,42 @@ static u64 tg_request_cfs_quota(struct t
> >>       return delta;
> >>  }
> >>
> >> +static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
> >> +{
> >> +     struct sched_entity *se;
> >> +
> >> +     se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
> >> +
> >> +     /* account load preceeding throttle */
> >> +     update_cfs_load(cfs_rq, 0);
> >> +
> >> +     /* prevent previous buddy nominations from re-picking this se */
> >> +     clear_buddies(cfs_rq_of(se), se);
> >> +
> >> +     /*
> >> +      * It's possible for the current task to block and re-wake before task
> >> +      * switch, leading to a throttle within enqueue_task->update_curr()
> >> +      * versus an an entity that has not technically been enqueued yet.
> >
> > I'm not quite seeing how this would happen.. care to expand on this?
> >
> 
> I'm not sure the example Bharata gave is correct -- I'm going to treat
> that discussion separately as it's not the intent here.
> 
> Here the task _is_ running.
> 
> Specifically:
> 
> - Suppose the current task on a cfs_rq blocks
> - Accordingly we issue dequeue against that task (however it remains
> as curr until the put)
> - Before we get to the put some other activity (e.g. network bottom
> half) gets to run and re-wake the task
> - The time elapsed for this is charged to the task, which might push
> it over its reservation, it then gets throttled while we're trying to
> queue it
> 
> BUT
> 
> We haven't actually done any of the enqueue work yet so there's
> nothing to do to take it off rq.  So what we just mark it throttled
> and make sure that the rest of the enqueue work gets short circuited.
> 
> The clock_task helps reduce the occurrence of this since the task will
> be spared the majority of the SI time but it's still possible to push
> it over.

Ah, uhm, so this is all due to us dropping rq->lock after dequeue,
right? Would 

  https://lkml.org/lkml/2011/1/4/228

help here?

> >> +      * In this case, since we haven't actually done the enqueue yet, cut
> >> +      * out and allow enqueue_entity() to short-circuit
> >> +      */
> >> +     if (!se->on_rq)
> >> +             goto out_throttled;
> >> +
> >> +     for_each_sched_entity(se) {
> >> +             struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >> +
> >> +             dequeue_entity(cfs_rq, se, 1);
> >> +             if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
> >> +                     break;
> >> +     }
> >> +
> >> +out_throttled:
> >> +     cfs_rq->throttled = 1;
> >> +     update_cfs_rq_load_contribution(cfs_rq, 1);
> >> +}
> >> +
> >>  static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
> >>               unsigned long delta_exec)
> >>  {
> >> @@ -1438,10 +1524,16 @@ static void account_cfs_rq_quota(struct
> >>
> >>       cfs_rq->quota_used += delta_exec;
> >>
> >> -     if (cfs_rq->quota_used < cfs_rq->quota_assigned)
> >> +     if (cfs_rq_throttled(cfs_rq) ||
> >> +             cfs_rq->quota_used < cfs_rq->quota_assigned)
> >>               return;
> >
> > So we are throttled but running anyway, I suppose this comes from the PI
> > ceiling muck?
> >
> 
> No -- this is just the fact that there are cases where reschedule
> can't evict the task immediately.
> 
> e.g. softirq or any kernel time without config_preempt
> 
> Once we're throttled we know there's no time left or point in trying
> to acquire it so just short circuit these until we get to a point
> where this task can be removed from rq.

Right, but like I argued in another email, it could be refreshed on
another cpu and you now miss it.. :-)

> >> +     if (!entity_on_rq(pse))
> >> +             return;
> >> +#endif
> >
> > Ideally that #ifdef'ery would go away too.
> 
> This can 100% go away (and is already in the #ifdefs), but it will
> always be true in the !BANDWIDTH case, so it's a micro-overhead.
> Accompanying micro-optimization isn't really needed :)

Wouldn't gcc be able to optimize if (!true) stmt; with DCE ?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-28 13:48       ` Peter Zijlstra
@ 2011-03-01  8:31         ` Paul Turner
  0 siblings, 0 replies; 71+ messages in thread
From: Paul Turner @ 2011-03-01  8:31 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Kamalesh Babulal, Ingo Molnar, Pavel Emelyanov, Herbert Poetzl,
	Avi Kivity, Chris Friesen, Nikhil Rao

On Mon, Feb 28, 2011 at 5:48 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Thu, 2011-02-24 at 19:10 -0800, Paul Turner wrote:
>
>> >> @@ -761,7 +788,11 @@ static void update_cfs_load(struct cfs_r
>> >>       u64 now, delta;
>> >>       unsigned long load = cfs_rq->load.weight;
>> >>
>> >> -     if (cfs_rq->tg == &root_task_group)
>> >> +     /*
>> >> +      * Don't maintain averages for the root task group, or while we are
>> >> +      * throttled.
>> >> +      */
>> >> +     if (cfs_rq->tg == &root_task_group || cfs_rq_throttled(cfs_rq))
>> >>               return;
>> >>
>> >>       now = rq_of(cfs_rq)->clock_task;
>> >
>> > Placing the return there avoids updating the timestamps, so once we get
>> > unthrottled we'll observe a very long period and skew the load avg?
>> >
>>
>> It's easier to avoid this by fixing up the load average on unthrottle,
>> since there's no point in moving up the intermediate timestamps on
>> each throttled update.
>>
>> The one "gotcha" in either case is that it's possible for time to
>> drift on the child of a throttled group and I don't see an easy way
>> around this.
>
> drift how? running while being throttled due to non-preempt and other
> things?
>

Not quite -- that time will actually be omitted since we nuke the
last_update on unthrottle.

What I was referring to here is that it's not easy (and not currently
done) to freeze the load average clock for descendants of a throttled
entity.

Plugging this properly is actually rather annoying since we'd have to
do a tree walk at both throttle and unthrottle (it's not sufficient to
just fix things up at unthrottle because wakeups can lead to updates
in the interim, meaning we'd need to mark some state).

Whether this drift is worth all that hassle is questionable at this point.

>> > Ideally we'd never call this on throttled groups to begin with and
>> > handle them like full dequeue/enqueue like things.
>> >
>>
>> This is what is attempted -- however it's still possible actions such
>> as wakeup which may still occur against throttled groups regardless of
>> their queue state.
>>
>> In this case we still need to preserve the correct child hierarchy
>> state so that it can be re-enqueued when there is again bandwidth.
>
> If wakeup is the one sore spot, why not terminate the hierarchy
> iteration in enqueue_task_fair that does all the load bits?
>

It does actually terminate early already, the problem here is the
processing for children has already occurred since it's a bottom up
enqueue.

Perhaps this can be simplified:

If we do a single walk at the start to validate whether the entity is
throttled (we can use entity_on_rq to keep things clean) then we can
pass a flag into enqueue_entity indicating whether it is part of a
throttled hierarchy.

This would allow us to skip the accounting and keep things accurate
above without having to do all the tree walks at throttle/unthrottle.

Let me see what this looks like for the next posting.

>> >> @@ -1015,6 +1046,14 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>> >>        * Update run-time statistics of the 'current'.
>> >>        */
>> >>       update_curr(cfs_rq);
>> >> +
>> >> +
>> >> +#ifdef CONFIG_CFS_BANDWIDTH
>> >> +     if (!entity_is_task(se) && (cfs_rq_throttled(group_cfs_rq(se)) ||
>> >> +          !group_cfs_rq(se)->nr_running))
>> >> +             return;
>> >> +#endif
>> >> +
>> >>       update_cfs_load(cfs_rq, 0);
>> >>       account_entity_enqueue(cfs_rq, se);
>> >>       update_cfs_shares(cfs_rq);
>> >> @@ -1087,6 +1126,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, st
>> >>        */
>> >>       update_curr(cfs_rq);
>> >>
>> >> +#ifdef CONFIG_CFS_BANDWIDTH
>> >> +     if (!entity_is_task(se) && cfs_rq_throttled(group_cfs_rq(se)))
>> >> +             return;
>> >> +#endif
>> >> +
>> >>       update_stats_dequeue(cfs_rq, se);
>> >>       if (flags & DEQUEUE_SLEEP) {
>> >>  #ifdef CONFIG_SCHEDSTATS
>> >
>> > These make me very nervous, on enqueue you bail after adding
>> > min_vruntime to ->vruntime and calling update_curr(), but on dequeue you
>> > bail before subtracting min_vruntime from ->vruntime.
>> >
>>
>> min_vruntime shouldn't be added in enqueue since unthrottling is
>> treated as a wakeup (which results in placement versus min as opposed
>> to normalization).
>
> Sure, but at least put a comment there, I mean that's a glaring
> asymmetry.
>
>> >> @@ -1363,6 +1407,9 @@ enqueue_task_fair(struct rq *rq, struct
>> >>                       break;
>> >>               cfs_rq = cfs_rq_of(se);
>> >>               enqueue_entity(cfs_rq, se, flags);
>> >> +             /* don't continue to enqueue if our parent is throttled */
>> >> +             if (cfs_rq_throttled(cfs_rq))
>> >> +                     break;
>> >>               flags = ENQUEUE_WAKEUP;
>> >>       }
>> >>
>> >> @@ -1390,8 +1437,11 @@ static void dequeue_task_fair(struct rq
>> >>               cfs_rq = cfs_rq_of(se);
>> >>               dequeue_entity(cfs_rq, se, flags);
>> >>
>> >> -             /* Don't dequeue parent if it has other entities besides us */
>> >> -             if (cfs_rq->load.weight)
>> >> +             /*
>> >> +              * Don't dequeue parent if it has other entities besides us,
>> >> +              * or if it is throttled
>> >> +              */
>> >> +             if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
>> >>                       break;
>> >>               flags |= DEQUEUE_SLEEP;
>> >>       }
>> >
>> > How could we even be running if our parent was throttled?
>> >
>>
>> It's possible we throttled within the preceding dequeue_entity -- the
>> partial update_curr against cfs_rq might be just enough to push it
>> over the edge.  In which case that entity has already been dequeued
>> and we want to bail out.
>
> right.
>
>>
>> >> @@ -1430,6 +1480,42 @@ static u64 tg_request_cfs_quota(struct t
>> >>       return delta;
>> >>  }
>> >>
>> >> +static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
>> >> +{
>> >> +     struct sched_entity *se;
>> >> +
>> >> +     se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
>> >> +
>> >> +     /* account load preceeding throttle */
>> >> +     update_cfs_load(cfs_rq, 0);
>> >> +
>> >> +     /* prevent previous buddy nominations from re-picking this se */
>> >> +     clear_buddies(cfs_rq_of(se), se);
>> >> +
>> >> +     /*
>> >> +      * It's possible for the current task to block and re-wake before task
>> >> +      * switch, leading to a throttle within enqueue_task->update_curr()
>> >> +      * versus an an entity that has not technically been enqueued yet.
>> >
>> > I'm not quite seeing how this would happen.. care to expand on this?
>> >
>>
>> I'm not sure the example Bharata gave is correct -- I'm going to treat
>> that discussion separately as it's not the intent here.
>>
>> Here the task _is_ running.
>>
>> Specifically:
>>
>> - Suppose the current task on a cfs_rq blocks
>> - Accordingly we issue dequeue against that task (however it remains
>> as curr until the put)
>> - Before we get to the put some other activity (e.g. network bottom
>> half) gets to run and re-wake the task
>> - The time elapsed for this is charged to the task, which might push
>> it over its reservation, it then gets throttled while we're trying to
>> queue it
>>
>> BUT
>>
>> We haven't actually done any of the enqueue work yet so there's
>> nothing to do to take it off rq.  So what we just mark it throttled
>> and make sure that the rest of the enqueue work gets short circuited.
>>
>> The clock_task helps reduce the occurrence of this since the task will
>> be spared the majority of the SI time but it's still possible to push
>> it over.
>
> Ah, uhm, so this is all due to us dropping rq->lock after dequeue,
> right? Would
>
>  https://lkml.org/lkml/2011/1/4/228
>
> help here?
>

omm at a glance, it'll still be cfs_rq->curr until we actually issue the put.

I don't think we can skip update_curr in the !p->on_rq case since for
the non-preempt kernel case we should be counting that time; so I
think this case would unfortunately still exist.

>> >> +      * In this case, since we haven't actually done the enqueue yet, cut
>> >> +      * out and allow enqueue_entity() to short-circuit
>> >> +      */
>> >> +     if (!se->on_rq)
>> >> +             goto out_throttled;
>> >> +
>> >> +     for_each_sched_entity(se) {
>> >> +             struct cfs_rq *cfs_rq = cfs_rq_of(se);
>> >> +
>> >> +             dequeue_entity(cfs_rq, se, 1);
>> >> +             if (cfs_rq->load.weight || cfs_rq_throttled(cfs_rq))
>> >> +                     break;
>> >> +     }
>> >> +
>> >> +out_throttled:
>> >> +     cfs_rq->throttled = 1;
>> >> +     update_cfs_rq_load_contribution(cfs_rq, 1);
>> >> +}
>> >> +
>> >>  static void account_cfs_rq_quota(struct cfs_rq *cfs_rq,
>> >>               unsigned long delta_exec)
>> >>  {
>> >> @@ -1438,10 +1524,16 @@ static void account_cfs_rq_quota(struct
>> >>
>> >>       cfs_rq->quota_used += delta_exec;
>> >>
>> >> -     if (cfs_rq->quota_used < cfs_rq->quota_assigned)
>> >> +     if (cfs_rq_throttled(cfs_rq) ||
>> >> +             cfs_rq->quota_used < cfs_rq->quota_assigned)
>> >>               return;
>> >
>> > So we are throttled but running anyway, I suppose this comes from the PI
>> > ceiling muck?
>> >
>>
>> No -- this is just the fact that there are cases where reschedule
>> can't evict the task immediately.
>>
>> e.g. softirq or any kernel time without config_preempt
>>
>> Once we're throttled we know there's no time left or point in trying
>> to acquire it so just short circuit these until we get to a point
>> where this task can be removed from rq.
>
> Right, but like I argued in another email, it could be refreshed on
> another cpu and you now miss it.. :-)
>

Yes, that would be non-desirable.. hum, synchronizing the check under
rq->lock in do_sched_cfs_period_timer() should cover this.

>> >> +     if (!entity_on_rq(pse))
>> >> +             return;
>> >> +#endif
>> >
>> > Ideally that #ifdef'ery would go away too.
>>
>> This can 100% go away (and is already in the #ifdefs), but it will
>> always be true in the !BANDWIDTH case, so it's a micro-overhead.
>> Accompanying micro-optimization isn't really needed :)
>
> Wouldn't gcc be able to optimize if (!true) stmt; with DCE ?
>

it's p->on_rq so it's a memory ref, I don't think the gcc can reduce
it since the path from ttwu -> check_preempt goes through a function
pointer.

although if it can handle that bifurcation then I'm super impressed :)
-- ill check against the disassembly.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-02-16  3:18 ` [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota Paul Turner
  2011-02-18  6:52   ` Balbir Singh
  2011-02-23 13:32   ` Peter Zijlstra
@ 2011-03-02  7:23   ` Bharata B Rao
  2011-03-02  8:05     ` Paul Turner
  2 siblings, 1 reply; 71+ messages in thread
From: Bharata B Rao @ 2011-03-02  7:23 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Nikhil Rao

Hi Paul,

On Tue, Feb 15, 2011 at 07:18:34PM -0800, Paul Turner wrote:
> @@ -1015,6 +1046,14 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>  	 * Update run-time statistics of the 'current'.
>  	 */
>  	update_curr(cfs_rq);
> +
> +
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	if (!entity_is_task(se) && (cfs_rq_throttled(group_cfs_rq(se)) ||
> +	     !group_cfs_rq(se)->nr_running))
> +		return;
> +#endif
> +
>  	update_cfs_load(cfs_rq, 0);
>  	account_entity_enqueue(cfs_rq, se);
>  	update_cfs_shares(cfs_rq);

> @@ -1363,6 +1407,9 @@ enqueue_task_fair(struct rq *rq, struct 
>  			break;
>  		cfs_rq = cfs_rq_of(se);
>  		enqueue_entity(cfs_rq, se, flags);
> +		/* don't continue to enqueue if our parent is throttled */
> +		if (cfs_rq_throttled(cfs_rq))
> +			break;

1. This check (in enqueue_task_fair) ensures that if the cfs_rq we just enqueued
se to is throttled, we bail our from futher enqueueing of the hierarchy.

2. In enqueue_entity() we check if the entity we are enqueing owns a throttled
hieararchy and refuse to enqueue if true. And we silently refuse but continue
with futher enqueue attempts.

I see that 1 can happen when there a task belonging to a throttled group
wakes up.

Can you pls explain the scenario when 2 is needed ?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota
  2011-03-02  7:23   ` Bharata B Rao
@ 2011-03-02  8:05     ` Paul Turner
  0 siblings, 0 replies; 71+ messages in thread
From: Paul Turner @ 2011-03-02  8:05 UTC (permalink / raw
  To: bharata
  Cc: linux-kernel, Balbir Singh, Vaidyanathan Srinivasan,
	Srivatsa Vaddagiri, Kamalesh Babulal, Ingo Molnar, Peter Zijlstra,
	Pavel Emelyanov, Herbert Poetzl, Avi Kivity, Chris Friesen,
	Nikhil Rao

On Tue, Mar 1, 2011 at 11:23 PM, Bharata B Rao
<bharata@linux.vnet.ibm.com> wrote:
> Hi Paul,
>
> On Tue, Feb 15, 2011 at 07:18:34PM -0800, Paul Turner wrote:
>> @@ -1015,6 +1046,14 @@ enqueue_entity(struct cfs_rq *cfs_rq, st
>>        * Update run-time statistics of the 'current'.
>>        */
>>       update_curr(cfs_rq);
>> +
>> +
>> +#ifdef CONFIG_CFS_BANDWIDTH
>> +     if (!entity_is_task(se) && (cfs_rq_throttled(group_cfs_rq(se)) ||
>> +          !group_cfs_rq(se)->nr_running))
>> +             return;
>> +#endif
>> +
>>       update_cfs_load(cfs_rq, 0);
>>       account_entity_enqueue(cfs_rq, se);
>>       update_cfs_shares(cfs_rq);
>
>> @@ -1363,6 +1407,9 @@ enqueue_task_fair(struct rq *rq, struct
>>                       break;
>>               cfs_rq = cfs_rq_of(se);
>>               enqueue_entity(cfs_rq, se, flags);
>> +             /* don't continue to enqueue if our parent is throttled */
>> +             if (cfs_rq_throttled(cfs_rq))
>> +                     break;
>
> 1. This check (in enqueue_task_fair) ensures that if the cfs_rq we just enqueued
> se to is throttled, we bail our from futher enqueueing of the hierarchy.
>
> 2. In enqueue_entity() we check if the entity we are enqueing owns a throttled
> hieararchy and refuse to enqueue if true. And we silently refuse but continue
> with futher enqueue attempts.
>
> I see that 1 can happen when there a task belonging to a throttled group
> wakes up.
>
> Can you pls explain the scenario when 2 is needed ?
>

As I recall this was for entity reweight

In this case on_rq is cached and it's possible we'll tip over into a
throttled state when we update_curr().  I think with reweight_entity()
this may no longer be required.

In my current v5 stack I've actually reworked the throttling logic to make
update_curr() non-state changing again and thus eliminated the
associated checks and cases.

> Regards,
> Bharata.
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 0/7] Introduction
  2011-02-25 13:06     ` jacob pan
@ 2011-03-08  3:57       ` Balbir Singh
  2011-03-08 18:18         ` Jacob Pan
  2011-03-09 10:12       ` Paul Turner
  1 sibling, 1 reply; 71+ messages in thread
From: Balbir Singh @ 2011-03-08  3:57 UTC (permalink / raw
  To: jacob pan
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Arjan van de Ven, Rafael J. Wysocki, Matt Helsley

* jacob pan <jacob.jun.pan@linux.intel.com> [2011-02-25 05:06:46]:

> On Fri, 25 Feb 2011 02:03:54 -0800
> Paul Turner <pjt@google.com> wrote:
> 
> > On Thu, Feb 24, 2011 at 4:11 PM, jacob pan
> > <jacob.jun.pan@linux.intel.com> wrote:
> > > On Tue, 15 Feb 2011 19:18:31 -0800
> > > Paul Turner <pjt@google.com> wrote:
> > >
> > >> Hi all,
> > >>
> > >> Please find attached v4 of CFS bandwidth control; while this rebase
> > >> against some of the latest SCHED_NORMAL code is new, the features
> > >> and methodology are fairly mature at this point and have proved
> > >> both effective and stable for several workloads.
> > >>
> > >> As always, all comments/feedback welcome.
> > >>
> > >
> > > Hi Paul,
> > >
> > > Your patches provide a very useful but slightly different feature
> > > for what we need to manage idle time in order to save power. What we
> > > need is kind of a quota/period in terms of idle time. I have been
> > > playing with your patches and noticed that when the cgroup cpu usage
> > > exceeds the quota the effect of throttling is similar to what I have
> > > been trying to do with freezer subsystem. i.e. freeze and thaw at
> > > given period and percentage runtime.
> > > https://lkml.org/lkml/2011/2/15/314
> > >
> > > Have you thought about adding such feature (please see detailed
> > > description in the link above) to your patches?
> > >
> > 
> > So reading the description it seems like rooting everything in a
> > 'freezer' container and then setting up a quota of
> > 
> > (1 - frozen_percentage)  * nr_cpus * frozen_period * sec_to_usec
> > 
> I guess you meant frozen_percentage is less than 1, i.e. 90 is .90. my
> code treat 90 as 90. just a clarification.
> > on a period of
> > 
> > frozen_period * sec_to_usec
> > 
> > Would provide the same functionality.  Is there other unduplicated
> > functionality beyond this?
> Do you mean the same functionality as your patch? Not really, since my
> approach will stop the tasks based on hard time slices. But seems your
> patch will allow them to run if they don't exceed the quota. Am i
> missing something?
> That is the only functionality difference i know.
> 
> Like the reviewer of freezer patch pointed out, it is a more logical
> fit to implement such feature in scheduler/yours in stead of freezer. So
> i am wondering if your patch can be expended to include limiting quota
> on real time.
>

Do you mean sched rt group controller? Have you looked at
cpu.rt_runtime_us and cpu.rt_perioud_us?
 
> I did a comparison study between CFS BW and freezer patch on skype with
> identical quota setting as you pointed out earlier. Both use 2 sec
> period and .2 sec quota (10%). Skype typically uses 5% of the CPU on my
> system when placing a call(below cfs quota) and it wakes up every 100ms
> to do some quick checks. Then I run skype in cpu then freezer cgroup
> (with all its children). Here is my result based on timechart and
> powertop.
> 
> patch name	wakeups		skype call?
> ------------------------------------------------------------------
> CFS BW		10/sec		yes
> freezer		1/sec		no
>

Is this good or bad for CFS BW?

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 0/7] Introduction
  2011-03-08  3:57       ` Balbir Singh
@ 2011-03-08 18:18         ` Jacob Pan
  0 siblings, 0 replies; 71+ messages in thread
From: Jacob Pan @ 2011-03-08 18:18 UTC (permalink / raw
  To: balbir
  Cc: Paul Turner, linux-kernel, Bharata B Rao, Dhaval Giani,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Arjan van de Ven, Rafael J. Wysocki, Matt Helsley

on Tue, 8 Mar 2011 09:27:59 +0530 Balbir Singh wrote:
>* jacob pan <jacob.jun.pan@linux.intel.com> [2011-02-25 05:06:46]:
>
>> On Fri, 25 Feb 2011 02:03:54 -0800
>> Paul Turner <pjt@google.com> wrote:
>> 
>> > On Thu, Feb 24, 2011 at 4:11 PM, jacob pan
>> > <jacob.jun.pan@linux.intel.com> wrote:
>> > > On Tue, 15 Feb 2011 19:18:31 -0800
>> > > Paul Turner <pjt@google.com> wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> Please find attached v4 of CFS bandwidth control; while this rebase
>> > >> against some of the latest SCHED_NORMAL code is new, the features
>> > >> and methodology are fairly mature at this point and have proved
>> > >> both effective and stable for several workloads.
>> > >>
>> > >> As always, all comments/feedback welcome.
>> > >>
>> > >
>> > > Hi Paul,
>> > >
>> > > Your patches provide a very useful but slightly different feature
>> > > for what we need to manage idle time in order to save power. What we
>> > > need is kind of a quota/period in terms of idle time. I have been
>> > > playing with your patches and noticed that when the cgroup cpu usage
>> > > exceeds the quota the effect of throttling is similar to what I have
>> > > been trying to do with freezer subsystem. i.e. freeze and thaw at
>> > > given period and percentage runtime.
>> > > https://lkml.org/lkml/2011/2/15/314
>> > >
>> > > Have you thought about adding such feature (please see detailed
>> > > description in the link above) to your patches?
>> > >
>> > 
>> > So reading the description it seems like rooting everything in a
>> > 'freezer' container and then setting up a quota of
>> > 
>> > (1 - frozen_percentage)  * nr_cpus * frozen_period * sec_to_usec
>> > 
>> I guess you meant frozen_percentage is less than 1, i.e. 90 is .90. my
>> code treat 90 as 90. just a clarification.
>> > on a period of
>> > 
>> > frozen_period * sec_to_usec
>> > 
>> > Would provide the same functionality.  Is there other unduplicated
>> > functionality beyond this?
>> Do you mean the same functionality as your patch? Not really, since my
>> approach will stop the tasks based on hard time slices. But seems your
>> patch will allow them to run if they don't exceed the quota. Am i
>> missing something?
>> That is the only functionality difference i know.
>> 
>> Like the reviewer of freezer patch pointed out, it is a more logical
>> fit to implement such feature in scheduler/yours in stead of freezer. So
>> i am wondering if your patch can be expended to include limiting quota
>> on real time.
>>
>
>Do you mean sched rt group controller? Have you looked at
>cpu.rt_runtime_us and cpu.rt_perioud_us?
> 
>> I did a comparison study between CFS BW and freezer patch on skype with
>> identical quota setting as you pointed out earlier. Both use 2 sec
>> period and .2 sec quota (10%). Skype typically uses 5% of the CPU on my
>> system when placing a call(below cfs quota) and it wakes up every 100ms
>> to do some quick checks. Then I run skype in cpu then freezer cgroup
>> (with all its children). Here is my result based on timechart and
>> powertop.
>> 
>> patch name	wakeups		skype call?
>> ------------------------------------------------------------------
>> CFS BW		10/sec		yes
>> freezer		1/sec		no
>>
>
>Is this good or bad for CFS BW?
In terms of power saving for this particular use case, it is bad for
CFS BW. Since I am trying use cgroup to manage applications that are
not written with power saving in mind. CFS BW does not prevent
unnecessary wake-ups from these apps., therefore the system consumes
more power than the case where freezer duty cycling patch is used.
In my use case, as soon as skype is switched to the UI foreground, it
will be moved to another cgroup where enough quota will be given to
allow it place calls. Therefore, not being able to make calls while
being throttled is not a concern.

For mobile devices, often have just one app in the foreground. So
throttling background apps may not impact user experience but still can
save power.

Since CFS BW patch has the period and quota concept on BW control, that
is why I am asking if it is worth extending CFS BW patch to have a idle
time quota. Perhaps adding another parameter to allow limitting the idle
time in parallel to cfs_quota.

Rafael (CCed) wants to get an opinion from the scheduler folks before
considering the freezer patch.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 0/7] Introduction
  2011-02-25 13:06     ` jacob pan
  2011-03-08  3:57       ` Balbir Singh
@ 2011-03-09 10:12       ` Paul Turner
  2011-03-09 21:57         ` jacob pan
  1 sibling, 1 reply; 71+ messages in thread
From: Paul Turner @ 2011-03-09 10:12 UTC (permalink / raw
  To: jacob pan
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Arjan van de Ven, Rafael J. Wysocki, Matt Helsley

On Fri, Feb 25, 2011 at 5:06 AM, jacob pan
<jacob.jun.pan@linux.intel.com> wrote:
> On Fri, 25 Feb 2011 02:03:54 -0800
> Paul Turner <pjt@google.com> wrote:
>
>> On Thu, Feb 24, 2011 at 4:11 PM, jacob pan
>> <jacob.jun.pan@linux.intel.com> wrote:
>> > On Tue, 15 Feb 2011 19:18:31 -0800
>> > Paul Turner <pjt@google.com> wrote:
>> >
>> >> Hi all,
>> >>
>> >> Please find attached v4 of CFS bandwidth control; while this rebase
>> >> against some of the latest SCHED_NORMAL code is new, the features
>> >> and methodology are fairly mature at this point and have proved
>> >> both effective and stable for several workloads.
>> >>
>> >> As always, all comments/feedback welcome.
>> >>
>> >
>> > Hi Paul,
>> >
>> > Your patches provide a very useful but slightly different feature
>> > for what we need to manage idle time in order to save power. What we
>> > need is kind of a quota/period in terms of idle time. I have been
>> > playing with your patches and noticed that when the cgroup cpu usage
>> > exceeds the quota the effect of throttling is similar to what I have
>> > been trying to do with freezer subsystem. i.e. freeze and thaw at
>> > given period and percentage runtime.
>> > https://lkml.org/lkml/2011/2/15/314
>> >
>> > Have you thought about adding such feature (please see detailed
>> > description in the link above) to your patches?
>> >
>>
>> So reading the description it seems like rooting everything in a
>> 'freezer' container and then setting up a quota of
>>
>> (1 - frozen_percentage)  * nr_cpus * frozen_period * sec_to_usec
>>
> I guess you meant frozen_percentage is less than 1, i.e. 90 is .90. my
> code treat 90 as 90. just a clarification.
>> on a period of
>>
>> frozen_period * sec_to_usec
>>
>> Would provide the same functionality.  Is there other unduplicated
>> functionality beyond this?

Sorry -- I was out last week; comments inline.

> Do you mean the same functionality as your patch? Not really, since my
> approach will stop the tasks based on hard time slices
>. But seems your
> patch will allow them to run if they don't exceed the quota. Am i
> missing something?

Right, this is what was discussed above.

> That is the only functionality difference i know.
>
> Like the reviewer of freezer patch pointed out, it is a more logical
> fit to implement such feature in scheduler/yours in stead of freezer. So
> i am wondering if your patch can be expended to include limiting quota
> on real time.

The following two configurations should effectively exactly mirror the
freezer behavior without modification.

A) background while(1) thread on each cpu within the cgroup
This will result in synchronous consumption / exhaustion of quota in a
manor that duplicates the periodic freezing.

Given the goal is power-saving, this is obviously non-ideal.  However:

B) A userspace daemon toggles quota at the desired interval

Supposing you wanted a freezer period of 100ms per second, then having
a daemon wake up at 900ms into the interval and then setting a quota
amount that is effectively zero will then "freeze" the group.  Said
daemon can then release things by returning the group to an infinite
quota in 100ms, and then sleeping for another 900ms.

Is there particular advantage of doing this in-kernel?


>
> I did a comparison study between CFS BW and freezer patch on skype with
> identical quota setting as you pointed out earlier. Both use 2 sec
> period and .2 sec quota (10%). Skype typically uses 5% of the CPU on my
> system when placing a call(below cfs quota) and it wakes up every 100ms
> to do some quick checks. Then I run skype in cpu then freezer cgroup
> (with all its children). Here is my result based on timechart and
> powertop.
>
> patch name      wakeups         skype call?
> ------------------------------------------------------------------
> CFS BW          10/sec          yes
> freezer         1/sec           no
>

Is this a true saving?  While the actual task wake-up has been hidden,
the cpu is still coming out of a halt/idle state and processing the
interrupt/etc.

Have you had the chance to measure the actual comparative power-usage
in this case?

> Skype might not be the best example to illustrate the real usage of the
> feature, but we are targeting mobile device where they are mostly off or
> often have only one application allowed in foreground. So we want to
> reduce wakeups coming from the tasks that are not in the foreground.
>

If reducing wake-ups (at the userspace level) is proven to deliver
performance improvements, then it might be more productive to approach
that directly by considering strategies such as batching wakeups and
processing them periodically.

This would not have the negative performance impact of the current
approach, as well as being more deterministic.

>> One thing that does seem undesirable about your approach is (as it
>> seems to be described) threads will not be able to take advantage of
>> naturally occurring idle cycles and will incur a potential performance
>> penalty even at use << frozen_percentage.
>>
>> e.g. From your post
>>
>>        |  |<-- 90% frozen -     ->|  |
>> |  | ____|  |________________x_|  |__________________|  |_____
>>
>>         |<---- 5 seconds     ---->|
>>
>>
>> Suppose no threads active until the wake up at x, suppose there is an
>> accompanying 1 second of work for that thread to do.  That execution
>> time will be dilated to ~1.5 seconds (as it will span the 0.5 seconds
>> the freezer will stall for).  But the true usage for this period is
>> ~20% <<< 90%
> I agree my approach does not consider the natural cycle. But I am not
> sure if a thread can wake up at x when FROZEN.
>

While the ascii is a little mailer-mangled, in the diagram above x was
intended to precede the "frozen" time segment, but at a point where
the work it wants to do exceeds the time-before-freeze resulting in
dilation of execution and a performance regression.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [CFS Bandwidth Control v4 0/7] Introduction
  2011-03-09 10:12       ` Paul Turner
@ 2011-03-09 21:57         ` jacob pan
  0 siblings, 0 replies; 71+ messages in thread
From: jacob pan @ 2011-03-09 21:57 UTC (permalink / raw
  To: Paul Turner
  Cc: linux-kernel, Bharata B Rao, Dhaval Giani, Balbir Singh,
	Vaidyanathan Srinivasan, Gautham R Shenoy, Srivatsa Vaddagiri,
	Arjan van de Ven, Rafael J. Wysocki, Matt Helsley

On Wed, 9 Mar 2011 02:12:36 -0800
Paul Turner <pjt@google.com> wrote:

> On Fri, Feb 25, 2011 at 5:06 AM, jacob pan
> <jacob.jun.pan@linux.intel.com> wrote:
> > On Fri, 25 Feb 2011 02:03:54 -0800
> > Paul Turner <pjt@google.com> wrote:
> >
> >> On Thu, Feb 24, 2011 at 4:11 PM, jacob pan
> >> <jacob.jun.pan@linux.intel.com> wrote:
> >> > On Tue, 15 Feb 2011 19:18:31 -0800
> >> > Paul Turner <pjt@google.com> wrote:
> >> >
> >> >> Hi all,
> >> >>
> >> >> Please find attached v4 of CFS bandwidth control; while this
> >> >> rebase against some of the latest SCHED_NORMAL code is new, the
> >> >> features and methodology are fairly mature at this point and
> >> >> have proved both effective and stable for several workloads.
> >> >>
> >> >> As always, all comments/feedback welcome.
> >> >>
> >> >
> >> > Hi Paul,
> >> >
> >> > Your patches provide a very useful but slightly different feature
> >> > for what we need to manage idle time in order to save power.
> >> > What we need is kind of a quota/period in terms of idle time. I
> >> > have been playing with your patches and noticed that when the
> >> > cgroup cpu usage exceeds the quota the effect of throttling is
> >> > similar to what I have been trying to do with freezer subsystem.
> >> > i.e. freeze and thaw at given period and percentage runtime.
> >> > https://lkml.org/lkml/2011/2/15/314
> >> >
> >> > Have you thought about adding such feature (please see detailed
> >> > description in the link above) to your patches?
> >> >
> >>
> >> So reading the description it seems like rooting everything in a
> >> 'freezer' container and then setting up a quota of
> >>
> >> (1 - frozen_percentage)  * nr_cpus * frozen_period * sec_to_usec
> >>
> > I guess you meant frozen_percentage is less than 1, i.e. 90 is .90.
> > my code treat 90 as 90. just a clarification.
> >> on a period of
> >>
> >> frozen_period * sec_to_usec
> >>
> >> Would provide the same functionality.  Is there other unduplicated
> >> functionality beyond this?
> 
> Sorry -- I was out last week; comments inline.
> 
> > Do you mean the same functionality as your patch? Not really, since
> > my approach will stop the tasks based on hard time slices
> >. But seems your
> > patch will allow them to run if they don't exceed the quota. Am i
> > missing something?
> 
> Right, this is what was discussed above.
> 
> > That is the only functionality difference i know.
> >
> > Like the reviewer of freezer patch pointed out, it is a more logical
> > fit to implement such feature in scheduler/yours in stead of
> > freezer. So i am wondering if your patch can be expended to include
> > limiting quota on real time.
> 
> The following two configurations should effectively exactly mirror the
> freezer behavior without modification.
> 
> A) background while(1) thread on each cpu within the cgroup
> This will result in synchronous consumption / exhaustion of quota in a
> manor that duplicates the periodic freezing.
> 
> Given the goal is power-saving, this is obviously non-ideal.  However:
> 
> B) A userspace daemon toggles quota at the desired interval
> 
> Supposing you wanted a freezer period of 100ms per second, then having
> a daemon wake up at 900ms into the interval and then setting a quota
> amount that is effectively zero will then "freeze" the group.  Said
> daemon can then release things by returning the group to an infinite
> quota in 100ms, and then sleeping for another 900ms.
> 
> Is there particular advantage of doing this in-kernel?
> 
Yes, option B will mirror the behavior of the freezer patch. My concern
is that doing this in user space will be less efficient than doing it
in the kernel. For each period to run, the user daemon has to wake up
twice to adjust the quota. I guess if you do idle time quota check in
the kernel it may not need the extra wake-ups?
I do plan to have multiple cgroups with different period and runtime
quota, so the wake-ups will add up.

> 
> >
> > I did a comparison study between CFS BW and freezer patch on skype
> > with identical quota setting as you pointed out earlier. Both use 2
> > sec period and .2 sec quota (10%). Skype typically uses 5% of the
> > CPU on my system when placing a call(below cfs quota) and it wakes
> > up every 100ms to do some quick checks. Then I run skype in cpu
> > then freezer cgroup (with all its children). Here is my result
> > based on timechart and powertop.
> >
> > patch name      wakeups         skype call?
> > ------------------------------------------------------------------
> > CFS BW          10/sec          yes
> > freezer         1/sec           no
> >
> 
> Is this a true saving?  While the actual task wake-up has been hidden,
> the cpu is still coming out of a halt/idle state and processing the
> interrupt/etc.
> 
I think it is true power saving, consider wake-ups from CPU C
states are resulted from either timer or device IRQ, frozen process will
directly reduce timer IRQ.
> Have you had the chance to measure the actual comparative power-usage
> in this case?
> 
I have yet to do such study, it is in my plan.

> > Skype might not be the best example to illustrate the real usage of
> > the feature, but we are targeting mobile device where they are
> > mostly off or often have only one application allowed in
> > foreground. So we want to reduce wakeups coming from the tasks that
> > are not in the foreground.
> >
> 
> If reducing wake-ups (at the userspace level) is proven to deliver
> performance improvements, then it might be more productive to approach
> that directly by considering strategies such as batching wakeups and
> processing them periodically.
> 
> This would not have the negative performance impact of the current
> approach, as well as being more deterministic.
> 
> >> One thing that does seem undesirable about your approach is (as it
> >> seems to be described) threads will not be able to take advantage
> >> of naturally occurring idle cycles and will incur a potential
> >> performance penalty even at use << frozen_percentage.
> >>
> >> e.g. From your post
> >>
> >>        |  |<-- 90% frozen -     ->|  |
> >> |  | ____|  |________________x_|  |__________________|  |_____
> >>
> >>         |<---- 5 seconds     ---->|
> >>
> >>
> >> Suppose no threads active until the wake up at x, suppose there is
> >> an accompanying 1 second of work for that thread to do.  That
> >> execution time will be dilated to ~1.5 seconds (as it will span
> >> the 0.5 seconds the freezer will stall for).  But the true usage
> >> for this period is ~20% <<< 90%
> > I agree my approach does not consider the natural cycle. But I am
> > not sure if a thread can wake up at x when FROZEN.
> >
> 
> While the ascii is a little mailer-mangled, in the diagram above x was
> intended to precede the "frozen" time segment, but at a point where
> the work it wants to do exceeds the time-before-freeze resulting in
> dilation of execution and a performance regression.
Thanks for explaining again.

Jacob

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2011-03-09 21:57 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-02-16  3:18 [CFS Bandwidth Control v4 0/7] Introduction Paul Turner
2011-02-16  3:18 ` [CFS Bandwidth Control v4 1/7] sched: introduce primitives to account for CFS bandwidth tracking Paul Turner
2011-02-16 16:52   ` Balbir Singh
2011-02-17  2:54     ` Bharata B Rao
2011-02-23 13:32   ` Peter Zijlstra
2011-02-25  3:11     ` Paul Turner
2011-02-25 20:53     ` Paul Turner
2011-02-16  3:18 ` [CFS Bandwidth Control v4 2/7] sched: accumulate per-cfs_rq cpu usage Paul Turner
2011-02-16 17:45   ` Balbir Singh
2011-02-23 13:32   ` Peter Zijlstra
2011-02-25  3:33     ` Paul Turner
2011-02-25 12:31       ` Peter Zijlstra
2011-02-16  3:18 ` [CFS Bandwidth Control v4 3/7] sched: throttle cfs_rq entities which exceed their local quota Paul Turner
2011-02-18  6:52   ` Balbir Singh
2011-02-23 13:32   ` Peter Zijlstra
2011-02-24  5:21     ` Bharata B Rao
2011-02-24 11:05       ` Peter Zijlstra
2011-02-24 15:45         ` Bharata B Rao
2011-02-24 15:52           ` Peter Zijlstra
2011-02-24 16:39             ` Bharata B Rao
2011-02-24 17:20               ` Peter Zijlstra
2011-02-25  3:59                 ` Paul Turner
2011-02-25  3:41         ` Paul Turner
2011-02-25  3:10     ` Paul Turner
2011-02-25 13:58       ` Bharata B Rao
2011-02-25 20:51         ` Paul Turner
2011-02-28  3:50           ` Bharata B Rao
2011-02-28  6:38             ` Paul Turner
2011-02-28 13:48       ` Peter Zijlstra
2011-03-01  8:31         ` Paul Turner
2011-03-02  7:23   ` Bharata B Rao
2011-03-02  8:05     ` Paul Turner
2011-02-16  3:18 ` [CFS Bandwidth Control v4 4/7] sched: unthrottle cfs_rq(s) who ran out of quota at period refresh Paul Turner
2011-02-18  7:19   ` Balbir Singh
2011-02-18  8:10     ` Bharata B Rao
2011-02-23 12:23   ` Peter Zijlstra
2011-02-23 13:32   ` Peter Zijlstra
2011-02-24  7:04     ` Bharata B Rao
2011-02-24 11:14       ` Peter Zijlstra
2011-02-26  0:02     ` Paul Turner
2011-02-16  3:18 ` [CFS Bandwidth Control v4 5/7] sched: add exports tracking cfs bandwidth control statistics Paul Turner
2011-02-22  3:14   ` Balbir Singh
2011-02-22  4:13     ` Bharata B Rao
2011-02-22  4:40       ` Balbir Singh
2011-02-23  8:03         ` Paul Turner
2011-02-23 10:13           ` Balbir Singh
2011-02-23 13:32   ` Peter Zijlstra
2011-02-25  3:26     ` Paul Turner
2011-02-25  8:54       ` Peter Zijlstra
2011-02-16  3:18 ` [CFS Bandwidth Control v4 6/7] sched: hierarchical task accounting for SCHED_OTHER Paul Turner
2011-02-22  3:17   ` Balbir Singh
2011-02-23  8:05     ` Paul Turner
2011-02-23  2:02   ` Hidetoshi Seto
2011-02-23  2:20     ` Paul Turner
2011-02-23  2:43     ` Balbir Singh
2011-02-23 13:32   ` Peter Zijlstra
2011-02-25  3:25     ` Paul Turner
2011-02-25 12:17       ` Peter Zijlstra
2011-02-16  3:18 ` [CFS Bandwidth Control v4 7/7] sched: add documentation for bandwidth control Paul Turner
2011-02-21  2:47 ` [CFS Bandwidth Control v4 0/7] Introduction Xiao Guangrong
2011-02-22 10:28   ` Bharata B Rao
2011-02-23  7:42   ` Paul Turner
2011-02-23  7:51     ` Balbir Singh
2011-02-23  7:56       ` Paul Turner
2011-02-23  8:31         ` Bharata B Rao
     [not found] ` <20110224161111.7d83a884@jacob-laptop>
2011-02-25 10:03   ` Paul Turner
2011-02-25 13:06     ` jacob pan
2011-03-08  3:57       ` Balbir Singh
2011-03-08 18:18         ` Jacob Pan
2011-03-09 10:12       ` Paul Turner
2011-03-09 21:57         ` jacob pan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.