All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [sched] 36932a9e2b1: +121.2% boot-meminfo.SUnreclaim
@ 2014-08-05 21:29 ` Fengguang Wu
  0 siblings, 0 replies; 4+ messages in thread
From: Fengguang Wu @ 2014-08-05 21:29 UTC (permalink / raw
  To: Peter Zijlstra; +Cc: Dave Hansen, LKML, lkp

Hi Peter,

FYI, we noticed the below changes on

git://internal_merge_and_test_tree devel-athens-alpha-201408042000
commit 36932a9e2b1043506d06414fa988cd5b9592841f ("sched: Fix finish_task_switch vs prev_state")

test case: vm-vp-1G/boot/1

11b4855f1d9f1cc  36932a9e2b1043506d06414fa 
---------------  ------------------------- 
      8524 ± 1%    +121.2%      18858 ± 0%  TOTAL boot-meminfo.SUnreclaim
      1321 ± 9%   +3196.6%      43568 ± 0%  TOTAL boot-meminfo.KernelStack
      8756 ± 0%     +29.8%      11363 ± 0%  TOTAL boot-slabinfo.num_pages
     34982 ± 0%     +29.5%      45311 ± 0%  TOTAL boot-meminfo.Slab
    535601 ± 0%      -9.9%     482375 ± 0%  TOTAL boot-meminfo.MemFree
    116621 ± 0%      +9.4%     127548 ± 0%  TOTAL boot-slabinfo.num_objs

Legend:
	±XX%    - stddev percent
	[+-]XX% - change percent

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [sched] 36932a9e2b1: +121.2% boot-meminfo.SUnreclaim
@ 2014-08-05 21:29 ` Fengguang Wu
  0 siblings, 0 replies; 4+ messages in thread
From: Fengguang Wu @ 2014-08-05 21:29 UTC (permalink / raw
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 881 bytes --]

Hi Peter,

FYI, we noticed the below changes on

git://internal_merge_and_test_tree devel-athens-alpha-201408042000
commit 36932a9e2b1043506d06414fa988cd5b9592841f ("sched: Fix finish_task_switch vs prev_state")

test case: vm-vp-1G/boot/1

11b4855f1d9f1cc  36932a9e2b1043506d06414fa 
---------------  ------------------------- 
      8524 ± 1%    +121.2%      18858 ± 0%  TOTAL boot-meminfo.SUnreclaim
      1321 ± 9%   +3196.6%      43568 ± 0%  TOTAL boot-meminfo.KernelStack
      8756 ± 0%     +29.8%      11363 ± 0%  TOTAL boot-slabinfo.num_pages
     34982 ± 0%     +29.5%      45311 ± 0%  TOTAL boot-meminfo.Slab
    535601 ± 0%      -9.9%     482375 ± 0%  TOTAL boot-meminfo.MemFree
    116621 ± 0%      +9.4%     127548 ± 0%  TOTAL boot-slabinfo.num_objs

Legend:
	±XX%    - stddev percent
	[+-]XX% - change percent

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [sched] 36932a9e2b1: +121.2% boot-meminfo.SUnreclaim
  2014-08-05 21:29 ` Fengguang Wu
@ 2014-08-06  8:21   ` Peter Zijlstra
  -1 siblings, 0 replies; 4+ messages in thread
From: Peter Zijlstra @ 2014-08-06  8:21 UTC (permalink / raw
  To: Fengguang Wu; +Cc: Dave Hansen, LKML, lkp, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 6017 bytes --]

On Wed, Aug 06, 2014 at 05:29:48AM +0800, Fengguang Wu wrote:
> Hi Peter,
> 
> FYI, we noticed the below changes on
> 
> git://internal_merge_and_test_tree devel-athens-alpha-201408042000
> commit 36932a9e2b1043506d06414fa988cd5b9592841f ("sched: Fix finish_task_switch vs prev_state")
> 
> test case: vm-vp-1G/boot/1
> 
> 11b4855f1d9f1cc  36932a9e2b1043506d06414fa 
> ---------------  ------------------------- 
>       8524 ± 1%    +121.2%      18858 ± 0%  TOTAL boot-meminfo.SUnreclaim
>       1321 ± 9%   +3196.6%      43568 ± 0%  TOTAL boot-meminfo.KernelStack
>       8756 ± 0%     +29.8%      11363 ± 0%  TOTAL boot-slabinfo.num_pages
>      34982 ± 0%     +29.5%      45311 ± 0%  TOTAL boot-meminfo.Slab
>     535601 ± 0%      -9.9%     482375 ± 0%  TOTAL boot-meminfo.MemFree
>     116621 ± 0%      +9.4%     127548 ± 0%  TOTAL boot-slabinfo.num_objs


*blink*.. how does that happen?

For reference, that's the below patch (I've since given it an actual
Changelog, but the patch is the same).

---
Subject: sched: Fix finish_task_switch vs prev_state
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, 29 Jul 2014 11:22:37 +0200

Oleg wondered about the prev_state comment in finish_task_switch().

Aside from it being confusingly worded -- we neither initially
understood the actual problem being described -- we found that we'd
broken the scenario described.

Ever since commit e4a52bcb9a18 ("sched: Remove rq->lock from the first
half of ttwu()") we don't actually acquire rq->lock on wakeups anymore
and therefore holding rq->lock after the switch is no good.

Even if we did, __ARCH_WANT_UNLOCKED_CTXSW was already broken, because
it explicitly didn't hold the rq->lock anymore.

We could fix things by placing a full barrier between the prev->state
read and the next->on_cpu = 0 store, seeing how the remote wakeup code
waits for ->on_cpu to become false before proceeding with the wakeup
(so as to avoid having the task running on two cpus at the same time),
however full barriers are expensive.

Instead we read prev->state before the context switch and propagate it
unto finish_task_switch. This trivially solves the problem without
adding extra (and costly) memory barriers.

I'm not sure why we've never seen crashes due to this, it appears
entirely possible.

Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140729092237.GU12054@laptop.lan
---
 kernel/sched/core.c |   36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2190,6 +2190,7 @@ prepare_task_switch(struct rq *rq, struc
  * finish_task_switch - clean up after a task-switch
  * @rq: runqueue associated with task-switch
  * @prev: the thread we just switched away from.
+ * @prev_state: the state of @prev before we switched away from it.
  *
  * finish_task_switch must be called after the context switch, paired
  * with a prepare_task_switch call before the context switch.
@@ -2201,26 +2202,14 @@ prepare_task_switch(struct rq *rq, struc
  * with the lock held can cause deadlocks; see schedule() for
  * details.)
  */
-static void finish_task_switch(struct rq *rq, struct task_struct *prev)
+static void
+finish_task_switch(struct rq *rq, struct task_struct *prev, long prev_state)
 	__releases(rq->lock)
 {
 	struct mm_struct *mm = rq->prev_mm;
-	long prev_state;
 
 	rq->prev_mm = NULL;
 
-	/*
-	 * A task struct has one reference for the use as "current".
-	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
-	 * schedule one last time. The schedule call will never return, and
-	 * the scheduled task must drop that reference.
-	 * The test for TASK_DEAD must occur while the runqueue locks are
-	 * still held, otherwise prev could be scheduled on another cpu, die
-	 * there before we look at prev->state, and then the reference would
-	 * be dropped twice.
-	 *		Manfred Spraul <manfred@colorfullife.com>
-	 */
-	prev_state = prev->state;
 	vtime_task_switch(prev);
 	finish_arch_switch(prev);
 	perf_event_task_sched_in(prev, current);
@@ -2279,7 +2268,7 @@ asmlinkage __visible void schedule_tail(
 {
 	struct rq *rq = this_rq();
 
-	finish_task_switch(rq, prev);
+	finish_task_switch(rq, prev, 0);
 
 	/*
 	 * FIXME: do we need to worry about rq being invalidated by the
@@ -2304,6 +2293,21 @@ context_switch(struct rq *rq, struct tas
 	       struct task_struct *next)
 {
 	struct mm_struct *mm, *oldmm;
+	/*
+	 * A task struct has one reference for the use as "current".
+	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
+	 * schedule one last time. The schedule call will never return, and
+	 * the scheduled task must drop that reference.
+	 *
+	 * We must observe prev->state before clearing prev->on_cpu (in
+	 * finish_lock_switch), otherwise a concurrent wakeup can get prev
+	 * running on another CPU and we could race with its RUNNING -> DEAD
+	 * transition, and then the reference would be dropped twice.
+	 *
+	 * We avoid the race by observing prev->state while it is still
+	 * current.
+	 */
+	long prev_state = prev->state;
 
 	prepare_task_switch(rq, prev, next);
 
@@ -2347,7 +2351,7 @@ context_switch(struct rq *rq, struct tas
 	 * CPUs since it called schedule(), thus the 'rq' on its stack
 	 * frame will be invalid.
 	 */
-	finish_task_switch(this_rq(), prev);
+	finish_task_switch(this_rq(), prev, prev_state);
 }
 
 /*

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [sched] 36932a9e2b1: +121.2% boot-meminfo.SUnreclaim
@ 2014-08-06  8:21   ` Peter Zijlstra
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Zijlstra @ 2014-08-06  8:21 UTC (permalink / raw
  To: lkp

[-- Attachment #1: Type: text/plain, Size: 6031 bytes --]

On Wed, Aug 06, 2014 at 05:29:48AM +0800, Fengguang Wu wrote:
> Hi Peter,
> 
> FYI, we noticed the below changes on
> 
> git://internal_merge_and_test_tree devel-athens-alpha-201408042000
> commit 36932a9e2b1043506d06414fa988cd5b9592841f ("sched: Fix finish_task_switch vs prev_state")
> 
> test case: vm-vp-1G/boot/1
> 
> 11b4855f1d9f1cc  36932a9e2b1043506d06414fa 
> ---------------  ------------------------- 
>       8524 ± 1%    +121.2%      18858 ± 0%  TOTAL boot-meminfo.SUnreclaim
>       1321 ± 9%   +3196.6%      43568 ± 0%  TOTAL boot-meminfo.KernelStack
>       8756 ± 0%     +29.8%      11363 ± 0%  TOTAL boot-slabinfo.num_pages
>      34982 ± 0%     +29.5%      45311 ± 0%  TOTAL boot-meminfo.Slab
>     535601 ± 0%      -9.9%     482375 ± 0%  TOTAL boot-meminfo.MemFree
>     116621 ± 0%      +9.4%     127548 ± 0%  TOTAL boot-slabinfo.num_objs


*blink*.. how does that happen?

For reference, that's the below patch (I've since given it an actual
Changelog, but the patch is the same).

---
Subject: sched: Fix finish_task_switch vs prev_state
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue, 29 Jul 2014 11:22:37 +0200

Oleg wondered about the prev_state comment in finish_task_switch().

Aside from it being confusingly worded -- we neither initially
understood the actual problem being described -- we found that we'd
broken the scenario described.

Ever since commit e4a52bcb9a18 ("sched: Remove rq->lock from the first
half of ttwu()") we don't actually acquire rq->lock on wakeups anymore
and therefore holding rq->lock after the switch is no good.

Even if we did, __ARCH_WANT_UNLOCKED_CTXSW was already broken, because
it explicitly didn't hold the rq->lock anymore.

We could fix things by placing a full barrier between the prev->state
read and the next->on_cpu = 0 store, seeing how the remote wakeup code
waits for ->on_cpu to become false before proceeding with the wakeup
(so as to avoid having the task running on two cpus at the same time),
however full barriers are expensive.

Instead we read prev->state before the context switch and propagate it
unto finish_task_switch. This trivially solves the problem without
adding extra (and costly) memory barriers.

I'm not sure why we've never seen crashes due to this, it appears
entirely possible.

Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()")
Cc: Ingo Molnar <mingo@kernel.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140729092237.GU12054(a)laptop.lan
---
 kernel/sched/core.c |   36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2190,6 +2190,7 @@ prepare_task_switch(struct rq *rq, struc
  * finish_task_switch - clean up after a task-switch
  * @rq: runqueue associated with task-switch
  * @prev: the thread we just switched away from.
+ * @prev_state: the state of @prev before we switched away from it.
  *
  * finish_task_switch must be called after the context switch, paired
  * with a prepare_task_switch call before the context switch.
@@ -2201,26 +2202,14 @@ prepare_task_switch(struct rq *rq, struc
  * with the lock held can cause deadlocks; see schedule() for
  * details.)
  */
-static void finish_task_switch(struct rq *rq, struct task_struct *prev)
+static void
+finish_task_switch(struct rq *rq, struct task_struct *prev, long prev_state)
 	__releases(rq->lock)
 {
 	struct mm_struct *mm = rq->prev_mm;
-	long prev_state;
 
 	rq->prev_mm = NULL;
 
-	/*
-	 * A task struct has one reference for the use as "current".
-	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
-	 * schedule one last time. The schedule call will never return, and
-	 * the scheduled task must drop that reference.
-	 * The test for TASK_DEAD must occur while the runqueue locks are
-	 * still held, otherwise prev could be scheduled on another cpu, die
-	 * there before we look at prev->state, and then the reference would
-	 * be dropped twice.
-	 *		Manfred Spraul <manfred@colorfullife.com>
-	 */
-	prev_state = prev->state;
 	vtime_task_switch(prev);
 	finish_arch_switch(prev);
 	perf_event_task_sched_in(prev, current);
@@ -2279,7 +2268,7 @@ asmlinkage __visible void schedule_tail(
 {
 	struct rq *rq = this_rq();
 
-	finish_task_switch(rq, prev);
+	finish_task_switch(rq, prev, 0);
 
 	/*
 	 * FIXME: do we need to worry about rq being invalidated by the
@@ -2304,6 +2293,21 @@ context_switch(struct rq *rq, struct tas
 	       struct task_struct *next)
 {
 	struct mm_struct *mm, *oldmm;
+	/*
+	 * A task struct has one reference for the use as "current".
+	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
+	 * schedule one last time. The schedule call will never return, and
+	 * the scheduled task must drop that reference.
+	 *
+	 * We must observe prev->state before clearing prev->on_cpu (in
+	 * finish_lock_switch), otherwise a concurrent wakeup can get prev
+	 * running on another CPU and we could race with its RUNNING -> DEAD
+	 * transition, and then the reference would be dropped twice.
+	 *
+	 * We avoid the race by observing prev->state while it is still
+	 * current.
+	 */
+	long prev_state = prev->state;
 
 	prepare_task_switch(rq, prev, next);
 
@@ -2347,7 +2351,7 @@ context_switch(struct rq *rq, struct tas
 	 * CPUs since it called schedule(), thus the 'rq' on its stack
 	 * frame will be invalid.
 	 */
-	finish_task_switch(this_rq(), prev);
+	finish_task_switch(this_rq(), prev, prev_state);
 }
 
 /*

[-- Attachment #2: attachment.sig --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-08-06  8:22 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-05 21:29 [sched] 36932a9e2b1: +121.2% boot-meminfo.SUnreclaim Fengguang Wu
2014-08-05 21:29 ` Fengguang Wu
2014-08-06  8:21 ` Peter Zijlstra
2014-08-06  8:21   ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.