From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755282AbYEHMCI (ORCPT ); Thu, 8 May 2008 08:02:08 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751644AbYEHMBz (ORCPT ); Thu, 8 May 2008 08:01:55 -0400 Received: from mx2.mail.elte.hu ([157.181.151.9]:44540 "EHLO mx2.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751525AbYEHMBw (ORCPT ); Thu, 8 May 2008 08:01:52 -0400 Date: Thu, 8 May 2008 14:01:30 +0200 From: Ingo Molnar To: Linus Torvalds Cc: "Zhang, Yanmin" , Andi Kleen , Matthew Wilcox , LKML , Alexander Viro , Andrew Morton , Thomas Gleixner , "H. Peter Anvin" Subject: [patch] speed up / fix the new generic semaphore code (fix AIM7 40% regression with 2.6.26-rc1) Message-ID: <20080508120130.GA2860@elte.hu> References: <1210131712.3453.43.camel@ymzhang> <87lk2mbcqp.fsf@basil.nowhere.org> <20080507114643.GR19219@parisc-linux.org> <87hcdab8zp.fsf@basil.nowhere.org> <1210214696.3453.87.camel@ymzhang> <1210219729.3453.97.camel@ymzhang> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.17 (2007-11-01) X-ELTE-VirusStatus: clean X-ELTE-SpamScore: -1.5 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-1.5 required=5.9 tests=BAYES_00 autolearn=no SpamAssassin version=3.2.3 -1.5 BAYES_00 BODY: Bayesian spam probability is 0 to 1% [score: 0.0000] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Linus Torvalds wrote: > > > That said, "idle 0%" is easy when you spin. Do you also have > > > actual performance numbers? > > > > Yes. My conclusion is based on the actual number. cpu idle 0% is > > just a behavior it should be. > > Thanks, that's all I wanted to verify. > > I'll leave this overnight, and see if somebody has come up with some > smart and wonderful patch. And if not, I think I'll apply mine as > "known to fix a regression", and we can perhaps then improve on things > further from there. hey, i happen to have such a smart and wonderful patch =B-) i reproduced the AIM7 workload and can confirm Yanmin's findings that -.26-rc1 regresses over .25 - by over 67% here. Looking at the workload i found and fixed what i believe to be the real bug causing the AIM7 regression: it was inefficient wakeup / scheduling / locking behavior of the new generic semaphore code, causing suboptimal performance. The problem comes from the following code. The new semaphore code does this on down(): spin_lock_irqsave(&sem->lock, flags); if (likely(sem->count > 0)) sem->count--; else __down(sem); spin_unlock_irqrestore(&sem->lock, flags); and this on up(): spin_lock_irqsave(&sem->lock, flags); if (likely(list_empty(&sem->wait_list))) sem->count++; else __up(sem); spin_unlock_irqrestore(&sem->lock, flags); where __up() does: list_del(&waiter->list); waiter->up = 1; wake_up_process(waiter->task); and where __down() does this in essence: list_add_tail(&waiter.list, &sem->wait_list); waiter.task = task; waiter.up = 0; for (;;) { [...] spin_unlock_irq(&sem->lock); timeout = schedule_timeout(timeout); spin_lock_irq(&sem->lock); if (waiter.up) return 0; } the fastpath looks good and obvious, but note the following property of the contended path: if there's a task on the ->wait_list, the up() of the current owner will "pass over" ownership to that waiting task, in a wake-one manner, via the waiter->up flag and by removing the waiter from the wait list. That is all and fine in principle, but as implemented in kernel/semaphore.c it also creates a nasty, hidden source of contention! The contention comes from the following property of the new semaphore code: the new owner owns the semaphore exclusively, even if it is not running yet. So if the old owner, even if just a few instructions later, does a down() [lock_kernel()] again, it will be blocked and will have to wait on the new owner to eventually be scheduled (possibly on another CPU)! Or if another other task gets to lock_kernel() sooner than the "new owner" scheduled, it will be blocked unnecessarily and for a very long time when there are 2000 tasks running. I.e. the implementation of the new semaphores code does wake-one and lock ownership in a very restrictive way - it does not allow opportunistic re-locking of the lock at all and keeps the scheduler from picking task order intelligently. This kind of scheduling, with 2000 AIM7 processes running, creates awful cross-scheduling between those 2000 tasks, causes reduced parallelism, a throttled runqueue length and a lot of idle time. With increasing number of CPUs it causes an exponentially worse behavior in AIM7, as the chance for a newly woken new-owner task to actually run anytime soon is less and less likely. Note that it takes just a tiny bit of contention for the 'new-semaphore catastrophy' to happen: the wakeup latencies get added to whatever small contention there is, and quickly snowball out of control! I believe Yanmin's findings and numbers support this analysis too. The best fix for this problem is to use the same scheduling logic that the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and wanted because we do not want to over-schedule), but also allow opportunistic locking of the lock even if a wakee is already "in flight". The patch below implements this new logic. With this patch applied the AIM7 regression is largely fixed on my quad testbox: # v2.6.25 vanilla: .................. Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 56096.4 91 207.5 789.7 0.4675 2000 55894.4 94 208.2 792.7 0.4658 # v2.6.26-rc1-166-gc0a1811 vanilla: ................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 33230.6 83 350.3 784.5 0.2769 2000 31778.1 86 366.3 783.6 0.2648 # v2.6.26-rc1-166-gc0a1811 + semaphore-speedup: ............................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 55707.1 92 209.0 795.6 0.4642 2000 55704.4 96 209.0 796.0 0.4642 i.e. a 67% speedup. We are now back to within 1% of the v2.6.25 performance levels and have zero idle time during the test, as expected. Btw., interactivity also improved dramatically with the fix - for example console-switching became almost instantaneous during this workload (which after all is running 2000 tasks at once!), without the patch it was stuck for a minute at times. I also ran Linus's spinlock-BKL patch as well: # v2.6.26-rc1-166-gc0a1811 + Linus-BKL-spinlock-patch: ...................................................... Tasks Jobs/Min JTI Real CPU Jobs/sec/task 2000 55889.0 92 208.3 793.3 0.4657 2000 55891.7 96 208.3 793.3 0.4658 it is about 0.3% faster - but note that that is within the general noise levels of this test. I'd expect Linus's spinlock-BKL patch to give some small speedup because the BKL acquire times are short and 2000 tasks running all at once really increases the context-switching cost and most BKL contentions are within the cost of context-switching. But i believe the better solution for that is to remove BKL use from all hotpaths, not to hide some of its costs by reintroducing it as a spinlock. Reintroducing the spinlock based BKL would have other disadvantages as well: it could reintroduce per-CPU-ness assumptions in BKL-using code and other complications as well. It's also not a very realistic workload - with 2000 tasks running the system was barely serviceable. I'd much rather make BKL costs more apparent and more visible - but 50% regression was of course too much. But 0.3% for a 2000-tasks workload, which is near the noise level ... is acceptable i think - especially as this discussion has now reinvigorated the remove-the-BKL discussions and patches. Linus, we can do your spinlock-BKL patch too if you feel strongly about it, but i'd rather not - we fought so hard for the preemptible BKL :-) The spinlock-based-BKL patch only worked around the real problem i believe, because it eliminated the use of the suboptimal new semaphore code: with spinlocks there's no scheduling at all, so the wakeup/locking bug of the new semaphore code did not apply. It was not about any fastpath overhead AFAICS. [we'd have seen that with the CONFIG_PREEMPT_BKL=y code as well, which has been the default setting since v2.6.8.] There's another nice side-effect of this speedup patch, the new generic semaphore code got even smaller: text data bss dec hex filename 1241 0 0 1241 4d9 semaphore.o.before 1207 0 0 1207 4b7 semaphore.o.after (because the waiter.up complication got removed.) Longer-term we should look into using the mutex code for the generic semaphore code as well - but i's not easy due to legacies and it's outside of the scope of v2.6.26 and outside the scope of this patch as well. Hm? Ingo Signed-off-by: Ingo Molnar --- kernel/semaphore.c | 36 ++++++++++++++++-------------------- 1 file changed, 16 insertions(+), 20 deletions(-) Index: linux/kernel/semaphore.c =================================================================== --- linux.orig/kernel/semaphore.c +++ linux/kernel/semaphore.c @@ -54,10 +54,9 @@ void down(struct semaphore *sem) unsigned long flags; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(sem->count <= 0)) __down(sem); + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); } EXPORT_SYMBOL(down); @@ -77,10 +76,10 @@ int down_interruptible(struct semaphore int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(sem->count <= 0)) result = __down_interruptible(sem); + if (!result) + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); return result; @@ -103,10 +102,10 @@ int down_killable(struct semaphore *sem) int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(sem->count <= 0)) result = __down_killable(sem); + if (!result) + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); return result; @@ -157,10 +156,10 @@ int down_timeout(struct semaphore *sem, int result = 0; spin_lock_irqsave(&sem->lock, flags); - if (likely(sem->count > 0)) - sem->count--; - else + if (unlikely(sem->count <= 0)) result = __down_timeout(sem, jiffies); + if (!result) + sem->count--; spin_unlock_irqrestore(&sem->lock, flags); return result; @@ -179,9 +178,8 @@ void up(struct semaphore *sem) unsigned long flags; spin_lock_irqsave(&sem->lock, flags); - if (likely(list_empty(&sem->wait_list))) - sem->count++; - else + sem->count++; + if (unlikely(!list_empty(&sem->wait_list))) __up(sem); spin_unlock_irqrestore(&sem->lock, flags); } @@ -192,7 +190,6 @@ EXPORT_SYMBOL(up); struct semaphore_waiter { struct list_head list; struct task_struct *task; - int up; }; /* @@ -206,11 +203,11 @@ static inline int __sched __down_common( struct task_struct *task = current; struct semaphore_waiter waiter; - list_add_tail(&waiter.list, &sem->wait_list); waiter.task = task; - waiter.up = 0; for (;;) { + list_add_tail(&waiter.list, &sem->wait_list); + if (state == TASK_INTERRUPTIBLE && signal_pending(task)) goto interrupted; if (state == TASK_KILLABLE && fatal_signal_pending(task)) @@ -221,7 +218,7 @@ static inline int __sched __down_common( spin_unlock_irq(&sem->lock); timeout = schedule_timeout(timeout); spin_lock_irq(&sem->lock); - if (waiter.up) + if (sem->count > 0) return 0; } @@ -259,6 +256,5 @@ static noinline void __sched __up(struct struct semaphore_waiter *waiter = list_first_entry(&sem->wait_list, struct semaphore_waiter, list); list_del(&waiter->list); - waiter->up = 1; wake_up_process(waiter->task); }