Possible regression? 2.6.26-rc1: T61s failure after suspend/resume

LKML Archive mirror
 help / color / mirror / Atom feed

* Possible regression?  2.6.26-rc1: T61s failure after suspend/resume
@ 2008-05-08 19:53 Theodore Ts'o
  2008-05-08 21:48 ` Hugh Dickins
  2008-05-08 21:52 ` Rafael J. Wysocki
  0 siblings, 2 replies; 19+ messages in thread
From: Theodore Ts'o @ 2008-05-08 19:53 UTC (permalink / raw
  To: linux-kernel

I'm running a kernel based off of commit afa26be8 (just six commits
after 2.6.26-rc1), and very shortly after I suspend/resume my X61s (with
the Intel video chipset), the X server will lock up.  I can ssh into
the machine remotely, and restart the X server, but the newly restarted
X server will shortly lock up again, and the only way to solve the
problem is to reboot.  If I drop back to a 2.6.25 based kernel, the
problem goes away.

I've tried bisecting it, but the bisection points picked by git don't
boot at all, and given that I'm travelling I havent had much time to try
doing more bisecting; since I know a number of kernel developers have
Lenovo X61 laptops, I thought before I wasted more time trying to get
the git bisection to work, I'd check to see if anyone has seen this
problem and if the fix is known.  I'll also try the latest bleeding edge
kernel and hope it's fixed there....

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression?  2.6.26-rc1: T61s failure after suspend/resume
  2008-05-08 19:53 Possible regression? 2.6.26-rc1: T61s failure after suspend/resume Theodore Ts'o
@ 2008-05-08 21:48 ` Hugh Dickins
  2008-05-09  2:59   ` Theodore Tso
                     ` (2 more replies)
  2008-05-08 21:52 ` Rafael J. Wysocki
  1 sibling, 3 replies; 19+ messages in thread
From: Hugh Dickins @ 2008-05-08 21:48 UTC (permalink / raw
  To: Theodore Ts'o; +Cc: Glauber Costa, Ingo Molnar, linux-kernel

On Thu, 8 May 2008, Theodore Ts'o wrote:
> 
> I'm running a kernel based off of commit afa26be8 (just six commits
> after 2.6.26-rc1), and very shortly after I suspend/resume my X61s (with
> the Intel video chipset), the X server will lock up.  I can ssh into
> the machine remotely, and restart the X server, but the newly restarted
> X server will shortly lock up again, and the only way to solve the
> problem is to reboot.  If I drop back to a 2.6.25 based kernel, the
> problem goes away.
> 
> I've tried bisecting it, but the bisection points picked by git don't
> boot at all, and given that I'm travelling I havent had much time to try
> doing more bisecting; since I know a number of kernel developers have
> Lenovo X61 laptops, I thought before I wasted more time trying to get
> the git bisection to work, I'd check to see if anyone has seen this
> problem and if the fix is known.  I'll also try the latest bleeding edge
> kernel and hope it's fixed there....

I don't have a Lenovo X61, and I've no problem on my uniprocessor T43p.
But I also have a Fujitsu Siemens Esprimo Mobile, Core2 Duo and Intel
graphics like yours, and that's been behaving strangely after resume
from RAM since somewhere between 2.6.25 and 2.6.26-rc1.

Sounds like it might be the same problem, though I quickly moved away
trying it with X, and have been trying to investigate just from the
console for some days now.  Weird memory corruption after resume.

Like you, little success with bisection: probably-other bugs get in
the way.  Some bisection points don't boot, some don't come back from
resume at all, some hang before getting to test.  When, as a working
hypothesis, I assumed that not coming back from resume might be the
same problem manifesting in the return from resume itself, and shifted
around bisection points a bit to avoid non-booting, then it arrived at

commit 4fe29a85642544503cf81e9cf251ef0f4e65b162
Author: Glauber de Oliveira Costa <gcosta@redhat.com>
Date:   Wed Mar 19 14:25:23 2008 -0300
    x86: use specialized routine for setup per-cpu area

as the suspect commit.  But I couldn't see anything obviously wrong
with that; and it could well be no more guilty than shifting around
the kernel address space somewhat.  I've rather given up on the
bisection angle; and indeed, since found that how the problem
manifests varies somewhat from one day's git to another,
from one config to another.

It does not happen with maxcpus=1.  Yesterday it occurred to me
to try without CONFIG_PREEMPT=y; but reached no conclusion on that,
it turns out preemption has been somehow essential to resume from
RAM on this machine since before 2.6.25: clearly a separate issue.
And resume from RAM running 64-bit on it is also long problematic.

To reproduce the problem, I start off by building a kernel with
make -j3 (from habit, perhaps with priming the pagecache in mind),
then interrupt that around the time it gets to filemap.o, bootmem.o.
I pm-suspend, close the lid, wait a few seconds, open the lid;
make mrproper and start a make -j3 build again.  (Though the very
first time I noticed the problem, it was a segfault in a git pull
after resume.)

How quickly it goes bad varies a lot: often hangs right at the
start while sedding stuff before getting down to the build itself.
Often gets well into the build before gcc reports Real-time signal
(most commonly 14 but others seen) killed cc1.  But my favourite,
the most distinctive failure, is segfault (usually in sh or make)
at 20295564 ip .....2f2 error 6 in ld-2.6.1.so (openSUSE 10.3).

Always 20295564; and objdumping ld-2.6.1.so shows 0x14 of that is
just the offset from %edi, so the crucial address is 0x20295550.
Which is "PU) ", though I've not found that string anywhere in
the running vmlinux (but of course it does appear in kernel source).

Yesterday morning's git looked promising: because of the libata
70sec delay, I got diverted after the resume from RAM, left that
laptop idle, and found hald-something-or-other had come in every
few minutes and got that segfault at 20295564 (but with increasing
ip addresses: some address-space randomization effect, I suppose).
Well, I suppose it probably got run more often, but I'd only notice
the segfaulting ones.  So it can happen when close to idle; but
I've not been able to reproduce that since.

It's such a good signature, but I've failed to make progress with it.
Ted, please try doing the same (and check your logs for existing
segfault messages): let's see if you get the same number ;)
though I've no idea what it'd tell us.

Hugh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression?  2.6.26-rc1: T61s failure after suspend/resume
  2008-05-08 19:53 Possible regression? 2.6.26-rc1: T61s failure after suspend/resume Theodore Ts'o
  2008-05-08 21:48 ` Hugh Dickins
@ 2008-05-08 21:52 ` Rafael J. Wysocki
  2008-05-09  2:44   ` Possible regression? 2.6.26-rc1: X61s " Theodore Tso
  1 sibling, 1 reply; 19+ messages in thread
From: Rafael J. Wysocki @ 2008-05-08 21:52 UTC (permalink / raw
  To: Theodore Ts'o; +Cc: linux-kernel, Jesse Barnes

On Thursday, 8 of May 2008, Theodore Ts'o wrote:
> 
> I'm running a kernel based off of commit afa26be8 (just six commits
> after 2.6.26-rc1), and very shortly after I suspend/resume my X61s (with
> the Intel video chipset), the X server will lock up.  I can ssh into
> the machine remotely, and restart the X server, but the newly restarted
> X server will shortly lock up again, and the only way to solve the
> problem is to reboot.  If I drop back to a 2.6.25 based kernel, the
> problem goes away.
> 
> I've tried bisecting it, but the bisection points picked by git don't
> boot at all, and given that I'm travelling I havent had much time to try
> doing more bisecting; since I know a number of kernel developers have
> Lenovo X61 laptops, I thought before I wasted more time trying to get
> the git bisection to work, I'd check to see if anyone has seen this
> problem and if the fix is known.  I'll also try the latest bleeding edge
> kernel and hope it's fixed there....

This looks like another manifestation of
http://bugzilla.kernel.org/show_bug.cgi?id=10620

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression?  2.6.26-rc1: X61s failure after suspend/resume
  2008-05-08 21:52 ` Rafael J. Wysocki
@ 2008-05-09  2:44   ` Theodore Tso
  2008-05-09  9:49     ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Theodore Tso @ 2008-05-09  2:44 UTC (permalink / raw
  To: Rafael J. Wysocki; +Cc: linux-kernel, Jesse Barnes

On Thu, May 08, 2008 at 11:52:03PM +0200, Rafael J. Wysocki wrote:
> On Thursday, 8 of May 2008, Theodore Ts'o wrote:
> > 
> > I'm running a kernel based off of commit afa26be8 (just six commits
> > after 2.6.26-rc1), and very shortly after I suspend/resume my X61s (with
> > the Intel video chipset), the X server will lock up.  I can ssh into
> > the machine remotely, and restart the X server, but the newly restarted
> > X server will shortly lock up again, and the only way to solve the
> > problem is to reboot.  If I drop back to a 2.6.25 based kernel, the
> > problem goes away.
> > 
> > I've tried bisecting it, but the bisection points picked by git don't
> > boot at all, and given that I'm travelling I havent had much time to try
> > doing more bisecting; since I know a number of kernel developers have
> > Lenovo X61 laptops, I thought before I wasted more time trying to get
> > the git bisection to work, I'd check to see if anyone has seen this
> > problem and if the fix is known.  I'll also try the latest bleeding edge
> > kernel and hope it's fixed there....
> 
> This looks like another manifestation of
> http://bugzilla.kernel.org/show_bug.cgi?id=10620

Could be.  On my system, the X server runs for about 15 seconds to
five minutes before it wedges up and locks up.  This is why it took me
a while before I finally figured out that the way to reliably
reproduce the problem was to do a suspend/resume.  So it's not
*identical* to the report, but its really close....

When I have more time I'll try to find some actual bisection points
that actually will successfully boot on the X61s laptop, and not die
within 6-8 seconds of the kernel loading.....


							- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression?  2.6.26-rc1: T61s failure after suspend/resume
  2008-05-08 21:48 ` Hugh Dickins
@ 2008-05-09  2:59   ` Theodore Tso
  2008-05-09  7:31     ` Hugh Dickins
  2008-05-09 11:16     ` Peter Zijlstra
  2008-05-09 14:18   ` Carlos R. Mafra
  2008-05-09 15:28   ` Glauber Costa
  2 siblings, 2 replies; 19+ messages in thread
From: Theodore Tso @ 2008-05-09  2:59 UTC (permalink / raw
  To: Hugh Dickins; +Cc: Glauber Costa, Ingo Molnar, linux-kernel

On Thu, May 08, 2008 at 10:48:57PM +0100, Hugh Dickins wrote:
> It's such a good signature, but I've failed to make progress with it.
> Ted, please try doing the same (and check your logs for existing
> segfault messages): let's see if you get the same number ;)
> though I've no idea what it'd tell us.

I cant say see any such segfaults.  The only ones in my logs are
these, and they seem to be correlated to right after a system boot:

May  5 09:35:02 closure kernel: [ 2249.245926] hald-addon-keyb[8631]: segfault at fffffffd ip b7e827bc sp bffcf1a8 error 4 in libc-2.6.1.so[b7e15000+144000]
May  5 09:35:02 closure kernel: [ 2249.252562] hald-addon-keyb[8630]: segfault at fffffffd ip b7dcf7bc sp bf81b9d8 error 4 in libc-2.6.1.so[b7d62000+144000]

I did find this, but it was from an attempt to do a bisect (see
below).  In this case the system lasted half-way through the boot
sequence (although not before the X server started) before it crashed.
I'm beginning to think that "git bisecting" in the middle of the merge
window just doesn't work well because some people aren't adequately
checking to make sure the their patch series are "git bisectable" in
terms of being bootable between arbitrary patches in their series.  So
what I plan to do when I have a spare 10-15 hours is to fetch the git
id's from patch-2.6.25-git*.id, which should hopefully represent
somewhat more likely-to-bootable git bisection points, and try to do a
git bisect using those points to see if the resulting kernels are a
bit more likely to last long enough so I can test for this particular
regression.

       	       	    	      	- Ted

P.S.  The "-numa" is due to a mistake that crept in via one of the
patch trees (and which set -LOCALVERSION in the top-level Makefile; it
got reverted later.)


[    2.097917] 
[    2.097919] =================================
[    2.098059] [ INFO: inconsistent lock state ]
[    2.098136] 2.6.25-numa-04462-g10c993a #23
[    2.098209] ---------------------------------
[    2.098284] inconsistent {in-hardirq-W} -> {hardirq-on-W} usage.
[    2.098359] swapper/0 [HC0[0]:SC0[0]:HE1:SE1] takes:
[    2.098434]  (&rq->rq_lock_key){++..}, at: [sched_clock_idle_wakeup_event+67/116] sched_clock_idle_wakeup_event+0x43/0x74
[    2.098753] {in-hardirq-W} state was registered at:
[    2.098833]   [__lock_acquire+1023/2834] __lock_acquire+0x3ff/0xb12
[    2.099082]   [lock_acquire+106/144] lock_acquire+0x6a/0x90
[    2.099329]   [_spin_lock+28/73] _spin_lock+0x1c/0x49
[    2.099590]   [scheduler_tick+67/443] scheduler_tick+0x43/0x1bb
[    2.099836]   [update_process_times+61/73] update_process_times+0x3d/0x49
[    2.100083]   [tick_periodic+102/114] tick_periodic+0x66/0x72
[    2.100327]   [tick_handle_periodic+25/106] tick_handle_periodic+0x19/0x6a
[    2.100574]   [timer_interrupt+72/115] timer_interrupt+0x48/0x73
[    2.100822]   [handle_IRQ_event+26/79] handle_IRQ_event+0x1a/0x4f
[    2.101064]   [handle_level_irq+127/202] handle_level_irq+0x7f/0xca
[    2.101316]   [do_IRQ+169/210] do_IRQ+0xa9/0xd2
[    2.101563]   [<ffffffff>] 0xffffffff
[    2.101806] irq event stamp: 1772935
[    2.101883] hardirqs last  enabled at (1772935): [native_sched_clock+231/255] native_sched_clock+0xe7/0xff
[    2.102091] hardirqs last disabled at (1772934): [native_sched_clock+109/255] native_sched_clock+0x6d/0xff
[    2.102298] softirqs last  enabled at (1772496): [__do_softirq+249/255] __do_softirq+0xf9/0xff
[    2.102501] softirqs last disabled at (1772491): [do_softirq+113/206] do_softirq+0x71/0xce
[    2.102708] 
[    2.102709] other info that might help us debug this:
[    2.102850] no locks held by swapper/0.
[    2.102923] 
[    2.102924] stack backtrace:
[    2.103067] Pid: 0, comm: swapper Not tainted 2.6.25-numa-04462-g10c993a #23
[    2.103145]  [print_usage_bug+263/276] print_usage_bug+0x107/0x114
[    2.103278]  [mark_lock+491/924] mark_lock+0x1eb/0x39c
[    2.103417]  [__lock_acquire+1140/2834] __lock_acquire+0x474/0xb12
[    2.103547]  [restore_nocheck+18/21] ? restore_nocheck+0x12/0x15
[    2.103740]  [native_sched_clock+231/255] ? native_sched_clock+0xe7/0xff
[    2.103929]  [lock_acquire+106/144] lock_acquire+0x6a/0x90
[    2.104065]  [sched_clock_idle_wakeup_event+67/116] ? sched_clock_idle_wakeup_event+0x43/0x74
[    2.104255]  [_spin_lock+28/73] _spin_lock+0x1c/0x49
[    2.104392]  [sched_clock_idle_wakeup_event+67/116] ? sched_clock_idle_wakeup_event+0x43/0x74
[    2.104582]  [sched_clock_idle_wakeup_event+67/116] sched_clock_idle_wakeup_event+0x43/0x74
[    2.104715]  [<f886c3b3>] acpi_idle_enter_simple+0x19a/0x21b [processor]
[    2.104858]  [<f886bfa5>] acpi_idle_enter_bm+0xbe/0x332 [processor]
[    2.104999]  [cpuidle_idle_call+99/143] cpuidle_idle_call+0x63/0x8f
[    2.105135]  [cpuidle_idle_call+0/143] ? cpuidle_idle_call+0x0/0x8f
[    2.105330]  [cpu_idle+182/214] cpu_idle+0xb6/0xd6
[    2.105463]  [rest_init+73/75] rest_init+0x49/0x4b
[    2.105596]  =======================

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression?  2.6.26-rc1: T61s failure after suspend/resume
  2008-05-09  2:59   ` Theodore Tso
@ 2008-05-09  7:31     ` Hugh Dickins
  2008-05-09 11:16     ` Peter Zijlstra
  1 sibling, 0 replies; 19+ messages in thread
From: Hugh Dickins @ 2008-05-09  7:31 UTC (permalink / raw
  To: Theodore Tso; +Cc: Glauber Costa, Ingo Molnar, linux-kernel

On Thu, 8 May 2008, Theodore Tso wrote:
> On Thu, May 08, 2008 at 10:48:57PM +0100, Hugh Dickins wrote:
> > It's such a good signature, but I've failed to make progress with it.
> > Ted, please try doing the same (and check your logs for existing
> > segfault messages): let's see if you get the same number ;)
> > though I've no idea what it'd tell us.
> 
> I cant say see any such segfaults.  The only ones in my logs are
> these, and they seem to be correlated to right after a system boot:
> 
> May  5 09:35:02 closure kernel: [ 2249.245926] hald-addon-keyb[8631]: segfault at fffffffd ip b7e827bc sp bffcf1a8 error 4 in libc-2.6.1.so[b7e15000+144000]
> May  5 09:35:02 closure kernel: [ 2249.252562] hald-addon-keyb[8630]: segfault at fffffffd ip b7dcf7bc sp bf81b9d8 error 4 in libc-2.6.1.so[b7d62000+144000]

I've grown accustomed to having a hal-whatever segfault just after startup
on the T43p, and pay no attention to those: I agree yours above are in
that category, and of no relevance to our -rc1 concerns.

Yours may indeed have nothing to do with mine: the absence of (interesting)
segfault in the logs doesn't let me draw any conclusion.  If you can
get through a successful make -j3 kernel build after resume, without
X in the way, then I shall conclude yours is not mine.  But for now
I'll assume I'm on my own and plug away at it somehow.

> 
> I did find this, but it was from an attempt to do a bisect (see
> below).  In this case the system lasted half-way through the boot
> sequence (although not before the X server started) before it crashed.
> I'm beginning to think that "git bisecting" in the middle of the merge
> window just doesn't work well because some people aren't adequately
> checking to make sure the their patch series are "git bisectable" in
> terms of being bootable between arbitrary patches in their series.

I'm actually impressed by how well people generally keep to that
discipline; I see the problem as more that when there's such a
quantity of changes flowing in, the chance of one bug interfering
with the hunt for another bug goes up and up.

> So what I plan to do when I have a spare 10-15 hours is to fetch the git
> id's from patch-2.6.25-git*.id, which should hopefully represent
> somewhat more likely-to-bootable git bisection points, and try to do a
> git bisect using those points to see if the resulting kernels are a
> bit more likely to last long enough so I can test for this particular
> regression.

Good luck: doesn't sound like so much fun that I'd want to do the same!

Hugh

> 
>        	       	    	      	- Ted
> 
> P.S.  The "-numa" is due to a mistake that crept in via one of the
> patch trees (and which set -LOCALVERSION in the top-level Makefile; it
> got reverted later.)
> 
> 
> [    2.097917] 
> [    2.097919] =================================
> [    2.098059] [ INFO: inconsistent lock state ]
> [    2.098136] 2.6.25-numa-04462-g10c993a #23
> [    2.098209] ---------------------------------
> [    2.098284] inconsistent {in-hardirq-W} -> {hardirq-on-W} usage.
> [    2.098359] swapper/0 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [    2.098434]  (&rq->rq_lock_key){++..}, at: [sched_clock_idle_wakeup_event+67/116] sched_clock_idle_wakeup_event+0x43/0x74
> [    2.098753] {in-hardirq-W} state was registered at:
> [    2.098833]   [__lock_acquire+1023/2834] __lock_acquire+0x3ff/0xb12
> [    2.099082]   [lock_acquire+106/144] lock_acquire+0x6a/0x90
> [    2.099329]   [_spin_lock+28/73] _spin_lock+0x1c/0x49
> [    2.099590]   [scheduler_tick+67/443] scheduler_tick+0x43/0x1bb
> [    2.099836]   [update_process_times+61/73] update_process_times+0x3d/0x49
> [    2.100083]   [tick_periodic+102/114] tick_periodic+0x66/0x72
> [    2.100327]   [tick_handle_periodic+25/106] tick_handle_periodic+0x19/0x6a
> [    2.100574]   [timer_interrupt+72/115] timer_interrupt+0x48/0x73
> [    2.100822]   [handle_IRQ_event+26/79] handle_IRQ_event+0x1a/0x4f
> [    2.101064]   [handle_level_irq+127/202] handle_level_irq+0x7f/0xca
> [    2.101316]   [do_IRQ+169/210] do_IRQ+0xa9/0xd2
> [    2.101563]   [<ffffffff>] 0xffffffff
> [    2.101806] irq event stamp: 1772935
> [    2.101883] hardirqs last  enabled at (1772935): [native_sched_clock+231/255] native_sched_clock+0xe7/0xff
> [    2.102091] hardirqs last disabled at (1772934): [native_sched_clock+109/255] native_sched_clock+0x6d/0xff
> [    2.102298] softirqs last  enabled at (1772496): [__do_softirq+249/255] __do_softirq+0xf9/0xff
> [    2.102501] softirqs last disabled at (1772491): [do_softirq+113/206] do_softirq+0x71/0xce
> [    2.102708] 
> [    2.102709] other info that might help us debug this:
> [    2.102850] no locks held by swapper/0.
> [    2.102923] 
> [    2.102924] stack backtrace:
> [    2.103067] Pid: 0, comm: swapper Not tainted 2.6.25-numa-04462-g10c993a #23
> [    2.103145]  [print_usage_bug+263/276] print_usage_bug+0x107/0x114
> [    2.103278]  [mark_lock+491/924] mark_lock+0x1eb/0x39c
> [    2.103417]  [__lock_acquire+1140/2834] __lock_acquire+0x474/0xb12
> [    2.103547]  [restore_nocheck+18/21] ? restore_nocheck+0x12/0x15
> [    2.103740]  [native_sched_clock+231/255] ? native_sched_clock+0xe7/0xff
> [    2.103929]  [lock_acquire+106/144] lock_acquire+0x6a/0x90
> [    2.104065]  [sched_clock_idle_wakeup_event+67/116] ? sched_clock_idle_wakeup_event+0x43/0x74
> [    2.104255]  [_spin_lock+28/73] _spin_lock+0x1c/0x49
> [    2.104392]  [sched_clock_idle_wakeup_event+67/116] ? sched_clock_idle_wakeup_event+0x43/0x74
> [    2.104582]  [sched_clock_idle_wakeup_event+67/116] sched_clock_idle_wakeup_event+0x43/0x74
> [    2.104715]  [<f886c3b3>] acpi_idle_enter_simple+0x19a/0x21b [processor]
> [    2.104858]  [<f886bfa5>] acpi_idle_enter_bm+0xbe/0x332 [processor]
> [    2.104999]  [cpuidle_idle_call+99/143] cpuidle_idle_call+0x63/0x8f
> [    2.105135]  [cpuidle_idle_call+0/143] ? cpuidle_idle_call+0x0/0x8f
> [    2.105330]  [cpu_idle+182/214] cpu_idle+0xb6/0xd6
> [    2.105463]  [rest_init+73/75] rest_init+0x49/0x4b
> [    2.105596]  =======================

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression?  2.6.26-rc1: X61s failure after suspend/resume
  2008-05-09  2:44   ` Possible regression? 2.6.26-rc1: X61s " Theodore Tso
@ 2008-05-09  9:49     ` Ingo Molnar
  2008-05-12  2:03       ` Theodore Tso
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2008-05-09  9:49 UTC (permalink / raw
  To: Theodore Tso, Rafael J. Wysocki, linux-kernel, Jesse Barnes


* Theodore Tso <tytso@MIT.EDU> wrote:

> > This looks like another manifestation of
> > http://bugzilla.kernel.org/show_bug.cgi?id=10620
> 
> Could be.  On my system, the X server runs for about 15 seconds to 
> five minutes before it wedges up and locks up.  This is why it took me 
> a while before I finally figured out that the way to reliably 
> reproduce the problem was to do a suspend/resume.  So it's not 
> *identical* to the report, but its really close....
> 
> When I have more time I'll try to find some actual bisection points 
> that actually will successfully boot on the X61s laptop, and not die 
> within 6-8 seconds of the kernel loading.....

on the off chance that this might be related: could you try to boot with 
nopat?

and on the off chance that this is a problem that has already been 
fixed, you might want to try x86.git/latest:

   http://people.redhat.com/mingo/x86.git/README

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression?  2.6.26-rc1: T61s failure after suspend/resume
  2008-05-09  2:59   ` Theodore Tso
  2008-05-09  7:31     ` Hugh Dickins
@ 2008-05-09 11:16     ` Peter Zijlstra
  2008-05-09 11:36       ` Hugh Dickins
  1 sibling, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2008-05-09 11:16 UTC (permalink / raw
  To: Theodore Tso
  Cc: Hugh Dickins, Glauber Costa, Ingo Molnar, linux-kernel,
	Venki Pallipadi

On Thu, 2008-05-08 at 22:59 -0400, Theodore Tso wrote:

> [    2.097919] =================================
> [    2.098059] [ INFO: inconsistent lock state ]
> [    2.098136] 2.6.25-numa-04462-g10c993a #23
> [    2.098209] ---------------------------------
> [    2.098284] inconsistent {in-hardirq-W} -> {hardirq-on-W} usage.
> [    2.098359] swapper/0 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [    2.098434]  (&rq->rq_lock_key){++..}, at: [sched_clock_idle_wakeup_event+67/116] sched_clock_idle_wakeup_event+0x43/0x74
> [    2.098753] {in-hardirq-W} state was registered at:
> [    2.098833]   [__lock_acquire+1023/2834] __lock_acquire+0x3ff/0xb12
> [    2.099082]   [lock_acquire+106/144] lock_acquire+0x6a/0x90
> [    2.099329]   [_spin_lock+28/73] _spin_lock+0x1c/0x49
> [    2.099590]   [scheduler_tick+67/443] scheduler_tick+0x43/0x1bb
> [    2.099836]   [update_process_times+61/73] update_process_times+0x3d/0x49
> [    2.100083]   [tick_periodic+102/114] tick_periodic+0x66/0x72
> [    2.100327]   [tick_handle_periodic+25/106] tick_handle_periodic+0x19/0x6a
> [    2.100574]   [timer_interrupt+72/115] timer_interrupt+0x48/0x73
> [    2.100822]   [handle_IRQ_event+26/79] handle_IRQ_event+0x1a/0x4f
> [    2.101064]   [handle_level_irq+127/202] handle_level_irq+0x7f/0xca
> [    2.101316]   [do_IRQ+169/210] do_IRQ+0xa9/0xd2
> [    2.101563]   [<ffffffff>] 0xffffffff
> [    2.101806] irq event stamp: 1772935
> [    2.101883] hardirqs last  enabled at (1772935): [native_sched_clock+231/255] native_sched_clock+0xe7/0xff
> [    2.102091] hardirqs last disabled at (1772934): [native_sched_clock+109/255] native_sched_clock+0x6d/0xff
> [    2.102298] softirqs last  enabled at (1772496): [__do_softirq+249/255] __do_softirq+0xf9/0xff
> [    2.102501] softirqs last disabled at (1772491): [do_softirq+113/206] do_softirq+0x71/0xce
> [    2.102708] 
> [    2.102709] other info that might help us debug this:
> [    2.102850] no locks held by swapper/0.
> [    2.102923] 
> [    2.102924] stack backtrace:
> [    2.103067] Pid: 0, comm: swapper Not tainted 2.6.25-numa-04462-g10c993a #23
> [    2.103145]  [print_usage_bug+263/276] print_usage_bug+0x107/0x114
> [    2.103278]  [mark_lock+491/924] mark_lock+0x1eb/0x39c
> [    2.103417]  [__lock_acquire+1140/2834] __lock_acquire+0x474/0xb12
> [    2.103547]  [restore_nocheck+18/21] ? restore_nocheck+0x12/0x15
> [    2.103740]  [native_sched_clock+231/255] ? native_sched_clock+0xe7/0xff
> [    2.103929]  [lock_acquire+106/144] lock_acquire+0x6a/0x90
> [    2.104065]  [sched_clock_idle_wakeup_event+67/116] ? sched_clock_idle_wakeup_event+0x43/0x74
> [    2.104255]  [_spin_lock+28/73] _spin_lock+0x1c/0x49
> [    2.104392]  [sched_clock_idle_wakeup_event+67/116] ? sched_clock_idle_wakeup_event+0x43/0x74
> [    2.104582]  [sched_clock_idle_wakeup_event+67/116] sched_clock_idle_wakeup_event+0x43/0x74
> [    2.104715]  [<f886c3b3>] acpi_idle_enter_simple+0x19a/0x21b [processor]
> [    2.104858]  [<f886bfa5>] acpi_idle_enter_bm+0xbe/0x332 [processor]
> [    2.104999]  [cpuidle_idle_call+99/143] cpuidle_idle_call+0x63/0x8f
> [    2.105135]  [cpuidle_idle_call+0/143] ? cpuidle_idle_call+0x0/0x8f
> [    2.105330]  [cpu_idle+182/214] cpu_idle+0xb6/0xd6
> [    2.105463]  [rest_init+73/75] rest_init+0x49/0x4b
> [    2.105596]  =======================

That's not good...

But I'm failing to see how IRQs get enabled between
sched_clock_idle_sleep_event() and sched_clock_idle_wakeup_event() in
acpi_idle_enter_simple().





^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression?  2.6.26-rc1: T61s failure after suspend/resume
  2008-05-09 11:16     ` Peter Zijlstra
@ 2008-05-09 11:36       ` Hugh Dickins
  2008-05-09 11:49         ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Hugh Dickins @ 2008-05-09 11:36 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: Theodore Tso, Glauber Costa, Ingo Molnar, linux-kernel,
	Venki Pallipadi

On Fri, 9 May 2008, Peter Zijlstra wrote:
> On Thu, 2008-05-08 at 22:59 -0400, Theodore Tso wrote:
> 
> > [    2.097919] =================================
> > [    2.098059] [ INFO: inconsistent lock state ]
> > [    2.098136] 2.6.25-numa-04462-g10c993a #23
> > [    2.098209] ---------------------------------
> > [    2.098284] inconsistent {in-hardirq-W} -> {hardirq-on-W} usage.
> > [    2.098359] swapper/0 [HC0[0]:SC0[0]:HE1:SE1] takes:
> > [    2.098434]  (&rq->rq_lock_key){++..}, at: [sched_clock_idle_wakeup_event+67/116] sched_clock_idle_wakeup_event+0x43/0x74
> > [    2.098753] {in-hardirq-W} state was registered at:
> > [    2.098833]   [__lock_acquire+1023/2834] __lock_acquire+0x3ff/0xb12
> > [    2.099082]   [lock_acquire+106/144] lock_acquire+0x6a/0x90
> > [    2.099329]   [_spin_lock+28/73] _spin_lock+0x1c/0x49
> > [    2.099590]   [scheduler_tick+67/443] scheduler_tick+0x43/0x1bb
> > [    2.099836]   [update_process_times+61/73] update_process_times+0x3d/0x49
> > [    2.100083]   [tick_periodic+102/114] tick_periodic+0x66/0x72
> > [    2.100327]   [tick_handle_periodic+25/106] tick_handle_periodic+0x19/0x6a
> > [    2.100574]   [timer_interrupt+72/115] timer_interrupt+0x48/0x73
> > [    2.100822]   [handle_IRQ_event+26/79] handle_IRQ_event+0x1a/0x4f
> > [    2.101064]   [handle_level_irq+127/202] handle_level_irq+0x7f/0xca
> > [    2.101316]   [do_IRQ+169/210] do_IRQ+0xa9/0xd2
> > [    2.101563]   [<ffffffff>] 0xffffffff
> > [    2.101806] irq event stamp: 1772935
> > [    2.101883] hardirqs last  enabled at (1772935): [native_sched_clock+231/255] native_sched_clock+0xe7/0xff
> > [    2.102091] hardirqs last disabled at (1772934): [native_sched_clock+109/255] native_sched_clock+0x6d/0xff
> > [    2.102298] softirqs last  enabled at (1772496): [__do_softirq+249/255] __do_softirq+0xf9/0xff
> > [    2.102501] softirqs last disabled at (1772491): [do_softirq+113/206] do_softirq+0x71/0xce
> > [    2.102708] 
> > [    2.102709] other info that might help us debug this:
> > [    2.102850] no locks held by swapper/0.
> > [    2.102923] 
> > [    2.102924] stack backtrace:
> > [    2.103067] Pid: 0, comm: swapper Not tainted 2.6.25-numa-04462-g10c993a #23
> > [    2.103145]  [print_usage_bug+263/276] print_usage_bug+0x107/0x114
> > [    2.103278]  [mark_lock+491/924] mark_lock+0x1eb/0x39c
> > [    2.103417]  [__lock_acquire+1140/2834] __lock_acquire+0x474/0xb12
> > [    2.103547]  [restore_nocheck+18/21] ? restore_nocheck+0x12/0x15
> > [    2.103740]  [native_sched_clock+231/255] ? native_sched_clock+0xe7/0xff
> > [    2.103929]  [lock_acquire+106/144] lock_acquire+0x6a/0x90
> > [    2.104065]  [sched_clock_idle_wakeup_event+67/116] ? sched_clock_idle_wakeup_event+0x43/0x74
> > [    2.104255]  [_spin_lock+28/73] _spin_lock+0x1c/0x49
> > [    2.104392]  [sched_clock_idle_wakeup_event+67/116] ? sched_clock_idle_wakeup_event+0x43/0x74
> > [    2.104582]  [sched_clock_idle_wakeup_event+67/116] sched_clock_idle_wakeup_event+0x43/0x74
> > [    2.104715]  [<f886c3b3>] acpi_idle_enter_simple+0x19a/0x21b [processor]
> > [    2.104858]  [<f886bfa5>] acpi_idle_enter_bm+0xbe/0x332 [processor]
> > [    2.104999]  [cpuidle_idle_call+99/143] cpuidle_idle_call+0x63/0x8f
> > [    2.105135]  [cpuidle_idle_call+0/143] ? cpuidle_idle_call+0x0/0x8f
> > [    2.105330]  [cpu_idle+182/214] cpu_idle+0xb6/0xd6
> > [    2.105463]  [rest_init+73/75] rest_init+0x49/0x4b
> > [    2.105596]  =======================
> 
> That's not good...
> 
> But I'm failing to see how IRQs get enabled between
> sched_clock_idle_sleep_event() and sched_clock_idle_wakeup_event() in
> acpi_idle_enter_simple().

If this point in Ted's bisection was without your idle irq fixes
(that patch that added trace_hardirqs_on to __sti_mwait, amongst
other things), would that account for it?

Hugh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression?  2.6.26-rc1: T61s failure after suspend/resume
  2008-05-09 11:36       ` Hugh Dickins
@ 2008-05-09 11:49         ` Peter Zijlstra
  2008-05-09 13:53           ` Theodore Tso
  0 siblings, 1 reply; 19+ messages in thread
From: Peter Zijlstra @ 2008-05-09 11:49 UTC (permalink / raw
  To: Hugh Dickins
  Cc: Theodore Tso, Glauber Costa, Ingo Molnar, linux-kernel,
	Venki Pallipadi

On Fri, 2008-05-09 at 12:36 +0100, Hugh Dickins wrote:
> On Fri, 9 May 2008, Peter Zijlstra wrote:
> > On Thu, 2008-05-08 at 22:59 -0400, Theodore Tso wrote:
> > 
> > > [    2.097919] =================================
> > > [    2.098059] [ INFO: inconsistent lock state ]
> > > [    2.098136] 2.6.25-numa-04462-g10c993a #23
> > > [    2.098209] ---------------------------------
> > > [    2.098284] inconsistent {in-hardirq-W} -> {hardirq-on-W} usage.
> > > [    2.098359] swapper/0 [HC0[0]:SC0[0]:HE1:SE1] takes:
> > > [    2.098434]  (&rq->rq_lock_key){++..}, at: [sched_clock_idle_wakeup_event+67/116] sched_clock_idle_wakeup_event+0x43/0x74
> > > [    2.098753] {in-hardirq-W} state was registered at:
> > > [    2.098833]   [__lock_acquire+1023/2834] __lock_acquire+0x3ff/0xb12
> > > [    2.099082]   [lock_acquire+106/144] lock_acquire+0x6a/0x90
> > > [    2.099329]   [_spin_lock+28/73] _spin_lock+0x1c/0x49
> > > [    2.099590]   [scheduler_tick+67/443] scheduler_tick+0x43/0x1bb
> > > [    2.099836]   [update_process_times+61/73] update_process_times+0x3d/0x49
> > > [    2.100083]   [tick_periodic+102/114] tick_periodic+0x66/0x72
> > > [    2.100327]   [tick_handle_periodic+25/106] tick_handle_periodic+0x19/0x6a
> > > [    2.100574]   [timer_interrupt+72/115] timer_interrupt+0x48/0x73
> > > [    2.100822]   [handle_IRQ_event+26/79] handle_IRQ_event+0x1a/0x4f
> > > [    2.101064]   [handle_level_irq+127/202] handle_level_irq+0x7f/0xca
> > > [    2.101316]   [do_IRQ+169/210] do_IRQ+0xa9/0xd2
> > > [    2.101563]   [<ffffffff>] 0xffffffff
> > > [    2.101806] irq event stamp: 1772935
> > > [    2.101883] hardirqs last  enabled at (1772935): [native_sched_clock+231/255] native_sched_clock+0xe7/0xff
> > > [    2.102091] hardirqs last disabled at (1772934): [native_sched_clock+109/255] native_sched_clock+0x6d/0xff
> > > [    2.102298] softirqs last  enabled at (1772496): [__do_softirq+249/255] __do_softirq+0xf9/0xff
> > > [    2.102501] softirqs last disabled at (1772491): [do_softirq+113/206] do_softirq+0x71/0xce
> > > [    2.102708] 
> > > [    2.102709] other info that might help us debug this:
> > > [    2.102850] no locks held by swapper/0.
> > > [    2.102923] 
> > > [    2.102924] stack backtrace:
> > > [    2.103067] Pid: 0, comm: swapper Not tainted 2.6.25-numa-04462-g10c993a #23
> > > [    2.103145]  [print_usage_bug+263/276] print_usage_bug+0x107/0x114
> > > [    2.103278]  [mark_lock+491/924] mark_lock+0x1eb/0x39c
> > > [    2.103417]  [__lock_acquire+1140/2834] __lock_acquire+0x474/0xb12
> > > [    2.103547]  [restore_nocheck+18/21] ? restore_nocheck+0x12/0x15
> > > [    2.103740]  [native_sched_clock+231/255] ? native_sched_clock+0xe7/0xff
> > > [    2.103929]  [lock_acquire+106/144] lock_acquire+0x6a/0x90
> > > [    2.104065]  [sched_clock_idle_wakeup_event+67/116] ? sched_clock_idle_wakeup_event+0x43/0x74
> > > [    2.104255]  [_spin_lock+28/73] _spin_lock+0x1c/0x49
> > > [    2.104392]  [sched_clock_idle_wakeup_event+67/116] ? sched_clock_idle_wakeup_event+0x43/0x74
> > > [    2.104582]  [sched_clock_idle_wakeup_event+67/116] sched_clock_idle_wakeup_event+0x43/0x74
> > > [    2.104715]  [<f886c3b3>] acpi_idle_enter_simple+0x19a/0x21b [processor]
> > > [    2.104858]  [<f886bfa5>] acpi_idle_enter_bm+0xbe/0x332 [processor]
> > > [    2.104999]  [cpuidle_idle_call+99/143] cpuidle_idle_call+0x63/0x8f
> > > [    2.105135]  [cpuidle_idle_call+0/143] ? cpuidle_idle_call+0x0/0x8f
> > > [    2.105330]  [cpu_idle+182/214] cpu_idle+0xb6/0xd6
> > > [    2.105463]  [rest_init+73/75] rest_init+0x49/0x4b
> > > [    2.105596]  =======================
> > 
> > That's not good...
> > 
> > But I'm failing to see how IRQs get enabled between
> > sched_clock_idle_sleep_event() and sched_clock_idle_wakeup_event() in
> > acpi_idle_enter_simple().
> 
> If this point in Ted's bisection was without your idle irq fixes
> (that patch that added trace_hardirqs_on to __sti_mwait, amongst
> other things), would that account for it?

Ah, that might, I'll do a checkout of his specific revision to verify.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression?  2.6.26-rc1: T61s failure after suspend/resume
  2008-05-09 11:49         ` Peter Zijlstra
@ 2008-05-09 13:53           ` Theodore Tso
  0 siblings, 0 replies; 19+ messages in thread
From: Theodore Tso @ 2008-05-09 13:53 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: Hugh Dickins, Glauber Costa, Ingo Molnar, linux-kernel,
	Venki Pallipadi

On Fri, May 09, 2008 at 01:49:45PM +0200, Peter Zijlstra wrote:
> > > But I'm failing to see how IRQs get enabled between
> > > sched_clock_idle_sleep_event() and sched_clock_idle_wakeup_event() in
> > > acpi_idle_enter_simple().
> > 
> > If this point in Ted's bisection was without your idle irq fixes
> > (that patch that added trace_hardirqs_on to __sti_mwait, amongst
> > other things), would that account for it?
> 
> Ah, that might, I'll do a checkout of his specific revision to verify.
> 

I wouldn't worry about it too much, it's not there in 2.6.26-rc1 as
far as I can tell.  And it may or may not have anything to do with the
fact that my system locked up about 60 seconds afterwards, before I
had a chance to login via X and do a suspend/resume cycle (since
that's the most reliable way to reproduce my regression).  I'm
travelling this week, but when I have time, I'll try disabling X,
doing a suspend/resume cycle, and then doing a kerenl build -j3 and
see if I can provoke your symptoms.

					- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression?  2.6.26-rc1: T61s failure after suspend/resume
  2008-05-08 21:48 ` Hugh Dickins
  2008-05-09  2:59   ` Theodore Tso
@ 2008-05-09 14:18   ` Carlos R. Mafra
  2008-05-09 15:28   ` Glauber Costa
  2 siblings, 0 replies; 19+ messages in thread
From: Carlos R. Mafra @ 2008-05-09 14:18 UTC (permalink / raw
  To: Hugh Dickins; +Cc: Theodore Ts'o, Glauber Costa, Ingo Molnar, linux-kernel


On Thu  8.May'08 at 22:48:57 +0100, Hugh Dickins wrote:
> [...] 
> To reproduce the problem, I start off by building a kernel with
> make -j3 (from habit, perhaps with priming the pagecache in mind),
> then interrupt that around the time it gets to filemap.o, bootmem.o.
> I pm-suspend, close the lid, wait a few seconds, open the lid;
> make mrproper and start a make -j3 build again.  (Though the very
> first time I noticed the problem, it was a segfault in a git pull
> after resume.)
> 
> How quickly it goes bad varies a lot: often hangs right at the
> start while sedding stuff before getting down to the build itself.
> Often gets well into the build before gcc reports Real-time signal
> (most commonly 14 but others seen) killed cc1.  But my favourite,
> the most distinctive failure, is segfault (usually in sh or make)
> at 20295564 ip .....2f2 error 6 in ld-2.6.1.so (openSUSE 10.3).
> 
> Always 20295564; and objdumping ld-2.6.1.so shows 0x14 of that is
> just the offset from %edi, so the crucial address is 0x20295550.
> Which is "PU) ", though I've not found that string anywhere in
> the running vmlinux (but of course it does appear in kernel source).

I have an issue similar to yours, Hugh.

I see a lot of segfaults when compiling the kernel with make -j3
after coming back from a "echo standby > /sys/power/state" in
my desktop.

For example,

sh[10355]: segfault at 80fd0a4 ip 0807a9a8 sp bfe56800 error 5 in bash[8048000+b4000]
make[12628]: segfault at 80593d0 ip 080593d0 sp bfd8087c error 5 in make[8048000+26000]
wmnet[3570]: segfault at 804c000 ip 0804c000 sp bffa5dd0 error 5 in wmnet[8048000+6000]
sh[13992]: segfault at 80b3030 ip 080b3030 sp bf8b050c error 5 in bash[8048000+b4000]
make[16300]: segfault at 804d230 ip 0804d230 sp bfcbf78c error 5 in make[8048000+26000]
sh[22895]: segfault at 80b75f0 ip 080b75f0 sp bfb3985c error 5 in bash[8048000+b4000]
sh[23507]: segfault at 80fd0a4 ip 0807a9a8 sp bffa27c0 error 5 in bash[8048000+b4000]
sh[23656]: segfault at 80fd0a4 ip 0807a9a8 sp bfb3eb70 error 5 in bash[8048000+b4000]
make[25753]: segfault at 805f4d0 ip 0805f4d0 sp bf9716cc error 5 in make[8048000+26000]
make[4857]: segfault at 804953c ip 0804953c sp bfe8fd6c error 5 in make[8048000+26000]
sh[15677]: segfault at 805bf18 ip 0805bf18 sp bf96b62c error 5 in bash[8048000+b4000]
make[15475]: segfault at 80496ac ip 080496ac sp bf92638c error 5 in make[8048000+26000]
gcc[19534]: segfault at 805ec3b ip 0805ec3b sp bfe99060 error 5 in gcc-4.2.3[8048000+2f000]
as[23425]: segfault at 8057920 ip 08057920 sp bfa91260 error 5 in as[8048000+46000]

I've just tried Ingo's suggestion of "nopat" but it did not solve my
problem. I think I will try to bisect it this weekend if I manage
to get some time.

Just wanted to say I have this problem for now. More info about my
desktop is here: 
http://www.ift.unesp.br/users/crmafra/cfs-debug-info-2008.05.09-10.58.50

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression? 2.6.26-rc1: T61s failure after suspend/resume
  2008-05-08 21:48 ` Hugh Dickins
  2008-05-09  2:59   ` Theodore Tso
  2008-05-09 14:18   ` Carlos R. Mafra
@ 2008-05-09 15:28   ` Glauber Costa
  2008-05-09 16:21     ` Hugh Dickins
  2 siblings, 1 reply; 19+ messages in thread
From: Glauber Costa @ 2008-05-09 15:28 UTC (permalink / raw
  To: Hugh Dickins; +Cc: Theodore Ts'o, Glauber Costa, Ingo Molnar, linux-kernel

On Thu, May 8, 2008 at 6:48 PM, Hugh Dickins <hugh@veritas.com> wrote:
> On Thu, 8 May 2008, Theodore Ts'o wrote:
>  >
>  > I'm running a kernel based off of commit afa26be8 (just six commits
>  > after 2.6.26-rc1), and very shortly after I suspend/resume my X61s (with
>  > the Intel video chipset), the X server will lock up.  I can ssh into
>  > the machine remotely, and restart the X server, but the newly restarted
>  > X server will shortly lock up again, and the only way to solve the
>  > problem is to reboot.  If I drop back to a 2.6.25 based kernel, the
>  > problem goes away.
>  >
>  > I've tried bisecting it, but the bisection points picked by git don't
>  > boot at all, and given that I'm travelling I havent had much time to try
>  > doing more bisecting; since I know a number of kernel developers have
>  > Lenovo X61 laptops, I thought before I wasted more time trying to get
>  > the git bisection to work, I'd check to see if anyone has seen this
>  > problem and if the fix is known.  I'll also try the latest bleeding edge
>  > kernel and hope it's fixed there....
>
>  I don't have a Lenovo X61, and I've no problem on my uniprocessor T43p.
>  But I also have a Fujitsu Siemens Esprimo Mobile, Core2 Duo and Intel
>  graphics like yours, and that's been behaving strangely after resume
>  from RAM since somewhere between 2.6.25 and 2.6.26-rc1.
>
>  Sounds like it might be the same problem, though I quickly moved away
>  trying it with X, and have been trying to investigate just from the
>  console for some days now.  Weird memory corruption after resume.
>
>  Like you, little success with bisection: probably-other bugs get in
>  the way.  Some bisection points don't boot, some don't come back from
>  resume at all, some hang before getting to test.  When, as a working
>  hypothesis, I assumed that not coming back from resume might be the
>  same problem manifesting in the return from resume itself, and shifted
>  around bisection points a bit to avoid non-booting, then it arrived at
>
>  commit 4fe29a85642544503cf81e9cf251ef0f4e65b162
>  Author: Glauber de Oliveira Costa <gcosta@redhat.com>
>  Date:   Wed Mar 19 14:25:23 2008 -0300
>     x86: use specialized routine for setup per-cpu area
>
>  as the suspect commit.  But I couldn't see anything obviously wrong
>  with that; and it could well be no more guilty than shifting around
>  the kernel address space somewhat.  I've rather given up on the
>  bisection angle; and indeed, since found that how the problem
>  manifests varies somewhat from one day's git to another,
>  from one config to another.
>
>  It does not happen with maxcpus=1.  Yesterday it occurred to me
>  to try without CONFIG_PREEMPT=y; but reached no conclusion on that,
>  it turns out preemption has been somehow essential to resume from
>  RAM on this machine since before 2.6.25: clearly a separate issue.
>  And resume from RAM running 64-bit on it is also long problematic.
>
>  To reproduce the problem, I start off by building a kernel with
>  make -j3 (from habit, perhaps with priming the pagecache in mind),
>  then interrupt that around the time it gets to filemap.o, bootmem.o.
>  I pm-suspend, close the lid, wait a few seconds, open the lid;
>  make mrproper and start a make -j3 build again.  (Though the very
>  first time I noticed the problem, it was a segfault in a git pull
>  after resume.)
>
>  How quickly it goes bad varies a lot: often hangs right at the
>  start while sedding stuff before getting down to the build itself.
>  Often gets well into the build before gcc reports Real-time signal
>  (most commonly 14 but others seen) killed cc1.  But my favourite,
>  the most distinctive failure, is segfault (usually in sh or make)
>  at 20295564 ip .....2f2 error 6 in ld-2.6.1.so (openSUSE 10.3).
>
>  Always 20295564; and objdumping ld-2.6.1.so shows 0x14 of that is
>  just the offset from %edi, so the crucial address is 0x20295550.
>  Which is "PU) ", though I've not found that string anywhere in
>  the running vmlinux (but of course it does appear in kernel source).
>
>  Yesterday morning's git looked promising: because of the libata
>  70sec delay, I got diverted after the resume from RAM, left that
>  laptop idle, and found hald-something-or-other had come in every
>  few minutes and got that segfault at 20295564 (but with increasing
>  ip addresses: some address-space randomization effect, I suppose).
>  Well, I suppose it probably got run more often, but I'd only notice
>  the segfaulting ones.  So it can happen when close to idle; but
>  I've not been able to reproduce that since.
>
>  It's such a good signature, but I've failed to make progress with it.
>  Ted, please try doing the same (and check your logs for existing
>  segfault messages): let's see if you get the same number ;)
>  though I've no idea what it'd tell us.
>
>  Hugh
>

I can't reproduce it neither, and looking at the code over and over
again, see no obvious point for the breakage. I'll try to reproduce it
myself,
to see if I can spot something. But correct me if I'm wrong, this is
all 64-bit machines, right?

I'm stuck with mostly 32-bit hardware, but will give it a try anyway.

-- 
Glauber Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression? 2.6.26-rc1: T61s failure after suspend/resume
  2008-05-09 15:28   ` Glauber Costa
@ 2008-05-09 16:21     ` Hugh Dickins
  2008-05-09 16:26       ` Glauber Costa
  2008-05-09 16:47       ` Adrian Bunk
  0 siblings, 2 replies; 19+ messages in thread
From: Hugh Dickins @ 2008-05-09 16:21 UTC (permalink / raw
  To: Glauber Costa
  Cc: Theodore Ts'o, Glauber Costa, Ingo Molnar, Carlos R. Mafra,
	linux-kernel

On Fri, 9 May 2008, Glauber Costa wrote:
> 
> I can't reproduce it neither, and looking at the code over and over
> again, see no obvious point for the breakage. I'll try to reproduce it
> myself,
> to see if I can spot something. But correct me if I'm wrong, this is
> all 64-bit machines, right?
> 
> I'm stuck with mostly 32-bit hardware, but will give it a try anyway.

The machine is 64-bit capable (Core2 Duo), but the kernels I'm running
for this are 32-bit, so I doubt that the 64-bitability is relevant.
I'd love to see what happens with a 64-bit kernel, but I never get
back from suspend with it (and that's not a recent regression).
Carlos is also seeing this with a 32-bit kernel (on P4 Xeon with HT).

Please don't take my git bisection result too seriously: that's where
it led when I fudged things around enough, and treated blank screens
as manifestations of the problem, which very likely they're not
(there's some other bug which makes it very variable how quickly
I resume).  And also, I wasn't checking how many cpus came up each
time: I wouldn't be surprised if at some points in your series only
one would come up, which would then look like a "good" point to me.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression? 2.6.26-rc1: T61s failure after suspend/resume
  2008-05-09 16:21     ` Hugh Dickins
@ 2008-05-09 16:26       ` Glauber Costa
  2008-05-09 16:47       ` Adrian Bunk
  1 sibling, 0 replies; 19+ messages in thread
From: Glauber Costa @ 2008-05-09 16:26 UTC (permalink / raw
  To: Hugh Dickins
  Cc: Theodore Ts'o, Glauber Costa, Ingo Molnar, Carlos R. Mafra,
	linux-kernel

On Fri, May 9, 2008 at 1:21 PM, Hugh Dickins <hugh@veritas.com> wrote:
> On Fri, 9 May 2008, Glauber Costa wrote:
>  >
>  > I can't reproduce it neither, and looking at the code over and over
>  > again, see no obvious point for the breakage. I'll try to reproduce it
>  > myself,
>  > to see if I can spot something. But correct me if I'm wrong, this is
>  > all 64-bit machines, right?
>  >
>  > I'm stuck with mostly 32-bit hardware, but will give it a try anyway.
>
>  The machine is 64-bit capable (Core2 Duo), but the kernels I'm running
>  for this are 32-bit, so I doubt that the 64-bitability is relevant.
>  I'd love to see what happens with a 64-bit kernel, but I never get
>  back from suspend with it (and that's not a recent regression).
>  Carlos is also seeing this with a 32-bit kernel (on P4 Xeon with HT).
>
>  Please don't take my git bisection result too seriously: that's where
>  it led when I fudged things around enough, and treated blank screens
>  as manifestations of the problem, which very likely they're not
>  (there's some other bug which makes it very variable how quickly
>  I resume).  And also, I wasn't checking how many cpus came up each
>  time: I wouldn't be surprised if at some points in your series only
>  one would come up, which would then look like a "good" point to me.
>
This is very unlikely. Exactly because I knew problems were likely to
arise in such a delicate thing, I was
extremely careful to make it not an issue. But yeah, ultimately, it can happen.


-- 
Glauber Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression? 2.6.26-rc1: T61s failure after suspend/resume
  2008-05-09 16:21     ` Hugh Dickins
  2008-05-09 16:26       ` Glauber Costa
@ 2008-05-09 16:47       ` Adrian Bunk
  1 sibling, 0 replies; 19+ messages in thread
From: Adrian Bunk @ 2008-05-09 16:47 UTC (permalink / raw
  To: Hugh Dickins
  Cc: Glauber Costa, Theodore Ts'o, Glauber Costa, Ingo Molnar,
	Carlos R. Mafra, linux-kernel

On Fri, May 09, 2008 at 05:21:14PM +0100, Hugh Dickins wrote:
> On Fri, 9 May 2008, Glauber Costa wrote:
> > 
> > I can't reproduce it neither, and looking at the code over and over
> > again, see no obvious point for the breakage. I'll try to reproduce it
> > myself,
> > to see if I can spot something. But correct me if I'm wrong, this is
> > all 64-bit machines, right?
> > 
> > I'm stuck with mostly 32-bit hardware, but will give it a try anyway.
> 
> The machine is 64-bit capable (Core2 Duo), but the kernels I'm running
> for this are 32-bit, so I doubt that the 64-bitability is relevant.
> I'd love to see what happens with a 64-bit kernel, but I never get
> back from suspend with it (and that's not a recent regression).
> Carlos is also seeing this with a 32-bit kernel (on P4 Xeon with HT).
> 
> Please don't take my git bisection result too seriously: that's where
> it led when I fudged things around enough, and treated blank screens
> as manifestations of the problem, which very likely they're not
> (there's some other bug which makes it very variable how quickly
> I resume).  And also, I wasn't checking how many cpus came up each
> time: I wouldn't be surprised if at some points in your series only
> one would come up, which would then look like a "good" point to me.

For 64bit kernels this commit seems to be a nop, but not for
32bit kernels.

Looking at commit 4fe29a85642544503cf81e9cf251ef0f4e65b162 it contained 
two changes for 32bit SMP kernels:
- it enables the previously not enabled HAVE_SETUP_PER_CPU_AREA
- the following hunk:

--- a/arch/x86/kernel/smpboot_32.c
+++ b/arch/x86/kernel/smpboot_32.c
@@ -665,6 +665,7 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu)
 		unmap_cpu_to_logical_apicid(cpu);
 		cpu_clear(cpu, cpu_callout_map); /* was set here (do_boot_cpu()) */
 		cpu_clear(cpu, cpu_initialized); /* was set by cpu_init() */
+		cpu_clear(cpu, cpu_possible_map);
 		cpucount--;
 	} else {
 		per_cpu(x86_cpu_to_apicid, cpu) = apicid;
@@ -743,6 +744,7 @@ EXPORT_SYMBOL(xquad_portio);
 
 static void __init disable_smp(void)
 {
+	cpu_possible_map = cpumask_of_cpu(0);
 	smpboot_clear_io_apic_irqs();
 	phys_cpu_present_map = physid_mask_of_physid(0);
 	map_cpu_to_logical_apicid();



Before assuming the bisection pointed to a wrong commit it might be 
worth checking these.

BTW:
I don't understand why these were mixed into a commit that
moved much stuff previously only used on 64bit kernels from 
arch/x86/kernel/setup64.c to arch/x86/kernel/setup.c ...


> Thanks,
> Hugh

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression?  2.6.26-rc1: X61s failure after suspend/resume
  2008-05-09  9:49     ` Ingo Molnar
@ 2008-05-12  2:03       ` Theodore Tso
  2008-05-12 11:35         ` Theodore Tso
  2008-05-12 12:28         ` Theodore Tso
  0 siblings, 2 replies; 19+ messages in thread
From: Theodore Tso @ 2008-05-12  2:03 UTC (permalink / raw
  To: Ingo Molnar; +Cc: Rafael J. Wysocki, linux-kernel, Jesse Barnes

On Fri, May 09, 2008 at 11:49:41AM +0200, Ingo Molnar wrote:
> 
> on the off chance that this might be related: could you try to boot with 
> nopat?

I did some more testing, and on my X61s laptop, using a somewhat more
recent snapshot (v2.6.26-rc1-434-g9662369 from Linus's tree Sunday
morning), I was able to reproduce the failure after a suspend/resume
without nopat, but after adding nopat, the problem seems to have gone
away.

I'll do some more testing, but it looks hopeful that this might be
another manisfestation of

http://bugzilla.kernel.org/show_bug.cgi?id=10620

and booting with nopat works around the problem.

							- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression?  2.6.26-rc1: X61s failure after suspend/resume
  2008-05-12  2:03       ` Theodore Tso
@ 2008-05-12 11:35         ` Theodore Tso
  2008-05-12 12:28         ` Theodore Tso
  1 sibling, 0 replies; 19+ messages in thread
From: Theodore Tso @ 2008-05-12 11:35 UTC (permalink / raw
  To: Ingo Molnar, Rafael J. Wysocki, linux-kernel, Jesse Barnes

On Sun, May 11, 2008 at 10:03:01PM -0400, Theodore Tso wrote:
> On Fri, May 09, 2008 at 11:49:41AM +0200, Ingo Molnar wrote:
> > 
> > on the off chance that this might be related: could you try to boot with 
> > nopat?
> 
> I did some more testing, and on my X61s laptop, using a somewhat more
> recent snapshot (v2.6.26-rc1-434-g9662369 from Linus's tree Sunday
> morning), I was able to reproduce the failure after a suspend/resume
> without nopat, but after adding nopat, the problem seems to have gone
> away.

Actually, it looks like I spoke too quickly.  I'll do some more
testing, but it looks like either (a) I got lucky, or (b) it's a lot
harder to trigger with nopat.  In any case, after getting home and
docking my workstation, I started a kernel build, and approximately 15
minutes later, the X server hung with the same symptoms as before.

	       	     	    	      	  - Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Possible regression?  2.6.26-rc1: X61s failure after suspend/resume
  2008-05-12  2:03       ` Theodore Tso
  2008-05-12 11:35         ` Theodore Tso
@ 2008-05-12 12:28         ` Theodore Tso
  1 sibling, 0 replies; 19+ messages in thread
From: Theodore Tso @ 2008-05-12 12:28 UTC (permalink / raw
  To: Ingo Molnar, Rafael J. Wysocki, linux-kernel, Jesse Barnes

On Sun, May 11, 2008 at 10:03:01PM -0400, Theodore Tso wrote:
> On Fri, May 09, 2008 at 11:49:41AM +0200, Ingo Molnar wrote:
> > 
> > on the off chance that this might be related: could you try to boot with 
> > nopat?

Hey Ingo, which bug/regression did you think my issue might be related
to?  I just looked up Kernel Bug #10620, which is what I had assumed
(from context) you were referring to, but looking at the bug, it
doesn't mention anything about nopat at all.

In any case, I'm doing some more testing to see if nopat really makes
the problem harder to recur...

						- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2008-05-12 12:29 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-08 19:53 Possible regression? 2.6.26-rc1: T61s failure after suspend/resume Theodore Ts'o
2008-05-08 21:48 ` Hugh Dickins
2008-05-09  2:59   ` Theodore Tso
2008-05-09  7:31     ` Hugh Dickins
2008-05-09 11:16     ` Peter Zijlstra
2008-05-09 11:36       ` Hugh Dickins
2008-05-09 11:49         ` Peter Zijlstra
2008-05-09 13:53           ` Theodore Tso
2008-05-09 14:18   ` Carlos R. Mafra
2008-05-09 15:28   ` Glauber Costa
2008-05-09 16:21     ` Hugh Dickins
2008-05-09 16:26       ` Glauber Costa
2008-05-09 16:47       ` Adrian Bunk
2008-05-08 21:52 ` Rafael J. Wysocki
2008-05-09  2:44   ` Possible regression? 2.6.26-rc1: X61s " Theodore Tso
2008-05-09  9:49     ` Ingo Molnar
2008-05-12  2:03       ` Theodore Tso
2008-05-12 11:35         ` Theodore Tso
2008-05-12 12:28         ` Theodore Tso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).