Hang and Soft Lockup problems with generic time code

LKML Archive mirror
 help / color / mirror / Atom feed

* Hang and Soft Lockup problems with generic time code
@ 2006-07-07 23:11 James Bottomley
  2006-07-07 23:39 ` john stultz
  0 siblings, 1 reply; 5+ messages in thread
From: James Bottomley @ 2006-07-07 23:11 UTC (permalink / raw
  To: john stultz, Andrew Morton; +Cc: linux-kernel

Ever since the 2.6.17 kernel pulled in the generic timer code, I've been
experiencing hangs and softlockups with the aic94xx driver (which I
thought were driver related).  Finally, after a lot of debugging I've
isolated the culprit to linux/time.h:timespec_add_ns()

What is happening is that a->tv_nsec is coming in here negative and
looping for huge amounts of time.

Why tv_nsec is negative appears to be related to massive cycle
adjustments in kernel/timer.c:update_wall_time().  With the TSC as my
clocksource I've seen the clocksource_read() return increments of in the
200s range.  No idea why this is happening.  The same strange
discontinuous jumps in cycle count also occurs with pm_acpi as the clock
source.

I can't get a good enough handle on all the generic time code changes to
reverse them.  However, this machine is a P4, so I was able to boot it
with an x86_64 kernel (which doesn't yet use the generic time code) and
confirm that all the hangs and softlockups go away.

The machine in question is an IBM x206m dual core P4.

James

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hang and Soft Lockup problems with generic time code
  2006-07-07 23:11 Hang and Soft Lockup problems with generic time code James Bottomley
@ 2006-07-07 23:39 ` john stultz
  2006-07-08  4:36   ` James Bottomley
  0 siblings, 1 reply; 5+ messages in thread
From: john stultz @ 2006-07-07 23:39 UTC (permalink / raw
  To: James Bottomley; +Cc: Andrew Morton, linux-kernel, Roman Zippel

On Fri, 2006-07-07 at 18:11 -0500, James Bottomley wrote:
> Ever since the 2.6.17 kernel pulled in the generic timer code, I've been
> experiencing hangs and softlockups with the aic94xx driver (which I
> thought were driver related).  Finally, after a lot of debugging I've
> isolated the culprit to linux/time.h:timespec_add_ns()
> 
> What is happening is that a->tv_nsec is coming in here negative and
> looping for huge amounts of time.

Yep. This has been seen where a large number of ticks are lost. Roman
and I are working on a solution for this (I sent a patch out to the list
earlier today for it, and Roman *just* posted his version a moment ago -
if you can give one or both of them a try it would be appreciated).

> Why tv_nsec is negative appears to be related to massive cycle
> adjustments in kernel/timer.c:update_wall_time().  With the TSC as my
> clocksource I've seen the clocksource_read() return increments of in the
> 200s range.  No idea why this is happening.  The same strange
> discontinuous jumps in cycle count also occurs with pm_acpi as the clock
> source.

Did you really mean jumps of 200 seconds? Hmmm. The issue Roman and I
have been looking into does occur when we lose a number of ticks and
that confuses the clocksource adjustment code. The fix we're working on
corrects the adjustment confusion, but doesn't fix the lost ticks.

However 200 seconds of lost ticks sounds very off. Could the driver be
disabling interrupt for such a long period of time?

> I can't get a good enough handle on all the generic time code changes to
> reverse them.  However, this machine is a P4, so I was able to boot it
> with an x86_64 kernel (which doesn't yet use the generic time code) and
> confirm that all the hangs and softlockups go away.
> 
> The machine in question is an IBM x206m dual core P4.

I appreciate the report and apologize for the trouble.

thanks
-john

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hang and Soft Lockup problems with generic time code
  2006-07-07 23:39 ` john stultz
@ 2006-07-08  4:36   ` James Bottomley
  2006-07-08 21:47     ` john stultz
  0 siblings, 1 reply; 5+ messages in thread
From: James Bottomley @ 2006-07-08  4:36 UTC (permalink / raw
  To: john stultz; +Cc: Andrew Morton, linux-kernel, Roman Zippel

On Fri, 2006-07-07 at 16:39 -0700, john stultz wrote:
> Yep. This has been seen where a large number of ticks are lost. Roman
> and I are working on a solution for this (I sent a patch out to the
> list
> earlier today for it, and Roman *just* posted his version a moment ago
> -
> if you can give one or both of them a try it would be appreciated).

Well, the patch you posted here:

Message-ID: 1152298515.5330.12.camel () localhost ! localdomain

Seems to work fine, thanks.  I'm not sure what I'm looking for for the
other one.


> Did you really mean jumps of 200 seconds? Hmmm. The issue Roman and I
> have been looking into does occur when we lose a number of ticks and
> that confuses the clocksource adjustment code. The fix we're working
> on
> corrects the adjustment confusion, but doesn't fix the lost ticks.
> 
> However 200 seconds of lost ticks sounds very off. Could the driver be
> disabling interrupt for such a long period of time?

Well, what I was seeing was that 

clocksource_read(clock) - clock->cycle_last

is returning a value about 200 x clock->cycle_interval

According to the debugging printks I put into update_wall_time().  I was
assuming this was caused by a jump in the TSC count, but I suppose it
could also be cause by spurious alterations to cycle_last or other
effects I haven't traced.

James



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hang and Soft Lockup problems with generic time code
  2006-07-08  4:36   ` James Bottomley
@ 2006-07-08 21:47     ` john stultz
  2006-07-08 22:13       ` James Bottomley
  0 siblings, 1 reply; 5+ messages in thread
From: john stultz @ 2006-07-08 21:47 UTC (permalink / raw
  To: James Bottomley; +Cc: Andrew Morton, linux-kernel, Roman Zippel

On Fri, 2006-07-07 at 23:36 -0500, James Bottomley wrote:
> > Did you really mean jumps of 200 seconds? Hmmm. The issue Roman and I
> > have been looking into does occur when we lose a number of ticks and
> > that confuses the clocksource adjustment code. The fix we're working
> > on
> > corrects the adjustment confusion, but doesn't fix the lost ticks.
> > 
> > However 200 seconds of lost ticks sounds very off. Could the driver be
> > disabling interrupt for such a long period of time?
> 
> Well, what I was seeing was that 
> 
> clocksource_read(clock) - clock->cycle_last
> 
> is returning a value about 200 x clock->cycle_interval

That then would be ~200 ticks. Is this at HZ=1000 ? 

> According to the debugging printks I put into update_wall_time().  I was
> assuming this was caused by a jump in the TSC count, but I suppose it
> could also be cause by spurious alterations to cycle_last or other
> effects I haven't traced.

Since this issue effected both the TSC and ACPI PM timer, I'd more
likely suspect something is holding off the timer interrupt. This could
be some kernel code like a driver, or it could be something like an SMI
from the BIOS.

thanks
-john


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Hang and Soft Lockup problems with generic time code
  2006-07-08 21:47     ` john stultz
@ 2006-07-08 22:13       ` James Bottomley
  0 siblings, 0 replies; 5+ messages in thread
From: James Bottomley @ 2006-07-08 22:13 UTC (permalink / raw
  To: john stultz; +Cc: Andrew Morton, linux-kernel, Roman Zippel

On Sat, 2006-07-08 at 14:47 -0700, john stultz wrote:
> > Well, what I was seeing was that 
> > 
> > clocksource_read(clock) - clock->cycle_last
> > 
> > is returning a value about 200 x clock->cycle_interval
> 
> That then would be ~200 ticks. Is this at HZ=1000 ? 

no, 250.

> > According to the debugging printks I put into update_wall_time().  I was
> > assuming this was caused by a jump in the TSC count, but I suppose it
> > could also be cause by spurious alterations to cycle_last or other
> > effects I haven't traced.
> 
> Since this issue effected both the TSC and ACPI PM timer, I'd more
> likely suspect something is holding off the timer interrupt. This could
> be some kernel code like a driver, or it could be something like an SMI
> from the BIOS.

The driver takes only ~10s to insert and these cycle jumps occur within
that time frame, so it's not a real 200s.  The timer system has somehow
manufactured the cycle jump.

James



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-07-08 22:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-07 23:11 Hang and Soft Lockup problems with generic time code James Bottomley
2006-07-07 23:39 ` john stultz
2006-07-08  4:36   ` James Bottomley
2006-07-08 21:47     ` john stultz
2006-07-08 22:13       ` James Bottomley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).