All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [uml-devel] Lockups with the fixed timer code :/
@ 2008-04-05 16:45 Nix
  2008-04-14 14:40 ` Jeff Dike
  0 siblings, 1 reply; 8+ messages in thread
From: Nix @ 2008-04-05 16:45 UTC (permalink / raw
  To: user-mode-linux-devel; +Cc: Jeff Dike, Thomas Gleixner

The fixed timer patch you posted a few weeks back has indeed fixed my
select()-based timeout woes.

Unfortunately, both with the old kludgy approach and with the new
remain-versus-max estimator code, I see intermittent tight lockups of
the UML kernel-space ptrace thread, with that thread chewing all
available CPU time and the virtual machine, unsurprisingly, going
unresponsive. sysprof seems unwilling to extract anything out of this,
even though it has debugging info; gdb-attachment is somewhat more
informative. A few snapshots a few seconds apart in an instance that had
been looping for some time:

0x080836c2 in update_wall_time ()
0x080836b3 in update_wall_time ()
0x08083b32 in current_tick_length ()
0x08083644 in update_wall_time ()
0x0808368e in update_wall_time ()

I'm willing to bet that, in the loop in update_wall_time(), `offset' is
somehow going negative: and as it's an unsigned value that's a bit
problematic. Unfortunately I can't see any way that could happen, unless
something is smashing it or changing clock->cycle_interval and racing
with the loop conditional test.

I'll instrument that loop and try to figure it out.

-- 
`The rest is a tale of post and counter-post.' --- Ian Rawlings
                                                   describes USENET

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Register now and save $200. Hurry, offer ends at 11:59 p.m., 
Monday, April 7! Use priority code J8TLD2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [uml-devel] Lockups with the fixed timer code :/
  2008-04-05 16:45 [uml-devel] Lockups with the fixed timer code :/ Nix
@ 2008-04-14 14:40 ` Jeff Dike
  2008-04-15 19:47   ` Nix
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Dike @ 2008-04-14 14:40 UTC (permalink / raw
  To: Nix; +Cc: Thomas Gleixner, user-mode-linux-devel

On Sat, Apr 05, 2008 at 05:45:39PM +0100, Nix wrote:
> The fixed timer patch you posted a few weeks back has indeed fixed my
> select()-based timeout woes.
> 
> Unfortunately, both with the old kludgy approach and with the new
> remain-versus-max estimator code, I see intermittent tight lockups of
> the UML kernel-space ptrace thread, with that thread chewing all
> available CPU time and the virtual machine, unsurprisingly, going
> unresponsive. 

Below is another patch for you to try.  I spent most of last week
chasing this one.  The symptoms are somewhat similar to yours -
intermittent UML hangs, although not with UML spinning, and it still
pings.

The problem is NTP adjusting the multiplier part of the clock-provided
cycles-to-ns conversion function.  UML pretended to have a ns clock,
with a multiplier of 1.  When NTP adjusted that down in order to slow
down the clock, that became 0, and time stopped.

The fix below is to switch to a usec clock, with a multiplier of 1000,
which can be adjusted with much more granualarity.

      	     	      Jeff

-- 
Work email - jdike at linux dot intel dot com

Index: linux-2.6.22/arch/um/kernel/time.c
===================================================================
--- linux-2.6.22.orig/arch/um/kernel/time.c	2008-04-10 12:53:32.000000000 -0400
+++ linux-2.6.22/arch/um/kernel/time.c	2008-04-14 10:30:00.000000000 -0400
@@ -75,7 +75,7 @@ static irqreturn_t um_timer(int irq, voi
 
 static cycle_t itimer_read(void)
 {
-	return os_nsecs();
+	return os_nsecs() / 1000;
 }
 
 static struct clocksource itimer_clocksource = {
@@ -83,7 +83,7 @@ static struct clocksource itimer_clockso
 	.rating		= 300,
 	.read		= itimer_read,
 	.mask		= CLOCKSOURCE_MASK(64),
-	.mult		= 1,
+	.mult		= 1000,
 	.shift		= 0,
 	.flags		= CLOCK_SOURCE_IS_CONTINUOUS,
 };

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [uml-devel] Lockups with the fixed timer code :/
  2008-04-14 14:40 ` Jeff Dike
@ 2008-04-15 19:47   ` Nix
  2008-04-15 20:59     ` Ryan Finnie
  2008-04-16 19:44     ` Nix
  0 siblings, 2 replies; 8+ messages in thread
From: Nix @ 2008-04-15 19:47 UTC (permalink / raw
  To: Jeff Dike; +Cc: Thomas Gleixner, user-mode-linux-devel

On 14 Apr 2008, Jeff Dike spake thusly:
> Below is another patch for you to try.  I spent most of last week
> chasing this one.  The symptoms are somewhat similar to yours -
> intermittent UML hangs, although not with UML spinning, and it still
> pings.

Having not quite the same symptoms is interesting.

I added instrumentation to spot huge offsets on the offchance that it
was going negative... whereupon it ceased to hang for an entire week
after a week of dying multiple times a day :/

> The problem is NTP adjusting the multiplier part of the clock-provided
> cycles-to-ns conversion function.  UML pretended to have a ns clock,
> with a multiplier of 1.  When NTP adjusted that down in order to slow
> down the clock, that became 0, and time stopped.

Oh, I didn't think that clock_interval might be *variable*. So much for
my instrumentation hack then. :)

... hang on, wouldn't this only happen if you had NTP running on your
guest? (I don't. For that matter, *why* would you have NTP running on
your guest? The guest just gets the time of day from the host anyway!)

(And why does the clocksource system allow NTP to reduce the clock
interval so far that it becomes zero anyway? Shouldn't this be barred?
What am I missing?)

> The fix below is to switch to a usec clock, with a multiplier of 1000,
> which can be adjusted with much more granualarity.

OK, trying that. (I'll extend the instrumentation patch to watch for
zero cycle_interval as well, and see what happens. With luck nothing
will happen except that the crashes will stop... except that they
already *have* stopped for me. Annoying.)

-- 
`The rest is a tale of post and counter-post.' --- Ian Rawlings
                                                   describes USENET

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [uml-devel] Lockups with the fixed timer code :/
  2008-04-15 19:47   ` Nix
@ 2008-04-15 20:59     ` Ryan Finnie
  2008-04-16 19:44     ` Nix
  1 sibling, 0 replies; 8+ messages in thread
From: Ryan Finnie @ 2008-04-15 20:59 UTC (permalink / raw
  To: Nix; +Cc: Jeff Dike, Thomas Gleixner, user-mode-linux-devel

On Tue, Apr 15, 2008 at 12:47 PM, Nix <nix@esperi.org.uk> wrote:
>  > The problem is NTP adjusting the multiplier part of the clock-provided
>  > cycles-to-ns conversion function.  UML pretended to have a ns clock,
>  > with a multiplier of 1.  When NTP adjusted that down in order to slow
>  > down the clock, that became 0, and time stopped.
>
>  ... hang on, wouldn't this only happen if you had NTP running on your
>  guest? (I don't. For that matter, *why* would you have NTP running on
>  your guest? The guest just gets the time of day from the host anyway!)

Yes, ntpd would be required to trigger (or anything that adjusts mult,
but I can't think of anything other than ntpd that would.

(Anyway, this would be my doing; I was the only who reported the
hanging a few weeks ago, and Jeff had been banging on on my test
server, where with a dozen or so guests running concurrently, at least
one of them would trigger within an hour.  And the reason I have ntpd
running on UML guests is I use the same deployment script for both UML
guests and physical dedicated servers.)

RF

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [uml-devel] Lockups with the fixed timer code :/
  2008-04-15 19:47   ` Nix
  2008-04-15 20:59     ` Ryan Finnie
@ 2008-04-16 19:44     ` Nix
  2008-04-24 19:57       ` Nix
  1 sibling, 1 reply; 8+ messages in thread
From: Nix @ 2008-04-16 19:44 UTC (permalink / raw
  To: Jeff Dike; +Cc: Thomas Gleixner, user-mode-linux-devel

On 15 Apr 2008, nix@esperi.org.uk said:
> OK, trying that. (I'll extend the instrumentation patch to watch for
> zero cycle_interval as well, and see what happens. With luck nothing
> will happen except that the crashes will stop... except that they
> already *have* stopped for me. Annoying.)

Hard-spinning again, inside gettimeofday() inside os_nsecs(). (The stack
above that point is corrupt, or at least gdb 6.7.1 won't show
it. Nonetheless if it's not stuck inside idle_sleep() I'll eat my
hat. Note that I am *not* running CONFIG_NO_HZ because, hah, I thought
that as rather new code it might be unstable...)

This *is* a change in behaviour: the backtrace is different! Yay! :)

-- 
`The rest is a tale of post and counter-post.' --- Ian Rawlings
                                                   describes USENET

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [uml-devel] Lockups with the fixed timer code :/
  2008-04-16 19:44     ` Nix
@ 2008-04-24 19:57       ` Nix
  2008-04-24 20:37         ` Jeff Dike
  0 siblings, 1 reply; 8+ messages in thread
From: Nix @ 2008-04-24 19:57 UTC (permalink / raw
  To: Jeff Dike; +Cc: Thomas Gleixner, user-mode-linux-devel

On 16 Apr 2008, nix@esperi.org.uk stated:
> This *is* a change in behaviour: the backtrace is different! Yay! :)

I upgraded the guest to 2.6.25 a week ago and it stopped happening.
There is hope.  (Mind you it's stopped going wrong for week-long periods
before...)

-- 
`If you are having a "ua luea luea le ua le" kind of day, I can only
 assume that you are doing no work due [to] incapacitating nausea caused 
 by numerous lazy demons.' --- Frossie

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [uml-devel] Lockups with the fixed timer code :/
  2008-04-24 19:57       ` Nix
@ 2008-04-24 20:37         ` Jeff Dike
  2008-04-24 22:18           ` Nix
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Dike @ 2008-04-24 20:37 UTC (permalink / raw
  To: Nix; +Cc: Thomas Gleixner, user-mode-linux-devel

On Thu, Apr 24, 2008 at 08:57:54PM +0100, Nix wrote:
> On 16 Apr 2008, nix@esperi.org.uk stated:
> > This *is* a change in behaviour: the backtrace is different! Yay! :)
> 
> I upgraded the guest to 2.6.25 a week ago and it stopped happening.
> There is hope.  (Mind you it's stopped going wrong for week-long periods
> before...)

OK, yell if it starts happening again...

    	       Jeff

-- 
Work email - jdike at linux dot intel dot com

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [uml-devel] Lockups with the fixed timer code :/
  2008-04-24 20:37         ` Jeff Dike
@ 2008-04-24 22:18           ` Nix
  0 siblings, 0 replies; 8+ messages in thread
From: Nix @ 2008-04-24 22:18 UTC (permalink / raw
  To: Jeff Dike; +Cc: Thomas Gleixner, user-mode-linux-devel

On 24 Apr 2008, Jeff Dike uttered the following:
> OK, yell if it starts happening again...

I tempted fate, and lo, it happens within the hour...

(gdb) bt
#0  0x08083e3a in getnstimeofday ()

I will now do what I should have done long since and turn on frame
pointers and debugging info so I can get a damn backtrace. (I really
thought that was on, but apparently not...)


One 2.6.25 improvement: in 2.6.24, whenever this happened the TUN/TAP
driver seemed to get some sort of nasty lock taken out, all future
packets sent over it vanished into the aether rather than going over the
(virtual) ether(net), and I had to reboot the host to fix it. No
longer. :)

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
User-mode-linux-devel mailing list
User-mode-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/user-mode-linux-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-04-24 22:19 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-05 16:45 [uml-devel] Lockups with the fixed timer code :/ Nix
2008-04-14 14:40 ` Jeff Dike
2008-04-15 19:47   ` Nix
2008-04-15 20:59     ` Ryan Finnie
2008-04-16 19:44     ` Nix
2008-04-24 19:57       ` Nix
2008-04-24 20:37         ` Jeff Dike
2008-04-24 22:18           ` Nix

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.