LKML Archive mirror
 help / color / mirror / Atom feed
* [patch] xenfb: fix xenfb suspend/resume race
@ 2010-12-30 12:56 Joe Jin
  2010-12-30 16:40 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 14+ messages in thread
From: Joe Jin @ 2010-12-30 12:56 UTC (permalink / raw
  To: jeremy, ian.campbell, Andrew Morton
  Cc: linux-fbdev, xen-devel, linux-kernel, gurudas.pai, greg.marsden,
	guru.anbalagane, joe.jin

Hi,

when do migration test, we hit the panic as below:
<1>BUG: unable to handle kernel paging request at 0000000b819fdb98
<1>IP: [<ffffffff812a588f>] notify_remote_via_irq+0x13/0x34
<4>PGD 94b10067 PUD 0
<0>Oops: 0000 [#1] SMP
<0>last sysfs file: /sys/class/misc/autofs/dev
<4>CPU 3
<4>Modules linked in: autofs4(U) hidp(U) nfs(U) fscache(U) nfs_acl(U)
auth_rpcgss(U) rfcomm(U) l2cap(U) bluetooth(U) rfkill(U) lockd(U) sunrpc(U)
nf_conntrack_netbios_ns(U) ipt_REJECT(U) nf_conntrack_ipv4(U)
nf_defrag_ipv4(U) xt_state(U) nf_conntrack(U) iptable_filter(U) ip_tables(U)
ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U)
ipv6(U) parport_pc(U) lp(U) parport(U) snd_seq_dummy(U) snd_seq_oss(U)
snd_seq_midi_event(U) snd_seq(U) snd_seq_device(U) snd_pcm_oss(U)
snd_mixer_oss(U) snd_pcm(U) snd_timer(U) snd(U) soundcore(U)
snd_page_alloc(U) joydev(U) xen_netfront(U) pcspkr(U) xen_blkfront(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Pid: 18, comm: events/3 Not tainted 2.6.32
RIP: e030:[<ffffffff812a588f>]  [<ffffffff812a588f>]
ify_remote_via_irq+0x13/0x34
RSP: e02b:ffff8800e7bf7bd0  EFLAGS: 00010202
RAX: ffff8800e61c8000 RBX: ffff8800e62f82c0 RCX: 0000000000000000
RDX: 00000000000001e3 RSI: ffff8800e7bf7c68 RDI: 0000000bfffffff4
RBP: ffff8800e7bf7be0 R08: 00000000000001e2 R09: ffff8800e62f82c0
R10: 0000000000000001 R11: ffff8800e6386110 R12: 0000000000000000
R13: 0000000000000007 R14: ffff8800e62f82e0 R15: 0000000000000240
FS:  00007f409d3906e0(0000) GS:ffff8800028b8000(0000)
GS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000b819fdb98 CR3: 000000003ee3b000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process events/3 (pid: 18, threadinfo ffff8800e7bf6000, task
f8800e7bf4540)
Stack:
 0000000000000200 ffff8800e61c8000 ffff8800e7bf7c00 ffffffff812712c9
<0> ffffffff8100ea5f ffffffff81438d80 ffff8800e7bf7cd0 ffffffff812714ee
<0> 0000000000000000 ffffffff81270568 000000000000e030 0000000000010202
Call Trace:
 [<ffffffff812712c9>] xenfb_send_event+0x5c/0x5e
 [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
 [<ffffffff812714ee>] xenfb_refresh+0x1b1/0x1d7
 [<ffffffff81270568>] ? sys_imageblit+0x1ac/0x458
 [<ffffffff81271786>] xenfb_imageblit+0x2f/0x34
 [<ffffffff8126a3e5>] soft_cursor+0x1b5/0x1c8
 [<ffffffff8126a137>] bit_cursor+0x4b6/0x4d7
 [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
 [<ffffffff81269c81>] ? bit_cursor+0x0/0x4d7
 [<ffffffff812656b7>] fb_flashcursor+0xff/0x111
 [<ffffffff812655b8>] ? fb_flashcursor+0x0/0x111
 [<ffffffff81071812>] worker_thread+0x14d/0x1ed
 [<ffffffff81075a8c>] ? autoremove_wake_function+0x0/0x3d
 [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
 [<ffffffff810716c5>] ? worker_thread+0x0/0x1ed
 [<ffffffff810756e3>] kthread+0x6e/0x76
 [<ffffffff81012dea>] child_rip+0xa/0x20
 [<ffffffff81011fd1>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101275d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81012de0>] ? child_rip+0x0/0x20
Code: 6b ff 0c 8b 87 a4 db 9f 81 66 85 c0 74 08 0f b7 f8 e8 3b ff ff ff c9
c3 55 48 89 e5 48 83 ec 10 0f 1f 44 00 00 89 ff 48 6b ff 0c <8b> 87 a4 db 9f
81 66 85 c0 74 14 48 8d 75 f0 0f b7 c0 bf 04 00
RIP  [<ffffffff812a588f>] notify_remote_via_irq+0x13/0x34
 RSP <ffff8800e7bf7bd0>
CR2: 0000000b819fdb98
---[ end trace 098b4b74827595d0 ]---
Kernel panic - not syncing: Fatal exception
Pid: 18, comm: events/3 Tainted: G      D    2.6.32
Call Trace:
 [<ffffffff812a029e>] ? card_probe+0x99/0x123
 [<ffffffff81056a96>] panic+0xa5/0x162
 [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
 [<ffffffff81079824>] ? down_trylock+0x30/0x38
 [<ffffffff812a029e>] ? card_probe+0x99/0x123
 [<ffffffff8105744c>] ? console_unblank+0x23/0x6f
 [<ffffffff81056763>] ? print_oops_end_marker+0x23/0x25
 [<ffffffff812a029e>] ? card_probe+0x99/0x123
 [<ffffffff81439c76>] oops_end+0xb7/0xc7
 [<ffffffff810366de>] no_context+0x1f1/0x200
 [<ffffffff812a029e>] ? card_probe+0x99/0x123
 [<ffffffff81036931>] __bad_area_nosemaphore+0x183/0x1a6
 [<ffffffff812af119>] ? extract_buf+0xbd/0x134
 [<ffffffff81030c7b>] ? pvclock_clocksource_read+0x47/0x9e
 [<ffffffff810369de>] bad_area_nosemaphore+0x13/0x15
 [<ffffffff8143b0ed>] do_page_fault+0x147/0x26c
 [<ffffffff81439185>] page_fault+0x25/0x30
 [<ffffffff812a588f>] ? notify_remote_via_irq+0x13/0x34
 [<ffffffff812712c9>] xenfb_send_event+0x5c/0x5e
 [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
 [<ffffffff812714ee>] xenfb_refresh+0x1b1/0x1d7
 [<ffffffff81270568>] ? sys_imageblit+0x1ac/0x458
 [<ffffffff81271786>] xenfb_imageblit+0x2f/0x34
 [<ffffffff8126a3e5>] soft_cursor+0x1b5/0x1c8
 [<ffffffff8126a137>] bit_cursor+0x4b6/0x4d7
 [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
 [<ffffffff81269c81>] ? bit_cursor+0x0/0x4d7
 [<ffffffff812656b7>] fb_flashcursor+0xff/0x111
 [<ffffffff812655b8>] ? fb_flashcursor+0x0/0x111
 [<ffffffff81071812>] worker_thread+0x14d/0x1ed
 [<ffffffff81075a8c>] ? autoremove_wake_function+0x0/0x3d
 [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
 [<ffffffff810716c5>] ? worker_thread+0x0/0x1ed
 [<ffffffff810756e3>] kthread+0x6e/0x76
 [<ffffffff81012dea>] child_rip+0xa/0x20
 [<ffffffff81011fd1>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101275d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81012de0>] ? child_rip+0x0/0x20
 [<ffffffff81012de0>] ? child_rip+0x0/0x20

Check the source found this maybe caused by kernel tried to used not ready
xenfb when resume.

Below is the potential fix, please reivew it

Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Andrew Morton <akpm@linux-foundation.org>

---
 xen-fbfront.c |   19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/video/xen-fbfront.c b/drivers/video/xen-fbfront.c
index dc72563..367fb1c 100644
--- a/drivers/video/xen-fbfront.c
+++ b/drivers/video/xen-fbfront.c
@@ -561,26 +561,24 @@ static void xenfb_init_shared_page(struct xenfb_info *info,
 static int xenfb_connect_backend(struct xenbus_device *dev,
 				 struct xenfb_info *info)
 {
-	int ret, evtchn;
+	int ret, evtchn, irq;
 	struct xenbus_transaction xbt;
 
 	ret = xenbus_alloc_evtchn(dev, &evtchn);
 	if (ret)
 		return ret;
-	ret = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
+	irq = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
 					0, dev->devicetype, info);
-	if (ret < 0) {
+	if (irq < 0) {
 		xenbus_free_evtchn(dev, evtchn);
 		xenbus_dev_fatal(dev, ret, "bind_evtchn_to_irqhandler");
-		return ret;
+		return irq;
 	}
-	info->irq = ret;
-
  again:
 	ret = xenbus_transaction_start(&xbt);
 	if (ret) {
 		xenbus_dev_fatal(dev, ret, "starting transaction");
-		return ret;
+		goto unbind_irq;
 	}
 	ret = xenbus_printf(xbt, dev->nodename, "page-ref", "%lu",
 			    virt_to_mfn(info->page));
@@ -602,15 +600,20 @@ static int xenfb_connect_backend(struct xenbus_device *dev,
 		if (ret == -EAGAIN)
 			goto again;
 		xenbus_dev_fatal(dev, ret, "completing transaction");
-		return ret;
+		goto unbind_irq;
 	}
 
 	xenbus_switch_state(dev, XenbusStateInitialised);
+	info->irq = irq;
 	return 0;
 
  error_xenbus:
 	xenbus_transaction_end(xbt, 1);
 	xenbus_dev_fatal(dev, ret, "writing xenstore");
+ unbind_irq:
+	printk(KERN_ERR "xenfb_connect_backend failed!\n");
+	unbind_from_irqhandler(irq, info);
+	xenbus_free_evtchn(dev, evtchn);
 	return ret;
 }
 


-- 
Oracle <http://www.oracle.com>
Joe Jin | Team Leader, Software Development | +8610.8278.6295
ORACLE | Linux and Virtualization
Incubator Building 2-A ZPark | Beijing China, 100094



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [patch] xenfb: fix xenfb suspend/resume race
  2010-12-30 12:56 [patch] xenfb: fix xenfb suspend/resume race Joe Jin
@ 2010-12-30 16:40 ` Konrad Rzeszutek Wilk
  2010-12-31  0:56   ` Joe Jin
  2011-01-04 11:15   ` Ian Campbell
  0 siblings, 2 replies; 14+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-12-30 16:40 UTC (permalink / raw
  To: Joe Jin
  Cc: jeremy, ian.campbell, Andrew Morton, linux-fbdev, xen-devel,
	linux-kernel, gurudas.pai, greg.marsden, guru.anbalagane

On Thu, Dec 30, 2010 at 08:56:16PM +0800, Joe Jin wrote:
> Hi,

Joe,

Patch looks good, however..

I am unclear from your description whether the patch fixes
the problem (I would presume so). Or does it take a long time
to hit this race?

> 
> when do migration test, we hit the panic as below:
> <1>BUG: unable to handle kernel paging request at 0000000b819fdb98
> <1>IP: [<ffffffff812a588f>] notify_remote_via_irq+0x13/0x34
> <4>PGD 94b10067 PUD 0
> <0>Oops: 0000 [#1] SMP
> <0>last sysfs file: /sys/class/misc/autofs/dev
> <4>CPU 3
> <4>Modules linked in: autofs4(U) hidp(U) nfs(U) fscache(U) nfs_acl(U)
> auth_rpcgss(U) rfcomm(U) l2cap(U) bluetooth(U) rfkill(U) lockd(U) sunrpc(U)
> nf_conntrack_netbios_ns(U) ipt_REJECT(U) nf_conntrack_ipv4(U)
> nf_defrag_ipv4(U) xt_state(U) nf_conntrack(U) iptable_filter(U) ip_tables(U)
> ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U)
> ipv6(U) parport_pc(U) lp(U) parport(U) snd_seq_dummy(U) snd_seq_oss(U)
> snd_seq_midi_event(U) snd_seq(U) snd_seq_device(U) snd_pcm_oss(U)
> snd_mixer_oss(U) snd_pcm(U) snd_timer(U) snd(U) soundcore(U)
> snd_page_alloc(U) joydev(U) xen_netfront(U) pcspkr(U) xen_blkfront(U)
> uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
> Pid: 18, comm: events/3 Not tainted 2.6.32
> RIP: e030:[<ffffffff812a588f>]  [<ffffffff812a588f>]
> ify_remote_via_irq+0x13/0x34
> RSP: e02b:ffff8800e7bf7bd0  EFLAGS: 00010202
> RAX: ffff8800e61c8000 RBX: ffff8800e62f82c0 RCX: 0000000000000000
> RDX: 00000000000001e3 RSI: ffff8800e7bf7c68 RDI: 0000000bfffffff4
> RBP: ffff8800e7bf7be0 R08: 00000000000001e2 R09: ffff8800e62f82c0
> R10: 0000000000000001 R11: ffff8800e6386110 R12: 0000000000000000
> R13: 0000000000000007 R14: ffff8800e62f82e0 R15: 0000000000000240
> FS:  00007f409d3906e0(0000) GS:ffff8800028b8000(0000)
> GS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000b819fdb98 CR3: 000000003ee3b000 CR4: 0000000000002660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process events/3 (pid: 18, threadinfo ffff8800e7bf6000, task
> f8800e7bf4540)
> Stack:
>  0000000000000200 ffff8800e61c8000 ffff8800e7bf7c00 ffffffff812712c9
> <0> ffffffff8100ea5f ffffffff81438d80 ffff8800e7bf7cd0 ffffffff812714ee
> <0> 0000000000000000 ffffffff81270568 000000000000e030 0000000000010202
> Call Trace:
>  [<ffffffff812712c9>] xenfb_send_event+0x5c/0x5e
>  [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
>  [<ffffffff812714ee>] xenfb_refresh+0x1b1/0x1d7
>  [<ffffffff81270568>] ? sys_imageblit+0x1ac/0x458
>  [<ffffffff81271786>] xenfb_imageblit+0x2f/0x34
>  [<ffffffff8126a3e5>] soft_cursor+0x1b5/0x1c8
>  [<ffffffff8126a137>] bit_cursor+0x4b6/0x4d7
>  [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
>  [<ffffffff81269c81>] ? bit_cursor+0x0/0x4d7
>  [<ffffffff812656b7>] fb_flashcursor+0xff/0x111
>  [<ffffffff812655b8>] ? fb_flashcursor+0x0/0x111
>  [<ffffffff81071812>] worker_thread+0x14d/0x1ed
>  [<ffffffff81075a8c>] ? autoremove_wake_function+0x0/0x3d
>  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
>  [<ffffffff810716c5>] ? worker_thread+0x0/0x1ed
>  [<ffffffff810756e3>] kthread+0x6e/0x76
>  [<ffffffff81012dea>] child_rip+0xa/0x20
>  [<ffffffff81011fd1>] ? int_ret_from_sys_call+0x7/0x1b
>  [<ffffffff8101275d>] ? retint_restore_args+0x5/0x6
>  [<ffffffff81012de0>] ? child_rip+0x0/0x20
> Code: 6b ff 0c 8b 87 a4 db 9f 81 66 85 c0 74 08 0f b7 f8 e8 3b ff ff ff c9
> c3 55 48 89 e5 48 83 ec 10 0f 1f 44 00 00 89 ff 48 6b ff 0c <8b> 87 a4 db 9f
> 81 66 85 c0 74 14 48 8d 75 f0 0f b7 c0 bf 04 00
> RIP  [<ffffffff812a588f>] notify_remote_via_irq+0x13/0x34
>  RSP <ffff8800e7bf7bd0>
> CR2: 0000000b819fdb98
> ---[ end trace 098b4b74827595d0 ]---
> Kernel panic - not syncing: Fatal exception
> Pid: 18, comm: events/3 Tainted: G      D    2.6.32
> Call Trace:
>  [<ffffffff812a029e>] ? card_probe+0x99/0x123
>  [<ffffffff81056a96>] panic+0xa5/0x162
>  [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
>  [<ffffffff81079824>] ? down_trylock+0x30/0x38
>  [<ffffffff812a029e>] ? card_probe+0x99/0x123
>  [<ffffffff8105744c>] ? console_unblank+0x23/0x6f
>  [<ffffffff81056763>] ? print_oops_end_marker+0x23/0x25
>  [<ffffffff812a029e>] ? card_probe+0x99/0x123
>  [<ffffffff81439c76>] oops_end+0xb7/0xc7
>  [<ffffffff810366de>] no_context+0x1f1/0x200
>  [<ffffffff812a029e>] ? card_probe+0x99/0x123
>  [<ffffffff81036931>] __bad_area_nosemaphore+0x183/0x1a6
>  [<ffffffff812af119>] ? extract_buf+0xbd/0x134
>  [<ffffffff81030c7b>] ? pvclock_clocksource_read+0x47/0x9e
>  [<ffffffff810369de>] bad_area_nosemaphore+0x13/0x15
>  [<ffffffff8143b0ed>] do_page_fault+0x147/0x26c
>  [<ffffffff81439185>] page_fault+0x25/0x30
>  [<ffffffff812a588f>] ? notify_remote_via_irq+0x13/0x34
>  [<ffffffff812712c9>] xenfb_send_event+0x5c/0x5e
>  [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
>  [<ffffffff812714ee>] xenfb_refresh+0x1b1/0x1d7
>  [<ffffffff81270568>] ? sys_imageblit+0x1ac/0x458
>  [<ffffffff81271786>] xenfb_imageblit+0x2f/0x34
>  [<ffffffff8126a3e5>] soft_cursor+0x1b5/0x1c8
>  [<ffffffff8126a137>] bit_cursor+0x4b6/0x4d7
>  [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
>  [<ffffffff81269c81>] ? bit_cursor+0x0/0x4d7
>  [<ffffffff812656b7>] fb_flashcursor+0xff/0x111
>  [<ffffffff812655b8>] ? fb_flashcursor+0x0/0x111
>  [<ffffffff81071812>] worker_thread+0x14d/0x1ed
>  [<ffffffff81075a8c>] ? autoremove_wake_function+0x0/0x3d
>  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
>  [<ffffffff810716c5>] ? worker_thread+0x0/0x1ed
>  [<ffffffff810756e3>] kthread+0x6e/0x76
>  [<ffffffff81012dea>] child_rip+0xa/0x20
>  [<ffffffff81011fd1>] ? int_ret_from_sys_call+0x7/0x1b
>  [<ffffffff8101275d>] ? retint_restore_args+0x5/0x6
>  [<ffffffff81012de0>] ? child_rip+0x0/0x20
>  [<ffffffff81012de0>] ? child_rip+0x0/0x20
> 
> Check the source found this maybe caused by kernel tried to used not ready
> xenfb when resume.
> 
> Below is the potential fix, please reivew it
> 
> Signed-off-by: Joe Jin <joe.jin@oracle.com>
> Cc: Ian Campbell <ian.campbell@citrix.com>
> Cc: Jeremy Fitzhardinge <jeremy@goop.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> 
> ---
>  xen-fbfront.c |   19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/video/xen-fbfront.c b/drivers/video/xen-fbfront.c
> index dc72563..367fb1c 100644
> --- a/drivers/video/xen-fbfront.c
> +++ b/drivers/video/xen-fbfront.c
> @@ -561,26 +561,24 @@ static void xenfb_init_shared_page(struct xenfb_info *info,
>  static int xenfb_connect_backend(struct xenbus_device *dev,
>  				 struct xenfb_info *info)
>  {
> -	int ret, evtchn;
> +	int ret, evtchn, irq;
>  	struct xenbus_transaction xbt;
>  
>  	ret = xenbus_alloc_evtchn(dev, &evtchn);
>  	if (ret)
>  		return ret;
> -	ret = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
> +	irq = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
>  					0, dev->devicetype, info);
> -	if (ret < 0) {
> +	if (irq < 0) {
>  		xenbus_free_evtchn(dev, evtchn);
>  		xenbus_dev_fatal(dev, ret, "bind_evtchn_to_irqhandler");
> -		return ret;
> +		return irq;
>  	}
> -	info->irq = ret;
> -
>   again:
>  	ret = xenbus_transaction_start(&xbt);
>  	if (ret) {
>  		xenbus_dev_fatal(dev, ret, "starting transaction");
> -		return ret;
> +		goto unbind_irq;
>  	}
>  	ret = xenbus_printf(xbt, dev->nodename, "page-ref", "%lu",
>  			    virt_to_mfn(info->page));
> @@ -602,15 +600,20 @@ static int xenfb_connect_backend(struct xenbus_device *dev,
>  		if (ret == -EAGAIN)
>  			goto again;
>  		xenbus_dev_fatal(dev, ret, "completing transaction");
> -		return ret;
> +		goto unbind_irq;
>  	}
>  
>  	xenbus_switch_state(dev, XenbusStateInitialised);
> +	info->irq = irq;
>  	return 0;
>  
>   error_xenbus:
>  	xenbus_transaction_end(xbt, 1);
>  	xenbus_dev_fatal(dev, ret, "writing xenstore");
> + unbind_irq:
> +	printk(KERN_ERR "xenfb_connect_backend failed!\n");
> +	unbind_from_irqhandler(irq, info);
> +	xenbus_free_evtchn(dev, evtchn);
>  	return ret;
>  }
>  
> 
> 
> -- 
> Oracle <http://www.oracle.com>
> Joe Jin | Team Leader, Software Development | +8610.8278.6295
> ORACLE | Linux and Virtualization
> Incubator Building 2-A ZPark | Beijing China, 100094
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] xenfb: fix xenfb suspend/resume race
  2010-12-30 16:40 ` Konrad Rzeszutek Wilk
@ 2010-12-31  0:56   ` Joe Jin
  2011-01-03 16:34     ` [Xen-devel] " Konrad Rzeszutek Wilk
  2011-01-04 11:15   ` Ian Campbell
  1 sibling, 1 reply; 14+ messages in thread
From: Joe Jin @ 2010-12-31  0:56 UTC (permalink / raw
  To: Konrad Rzeszutek Wilk
  Cc: jeremy, ian.campbell, Andrew Morton, linux-fbdev, xen-devel,
	linux-kernel, gurudas.pai, greg.marsden, guru.anbalagane

On 12/31/10 00:40, Konrad Rzeszutek Wilk wrote:
> On Thu, Dec 30, 2010 at 08:56:16PM +0800, Joe Jin wrote:
>> Hi,
> 
> Joe,
> 
> Patch looks good, however..
> 
> I am unclear from your description whether the patch fixes
> the problem (I would presume so). Or does it take a long time
> to hit this race?
> 
Yes, more than 100 migrations. we hit this issue around 3 times.

I dumped vmcore when guest crashed, from vmcore everything
looked good, fb_info, xenfb_info and so on.

Checked the calltrace I suspected when guest resuming, the process kevent
scheduled and refresh xenfb. look like when call notify_remote_via_irq(),  better
to confirm irq is valid? 

Please review new patch.

Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
 
---
 video/xen-fbfront.c |   19 +++++++++++--------
 xen/events.c        |    2 ++
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/video/xen-fbfront.c b/drivers/video/xen-fbfront.c
index dc72563..367fb1c 100644
--- a/drivers/video/xen-fbfront.c
+++ b/drivers/video/xen-fbfront.c
@@ -561,26 +561,24 @@ static void xenfb_init_shared_page(struct xenfb_info *info,
 static int xenfb_connect_backend(struct xenbus_device *dev,
 				 struct xenfb_info *info)
 {
-	int ret, evtchn;
+	int ret, evtchn, irq;
 	struct xenbus_transaction xbt;
 
 	ret = xenbus_alloc_evtchn(dev, &evtchn);
 	if (ret)
 		return ret;
-	ret = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
+	irq = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
 					0, dev->devicetype, info);
-	if (ret < 0) {
+	if (irq < 0) {
 		xenbus_free_evtchn(dev, evtchn);
 		xenbus_dev_fatal(dev, ret, "bind_evtchn_to_irqhandler");
-		return ret;
+		return irq;
 	}
-	info->irq = ret;
-
  again:
 	ret = xenbus_transaction_start(&xbt);
 	if (ret) {
 		xenbus_dev_fatal(dev, ret, "starting transaction");
-		return ret;
+		goto unbind_irq;
 	}
 	ret = xenbus_printf(xbt, dev->nodename, "page-ref", "%lu",
 			    virt_to_mfn(info->page));
@@ -602,15 +600,20 @@ static int xenfb_connect_backend(struct xenbus_device *dev,
 		if (ret == -EAGAIN)
 			goto again;
 		xenbus_dev_fatal(dev, ret, "completing transaction");
-		return ret;
+		goto unbind_irq;
 	}
 
 	xenbus_switch_state(dev, XenbusStateInitialised);
+	info->irq = irq;
 	return 0;
 
  error_xenbus:
 	xenbus_transaction_end(xbt, 1);
 	xenbus_dev_fatal(dev, ret, "writing xenstore");
+ unbind_irq:
+	printk(KERN_ERR "xenfb_connect_backend failed!\n");
+	unbind_from_irqhandler(irq, info);
+	xenbus_free_evtchn(dev, evtchn);
 	return ret;
 }
 
diff --git a/drivers/xen/events.c b/drivers/xen/events.c
index ac7b42f..4cfb5e2 100644
--- a/drivers/xen/events.c
+++ b/drivers/xen/events.c
@@ -175,6 +175,8 @@ static struct irq_info *info_for_irq(unsigned irq)
 
 static unsigned int evtchn_from_irq(unsigned irq)
 {
+	if (unlikely(irq < 0 || irq >= nr_irqs))
+		return 0;
 	return info_for_irq(irq)->evtchn;
 }
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Xen-devel] Re: [patch] xenfb: fix xenfb suspend/resume race
  2010-12-31  0:56   ` Joe Jin
@ 2011-01-03 16:34     ` Konrad Rzeszutek Wilk
  2011-01-04  0:34       ` Joe Jin
  0 siblings, 1 reply; 14+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-03 16:34 UTC (permalink / raw
  To: Joe Jin
  Cc: jeremy, xen-devel, ian.campbell, gurudas.pai, guru.anbalagane,
	greg.marsden, linux-kernel, linux-fbdev, Andrew Morton

> > I am unclear from your description whether the patch fixes
> > the problem (I would presume so). Or does it take a long time
> > to hit this race?
> > 
> Yes, more than 100 migrations. we hit this issue around 3 times.

OK, so you are still trying to find the culprit.

Did you look at this patch from Ian:

https://patchwork.kernel.org/patch/403192/

?
> 
> I dumped vmcore when guest crashed, from vmcore everything
> looked good, fb_info, xenfb_info and so on.

And the event channels are correct?

.. snip..
> diff --git a/drivers/xen/events.c b/drivers/xen/events.c
> index ac7b42f..4cfb5e2 100644
> --- a/drivers/xen/events.c
> +++ b/drivers/xen/events.c
> @@ -175,6 +175,8 @@ static struct irq_info *info_for_irq(unsigned irq)
>  
>  static unsigned int evtchn_from_irq(unsigned irq)
>  {
> +	if (unlikely(irq < 0 || irq >= nr_irqs))
> +		return 0;

You could insert a WARN_ON here to see see if you get this during your
migration process.

Or use xen_raw_printk in case the guest is hung for good.

>  	return info_for_irq(irq)->evtchn;
>  }
>  
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Xen-devel] Re: [patch] xenfb: fix xenfb suspend/resume race
  2011-01-03 16:34     ` [Xen-devel] " Konrad Rzeszutek Wilk
@ 2011-01-04  0:34       ` Joe Jin
  0 siblings, 0 replies; 14+ messages in thread
From: Joe Jin @ 2011-01-04  0:34 UTC (permalink / raw
  To: Konrad Rzeszutek Wilk
  Cc: jeremy, xen-devel, ian.campbell, gurudas.pai, guru.anbalagane,
	greg.marsden, linux-kernel, linux-fbdev, Andrew Morton

On 01/04/11 00:34, Konrad Rzeszutek Wilk wrote:
>>> I am unclear from your description whether the patch fixes
>>> the problem (I would presume so). Or does it take a long time
>>> to hit this race?
>>>
>> Yes, more than 100 migrations. we hit this issue around 3 times.
> 
> OK, so you are still trying to find the culprit.
> 
> Did you look at this patch from Ian:
> 
> https://patchwork.kernel.org/patch/403192/

We have reproduced the issue with the patch.

> 
> ?
>>
>> I dumped vmcore when guest crashed, from vmcore everything
>> looked good, fb_info, xenfb_info and so on.
> 
> And the event channels are correct?
> 
> .. snip..
>> diff --git a/drivers/xen/events.c b/drivers/xen/events.c
>> index ac7b42f..4cfb5e2 100644
>> --- a/drivers/xen/events.c
>> +++ b/drivers/xen/events.c
>> @@ -175,6 +175,8 @@ static struct irq_info *info_for_irq(unsigned irq)
>>  
>>  static unsigned int evtchn_from_irq(unsigned irq)
>>  {
>> +	if (unlikely(irq < 0 || irq >= nr_irqs))
>> +		return 0;
> 
> You could insert a WARN_ON here to see see if you get this during your
> migration process.
> 
> Or use xen_raw_printk in case the guest is hung for good.
> 

Thanks for your advice, will try it.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] xenfb: fix xenfb suspend/resume race
  2010-12-30 16:40 ` Konrad Rzeszutek Wilk
  2010-12-31  0:56   ` Joe Jin
@ 2011-01-04 11:15   ` Ian Campbell
  2011-01-06  7:14     ` Joe Jin
  1 sibling, 1 reply; 14+ messages in thread
From: Ian Campbell @ 2011-01-04 11:15 UTC (permalink / raw
  To: Konrad Rzeszutek Wilk
  Cc: Joe Jin, jeremy@goop.org, Andrew Morton,
	linux-fbdev@vger.kernel.org, xen-devel@lists.xensource.com,
	linux-kernel@vger.kernel.org, gurudas.pai@oracle.com,
	greg.marsden@oracle.com, guru.anbalagane@oracle.com

On Thu, 2010-12-30 at 16:40 +0000, Konrad Rzeszutek Wilk wrote:
> On Thu, Dec 30, 2010 at 08:56:16PM +0800, Joe Jin wrote:
> > Hi,
> 
> Joe,
> 
> Patch looks good, however..
> 
> I am unclear from your description whether the patch fixes
> the problem (I would presume so). Or does it take a long time
> to hit this race?

I also don't see how the patch relates to the stack trace.

Is the issue is that xenfb_send_event is called between xenfb_resume
(which tears down the state, including evtchn->irq binding) and the
probe/connect of the new fb?

I suspect xenfb_resume should set info->update_wanted to 0. This will
defer updates until we have successfully reconnected.

Otherwise, since xenfb_resume also calls xenfb_connect I presume the
call of xenfb_send_event is asynchronous. Which would suggest that some
other locking is missing.

Ian.


> 
> > 
> > when do migration test, we hit the panic as below:
> > <1>BUG: unable to handle kernel paging request at 0000000b819fdb98
> > <1>IP: [<ffffffff812a588f>] notify_remote_via_irq+0x13/0x34
> > <4>PGD 94b10067 PUD 0
> > <0>Oops: 0000 [#1] SMP
> > <0>last sysfs file: /sys/class/misc/autofs/dev
> > <4>CPU 3
> > <4>Modules linked in: autofs4(U) hidp(U) nfs(U) fscache(U) nfs_acl(U)
> > auth_rpcgss(U) rfcomm(U) l2cap(U) bluetooth(U) rfkill(U) lockd(U) sunrpc(U)
> > nf_conntrack_netbios_ns(U) ipt_REJECT(U) nf_conntrack_ipv4(U)
> > nf_defrag_ipv4(U) xt_state(U) nf_conntrack(U) iptable_filter(U) ip_tables(U)
> > ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U)
> > ipv6(U) parport_pc(U) lp(U) parport(U) snd_seq_dummy(U) snd_seq_oss(U)
> > snd_seq_midi_event(U) snd_seq(U) snd_seq_device(U) snd_pcm_oss(U)
> > snd_mixer_oss(U) snd_pcm(U) snd_timer(U) snd(U) soundcore(U)
> > snd_page_alloc(U) joydev(U) xen_netfront(U) pcspkr(U) xen_blkfront(U)
> > uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
> > Pid: 18, comm: events/3 Not tainted 2.6.32
> > RIP: e030:[<ffffffff812a588f>]  [<ffffffff812a588f>]
> > ify_remote_via_irq+0x13/0x34
> > RSP: e02b:ffff8800e7bf7bd0  EFLAGS: 00010202
> > RAX: ffff8800e61c8000 RBX: ffff8800e62f82c0 RCX: 0000000000000000
> > RDX: 00000000000001e3 RSI: ffff8800e7bf7c68 RDI: 0000000bfffffff4
> > RBP: ffff8800e7bf7be0 R08: 00000000000001e2 R09: ffff8800e62f82c0
> > R10: 0000000000000001 R11: ffff8800e6386110 R12: 0000000000000000
> > R13: 0000000000000007 R14: ffff8800e62f82e0 R15: 0000000000000240
> > FS:  00007f409d3906e0(0000) GS:ffff8800028b8000(0000)
> > GS:0000000000000000
> > CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 0000000b819fdb98 CR3: 000000003ee3b000 CR4: 0000000000002660
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process events/3 (pid: 18, threadinfo ffff8800e7bf6000, task
> > f8800e7bf4540)
> > Stack:
> >  0000000000000200 ffff8800e61c8000 ffff8800e7bf7c00 ffffffff812712c9
> > <0> ffffffff8100ea5f ffffffff81438d80 ffff8800e7bf7cd0 ffffffff812714ee
> > <0> 0000000000000000 ffffffff81270568 000000000000e030 0000000000010202
> > Call Trace:
> >  [<ffffffff812712c9>] xenfb_send_event+0x5c/0x5e
> >  [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
> >  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
> >  [<ffffffff812714ee>] xenfb_refresh+0x1b1/0x1d7
> >  [<ffffffff81270568>] ? sys_imageblit+0x1ac/0x458
> >  [<ffffffff81271786>] xenfb_imageblit+0x2f/0x34
> >  [<ffffffff8126a3e5>] soft_cursor+0x1b5/0x1c8
> >  [<ffffffff8126a137>] bit_cursor+0x4b6/0x4d7
> >  [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
> >  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
> >  [<ffffffff81269c81>] ? bit_cursor+0x0/0x4d7
> >  [<ffffffff812656b7>] fb_flashcursor+0xff/0x111
> >  [<ffffffff812655b8>] ? fb_flashcursor+0x0/0x111
> >  [<ffffffff81071812>] worker_thread+0x14d/0x1ed
> >  [<ffffffff81075a8c>] ? autoremove_wake_function+0x0/0x3d
> >  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
> >  [<ffffffff810716c5>] ? worker_thread+0x0/0x1ed
> >  [<ffffffff810756e3>] kthread+0x6e/0x76
> >  [<ffffffff81012dea>] child_rip+0xa/0x20
> >  [<ffffffff81011fd1>] ? int_ret_from_sys_call+0x7/0x1b
> >  [<ffffffff8101275d>] ? retint_restore_args+0x5/0x6
> >  [<ffffffff81012de0>] ? child_rip+0x0/0x20
> > Code: 6b ff 0c 8b 87 a4 db 9f 81 66 85 c0 74 08 0f b7 f8 e8 3b ff ff ff c9
> > c3 55 48 89 e5 48 83 ec 10 0f 1f 44 00 00 89 ff 48 6b ff 0c <8b> 87 a4 db 9f
> > 81 66 85 c0 74 14 48 8d 75 f0 0f b7 c0 bf 04 00
> > RIP  [<ffffffff812a588f>] notify_remote_via_irq+0x13/0x34
> >  RSP <ffff8800e7bf7bd0>
> > CR2: 0000000b819fdb98
> > ---[ end trace 098b4b74827595d0 ]---
> > Kernel panic - not syncing: Fatal exception
> > Pid: 18, comm: events/3 Tainted: G      D    2.6.32
> > Call Trace:
> >  [<ffffffff812a029e>] ? card_probe+0x99/0x123
> >  [<ffffffff81056a96>] panic+0xa5/0x162
> >  [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
> >  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
> >  [<ffffffff81079824>] ? down_trylock+0x30/0x38
> >  [<ffffffff812a029e>] ? card_probe+0x99/0x123
> >  [<ffffffff8105744c>] ? console_unblank+0x23/0x6f
> >  [<ffffffff81056763>] ? print_oops_end_marker+0x23/0x25
> >  [<ffffffff812a029e>] ? card_probe+0x99/0x123
> >  [<ffffffff81439c76>] oops_end+0xb7/0xc7
> >  [<ffffffff810366de>] no_context+0x1f1/0x200
> >  [<ffffffff812a029e>] ? card_probe+0x99/0x123
> >  [<ffffffff81036931>] __bad_area_nosemaphore+0x183/0x1a6
> >  [<ffffffff812af119>] ? extract_buf+0xbd/0x134
> >  [<ffffffff81030c7b>] ? pvclock_clocksource_read+0x47/0x9e
> >  [<ffffffff810369de>] bad_area_nosemaphore+0x13/0x15
> >  [<ffffffff8143b0ed>] do_page_fault+0x147/0x26c
> >  [<ffffffff81439185>] page_fault+0x25/0x30
> >  [<ffffffff812a588f>] ? notify_remote_via_irq+0x13/0x34
> >  [<ffffffff812712c9>] xenfb_send_event+0x5c/0x5e
> >  [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
> >  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
> >  [<ffffffff812714ee>] xenfb_refresh+0x1b1/0x1d7
> >  [<ffffffff81270568>] ? sys_imageblit+0x1ac/0x458
> >  [<ffffffff81271786>] xenfb_imageblit+0x2f/0x34
> >  [<ffffffff8126a3e5>] soft_cursor+0x1b5/0x1c8
> >  [<ffffffff8126a137>] bit_cursor+0x4b6/0x4d7
> >  [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
> >  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
> >  [<ffffffff81269c81>] ? bit_cursor+0x0/0x4d7
> >  [<ffffffff812656b7>] fb_flashcursor+0xff/0x111
> >  [<ffffffff812655b8>] ? fb_flashcursor+0x0/0x111
> >  [<ffffffff81071812>] worker_thread+0x14d/0x1ed
> >  [<ffffffff81075a8c>] ? autoremove_wake_function+0x0/0x3d
> >  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
> >  [<ffffffff810716c5>] ? worker_thread+0x0/0x1ed
> >  [<ffffffff810756e3>] kthread+0x6e/0x76
> >  [<ffffffff81012dea>] child_rip+0xa/0x20
> >  [<ffffffff81011fd1>] ? int_ret_from_sys_call+0x7/0x1b
> >  [<ffffffff8101275d>] ? retint_restore_args+0x5/0x6
> >  [<ffffffff81012de0>] ? child_rip+0x0/0x20
> >  [<ffffffff81012de0>] ? child_rip+0x0/0x20
> > 
> > Check the source found this maybe caused by kernel tried to used not ready
> > xenfb when resume.
> > 
> > Below is the potential fix, please reivew it
> > 
> > Signed-off-by: Joe Jin <joe.jin@oracle.com>
> > Cc: Ian Campbell <ian.campbell@citrix.com>
> > Cc: Jeremy Fitzhardinge <jeremy@goop.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > 
> > ---
> >  xen-fbfront.c |   19 +++++++++++--------
> >  1 file changed, 11 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/video/xen-fbfront.c b/drivers/video/xen-fbfront.c
> > index dc72563..367fb1c 100644
> > --- a/drivers/video/xen-fbfront.c
> > +++ b/drivers/video/xen-fbfront.c
> > @@ -561,26 +561,24 @@ static void xenfb_init_shared_page(struct xenfb_info *info,
> >  static int xenfb_connect_backend(struct xenbus_device *dev,
> >  				 struct xenfb_info *info)
> >  {
> > -	int ret, evtchn;
> > +	int ret, evtchn, irq;
> >  	struct xenbus_transaction xbt;
> >  
> >  	ret = xenbus_alloc_evtchn(dev, &evtchn);
> >  	if (ret)
> >  		return ret;
> > -	ret = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
> > +	irq = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
> >  					0, dev->devicetype, info);
> > -	if (ret < 0) {
> > +	if (irq < 0) {
> >  		xenbus_free_evtchn(dev, evtchn);
> >  		xenbus_dev_fatal(dev, ret, "bind_evtchn_to_irqhandler");
> > -		return ret;
> > +		return irq;
> >  	}
> > -	info->irq = ret;
> > -
> >   again:
> >  	ret = xenbus_transaction_start(&xbt);
> >  	if (ret) {
> >  		xenbus_dev_fatal(dev, ret, "starting transaction");
> > -		return ret;
> > +		goto unbind_irq;
> >  	}
> >  	ret = xenbus_printf(xbt, dev->nodename, "page-ref", "%lu",
> >  			    virt_to_mfn(info->page));
> > @@ -602,15 +600,20 @@ static int xenfb_connect_backend(struct xenbus_device *dev,
> >  		if (ret == -EAGAIN)
> >  			goto again;
> >  		xenbus_dev_fatal(dev, ret, "completing transaction");
> > -		return ret;
> > +		goto unbind_irq;
> >  	}
> >  
> >  	xenbus_switch_state(dev, XenbusStateInitialised);
> > +	info->irq = irq;
> >  	return 0;
> >  
> >   error_xenbus:
> >  	xenbus_transaction_end(xbt, 1);
> >  	xenbus_dev_fatal(dev, ret, "writing xenstore");
> > + unbind_irq:
> > +	printk(KERN_ERR "xenfb_connect_backend failed!\n");
> > +	unbind_from_irqhandler(irq, info);
> > +	xenbus_free_evtchn(dev, evtchn);
> >  	return ret;
> >  }
> >  
> > 
> > 
> > -- 
> > Oracle <http://www.oracle.com>
> > Joe Jin | Team Leader, Software Development | +8610.8278.6295
> > ORACLE | Linux and Virtualization
> > Incubator Building 2-A ZPark | Beijing China, 100094
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] xenfb: fix xenfb suspend/resume race
  2011-01-04 11:15   ` Ian Campbell
@ 2011-01-06  7:14     ` Joe Jin
  2011-01-06  8:02       ` Ian Campbell
  0 siblings, 1 reply; 14+ messages in thread
From: Joe Jin @ 2011-01-06  7:14 UTC (permalink / raw
  To: Ian Campbell
  Cc: Konrad Rzeszutek Wilk, jeremy@goop.org, Andrew Morton,
	linux-fbdev@vger.kernel.org, xen-devel@lists.xensource.com,
	linux-kernel@vger.kernel.org, gurudas.pai@oracle.com,
	greg.marsden@oracle.com, guru.anbalagane@oracle.com

On 01/04/11 19:15, Ian Campbell wrote:
> On Thu, 2010-12-30 at 16:40 +0000, Konrad Rzeszutek Wilk wrote:
>> On Thu, Dec 30, 2010 at 08:56:16PM +0800, Joe Jin wrote:
>>> Hi,
>>
>> Joe,
>>
>> Patch looks good, however..
>>
>> I am unclear from your description whether the patch fixes
>> the problem (I would presume so). Or does it take a long time
>> to hit this race?
> 
> I also don't see how the patch relates to the stack trace.
> 
> Is the issue is that xenfb_send_event is called between xenfb_resume
> (which tears down the state, including evtchn->irq binding) and the
> probe/connect of the new fb?

Yes, when hit this issue, with debugging kernel found irq is invalid(-1).
Check if irq is valid will fix this issue.

And, when failed to connect to backend, need to release the resource.

Please review new patch for this issue.
Thanks,
Joe


Signed-off-by: Joe Jin <joe.jin@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Andrew Morton <akpm@linux-foundation.org>

---
 video/xen-fbfront.c |   19 +++++++++++--------
 xen/events.c        |    4 ++++
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/video/xen-fbfront.c b/drivers/video/xen-fbfront.c
index dc72563..367fb1c 100644
--- a/drivers/video/xen-fbfront.c
+++ b/drivers/video/xen-fbfront.c
@@ -561,26 +561,24 @@ static void xenfb_init_shared_page(struct xenfb_info *info,
 static int xenfb_connect_backend(struct xenbus_device *dev,
 				 struct xenfb_info *info)
 {
-	int ret, evtchn;
+	int ret, evtchn, irq;
 	struct xenbus_transaction xbt;
 
 	ret = xenbus_alloc_evtchn(dev, &evtchn);
 	if (ret)
 		return ret;
-	ret = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
+	irq = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
 					0, dev->devicetype, info);
-	if (ret < 0) {
+	if (irq < 0) {
 		xenbus_free_evtchn(dev, evtchn);
 		xenbus_dev_fatal(dev, ret, "bind_evtchn_to_irqhandler");
-		return ret;
+		return irq;
 	}
-	info->irq = ret;
-
  again:
 	ret = xenbus_transaction_start(&xbt);
 	if (ret) {
 		xenbus_dev_fatal(dev, ret, "starting transaction");
-		return ret;
+		goto unbind_irq;
 	}
 	ret = xenbus_printf(xbt, dev->nodename, "page-ref", "%lu",
 			    virt_to_mfn(info->page));
@@ -602,15 +600,20 @@ static int xenfb_connect_backend(struct xenbus_device *dev,
 		if (ret == -EAGAIN)
 			goto again;
 		xenbus_dev_fatal(dev, ret, "completing transaction");
-		return ret;
+		goto unbind_irq;
 	}
 
 	xenbus_switch_state(dev, XenbusStateInitialised);
+	info->irq = irq;
 	return 0;
 
  error_xenbus:
 	xenbus_transaction_end(xbt, 1);
 	xenbus_dev_fatal(dev, ret, "writing xenstore");
+ unbind_irq:
+	printk(KERN_ERR "xenfb_connect_backend failed!\n");
+	unbind_from_irqhandler(irq, info);
+	xenbus_free_evtchn(dev, evtchn);
 	return ret;
 }
 
diff --git a/drivers/xen/events.c b/drivers/xen/events.c
index ac7b42f..4028704 100644
--- a/drivers/xen/events.c
+++ b/drivers/xen/events.c
@@ -175,6 +175,10 @@ static struct irq_info *info_for_irq(unsigned irq)
 
 static unsigned int evtchn_from_irq(unsigned irq)
 {
+	if (unlikely(irq < 0 || irq >= nr_irqs)) {
+		WARN_ON(1, "[%s]: Invalid irq(%d)!\n", __func__, irq);
+		return 0;
+	}
 	return info_for_irq(irq)->evtchn;
 }
 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [patch] xenfb: fix xenfb suspend/resume race
  2011-01-06  7:14     ` Joe Jin
@ 2011-01-06  8:02       ` Ian Campbell
  2011-01-06  8:14         ` Joe Jin
  2011-01-06  8:47         ` Ian Campbell
  0 siblings, 2 replies; 14+ messages in thread
From: Ian Campbell @ 2011-01-06  8:02 UTC (permalink / raw
  To: Joe Jin
  Cc: Konrad Rzeszutek Wilk, jeremy@goop.org, Andrew Morton,
	linux-fbdev@vger.kernel.org, xen-devel@lists.xensource.com,
	linux-kernel@vger.kernel.org, gurudas.pai@oracle.com,
	greg.marsden@oracle.com, guru.anbalagane@oracle.com

On Thu, 2011-01-06 at 07:14 +0000, Joe Jin wrote: 
> On 01/04/11 19:15, Ian Campbell wrote:
> > On Thu, 2010-12-30 at 16:40 +0000, Konrad Rzeszutek Wilk wrote:
> >> On Thu, Dec 30, 2010 at 08:56:16PM +0800, Joe Jin wrote:
> >>> Hi,
> >>
> >> Joe,
> >>
> >> Patch looks good, however..
> >>
> >> I am unclear from your description whether the patch fixes
> >> the problem (I would presume so). Or does it take a long time
> >> to hit this race?
> > 
> > I also don't see how the patch relates to the stack trace.
> > 
> > Is the issue is that xenfb_send_event is called between xenfb_resume
> > (which tears down the state, including evtchn->irq binding) and the
> > probe/connect of the new fb?
> 
> Yes, when hit this issue, with debugging kernel found irq is invalid(-1).

But why is it -1? I really don't think you have identified the root
cause here. If you really have identified the root cause then your
changelog needs to go into much greater depth regarding your analysis.

> Check if irq is valid will fix this issue.

No, it papers over the issue, the code should never have been allowed to
get this far if the connection to the backend is not yet fully resumed
(i.e. when irq == -1).

The call to xenfb_send_event should have been gated further up the call
chain, AFAICT by the check of info->update_wanted in xenfb_refresh. This
suggests that the correct fix is to set info->update_wanted = 0 in
xenfb_resume.

I said all this in my previous mail and you ignored it. Did you try this
approach?

> And, when failed to connect to backend, need to release the resource.

So the changes to xenfb_connect_backend are independent of the irq == -1
issue? In which case this part, which seems like a reasonable and valid
fix, should be split into a separate patch.

> Please review new patch for this issue.

Nacked-by: Ian Campbell <ian.campbell@citrix.com>

Ian.

> Thanks,
> Joe
> 
> 
> Signed-off-by: Joe Jin <joe.jin@oracle.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Ian Campbell <ian.campbell@citrix.com>
> Cc: Jeremy Fitzhardinge <jeremy@goop.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> 
> ---
>  video/xen-fbfront.c |   19 +++++++++++--------
>  xen/events.c        |    4 ++++
>  2 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/video/xen-fbfront.c b/drivers/video/xen-fbfront.c
> index dc72563..367fb1c 100644
> --- a/drivers/video/xen-fbfront.c
> +++ b/drivers/video/xen-fbfront.c
> @@ -561,26 +561,24 @@ static void xenfb_init_shared_page(struct xenfb_info *info,
>  static int xenfb_connect_backend(struct xenbus_device *dev,
>  				 struct xenfb_info *info)
>  {
> -	int ret, evtchn;
> +	int ret, evtchn, irq;
>  	struct xenbus_transaction xbt;
>  
>  	ret = xenbus_alloc_evtchn(dev, &evtchn);
>  	if (ret)
>  		return ret;
> -	ret = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
> +	irq = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
>  					0, dev->devicetype, info);
> -	if (ret < 0) {
> +	if (irq < 0) {
>  		xenbus_free_evtchn(dev, evtchn);
>  		xenbus_dev_fatal(dev, ret, "bind_evtchn_to_irqhandler");
> -		return ret;
> +		return irq;
>  	}
> -	info->irq = ret;
> -
>   again:
>  	ret = xenbus_transaction_start(&xbt);
>  	if (ret) {
>  		xenbus_dev_fatal(dev, ret, "starting transaction");
> -		return ret;
> +		goto unbind_irq;
>  	}
>  	ret = xenbus_printf(xbt, dev->nodename, "page-ref", "%lu",
>  			    virt_to_mfn(info->page));
> @@ -602,15 +600,20 @@ static int xenfb_connect_backend(struct xenbus_device *dev,
>  		if (ret == -EAGAIN)
>  			goto again;
>  		xenbus_dev_fatal(dev, ret, "completing transaction");
> -		return ret;
> +		goto unbind_irq;
>  	}
>  
>  	xenbus_switch_state(dev, XenbusStateInitialised);
> +	info->irq = irq;
>  	return 0;
>  
>   error_xenbus:
>  	xenbus_transaction_end(xbt, 1);
>  	xenbus_dev_fatal(dev, ret, "writing xenstore");
> + unbind_irq:
> +	printk(KERN_ERR "xenfb_connect_backend failed!\n");
> +	unbind_from_irqhandler(irq, info);
> +	xenbus_free_evtchn(dev, evtchn);
>  	return ret;
>  }
>  
> diff --git a/drivers/xen/events.c b/drivers/xen/events.c
> index ac7b42f..4028704 100644
> --- a/drivers/xen/events.c
> +++ b/drivers/xen/events.c
> @@ -175,6 +175,10 @@ static struct irq_info *info_for_irq(unsigned irq)
>  
>  static unsigned int evtchn_from_irq(unsigned irq)
>  {
> +	if (unlikely(irq < 0 || irq >= nr_irqs)) {
> +		WARN_ON(1, "[%s]: Invalid irq(%d)!\n", __func__, irq);
> +		return 0;
> +	}
>  	return info_for_irq(irq)->evtchn;
>  }
>  



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] xenfb: fix xenfb suspend/resume race
  2011-01-06  8:02       ` Ian Campbell
@ 2011-01-06  8:14         ` Joe Jin
  2011-01-07  6:43           ` [Xen-devel] " Joe Jin
  2011-01-06  8:47         ` Ian Campbell
  1 sibling, 1 reply; 14+ messages in thread
From: Joe Jin @ 2011-01-06  8:14 UTC (permalink / raw
  To: Ian Campbell
  Cc: Konrad Rzeszutek Wilk, jeremy@goop.org, Andrew Morton,
	linux-fbdev@vger.kernel.org, xen-devel@lists.xensource.com,
	linux-kernel@vger.kernel.org, gurudas.pai@oracle.com,
	greg.marsden@oracle.com, guru.anbalagane@oracle.com


> 
> No, it papers over the issue, the code should never have been allowed to
> get this far if the connection to the backend is not yet fully resumed
> (i.e. when irq == -1).
> 
> The call to xenfb_send_event should have been gated further up the call
> chain, AFAICT by the check of info->update_wanted in xenfb_refresh. This
> suggests that the correct fix is to set info->update_wanted = 0 in
> xenfb_resume.
> 
> I said all this in my previous mail and you ignored it. Did you try this
> approach?

Will try this then update you.

Thanks,
Joe

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Xen-devel] Re: [patch] xenfb: fix xenfb suspend/resume race
  2011-01-06  8:02       ` Ian Campbell
  2011-01-06  8:14         ` Joe Jin
@ 2011-01-06  8:47         ` Ian Campbell
  1 sibling, 0 replies; 14+ messages in thread
From: Ian Campbell @ 2011-01-06  8:47 UTC (permalink / raw
  To: Joe Jin
  Cc: jeremy@goop.org, xen-devel@lists.xensource.com,
	gurudas.pai@oracle.com, Konrad Rzeszutek Wilk,
	guru.anbalagane@oracle.com, greg.marsden@oracle.com,
	linux-kernel@vger.kernel.org, linux-fbdev@vger.kernel.org,
	Andrew Morton

On Thu, 2011-01-06 at 08:02 +0000, Ian Campbell wrote:
> 
> > Check if irq is valid will fix this issue.
> 
> No, it papers over the issue, the code should never have been allowed
> to get this far if the connection to the backend is not yet fully
> resumed (i.e. when irq == -1). 

On the other hand changing a kernel crash into a WARN_ON has merit in
its own right, I just think it is a stretch to say the xenfb issue is
fixed by it! You'd need to also make the warning go away to be able to
say that.

> @@ -175,6 +175,10 @@ static struct irq_info *info_for_irq(unsigned irq)
>  
>  static unsigned int evtchn_from_irq(unsigned irq)
>  {
> +     if (unlikely(irq < 0 || irq >= nr_irqs)) {
> +             WARN_ON(1, "[%s]: Invalid irq(%d)!\n", __func__, irq);

Did this compile without warnings? That isn't the correct syntax for a
WARN_ON since it takes only a condition and not the additional message.

I think this can simplified as:
	if (WARN(irq < 0 || irq >= nr_irqs, "invalid irq %d!\n", irq))
             return 0;

I think WARN includes the file name and line number as well as a stack
trace so including the function name is unnecessary.

I'm not sure the condition used is totally valid for a modern kernel
with sparse IRQs (it probably needs to do a irq_to_desc lookup) but I
think we can defer that until events.c moves to the more dynamic scheme
this implies (I should resurrect those patches now that thiongs have
settled down after the dom0 merge).

Ian.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch] xenfb: fix xenfb suspend/resume race.
@ 2011-01-07  6:40 Joe Jin
  2011-01-07  9:17 ` Ian Campbell
  0 siblings, 1 reply; 14+ messages in thread
From: Joe Jin @ 2011-01-07  6:40 UTC (permalink / raw
  To: jeremy, ian.campbell, Andrew Morton
  Cc: linux-fbdev, xen-devel, gurudas.pai, guru.anbalagane,
	greg.marsden, joe.jin, linux-kernel, Konrad Rzeszutek Wilk

Hi,

when do migration test, we hit the panic as below:
<1>BUG: unable to handle kernel paging request at 0000000b819fdb98
<1>IP: [<ffffffff812a588f>] notify_remote_via_irq+0x13/0x34
<4>PGD 94b10067 PUD 0
<0>Oops: 0000 [#1] SMP
<0>last sysfs file: /sys/class/misc/autofs/dev
<4>CPU 3
<4>Modules linked in: autofs4(U) hidp(U) nfs(U) fscache(U) nfs_acl(U)
auth_rpcgss(U) rfcomm(U) l2cap(U) bluetooth(U) rfkill(U) lockd(U) sunrpc(U)
nf_conntrack_netbios_ns(U) ipt_REJECT(U) nf_conntrack_ipv4(U)
nf_defrag_ipv4(U) xt_state(U) nf_conntrack(U) iptable_filter(U) ip_tables(U)
ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U)
ipv6(U) parport_pc(U) lp(U) parport(U) snd_seq_dummy(U) snd_seq_oss(U)
snd_seq_midi_event(U) snd_seq(U) snd_seq_device(U) snd_pcm_oss(U)
snd_mixer_oss(U) snd_pcm(U) snd_timer(U) snd(U) soundcore(U)
snd_page_alloc(U) joydev(U) xen_netfront(U) pcspkr(U) xen_blkfront(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Pid: 18, comm: events/3 Not tainted 2.6.32
RIP: e030:[<ffffffff812a588f>]  [<ffffffff812a588f>]
ify_remote_via_irq+0x13/0x34
RSP: e02b:ffff8800e7bf7bd0  EFLAGS: 00010202
RAX: ffff8800e61c8000 RBX: ffff8800e62f82c0 RCX: 0000000000000000
RDX: 00000000000001e3 RSI: ffff8800e7bf7c68 RDI: 0000000bfffffff4
RBP: ffff8800e7bf7be0 R08: 00000000000001e2 R09: ffff8800e62f82c0
R10: 0000000000000001 R11: ffff8800e6386110 R12: 0000000000000000
R13: 0000000000000007 R14: ffff8800e62f82e0 R15: 0000000000000240
FS:  00007f409d3906e0(0000) GS:ffff8800028b8000(0000)
GS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000b819fdb98 CR3: 000000003ee3b000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process events/3 (pid: 18, threadinfo ffff8800e7bf6000, task
f8800e7bf4540)
Stack:
 0000000000000200 ffff8800e61c8000 ffff8800e7bf7c00 ffffffff812712c9
<0> ffffffff8100ea5f ffffffff81438d80 ffff8800e7bf7cd0 ffffffff812714ee
<0> 0000000000000000 ffffffff81270568 000000000000e030 0000000000010202
Call Trace:
 [<ffffffff812712c9>] xenfb_send_event+0x5c/0x5e
 [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
 [<ffffffff812714ee>] xenfb_refresh+0x1b1/0x1d7
 [<ffffffff81270568>] ? sys_imageblit+0x1ac/0x458
 [<ffffffff81271786>] xenfb_imageblit+0x2f/0x34
 [<ffffffff8126a3e5>] soft_cursor+0x1b5/0x1c8
 [<ffffffff8126a137>] bit_cursor+0x4b6/0x4d7
 [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
 [<ffffffff81269c81>] ? bit_cursor+0x0/0x4d7
 [<ffffffff812656b7>] fb_flashcursor+0xff/0x111
 [<ffffffff812655b8>] ? fb_flashcursor+0x0/0x111
 [<ffffffff81071812>] worker_thread+0x14d/0x1ed
 [<ffffffff81075a8c>] ? autoremove_wake_function+0x0/0x3d
 [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
 [<ffffffff810716c5>] ? worker_thread+0x0/0x1ed
 [<ffffffff810756e3>] kthread+0x6e/0x76
 [<ffffffff81012dea>] child_rip+0xa/0x20
 [<ffffffff81011fd1>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101275d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81012de0>] ? child_rip+0x0/0x20
Code: 6b ff 0c 8b 87 a4 db 9f 81 66 85 c0 74 08 0f b7 f8 e8 3b ff ff ff c9
c3 55 48 89 e5 48 83 ec 10 0f 1f 44 00 00 89 ff 48 6b ff 0c <8b> 87 a4 db 9f
81 66 85 c0 74 14 48 8d 75 f0 0f b7 c0 bf 04 00
RIP  [<ffffffff812a588f>] notify_remote_via_irq+0x13/0x34
 RSP <ffff8800e7bf7bd0>
CR2: 0000000b819fdb98
---[ end trace 098b4b74827595d0 ]---

The root cause of the panic is try to refresh xenfb when suspend/resume.
Clear refresh flag of xenfb before disconnect backend would fix this issue.
Also below patch will fixed mem leak when connect to xenfb backend failed.

Please review and comment.

Signed-off-by: Joe Jin <joe.jin@oracle.com>
Tested-by: Gurudas Pai <gurudas.pai@oracle.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>

---
 xen-fbfront.c |   21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/video/xen-fbfront.c b/drivers/video/xen-fbfront.c
index dc72563..6f23797 100644
--- a/drivers/video/xen-fbfront.c
+++ b/drivers/video/xen-fbfront.c
@@ -561,26 +561,24 @@ static void xenfb_init_shared_page(struct xenfb_info *info,
 static int xenfb_connect_backend(struct xenbus_device *dev,
 				 struct xenfb_info *info)
 {
-	int ret, evtchn;
+	int ret, evtchn, irq;
 	struct xenbus_transaction xbt;
 
 	ret = xenbus_alloc_evtchn(dev, &evtchn);
 	if (ret)
 		return ret;
-	ret = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
+	irq = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
 					0, dev->devicetype, info);
-	if (ret < 0) {
+	if (irq < 0) {
 		xenbus_free_evtchn(dev, evtchn);
 		xenbus_dev_fatal(dev, ret, "bind_evtchn_to_irqhandler");
-		return ret;
+		return irq;
 	}
-	info->irq = ret;
-
  again:
 	ret = xenbus_transaction_start(&xbt);
 	if (ret) {
 		xenbus_dev_fatal(dev, ret, "starting transaction");
-		return ret;
+		goto unbind_irq;
 	}
 	ret = xenbus_printf(xbt, dev->nodename, "page-ref", "%lu",
 			    virt_to_mfn(info->page));
@@ -602,20 +600,27 @@ static int xenfb_connect_backend(struct xenbus_device *dev,
 		if (ret == -EAGAIN)
 			goto again;
 		xenbus_dev_fatal(dev, ret, "completing transaction");
-		return ret;
+		goto unbind_irq;
 	}
 
 	xenbus_switch_state(dev, XenbusStateInitialised);
+	info->irq = irq;
 	return 0;
 
  error_xenbus:
 	xenbus_transaction_end(xbt, 1);
 	xenbus_dev_fatal(dev, ret, "writing xenstore");
+ unbind_irq:
+	printk(KERN_ERR "xenfb_connect_backend failed!\n");
+	unbind_from_irqhandler(irq, info);
+	xenbus_free_evtchn(dev, evtchn);
 	return ret;
 }
 
 static void xenfb_disconnect_backend(struct xenfb_info *info)
 {
+	/* Prevent refresh xenfb */
+	info->update_wanted = 0;
 	if (info->irq >= 0)
 		unbind_from_irqhandler(info->irq, info);
 	info->irq = -1;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Xen-devel] Re: [patch] xenfb: fix xenfb suspend/resume race
  2011-01-06  8:14         ` Joe Jin
@ 2011-01-07  6:43           ` Joe Jin
  0 siblings, 0 replies; 14+ messages in thread
From: Joe Jin @ 2011-01-07  6:43 UTC (permalink / raw
  To: Joe Jin
  Cc: Ian Campbell, jeremy@goop.org, xen-devel@lists.xensource.com,
	gurudas.pai@oracle.com, Konrad Rzeszutek Wilk,
	guru.anbalagane@oracle.com, greg.marsden@oracle.com,
	linux-kernel@vger.kernel.org, linux-fbdev@vger.kernel.org,
	Andrew Morton

On 01/06/11 16:14, Joe Jin wrote:
> 
>>
>> No, it papers over the issue, the code should never have been allowed to
>> get this far if the connection to the backend is not yet fully resumed
>> (i.e. when irq == -1).
>>
>> The call to xenfb_send_event should have been gated further up the call
>> chain, AFAICT by the check of info->update_wanted in xenfb_refresh. This
>> suggests that the correct fix is to set info->update_wanted = 0 in
>> xenfb_resume.
>>
>> I said all this in my previous mail and you ignored it. Did you try this
>> approach?
> 

Disable xenfb refresh flag before suspend will fix this issue, sent new patch
to you by another mail, please review it.

Thanks,
Joe

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch] xenfb: fix xenfb suspend/resume race.
  2011-01-07  6:40 Joe Jin
@ 2011-01-07  9:17 ` Ian Campbell
  0 siblings, 0 replies; 14+ messages in thread
From: Ian Campbell @ 2011-01-07  9:17 UTC (permalink / raw
  To: Joe Jin
  Cc: jeremy@goop.org, Andrew Morton, linux-fbdev@vger.kernel.org,
	xen-devel@lists.xensource.com, gurudas.pai@oracle.com,
	guru.anbalagane@oracle.com, greg.marsden@oracle.com,
	linux-kernel@vger.kernel.org, Konrad Rzeszutek Wilk

On Fri, 2011-01-07 at 06:40 +0000, Joe Jin wrote:
> Hi,
> 
> when do migration test, we hit the panic as below:
> <1>BUG: unable to handle kernel paging request at 0000000b819fdb98
> <1>IP: [<ffffffff812a588f>] notify_remote_via_irq+0x13/0x34
> <4>PGD 94b10067 PUD 0
> <0>Oops: 0000 [#1] SMP
> <0>last sysfs file: /sys/class/misc/autofs/dev
> <4>CPU 3
> <4>Modules linked in: autofs4(U) hidp(U) nfs(U) fscache(U) nfs_acl(U)
> auth_rpcgss(U) rfcomm(U) l2cap(U) bluetooth(U) rfkill(U) lockd(U) sunrpc(U)
> nf_conntrack_netbios_ns(U) ipt_REJECT(U) nf_conntrack_ipv4(U)
> nf_defrag_ipv4(U) xt_state(U) nf_conntrack(U) iptable_filter(U) ip_tables(U)
> ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U)
> ipv6(U) parport_pc(U) lp(U) parport(U) snd_seq_dummy(U) snd_seq_oss(U)
> snd_seq_midi_event(U) snd_seq(U) snd_seq_device(U) snd_pcm_oss(U)
> snd_mixer_oss(U) snd_pcm(U) snd_timer(U) snd(U) soundcore(U)
> snd_page_alloc(U) joydev(U) xen_netfront(U) pcspkr(U) xen_blkfront(U)
> uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
> Pid: 18, comm: events/3 Not tainted 2.6.32
> RIP: e030:[<ffffffff812a588f>]  [<ffffffff812a588f>]
> ify_remote_via_irq+0x13/0x34
> RSP: e02b:ffff8800e7bf7bd0  EFLAGS: 00010202
> RAX: ffff8800e61c8000 RBX: ffff8800e62f82c0 RCX: 0000000000000000
> RDX: 00000000000001e3 RSI: ffff8800e7bf7c68 RDI: 0000000bfffffff4
> RBP: ffff8800e7bf7be0 R08: 00000000000001e2 R09: ffff8800e62f82c0
> R10: 0000000000000001 R11: ffff8800e6386110 R12: 0000000000000000
> R13: 0000000000000007 R14: ffff8800e62f82e0 R15: 0000000000000240
> FS:  00007f409d3906e0(0000) GS:ffff8800028b8000(0000)
> GS:0000000000000000
> CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000b819fdb98 CR3: 000000003ee3b000 CR4: 0000000000002660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process events/3 (pid: 18, threadinfo ffff8800e7bf6000, task
> f8800e7bf4540)
> Stack:
>  0000000000000200 ffff8800e61c8000 ffff8800e7bf7c00 ffffffff812712c9
> <0> ffffffff8100ea5f ffffffff81438d80 ffff8800e7bf7cd0 ffffffff812714ee
> <0> 0000000000000000 ffffffff81270568 000000000000e030 0000000000010202
> Call Trace:
>  [<ffffffff812712c9>] xenfb_send_event+0x5c/0x5e
>  [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
>  [<ffffffff812714ee>] xenfb_refresh+0x1b1/0x1d7
>  [<ffffffff81270568>] ? sys_imageblit+0x1ac/0x458
>  [<ffffffff81271786>] xenfb_imageblit+0x2f/0x34
>  [<ffffffff8126a3e5>] soft_cursor+0x1b5/0x1c8
>  [<ffffffff8126a137>] bit_cursor+0x4b6/0x4d7
>  [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
>  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
>  [<ffffffff81269c81>] ? bit_cursor+0x0/0x4d7
>  [<ffffffff812656b7>] fb_flashcursor+0xff/0x111
>  [<ffffffff812655b8>] ? fb_flashcursor+0x0/0x111
>  [<ffffffff81071812>] worker_thread+0x14d/0x1ed
>  [<ffffffff81075a8c>] ? autoremove_wake_function+0x0/0x3d
>  [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
>  [<ffffffff810716c5>] ? worker_thread+0x0/0x1ed
>  [<ffffffff810756e3>] kthread+0x6e/0x76
>  [<ffffffff81012dea>] child_rip+0xa/0x20
>  [<ffffffff81011fd1>] ? int_ret_from_sys_call+0x7/0x1b
>  [<ffffffff8101275d>] ? retint_restore_args+0x5/0x6
>  [<ffffffff81012de0>] ? child_rip+0x0/0x20
> Code: 6b ff 0c 8b 87 a4 db 9f 81 66 85 c0 74 08 0f b7 f8 e8 3b ff ff ff c9
> c3 55 48 89 e5 48 83 ec 10 0f 1f 44 00 00 89 ff 48 6b ff 0c <8b> 87 a4 db 9f
> 81 66 85 c0 74 14 48 8d 75 f0 0f b7 c0 bf 04 00
> RIP  [<ffffffff812a588f>] notify_remote_via_irq+0x13/0x34
>  RSP <ffff8800e7bf7bd0>
> CR2: 0000000b819fdb98
> ---[ end trace 098b4b74827595d0 ]---
> 
> The root cause of the panic is try to refresh xenfb when suspend/resume.

perhaps work "... between the resume and reconnecting to the backend."
into that sentence somewhere.

> Clear refresh flag of xenfb before disconnect backend would fix this issue.

s/refresh/update_wanted/

> Also below patch will fixed mem leak when connect to xenfb backend failed.
> 
> Please review and comment.
> 
> Signed-off-by: Joe Jin <joe.jin@oracle.com>
> Tested-by: Gurudas Pai <gurudas.pai@oracle.com>

Looks good. But please separate the mem leak fix into its own patch, it
has nothing to do with this crash (hiding a 1 line fix for a crash in
amongst 30 lines of something else does nobody any favours, as the
length of this thread testifies).

You can add
Acked-by: Ian Campbell <ian.campbell@citrix.com>
to the xenfb_disconnect_backend change once you've split it out.

I have some further comments on the xenfb_connect_backend change below.

> Cc: Jeremy Fitzhardinge <jeremy@goop.org>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> 
> ---
>  xen-fbfront.c |   21 +++++++++++++--------
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/video/xen-fbfront.c b/drivers/video/xen-fbfront.c
> index dc72563..6f23797 100644
> --- a/drivers/video/xen-fbfront.c
> +++ b/drivers/video/xen-fbfront.c
> @@ -561,26 +561,24 @@ static void xenfb_init_shared_page(struct xenfb_info *info,
>  static int xenfb_connect_backend(struct xenbus_device *dev,
>  				 struct xenfb_info *info)
>  {
> -	int ret, evtchn;
> +	int ret, evtchn, irq;
>  	struct xenbus_transaction xbt;
>  
>  	ret = xenbus_alloc_evtchn(dev, &evtchn);
>  	if (ret)
>  		return ret;
> -	ret = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
> +	irq = bind_evtchn_to_irqhandler(evtchn, xenfb_event_handler,
>  					0, dev->devicetype, info);
> -	if (ret < 0) {
> +	if (irq < 0) {
>  		xenbus_free_evtchn(dev, evtchn);
>  		xenbus_dev_fatal(dev, ret, "bind_evtchn_to_irqhandler");
> -		return ret;
> +		return irq;
>  	}
> -	info->irq = ret;
> -
>   again:
>  	ret = xenbus_transaction_start(&xbt);
>  	if (ret) {
>  		xenbus_dev_fatal(dev, ret, "starting transaction");
> -		return ret;
> +		goto unbind_irq;
>  	}
>  	ret = xenbus_printf(xbt, dev->nodename, "page-ref", "%lu",
>  			    virt_to_mfn(info->page));
> @@ -602,20 +600,27 @@ static int xenfb_connect_backend(struct xenbus_device *dev,
>  		if (ret == -EAGAIN)
>  			goto again;
>  		xenbus_dev_fatal(dev, ret, "completing transaction");
> -		return ret;
> +		goto unbind_irq;
>  	}
>  
>  	xenbus_switch_state(dev, XenbusStateInitialised);
> +	info->irq = irq;
>  	return 0;
>  
>   error_xenbus:
>  	xenbus_transaction_end(xbt, 1);
>  	xenbus_dev_fatal(dev, ret, "writing xenstore");
> + unbind_irq:
> +	printk(KERN_ERR "xenfb_connect_backend failed!\n");

If anything this should be xenbus_dev_BLAH(). However all the places
which "goto unbind_irq" already include a xenbus_dev_fatal with more
specific context information and so does the case which falls through
from error_xenbus so I think this new message is redundant.

> +	unbind_from_irqhandler(irq, info);
> +	xenbus_free_evtchn(dev, evtchn);

unbind_from_irqhandler will also close the event channel for you so the
call to xenbus_free_evtchn is not necessary here.

Thanks,
Ian.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch] xenfb: fix xenfb suspend/resume race.
@ 2011-01-07 10:17 Joe Jin
  0 siblings, 0 replies; 14+ messages in thread
From: Joe Jin @ 2011-01-07 10:17 UTC (permalink / raw
  To: jeremy, ian.campbell, Andrew Morton
  Cc: linux-fbdev, xen-devel, gurudas.pai, guru.anbalagane,
	greg.marsden, joe.jin, linux-kernel, Konrad Rzeszutek Wilk

Hi,

when do migration test, we hit the panic as below:

<1>BUG: unable to handle kernel paging request at 0000000b819fdb98
<1>IP: [<ffffffff812a588f>] notify_remote_via_irq+0x13/0x34
<4>PGD 94b10067 PUD 0
<0>Oops: 0000 [#1] SMP
<0>last sysfs file: /sys/class/misc/autofs/dev
<4>CPU 3
<4>Modules linked in: autofs4(U) hidp(U) nfs(U) fscache(U) nfs_acl(U)
auth_rpcgss(U) rfcomm(U) l2cap(U) bluetooth(U) rfkill(U) lockd(U) sunrpc(U)
nf_conntrack_netbios_ns(U) ipt_REJECT(U) nf_conntrack_ipv4(U)
nf_defrag_ipv4(U) xt_state(U) nf_conntrack(U) iptable_filter(U) ip_tables(U)
ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U)
ipv6(U) parport_pc(U) lp(U) parport(U) snd_seq_dummy(U) snd_seq_oss(U)
snd_seq_midi_event(U) snd_seq(U) snd_seq_device(U) snd_pcm_oss(U)
snd_mixer_oss(U) snd_pcm(U) snd_timer(U) snd(U) soundcore(U)
snd_page_alloc(U) joydev(U) xen_netfront(U) pcspkr(U) xen_blkfront(U)
uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Pid: 18, comm: events/3 Not tainted 2.6.32
RIP: e030:[<ffffffff812a588f>]  [<ffffffff812a588f>]
ify_remote_via_irq+0x13/0x34
RSP: e02b:ffff8800e7bf7bd0  EFLAGS: 00010202
RAX: ffff8800e61c8000 RBX: ffff8800e62f82c0 RCX: 0000000000000000
RDX: 00000000000001e3 RSI: ffff8800e7bf7c68 RDI: 0000000bfffffff4
RBP: ffff8800e7bf7be0 R08: 00000000000001e2 R09: ffff8800e62f82c0
R10: 0000000000000001 R11: ffff8800e6386110 R12: 0000000000000000
R13: 0000000000000007 R14: ffff8800e62f82e0 R15: 0000000000000240
FS:  00007f409d3906e0(0000) GS:ffff8800028b8000(0000)
GS:0000000000000000
CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000b819fdb98 CR3: 000000003ee3b000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process events/3 (pid: 18, threadinfo ffff8800e7bf6000, task
f8800e7bf4540)
Stack:
 0000000000000200 ffff8800e61c8000 ffff8800e7bf7c00 ffffffff812712c9
<0> ffffffff8100ea5f ffffffff81438d80 ffff8800e7bf7cd0 ffffffff812714ee
<0> 0000000000000000 ffffffff81270568 000000000000e030 0000000000010202
Call Trace:
 [<ffffffff812712c9>] xenfb_send_event+0x5c/0x5e
 [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
 [<ffffffff812714ee>] xenfb_refresh+0x1b1/0x1d7
 [<ffffffff81270568>] ? sys_imageblit+0x1ac/0x458
 [<ffffffff81271786>] xenfb_imageblit+0x2f/0x34
 [<ffffffff8126a3e5>] soft_cursor+0x1b5/0x1c8
 [<ffffffff8126a137>] bit_cursor+0x4b6/0x4d7
 [<ffffffff8100ea5f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
 [<ffffffff81269c81>] ? bit_cursor+0x0/0x4d7
 [<ffffffff812656b7>] fb_flashcursor+0xff/0x111
 [<ffffffff812655b8>] ? fb_flashcursor+0x0/0x111
 [<ffffffff81071812>] worker_thread+0x14d/0x1ed
 [<ffffffff81075a8c>] ? autoremove_wake_function+0x0/0x3d
 [<ffffffff81438d80>] ? _spin_unlock_irqrestore+0x16/0x18
 [<ffffffff810716c5>] ? worker_thread+0x0/0x1ed
 [<ffffffff810756e3>] kthread+0x6e/0x76
 [<ffffffff81012dea>] child_rip+0xa/0x20
 [<ffffffff81011fd1>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff8101275d>] ? retint_restore_args+0x5/0x6
 [<ffffffff81012de0>] ? child_rip+0x0/0x20
Code: 6b ff 0c 8b 87 a4 db 9f 81 66 85 c0 74 08 0f b7 f8 e8 3b ff ff ff c9
c3 55 48 89 e5 48 83 ec 10 0f 1f 44 00 00 89 ff 48 6b ff 0c <8b> 87 a4 db 9f
81 66 85 c0 74 14 48 8d 75 f0 0f b7 c0 bf 04 00
RIP  [<ffffffff812a588f>] notify_remote_via_irq+0x13/0x34
 RSP <ffff8800e7bf7bd0>
CR2: 0000000b819fdb98
---[ end trace 098b4b74827595d0 ]---

The root cause of race between the resume and reconnecting to the backend
Clear update_wanted flag of xenfb before disconnect backend would fix this issue.
Also below patch will fixed mem leak when connect to xenfb backend failed.


Signed-off-by: Joe Jin <joe.jin@oracle.com>
Tested-by: Gurudas Pai <gurudas.pai@oracle.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>

---
 xen-fbfront.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/video/xen-fbfront.c b/drivers/video/xen-fbfront.c
index dc72563..f2d9eb5 100644
--- a/drivers/video/xen-fbfront.c
+++ b/drivers/video/xen-fbfront.c
@@ -616,6 +616,8 @@ static int xenfb_connect_backend(struct xenbus_device *dev,
 
 static void xenfb_disconnect_backend(struct xenfb_info *info)
 {
+	/* Prevent xenfb refresh */
+	info->update_wanted = 0;
 	if (info->irq >= 0)
 		unbind_from_irqhandler(info->irq, info);
 	info->irq = -1;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-01-07 10:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-30 12:56 [patch] xenfb: fix xenfb suspend/resume race Joe Jin
2010-12-30 16:40 ` Konrad Rzeszutek Wilk
2010-12-31  0:56   ` Joe Jin
2011-01-03 16:34     ` [Xen-devel] " Konrad Rzeszutek Wilk
2011-01-04  0:34       ` Joe Jin
2011-01-04 11:15   ` Ian Campbell
2011-01-06  7:14     ` Joe Jin
2011-01-06  8:02       ` Ian Campbell
2011-01-06  8:14         ` Joe Jin
2011-01-07  6:43           ` [Xen-devel] " Joe Jin
2011-01-06  8:47         ` Ian Campbell
  -- strict thread matches above, loose matches on Subject: below --
2011-01-07  6:40 Joe Jin
2011-01-07  9:17 ` Ian Campbell
2011-01-07 10:17 Joe Jin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).