All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: "Engel, Amit" <Amit.Engel@Dell.com>
To: Sagi Grimberg <sagi@grimberg.me>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Cc: "Anner, Ran" <Ran.Anner@dell.com>, "Grupi, Elad" <Elad.Grupi@dell.com>
Subject: RE: nvme_tcp BUG: unable to handle kernel NULL pointer dereference at 0000000000000230
Date: Wed, 9 Jun 2021 07:48:59 +0000	[thread overview]
Message-ID: <CO1PR19MB48850B44ED23C179A4E4F541EE369@CO1PR19MB4885.namprd19.prod.outlook.com> (raw)
In-Reply-To: <e40414a9-41ef-f82c-7d1c-2583d8251c26@grimberg.me>

Hi Sagi,

Indeed RHEL8.3 does not have the mutex protection on nvme_tcp_stop_queue
However, in our case, based on the below back trace
We don't get to __nvme_tcp_stop_queue from nvme_tcp_stop_queue
We get to it from:
nvme_tcp_reconnect_ctrl_work --> nvme_tcp_setup_ctrl --> nvme_tcp_start_queue  --> __nvme_tcp_stop_queue

so I'm not sure how this mutex protection will help in this case

crash> bt -l
PID: 193053  TASK: ffff9491bdad17c0  CPU: 7   COMMAND: "kworker/u193:9"
 #0 [ffffb2e9cfdbbb70] machine_kexec at ffffffffb245bf3e
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/arch/x86/kernel/machine_kexec_64.c: 389
 #1 [ffffb2e9cfdbbbc8] __crash_kexec at ffffffffb256072d
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/kernel/kexec_core.c: 956
 #2 [ffffb2e9cfdbbc90] crash_kexec at ffffffffb256160d
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/./include/linux/compiler.h: 219
 #3 [ffffb2e9cfdbbca8] oops_end at ffffffffb2422d4d
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/arch/x86/kernel/dumpstack.c: 334
 #4 [ffffb2e9cfdbbcc8] no_context at ffffffffb246ba9e
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/arch/x86/mm/fault.c: 773
 #5 [ffffb2e9cfdbbd20] do_page_fault at ffffffffb246c5c2
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/./arch/x86/include/asm/jump_label.h: 38
 #6 [ffffb2e9cfdbbd50] page_fault at ffffffffb2e0122e
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/arch/x86/entry/entry_64.S: 1183
    [exception RIP: _raw_write_lock_bh+23]
    RIP: ffffffffb2cd6cc7  RSP: ffffb2e9cfdbbe00  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff94b2aefb4000  RCX: 0000000000000003
    RDX: 00000000000000ff  RSI: 00000000fffffe01  RDI: 0000000000000230
    RBP: ffff94923f793f40   R8: ffff9492ff1ea7f8   R9: 0000000000000000
    R10: 0000000000000000  R11: ffff9492ff1e8c64  R12: ffff94b2b7210338
    R13: 0000000000000000  R14: ffff94b27f7a4100  R15: ffff94b2b72110a0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/./arch/x86/include/asm/atomic.h: 194
 #7 [ffffb2e9cfdbbe00] __nvme_tcp_stop_queue at ffffffffc02dc0aa [nvme_tcp]
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/drivers/nvme/host/tcp.c: 1486
 #8 [ffffb2e9cfdbbe18] nvme_tcp_start_queue at ffffffffc02dcd18 [nvme_tcp]
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/drivers/nvme/host/tcp.c: 1525
 #9 [ffffb2e9cfdbbe38] nvme_tcp_setup_ctrl at ffffffffc02df258 [nvme_tcp]
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/drivers/nvme/host/tcp.c: 1814
#10 [ffffb2e9cfdbbe80] nvme_tcp_reconnect_ctrl_work at ffffffffc02df4bf [nvme_tcp]
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/drivers/nvme/host/tcp.c: 1962
#11 [ffffb2e9cfdbbe98] process_one_work at ffffffffb24d3477
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/./arch/x86/include/asm/jump_label.h: 38
#12 [ffffb2e9cfdbbed8] worker_thread at ffffffffb24d3b40
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/./include/linux/compiler.h: 193
#13 [ffffb2e9cfdbbf10] kthread at ffffffffb24d9502
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/kernel/kthread.c: 280
#14 [ffffb2e9cfdbbf50] ret_from_fork at ffffffffb2e00255
    /usr/src/debug/kernel-4.18.0-240.el8/linux-4.18.0-240.el8.x86_64/arch/x86/entry/entry_64.S: 360

-----Original Message-----
From: Sagi Grimberg <sagi@grimberg.me> 
Sent: Wednesday, June 9, 2021 2:39 AM
To: Engel, Amit; linux-nvme@lists.infradead.org
Cc: Anner, Ran; Grupi, Elad
Subject: Re: nvme_tcp BUG: unable to handle kernel NULL pointer dereference at 0000000000000230


[EXTERNAL EMAIL] 


> Hi Sagi,
> 
> A correction to the below analysis:
> It seems like sock->sk is NULL and not queue->sock
> 
> As part of _nvme_tcp_stop_queue
> kernel_sock_shutdown and nvme_tcp_restore_sock_calls are being called:
> kernel_sock_shutdown leads to nvme_tcp_state_change which will trigger err_work (nvme_tcp_error_recovery_work)
> 
> As part of nvme_tcp_error_recovery_work, nvme_tcp_free_queue is being called which releases the socket (sock_release)
> 
> In our case, based on the below bt:
> nvme_tcp_error_recovery_work is being triggered (and so sock_release) before nvme_tcp_restore_sock_calls , which end up with NULL dereference pointer at 'rwlock_t sk_callback_lock' ?
> 
> Can you please review and provide your inputs for this potential race ?

Seems that RH8.3 is missing the mutex protection on nvme_tcp_stop_queue.
I'm assuming it doesn't happen upstream?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

  reply	other threads:[~2021-06-09  7:49 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-01 17:51 nvme_tcp BUG: unable to handle kernel NULL pointer dereference at 0000000000000230 Engel, Amit
2021-06-02 12:28 ` Engel, Amit
2021-06-08 23:39   ` Sagi Grimberg
2021-06-09  7:48     ` Engel, Amit [this message]
2021-06-09  8:04       ` Sagi Grimberg
2021-06-09  8:39         ` Engel, Amit
2021-06-09  9:11           ` Sagi Grimberg
2021-06-09 11:14             ` Engel, Amit
2021-06-10  8:44               ` Engel, Amit
2021-06-10 20:03               ` Sagi Grimberg
2021-06-13  8:35                 ` Engel, Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CO1PR19MB48850B44ED23C179A4E4F541EE369@CO1PR19MB4885.namprd19.prod.outlook.com \
    --to=amit.engel@dell.com \
    --cc=Elad.Grupi@dell.com \
    --cc=Ran.Anner@dell.com \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.