Segfault in mlx5 driver on infiniband after application fork

Linux-RDMA Archive mirror
 help / color / mirror / Atom feed

* Segfault in mlx5 driver on infiniband after application fork
@ 2024-02-07 19:17 Rehm, Kevan
  2024-02-08  8:52 ` Leon Romanovsky
  0 siblings, 1 reply; 14+ messages in thread
From: Rehm, Kevan @ 2024-02-07 19:17 UTC (permalink / raw
  To: linux-rdma@vger.kernel.org

Greetings,

I don’t see a way to open a ticket at rdma-core; it was suggested that I send this email instead.

I have been chasing a problem in rdma-core-47.1.   Originally, I opened a ticket in libfabric, but it was pointed out that mlx5 is not part of libfabric.   Full description of the problem plus debug notes are documented at the github repository for libfabric, see issue 9792, please have a look there rather than repeating all of the background information in this email.

An application started by pytorch does a fork, then the child process attempts to use libfabric to open a new DAOS infiniband endpoint.    The original endpoint is owned and still in use by the parent process. 

When the parent process created the endpoint (fi_fabric, fi_domain, fi_endpoint calls), the mlx5 driver allocated memory pages for use in SRQ creation, and issued a madvise to say that the pages are DONTFORK.  These pages are associated with the domain’s ibv_device which is cached in the driver.   After the fork when the child process calls fi_domain for its new endpoint, it gets the ibv_device that was cached at the time it was created by the parent.   The child process immediately segfaults when trying to create a SRQ, because the pages associated with that ibv_device are not in the child’s memory.  There doesn’t appear to be any way for a child process to create a fresh endpoint because of the caching being done for ibv_devices.

Is this the proper way to “open a ticket” against rdma-core?

Regards, Kevan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Segfault in mlx5 driver on infiniband after application fork
  2024-02-07 19:17 Segfault in mlx5 driver on infiniband after application fork Rehm, Kevan
@ 2024-02-08  8:52 ` Leon Romanovsky
  2024-02-08  9:05   ` Mark Zhang
  0 siblings, 1 reply; 14+ messages in thread
From: Leon Romanovsky @ 2024-02-08  8:52 UTC (permalink / raw
  To: Rehm, Kevan; +Cc: linux-rdma@vger.kernel.org, Yishai Hadas

On Wed, Feb 07, 2024 at 07:17:01PM +0000, Rehm, Kevan wrote:
> Greetings,
>  
> I don’t see a way to open a ticket at rdma-core; it was suggested that I send this email instead.
>  
> I have been chasing a problem in rdma-core-47.1.   Originally, I opened a ticket in libfabric, but it was pointed out that mlx5 is not part of libfabric.   Full description of the problem plus debug notes are documented at the github repository for libfabric, see issue 9792, please have a look there rather than repeating all of the background information in this email.
>  
> An application started by pytorch does a fork, then the child process attempts to use libfabric to open a new DAOS infiniband endpoint.    The original endpoint is owned and still in use by the parent process. 
>  
> When the parent process created the endpoint (fi_fabric, fi_domain, fi_endpoint calls), the mlx5 driver allocated memory pages for use in SRQ creation, and issued a madvise to say that the pages are DONTFORK.  These pages are associated with the domain’s ibv_device which is cached in the driver.   After the fork when the child process calls fi_domain for its new endpoint, it gets the ibv_device that was cached at the time it was created by the parent.   The child process immediately segfaults when trying to create a SRQ, because the pages associated with that ibv_device are not in the child’s memory.  There doesn’t appear to be any way for a child process to create a fresh endpoint because of the caching being done for ibv_devices.
>  
> Is this the proper way to “open a ticket” against rdma-core?

It is right place, but I won't call it "proper way".
For anyone who is interested in this issue, please follow the links below:
https://github.com/ofiwg/libfabric/issues/9792
https://daosio.atlassian.net/browse/DAOS-15117

Regarding the issue, I don't know if mlx5 actively used to run
libfabric, but the mentioned call to ibv_dontfork_range() existed from
prehistoric era.

Do you have any environment variables set related to rdma-core?

Thanks

>  
> Regards, Kevan
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Segfault in mlx5 driver on infiniband after application fork
  2024-02-08  8:52 ` Leon Romanovsky
@ 2024-02-08  9:05   ` Mark Zhang
  0 siblings, 0 replies; 14+ messages in thread
From: Mark Zhang @ 2024-02-08  9:05 UTC (permalink / raw
  To: Leon Romanovsky, Rehm, Kevan; +Cc: linux-rdma@vger.kernel.org, Yishai Hadas

On 2/8/2024 4:52 PM, Leon Romanovsky wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Feb 07, 2024 at 07:17:01PM +0000, Rehm, Kevan wrote:
>> Greetings,
>>
>> I don’t see a way to open a ticket at rdma-core; it was suggested that I send this email instead.
>>
>> I have been chasing a problem in rdma-core-47.1.   Originally, I opened a ticket in libfabric, but it was pointed out that mlx5 is not part of libfabric.   Full description of the problem plus debug notes are documented at the github repository for libfabric, see issue 9792, please have a look there rather than repeating all of the background information in this email.
>>
>> An application started by pytorch does a fork, then the child process attempts to use libfabric to open a new DAOS infiniband endpoint.    The original endpoint is owned and still in use by the parent process.
>>
>> When the parent process created the endpoint (fi_fabric, fi_domain, fi_endpoint calls), the mlx5 driver allocated memory pages for use in SRQ creation, and issued a madvise to say that the pages are DONTFORK.  These pages are associated with the domain’s ibv_device which is cached in the driver.   After the fork when the child process calls fi_domain for its new endpoint, it gets the ibv_device that was cached at the time it was created by the parent.   The child process immediately segfaults when trying to create a SRQ, because the pages associated with that ibv_device are not in the child’s memory.  There doesn’t appear to be any way for a child process to create a fresh endpoint because of the caching being done for ibv_devices.
>>
>> Is this the proper way to “open a ticket” against rdma-core?
> 
> It is right place, but I won't call it "proper way".
> For anyone who is interested in this issue, please follow the links below:
> https://github.com/ofiwg/libfabric/issues/9792
> https://daosio.atlassian.net/browse/DAOS-15117
> 
> Regarding the issue, I don't know if mlx5 actively used to run
> libfabric, but the mentioned call to ibv_dontfork_range() existed from
> prehistoric era.
> 
> Do you have any environment variables set related to rdma-core?
> 

Is it reated to ibv_fork_init()? It must be called when fork() is called.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Segfault in mlx5 driver on infiniband after application fork
@ 2024-02-11 19:24 Kevan Rehm
  2024-02-12 13:33 ` Jason Gunthorpe
  0 siblings, 1 reply; 14+ messages in thread
From: Kevan Rehm @ 2024-02-11 19:24 UTC (permalink / raw
  To: Mark Zhang, Leon Romanovsky
  Cc: linux-rdma@vger.kernel.org, Yishai Hadas, kevan.rehm


>> An application started by pytorch does a fork, then the child process attempts to use libfabric to open a new DAOS infiniband endpoint.    The original endpoint is owned and still in use by the parent process.
>>
>> When the parent process created the endpoint (fi_fabric, fi_domain, fi_endpoint calls), the mlx5 driver allocated memory pages for use in SRQ creation, and issued a madvise to say that the pages are DONTFORK.  These pages are associated with the domain’sibv_device which is cached in the driver.   After the fork when the child process calls fi_domain for its new endpoint, it gets the ibv_device that was cached at the time it was created by the parent.   The child process immediately segfaults when trying to create a SRQ, because the pages associated with that ibv_device are not in the child’s memory.  There doesn’t appear to be any way for a child process to create a fresh endpoint because of the caching being done for ibv_devices.
>>

> For anyone who is interested in this issue, please follow the links below:
> https://github.com/ofiwg/libfabric/issues/9792
> https://daosio.atlassian.net/browse/DAOS-15117
> 
> Regarding the issue, I don't know if mlx5 actively used to run
> libfabric, but the mentioned call to ibv_dontfork_range() existed from
> prehistoric era.

Yes, libfabric has used mlx5 for a long time.

> Do you have any environment variables set related to rdma-core?
> 
IBV_FORK_SAFE is set to 1

> Is it reated to ibv_fork_init()? It must be called when fork() is called.

Calling ibv_fork_init() doesn’t help, because it immediately checks mm_root, sees it is non-zero (from the parent process’s prior call), and returns doing nothing.
There is now a simplified test case, see https://github.com/ofiwg/libfabric/issues/9792 for ongoing analysis.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Segfault in mlx5 driver on infiniband after application fork
  2024-02-11 19:24 Kevan Rehm
@ 2024-02-12 13:33 ` Jason Gunthorpe
  2024-02-12 14:37   ` Kevan Rehm
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2024-02-12 13:33 UTC (permalink / raw
  To: Kevan Rehm
  Cc: Mark Zhang, Leon Romanovsky, linux-rdma@vger.kernel.org,
	Yishai Hadas, kevan.rehm

On Sun, Feb 11, 2024 at 02:24:16PM -0500, Kevan Rehm wrote:
> 
> >> An application started by pytorch does a fork, then the child
> >> process attempts to use libfabric to open a new DAOS infiniband
> >> endpoint.  The original endpoint is owned and still in use by the
> >> parent process.
> >>
> >> When the parent process created the endpoint (fi_fabric,
> >> fi_domain, fi_endpoint calls), the mlx5 driver allocated memory
> >> pages for use in SRQ creation, and issued a madvise to say that
> >> the pages are DONTFORK.  These pages are associated with the
> >> domain’sibv_device which is cached in the driver.  After the fork
> >> when the child process calls fi_domain for its new endpoint, it
> >> gets the ibv_device that was cached at the time it was created by
> >> the parent.  The child process immediately segfaults when trying
> >> to create a SRQ, because the pages associated with that
> >> ibv_device are not in the child’s memory.  There doesn’t appear
> >> to be any way for a child process to create a fresh endpoint
> >> because of the caching being done for ibv_devices.
> 
> > For anyone who is interested in this issue, please follow the links below:
> > https://github.com/ofiwg/libfabric/issues/9792
> > https://daosio.atlassian.net/browse/DAOS-15117
> > 
> > Regarding the issue, I don't know if mlx5 actively used to run
> > libfabric, but the mentioned call to ibv_dontfork_range() existed from
> > prehistoric era.
> 
> Yes, libfabric has used mlx5 for a long time.
> 
> > Do you have any environment variables set related to rdma-core?
> > 
> IBV_FORK_SAFE is set to 1
> 
> > Is it reated to ibv_fork_init()? It must be called when fork() is called.
> 
> Calling ibv_fork_init() doesn’t help, because it immediately checks mm_root, sees it is non-zero (from the parent process’s prior call), and returns doing nothing.
> There is now a simplified test case, see https://github.com/ofiwg/libfabric/issues/9792 for ongoing analysis.

This was all fixed in the kernel, upgrade your kernel and forking
works much more reliably, but I'm not sure this case will work.

It is a libfabric problem if it is expecting memory to be registers
for RDMA and be used by both processes in a fork. That cannot work.

Don't do that, or make the memory MAP_SHARED so that the fork children
can access it.

The bugs seem a bit confused, there is no issue with ibv_device
sharing. Only with actually sharing underlying registered memory. Ie
sharing a SRQ memory pool between the child and parent.

"fork safe" does not magically make all scenarios work, it is
targetted at a specific use case where a rdma using process forks and
the fork does not continue to use rdma.

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Segfault in mlx5 driver on infiniband after application fork
  2024-02-12 13:33 ` Jason Gunthorpe
@ 2024-02-12 14:37   ` Kevan Rehm
  2024-02-12 14:40     ` Jason Gunthorpe
  0 siblings, 1 reply; 14+ messages in thread
From: Kevan Rehm @ 2024-02-12 14:37 UTC (permalink / raw
  To: Jason Gunthorpe
  Cc: Mark Zhang, Leon Romanovsky, linux-rdma@vger.kernel.org,
	Yishai Hadas, kevan.rehm, chien.tin.tung



> On Feb 12, 2024, at 8:33 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Sun, Feb 11, 2024 at 02:24:16PM -0500, Kevan Rehm wrote:
>> 
>>>> An application started by pytorch does a fork, then the child
>>>> process attempts to use libfabric to open a new DAOS infiniband
>>>> endpoint.  The original endpoint is owned and still in use by the
>>>> parent process.
>>>> 
>>>> When the parent process created the endpoint (fi_fabric,
>>>> fi_domain, fi_endpoint calls), the mlx5 driver allocated memory
>>>> pages for use in SRQ creation, and issued a madvise to say that
>>>> the pages are DONTFORK.  These pages are associated with the
>>>> domain’sibv_device which is cached in the driver.  After the fork
>>>> when the child process calls fi_domain for its new endpoint, it
>>>> gets the ibv_device that was cached at the time it was created by
>>>> the parent.  The child process immediately segfaults when trying
>>>> to create a SRQ, because the pages associated with that
>>>> ibv_device are not in the child’s memory.  There doesn’t appear
>>>> to be any way for a child process to create a fresh endpoint
>>>> because of the caching being done for ibv_devices.
>> 
>>> For anyone who is interested in this issue, please follow the links below:
>>> https://github.com/ofiwg/libfabric/issues/9792
>>> https://daosio.atlassian.net/browse/DAOS-15117
>>> 
>>> Regarding the issue, I don't know if mlx5 actively used to run
>>> libfabric, but the mentioned call to ibv_dontfork_range() existed from
>>> prehistoric era.
>> 
>> Yes, libfabric has used mlx5 for a long time.
>> 
>>> Do you have any environment variables set related to rdma-core?
>>> 
>> IBV_FORK_SAFE is set to 1
>> 
>>> Is it reated to ibv_fork_init()? It must be called when fork() is called.
>> 
>> Calling ibv_fork_init() doesn’t help, because it immediately checks mm_root, sees it is non-zero (from the parent process’s prior call), and returns doing nothing.
>> There is now a simplified test case, see https://github.com/ofiwg/libfabric/issues/9792 for ongoing analysis.
> 
> This was all fixed in the kernel, upgrade your kernel and forking
> works much more reliably, but I'm not sure this case will work.

I agree, that won’t help here.

> It is a libfabric problem if it is expecting memory to be registers
> for RDMA and be used by both processes in a fork. That cannot work.
> 
> Don't do that, or make the memory MAP_SHARED so that the fork children
> can access it.

Libfabric agrees, it wants to use separate registered memory in the child, but there doesn’t seem to be a way to do this.
> 
> The bugs seem a bit confused, there is no issue with ibv_device
> sharing. Only with actually sharing underlying registered memory. Ie
> sharing a SRQ memory pool between the child and parent.

Libfabric calls rdma_get_devices(), then walks the list looking for the entry for the correct domain (mlx5_1).  It saves a pointer to the matching dev_list entry which is an ibv_context structure.  Wrapped on that ibv_context is the mlx5 context which contains the registered pages that had dontfork set when the parent established its connection.  When the child process calls rdma_get_devices(), desiring to create a fresh connection to the same mlx5_1 domain, it will instead get back the same ibv_context that the parent got, not a fresh one, and so creation of a SRQ will segfault.   How can libfabric force verbs to return a fresh ibv_context for mlx5_1 instead of the one returned to the parent process?
> 
> "fork safe" does not magically make all scenarios work, it is
> targetted at a specific use case where a rdma using process forks and
> the fork does not continue to use rdma.
> 
> Jason


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Segfault in mlx5 driver on infiniband after application fork
  2024-02-12 14:37   ` Kevan Rehm
@ 2024-02-12 14:40     ` Jason Gunthorpe
  2024-02-12 16:04       ` Kevan Rehm
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2024-02-12 14:40 UTC (permalink / raw
  To: Kevan Rehm
  Cc: Mark Zhang, Leon Romanovsky, linux-rdma@vger.kernel.org,
	Yishai Hadas, kevan.rehm, chien.tin.tung

On Mon, Feb 12, 2024 at 09:37:25AM -0500, Kevan Rehm wrote:

> > This was all fixed in the kernel, upgrade your kernel and forking
> > works much more reliably, but I'm not sure this case will work.
> 
> I agree, that won’t help here.
> 
> > It is a libfabric problem if it is expecting memory to be registers
> > for RDMA and be used by both processes in a fork. That cannot work.
> > 
> > Don't do that, or make the memory MAP_SHARED so that the fork children
> > can access it.
> 
> Libfabric agrees, it wants to use separate registered memory in the
> child, but there doesn’t seem to be a way to do this.

How can that be true? libfabric is the only entity that causes memory
to be registered :)

> > The bugs seem a bit confused, there is no issue with ibv_device
> > sharing. Only with actually sharing underlying registered memory. Ie
> > sharing a SRQ memory pool between the child and parent.
> 
> Libfabric calls rdma_get_devices(), then walks the list looking for
> the entry for the correct domain (mlx5_1).  It saves a pointer to
> the matching dev_list entry which is an ibv_context structure.
> Wrapped on that ibv_context is the mlx5 context which contains the
> registered pages that had dontfork set when the parent established
  ^^^^^^^^^^^^^^^^

It does not. context don't have pages, your problem comes from
something else.

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Segfault in mlx5 driver on infiniband after application fork
  2024-02-12 14:40     ` Jason Gunthorpe
@ 2024-02-12 16:04       ` Kevan Rehm
  2024-02-12 16:12         ` Jason Gunthorpe
  0 siblings, 1 reply; 14+ messages in thread
From: Kevan Rehm @ 2024-02-12 16:04 UTC (permalink / raw
  To: Jason Gunthorpe
  Cc: Mark Zhang, Leon Romanovsky, linux-rdma@vger.kernel.org,
	Yishai Hadas, kevan.rehm, chien.tin.tung



> On Feb 12, 2024, at 9:40 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Mon, Feb 12, 2024 at 09:37:25AM -0500, Kevan Rehm wrote:
> 
>>> This was all fixed in the kernel, upgrade your kernel and forking
>>> works much more reliably, but I'm not sure this case will work.
>> 
>> I agree, that won’t help here.
>> 
>>> It is a libfabric problem if it is expecting memory to be registers
>>> for RDMA and be used by both processes in a fork. That cannot work.
>>> 
>>> Don't do that, or make the memory MAP_SHARED so that the fork children
>>> can access it.
>> 
>> Libfabric agrees, it wants to use separate registered memory in the
>> child, but there doesn’t seem to be a way to do this.
> 
> How can that be true? libfabric is the only entity that causes memory
> to be registered :)
> 
>>> The bugs seem a bit confused, there is no issue with ibv_device
>>> sharing. Only with actually sharing underlying registered memory. Ie
>>> sharing a SRQ memory pool between the child and parent.
>> 
>> Libfabric calls rdma_get_devices(), then walks the list looking for
>> the entry for the correct domain (mlx5_1).  It saves a pointer to
>> the matching dev_list entry which is an ibv_context structure.
>> Wrapped on that ibv_context is the mlx5 context which contains the
>> registered pages that had dontfork set when the parent established
>  ^^^^^^^^^^^^^^^^
> 
> It does not. context don't have pages, your problem comes from
> something else.

My terminology may be incorrect, certainly my knowledge is limited.  

See routine __add_page() in providers/mlx5/dbrec.c.  It calls either mlx5_alloc_buf() or mlx5_alloc_buf_extern() to allocate a page.  Those routines call ibv_dontfork_range on the page after it’s been allocated via posix_memalign().   _add_page() then adds the new page to the mlx5_context field dbr_available_pages.  Later the function mlx5_create_srq() calls mlx5_alloc_dbrec() to allocate space out of the page, it returns a __be32 which is stored in srq->db by mlx5_create_srq().  The routine then calls "*srq->db = 0” to initialize the space.

When the parent process calls mlx5_create_srq() to create a SRQ, a page gets allocated and dontfork is set.  After the fork, the child process calls rdma_get_devices() which returns the parent's ibv_context, which contains the above-mentioned mlx5_context.  When the child calls mlx5_create_srq(), the “srq->db = 0” statement segfaults because the space is allocated out of the same page that was allocated by the parent and is not in the child’s memory.
> 
> Jason


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Segfault in mlx5 driver on infiniband after application fork
  2024-02-12 16:04       ` Kevan Rehm
@ 2024-02-12 16:12         ` Jason Gunthorpe
  2024-02-12 16:37           ` Kevan Rehm
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2024-02-12 16:12 UTC (permalink / raw
  To: Kevan Rehm
  Cc: Mark Zhang, Leon Romanovsky, linux-rdma@vger.kernel.org,
	Yishai Hadas, kevan.rehm, chien.tin.tung

On Mon, Feb 12, 2024 at 11:04:36AM -0500, Kevan Rehm wrote:

> Those routines call ibv_dontfork_range on the page after it’s been
> allocated via posix_memalign().  _add_page() then adds the new page
> to the mlx5_context field dbr_available_pages.

Oh, if this is your trouble then upgrade your kernel. This part is
fixed on kernels that have working fork support.

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Segfault in mlx5 driver on infiniband after application fork
  2024-02-12 16:12         ` Jason Gunthorpe
@ 2024-02-12 16:37           ` Kevan Rehm
  2024-02-12 16:45             ` Jason Gunthorpe
  0 siblings, 1 reply; 14+ messages in thread
From: Kevan Rehm @ 2024-02-12 16:37 UTC (permalink / raw
  To: Jason Gunthorpe
  Cc: Mark Zhang, Leon Romanovsky, linux-rdma@vger.kernel.org,
	Yishai Hadas, kevan.rehm, chien.tin.tung



> On Feb 12, 2024, at 11:12 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> On Mon, Feb 12, 2024 at 11:04:36AM -0500, Kevan Rehm wrote:
> 
>> Those routines call ibv_dontfork_range on the page after it’s been
>> allocated via posix_memalign().  _add_page() then adds the new page
>> to the mlx5_context field dbr_available_pages.
> 
> Oh, if this is your trouble then upgrade your kernel. This part is
> fixed on kernels that have working fork support.

That’s the bit that confuses me; all this is happening in user space, what is different in the kernel that would prevent this problem from occurring in user space?   Any guess as to how much newer a kernel must be?
> 
> Jason


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Segfault in mlx5 driver on infiniband after application fork
  2024-02-12 16:37           ` Kevan Rehm
@ 2024-02-12 16:45             ` Jason Gunthorpe
  2024-02-16 19:56               ` Kevan Rehm
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2024-02-12 16:45 UTC (permalink / raw
  To: Kevan Rehm
  Cc: Mark Zhang, Leon Romanovsky, linux-rdma@vger.kernel.org,
	Yishai Hadas, kevan.rehm, chien.tin.tung

On Mon, Feb 12, 2024 at 11:37:39AM -0500, Kevan Rehm wrote:
> 
> 
> > On Feb 12, 2024, at 11:12 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > 
> > On Mon, Feb 12, 2024 at 11:04:36AM -0500, Kevan Rehm wrote:
> > 
> >> Those routines call ibv_dontfork_range on the page after it’s been
> >> allocated via posix_memalign().  _add_page() then adds the new page
> >> to the mlx5_context field dbr_available_pages.
> > 
> > Oh, if this is your trouble then upgrade your kernel. This part is
> > fixed on kernels that have working fork support.
> 
> That’s the bit that confuses me; all this is happening in user
> space, what is different in the kernel that would prevent this
> problem from occurring in user space?  Any guess as to how much
> newer a kernel must be?

Newer kernels are detected and disable the DONT_FORK calls in verbs.

rdma-core support is present since:

commit 67b00c3835a3480a035a9e1bcf5695f5c0e8568e
Author: Gal Pressman <galpress@amazon.com>
Date:   Sun Apr 4 17:24:54 2021 +0300

    verbs: Report when ibv_fork_init() is not needed
    
    Identify kernels which do not require ibv_fork_init() to be called and
    report it through the ibv_is_fork_initialized() verb.
    
    The feature detection is done through a new read-only attribute in the
    get sys netlink command. If the attribute is not reported, assume old
    kernel without COF support. If the attribute is reported, use the
    returned value.
    
    This allows ibv_is_fork_initialized() to return the previously unused
    IBV_FORK_UNNEEDED value, which takes precedence over the
    DISABLED/ENABLED values. Meaning that if the kernel does not require a
    call to ibv_fork_init(), IBV_FORK_UNNEEDED will be returned regardless
    of whether ibv_fork_init() was called or not.
    
    Signed-off-by: Gal Pressman <galpress@amazon.com>

The kernel support was in v5.13-rc1~78^2~1

And backported in a few cases.

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Segfault in mlx5 driver on infiniband after application fork
@ 2024-02-13 16:45 Kevan Rehm
  0 siblings, 0 replies; 14+ messages in thread
From: Kevan Rehm @ 2024-02-13 16:45 UTC (permalink / raw
  To: Jason Gunthorpe
  Cc: Mark Zhang, Leon Romanovsky, linux-rdma, Yishai Hadas,
	chien.tin.tung, kevan.rehm

Newer kernels are detected and disable the DONT_FORK calls in verbs.
> 
> rdma-core support is present since:
> 
> commit 67b00c3835a3480a035a9e1bcf5695f5c0e8568e
> Author: Gal Pressman <galpress@amazon.com>
> Date:   Sun Apr 4 17:24:54 2021 +0300
> 
>    verbs: Report when ibv_fork_init() is not needed
> 
>    Identify kernels which do not require ibv_fork_init() to be called and
>    report it through the ibv_is_fork_initialized() verb.
> 
>    The feature detection is done through a new read-only attribute in the
>    get sys netlink command. If the attribute is not reported, assume old
>    kernel without COF support. If the attribute is reported, use the
>    returned value.
> 
>    This allows ibv_is_fork_initialized() to return the previously unused
>    IBV_FORK_UNNEEDED value, which takes precedence over the
>    DISABLED/ENABLED values. Meaning that if the kernel does not require a
>    call to ibv_fork_init(), IBV_FORK_UNNEEDED will be returned regardless
>    of whether ibv_fork_init() was called or not.
> 
>    Signed-off-by: Gal Pressman <galpress@amazon.com>
> 
> The kernel support was in v5.13-rc1~78^2~1
> 
> And backported in a few cases.
> 
> Jason

The above info was immensely helpful, and I am running MOFED 23.10-OFED.23.10.0.5.5.1 so my kernel already has the fork improvements.  However, there are still issues, as the above requires all callers to check ibv_is_fork_initialized() before every call to ibv_fork_init.  Not everyone does this.

Routine ibv_get_device() unconditionally calls ibverbs_init() on the first call, and that routine calls ibv_fork_init() if either RDMA_FORK_SAFE or IBV_FORK_SAFE are set, even if the kernel has the fork enhancements.  I wrapped that check with a call to ibv_is_fork_initialized, and skipped the ibv_fork_init() call if IBV_FORK_UNNEEDED was returned.  This caused my little test program to run successfully, but the original benchmark still bombed.

The benchmark uses MPI.  It turns out that mpi4py calls PMPI_Init() which eventually makes UCX calls, and routine uct_ib_md_open() in UCX calls ibv_fork_init() without first calling ibv_is_fork_initialized.  It’s looking at some md_config->fork_init variable, not checking the kernel support.    In order to cover all potential cases, I changed my rdma patch to instead call ibv_is_fork_initialized() inside ibv_fork_init() itself, and return 0 without creating mm_root if kernel support is there.   This causes MPI and the original benchmark to work.

Is this a reasonable fix that could be added to rdma?

[root@delphi-029 libibverbs]# diff -C 5 memory.c.orig memory.c
*** memory.c.orig 2024-02-13 09:45:28.078997178 -0600
--- memory.c 2024-02-13 09:27:46.901699958 -0600
***************
*** 140,149 ****
--- 140,152 ----
huge_page_enabled = 1;

if (mm_root)
return 0;

+ if (ibv_is_fork_initialized() == IBV_FORK_UNNEEDED)
+ return 0;
+
if (too_late)
return EINVAL;

fprintf(stderr, "ibv_fork_init creating mm_root\n");
page_size = sysconf(_SC_PAGESIZE);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Segfault in mlx5 driver on infiniband after application fork
  2024-02-12 16:45             ` Jason Gunthorpe
@ 2024-02-16 19:56               ` Kevan Rehm
  0 siblings, 0 replies; 14+ messages in thread
From: Kevan Rehm @ 2024-02-16 19:56 UTC (permalink / raw
  To: Jason Gunthorpe
  Cc: Mark Zhang, Leon Romanovsky, linux-rdma@vger.kernel.org,
	Yishai Hadas, kevan.rehm, chien.tin.tung

> 
> Newer kernels are detected and disable the DONT_FORK calls in verbs.
> 
> rdma-core support is present since:
> 
> commit 67b00c3835a3480a035a9e1bcf5695f5c0e8568e
> Author: Gal Pressman <galpress@amazon.com>
> Date:   Sun Apr 4 17:24:54 2021 +0300
> 
>    verbs: Report when ibv_fork_init() is not needed
> 
>    Identify kernels which do not require ibv_fork_init() to be called and
>    report it through the ibv_is_fork_initialized() verb.
> 
>    The feature detection is done through a new read-only attribute in the
>    get sys netlink command. If the attribute is not reported, assume old
>    kernel without COF support. If the attribute is reported, use the
>    returned value.
> 
>    This allows ibv_is_fork_initialized() to return the previously unused
>    IBV_FORK_UNNEEDED value, which takes precedence over the
>    DISABLED/ENABLED values. Meaning that if the kernel does not require a
>    call to ibv_fork_init(), IBV_FORK_UNNEEDED will be returned regardless
>    of whether ibv_fork_init() was called or not.
> 
>    Signed-off-by: Gal Pressman <galpress@amazon.com>
> 
> The kernel support was in v5.13-rc1~78^2~1
> 
> And backported in a few cases.

To work around this, I had to use gdb on my benchmark to set a breakpoint in ibv_fork_init() in order to track down all the callers of that function, which turned out to be both UCX and Libfabric.  I then had to download source repos, examine the code, and for each repo determine what environment variable controls the calls to ibv_fork_init().  For Libfabric I had to ensure that RDMA_FORK_SAFE and IBV_FORK_SAFE were not set, which my team members routinely use.  For UCX I had to set UCX_IB_FORK_INIT=no, otherwise by default UCX always calls ibv_fork_init.   With UCX_IB_FORK_INIT set to no, scary error messages about registered memory corruption print to stderr whenever there is a fork, even though that’s not true any more with up-to-date kernels.   Folks that don’t know the details of ibv_fork_init() behavior are going to be reluctant to set UCX_IB_FORK_INIT=no.

If ibv_fork_init() would check the kernel and just return without initializing mm_root when the kernel has enhanced fork support, then all the environment variable hassles go away, the environment variable settings don’t matter, ibv_fork_init() will always do the right thing.  This seems like a big win to me, am I missing some downside perhaps?

Thanks, Kevan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Segfault in mlx5 driver on infiniband after application fork
@ 2024-02-21 12:51 Kevan Rehm
  0 siblings, 0 replies; 14+ messages in thread
From: Kevan Rehm @ 2024-02-21 12:51 UTC (permalink / raw
  To: Jason Gunthorpe
  Cc: Mark Zhang, Leon Romanovsky, linux-rdma@vger.kernel.org,
	Yishai Hadas, Kevan Rehm, chien.tin.tung@intel.com, Kevan Rehm

I posted PR #1431 for this.   I tested with IBV_FORK_SAFE and RDMA_FORK_SAFE set and unset.  Also with UCX_IB_FORK_INIT unset, set to no, and set to yes.   All combos work correctly without segfault.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-02-21 12:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-07 19:17 Segfault in mlx5 driver on infiniband after application fork Rehm, Kevan
2024-02-08  8:52 ` Leon Romanovsky
2024-02-08  9:05   ` Mark Zhang
  -- strict thread matches above, loose matches on Subject: below --
2024-02-11 19:24 Kevan Rehm
2024-02-12 13:33 ` Jason Gunthorpe
2024-02-12 14:37   ` Kevan Rehm
2024-02-12 14:40     ` Jason Gunthorpe
2024-02-12 16:04       ` Kevan Rehm
2024-02-12 16:12         ` Jason Gunthorpe
2024-02-12 16:37           ` Kevan Rehm
2024-02-12 16:45             ` Jason Gunthorpe
2024-02-16 19:56               ` Kevan Rehm
2024-02-13 16:45 Kevan Rehm
2024-02-21 12:51 Kevan Rehm

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).