AMD-GFX Archive mirror
 help / color / mirror / Atom feed
From: Felix Kuehling <felix.kuehling@amd.com>
To: Philip Yang <Philip.Yang@amd.com>, amd-gfx@lists.freedesktop.org
Cc: christian.koenig@amd.com
Subject: Re: [PATCH] drm/amdkfd: Remove arbitrary timeout for hmm_range_fault
Date: Mon, 6 May 2024 15:54:50 -0400	[thread overview]
Message-ID: <df6a0155-61b3-415b-b23b-ff52537abf40@amd.com> (raw)
In-Reply-To: <20240501225655.5215-1-Philip.Yang@amd.com>


On 2024-05-01 18:56, Philip Yang wrote:
> On system with khugepaged enabled and user cases with THP buffer, the
> hmm_range_fault may takes > 15 seconds to return -EBUSY, the arbitrary
> timeout value is not accurate, cause memory allocation failure.
>
> Remove the arbitrary timeout value, return EAGAIN to application if
> hmm_range_fault return EBUSY, then userspace libdrm and Thunk will call
> ioctl again.
>
> Change EAGAIN to debug message as this is not error.
>
> Signed-off-by: Philip Yang <Philip.Yang@amd.com>

Assuming this passes your stress testing without CPU stall warnings, 
this patch is

Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>


> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c |  5 ++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c          | 12 +++---------
>   drivers/gpu/drm/amd/amdkfd/kfd_svm.c             |  5 +----
>   3 files changed, 8 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 54198c3928c7..02696c2102f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1087,7 +1087,10 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr,
>   
>   	ret = amdgpu_ttm_tt_get_user_pages(bo, bo->tbo.ttm->pages, &range);
>   	if (ret) {
> -		pr_err("%s: Failed to get user pages: %d\n", __func__, ret);
> +		if (ret == -EAGAIN)
> +			pr_debug("Failed to get user pages, try again\n");
> +		else
> +			pr_err("%s: Failed to get user pages: %d\n", __func__, ret);
>   		goto unregister_out;
>   	}
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> index 431ec72655ec..e36fede7f74c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c
> @@ -202,20 +202,12 @@ int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier *notifier,
>   		pr_debug("hmm range: start = 0x%lx, end = 0x%lx",
>   			hmm_range->start, hmm_range->end);
>   
> -		/* Assuming 64MB takes maximum 1 second to fault page address */
> -		timeout = max((hmm_range->end - hmm_range->start) >> 26, 1UL);
> -		timeout *= HMM_RANGE_DEFAULT_TIMEOUT;
> -		timeout = jiffies + msecs_to_jiffies(timeout);
> +		timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
>   
>   retry:
>   		hmm_range->notifier_seq = mmu_interval_read_begin(notifier);
>   		r = hmm_range_fault(hmm_range);
>   		if (unlikely(r)) {
> -			schedule();
> -			/*
> -			 * FIXME: This timeout should encompass the retry from
> -			 * mmu_interval_read_retry() as well.
> -			 */
>   			if (r == -EBUSY && !time_after(jiffies, timeout))
>   				goto retry;
>   			goto out_free_pfns;
> @@ -247,6 +239,8 @@ int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier *notifier,
>   out_free_range:
>   	kfree(hmm_range);
>   
> +	if (r == -EBUSY)
> +		r = -EAGAIN;
>   	return r;
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index 94f83be2232d..e7040f809f33 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -1670,11 +1670,8 @@ static int svm_range_validate_and_map(struct mm_struct *mm,
>   						       readonly, owner, NULL,
>   						       &hmm_range);
>   			WRITE_ONCE(p->svms.faulting_task, NULL);
> -			if (r) {
> +			if (r)
>   				pr_debug("failed %d to get svm range pages\n", r);
> -				if (r == -EBUSY)
> -					r = -EAGAIN;
> -			}
>   		} else {
>   			r = -EFAULT;
>   		}

      parent reply	other threads:[~2024-05-06 19:55 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-01 22:56 [PATCH] drm/amdkfd: Remove arbitrary timeout for hmm_range_fault Philip Yang
2024-05-02  4:09 ` Chen, Xiaogang
2024-05-02 14:06   ` Philip Yang
2024-05-02 12:42 ` James Zhu
2024-05-02 15:05   ` Philip Yang
2024-05-06 19:54 ` Felix Kuehling [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=df6a0155-61b3-415b-b23b-ff52537abf40@amd.com \
    --to=felix.kuehling@amd.com \
    --cc=Philip.Yang@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).