[PATCH v2 0/9] DRM scheduler changes for Xe

dri-devel Archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/9] DRM scheduler changes for Xe
@ 2023-08-11  2:31 Matthew Brost
  2023-08-11  2:31 ` [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread Matthew Brost
                   ` (9 more replies)
  0 siblings, 10 replies; 80+ messages in thread
From: Matthew Brost @ 2023-08-11  2:31 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, thomas.hellstrom, Matthew Brost, sarah.walker,
	ketil.johnsen, Liviu.Dudau, luben.tuikov, lina, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
have been asked to merge our common DRM scheduler patches first.

This a continuation of a RFC [3] with all comments addressed, ready for
a full review, and hopefully in state which can merged in the near
future. More details of this series can found in the cover letter of the
RFC [3].

These changes have been tested with the Xe driver.

v2:
 - Break run job, free job, and process message in own work items
 - This might break other drivers as run job and free job now can run in
   parallel, can fix up if needed

Matt

Matthew Brost (9):
  drm/sched: Convert drm scheduler to use a work queue  rather than
    kthread
  drm/sched: Move schedule policy to scheduler / entity
  drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
  drm/sched: Split free_job into own work item
  drm/sched: Add generic scheduler message interface
  drm/sched: Add drm_sched_start_timeout_unlocked helper
  drm/sched: Start run wq before TDR in drm_sched_start
  drm/sched: Submit job before starting TDR
  drm/sched: Add helper to set TDR timeout

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   3 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   5 +-
 drivers/gpu/drm/lima/lima_sched.c          |   5 +-
 drivers/gpu/drm/msm/msm_ringbuffer.c       |   5 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c    |   5 +-
 drivers/gpu/drm/panfrost/panfrost_job.c    |   5 +-
 drivers/gpu/drm/scheduler/sched_entity.c   |  85 ++++-
 drivers/gpu/drm/scheduler/sched_fence.c    |   2 +-
 drivers/gpu/drm/scheduler/sched_main.c     | 408 ++++++++++++++++-----
 drivers/gpu/drm/v3d/v3d_sched.c            |  25 +-
 include/drm/gpu_scheduler.h                |  75 +++-
 11 files changed, 487 insertions(+), 136 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-11  2:31 [PATCH v2 0/9] DRM scheduler changes for Xe Matthew Brost
@ 2023-08-11  2:31 ` Matthew Brost
  2023-08-16 11:30   ` Danilo Krummrich
  2023-08-11  2:31 ` [PATCH v2 2/9] drm/sched: Move schedule policy to scheduler / entity Matthew Brost
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 80+ messages in thread
From: Matthew Brost @ 2023-08-11  2:31 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, thomas.hellstrom, Matthew Brost, sarah.walker,
	ketil.johnsen, Liviu.Dudau, luben.tuikov, lina, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
seems a bit odd but let us explain the reasoning below.

1. In XE the submission order from multiple drm_sched_entity is not
guaranteed to be the same completion even if targeting the same hardware
engine. This is because in XE we have a firmware scheduler, the GuC,
which allowed to reorder, timeslice, and preempt submissions. If a using
shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
apart as the TDR expects submission order == completion order. Using a
dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.

2. In XE submissions are done via programming a ring buffer (circular
buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
control on the ring for free.

A problem with this design is currently a drm_gpu_scheduler uses a
kthread for submission / job cleanup. This doesn't scale if a large
number of drm_gpu_scheduler are used. To work around the scaling issue,
use a worker rather than kthread for submission / job cleanup.

v2:
  - (Rob Clark) Fix msm build
  - Pass in run work queue
v3:
  - (Boris) don't have loop in worker
v4:
  - (Tvrtko) break out submit ready, stop, start helpers into own patch

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   2 +-
 drivers/gpu/drm/lima/lima_sched.c          |   2 +-
 drivers/gpu/drm/msm/msm_ringbuffer.c       |   2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c    |   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c    |   2 +-
 drivers/gpu/drm/scheduler/sched_main.c     | 106 +++++++++------------
 drivers/gpu/drm/v3d/v3d_sched.c            |  10 +-
 include/drm/gpu_scheduler.h                |  12 ++-
 9 files changed, 65 insertions(+), 75 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5f5efec6c444..3ebf9882edba 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2331,7 +2331,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
 			break;
 		}
 
-		r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
+		r = drm_sched_init(&ring->sched, &amdgpu_sched_ops, NULL,
 				   ring->num_hw_submission, 0,
 				   timeout, adev->reset_domain->wq,
 				   ring->sched_score, ring->name,
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 1ae87dfd19c4..8486a2923f1b 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -133,7 +133,7 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
 {
 	int ret;
 
-	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
+	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops, NULL,
 			     etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
 			     msecs_to_jiffies(500), NULL, NULL,
 			     dev_name(gpu->dev), gpu->dev);
diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index ffd91a5ee299..8d858aed0e56 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -488,7 +488,7 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
 
 	INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
 
-	return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
+	return drm_sched_init(&pipe->base, &lima_sched_ops, NULL, 1,
 			      lima_job_hang_limit,
 			      msecs_to_jiffies(timeout), NULL,
 			      NULL, name, pipe->ldev->dev);
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
index b60199184409..79aa112118da 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -93,7 +93,7 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id,
 	 /* currently managing hangcheck ourselves: */
 	sched_timeout = MAX_SCHEDULE_TIMEOUT;
 
-	ret = drm_sched_init(&ring->sched, &msm_sched_ops,
+	ret = drm_sched_init(&ring->sched, &msm_sched_ops, NULL,
 			num_hw_submissions, 0, sched_timeout,
 			NULL, NULL, to_msm_bo(ring->bo)->name, gpu->dev->dev);
 	if (ret) {
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index 3424a1bf6af3..2d607de5d393 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -407,7 +407,7 @@ int nouveau_sched_init(struct nouveau_drm *drm)
 	if (!drm->sched_wq)
 		return -ENOMEM;
 
-	return drm_sched_init(sched, &nouveau_sched_ops,
+	return drm_sched_init(sched, &nouveau_sched_ops, NULL,
 			      NOUVEAU_SCHED_HW_SUBMISSIONS, 0, job_hang_limit,
 			      NULL, NULL, "nouveau_sched", drm->dev->dev);
 }
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index a8b4827dc425..4efbc8aa3c2f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -831,7 +831,7 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 		js->queue[j].fence_context = dma_fence_context_alloc(1);
 
 		ret = drm_sched_init(&js->queue[j].sched,
-				     &panfrost_sched_ops,
+				     &panfrost_sched_ops, NULL,
 				     nentries, 0,
 				     msecs_to_jiffies(JOB_TIMEOUT_MS),
 				     pfdev->reset.wq,
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index e4fa62abca41..614e8c97a622 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -48,7 +48,6 @@
  * through the jobs entity pointer.
  */
 
-#include <linux/kthread.h>
 #include <linux/wait.h>
 #include <linux/sched.h>
 #include <linux/completion.h>
@@ -256,6 +255,16 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
 	return rb ? rb_entry(rb, struct drm_sched_entity, rb_tree_node) : NULL;
 }
 
+/**
+ * drm_sched_submit_queue - scheduler queue submission
+ * @sched: scheduler instance
+ */
+static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
+{
+	if (!READ_ONCE(sched->pause_submit))
+		queue_work(sched->submit_wq, &sched->work_submit);
+}
+
 /**
  * drm_sched_job_done - complete a job
  * @s_job: pointer to the job which is done
@@ -275,7 +284,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
 	dma_fence_get(&s_fence->finished);
 	drm_sched_fence_finished(s_fence, result);
 	dma_fence_put(&s_fence->finished);
-	wake_up_interruptible(&sched->wake_up_worker);
+	drm_sched_submit_queue(sched);
 }
 
 /**
@@ -868,7 +877,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
 void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
 {
 	if (drm_sched_can_queue(sched))
-		wake_up_interruptible(&sched->wake_up_worker);
+		drm_sched_submit_queue(sched);
 }
 
 /**
@@ -978,61 +987,42 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
 }
 EXPORT_SYMBOL(drm_sched_pick_best);
 
-/**
- * drm_sched_blocked - check if the scheduler is blocked
- *
- * @sched: scheduler instance
- *
- * Returns true if blocked, otherwise false.
- */
-static bool drm_sched_blocked(struct drm_gpu_scheduler *sched)
-{
-	if (kthread_should_park()) {
-		kthread_parkme();
-		return true;
-	}
-
-	return false;
-}
-
 /**
  * drm_sched_main - main scheduler thread
  *
  * @param: scheduler instance
- *
- * Returns 0.
  */
-static int drm_sched_main(void *param)
+static void drm_sched_main(struct work_struct *w)
 {
-	struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
+	struct drm_gpu_scheduler *sched =
+		container_of(w, struct drm_gpu_scheduler, work_submit);
+	struct drm_sched_entity *entity;
+	struct drm_sched_job *cleanup_job;
 	int r;
 
-	sched_set_fifo_low(current);
+	if (READ_ONCE(sched->pause_submit))
+		return;
 
-	while (!kthread_should_stop()) {
-		struct drm_sched_entity *entity = NULL;
-		struct drm_sched_fence *s_fence;
-		struct drm_sched_job *sched_job;
-		struct dma_fence *fence;
-		struct drm_sched_job *cleanup_job = NULL;
+	cleanup_job = drm_sched_get_cleanup_job(sched);
+	entity = drm_sched_select_entity(sched);
 
-		wait_event_interruptible(sched->wake_up_worker,
-					 (cleanup_job = drm_sched_get_cleanup_job(sched)) ||
-					 (!drm_sched_blocked(sched) &&
-					  (entity = drm_sched_select_entity(sched))) ||
-					 kthread_should_stop());
+	if (!entity && !cleanup_job)
+		return;	/* No more work */
 
-		if (cleanup_job)
-			sched->ops->free_job(cleanup_job);
+	if (cleanup_job)
+		sched->ops->free_job(cleanup_job);
 
-		if (!entity)
-			continue;
+	if (entity) {
+		struct dma_fence *fence;
+		struct drm_sched_fence *s_fence;
+		struct drm_sched_job *sched_job;
 
 		sched_job = drm_sched_entity_pop_job(entity);
-
 		if (!sched_job) {
 			complete_all(&entity->entity_idle);
-			continue;
+			if (!cleanup_job)
+				return;	/* No more work */
+			goto again;
 		}
 
 		s_fence = sched_job->s_fence;
@@ -1063,7 +1053,9 @@ static int drm_sched_main(void *param)
 
 		wake_up(&sched->job_scheduled);
 	}
-	return 0;
+
+again:
+	drm_sched_submit_queue(sched);
 }
 
 /**
@@ -1071,6 +1063,7 @@ static int drm_sched_main(void *param)
  *
  * @sched: scheduler instance
  * @ops: backend operations for this scheduler
+ * @submit_wq: workqueue to use for submission. If NULL, the system_wq is used
  * @hw_submission: number of hw submissions that can be in flight
  * @hang_limit: number of times to allow a job to hang before dropping it
  * @timeout: timeout value in jiffies for the scheduler
@@ -1084,14 +1077,16 @@ static int drm_sched_main(void *param)
  */
 int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   const struct drm_sched_backend_ops *ops,
+		   struct workqueue_struct *submit_wq,
 		   unsigned hw_submission, unsigned hang_limit,
 		   long timeout, struct workqueue_struct *timeout_wq,
 		   atomic_t *score, const char *name, struct device *dev)
 {
-	int i, ret;
+	int i;
 	sched->ops = ops;
 	sched->hw_submission_limit = hw_submission;
 	sched->name = name;
+	sched->submit_wq = submit_wq ? : system_wq;
 	sched->timeout = timeout;
 	sched->timeout_wq = timeout_wq ? : system_wq;
 	sched->hang_limit = hang_limit;
@@ -1100,23 +1095,15 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
 		drm_sched_rq_init(sched, &sched->sched_rq[i]);
 
-	init_waitqueue_head(&sched->wake_up_worker);
 	init_waitqueue_head(&sched->job_scheduled);
 	INIT_LIST_HEAD(&sched->pending_list);
 	spin_lock_init(&sched->job_list_lock);
 	atomic_set(&sched->hw_rq_count, 0);
 	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
+	INIT_WORK(&sched->work_submit, drm_sched_main);
 	atomic_set(&sched->_score, 0);
 	atomic64_set(&sched->job_id_count, 0);
-
-	/* Each scheduler will run on a seperate kernel thread */
-	sched->thread = kthread_run(drm_sched_main, sched, sched->name);
-	if (IS_ERR(sched->thread)) {
-		ret = PTR_ERR(sched->thread);
-		sched->thread = NULL;
-		DRM_DEV_ERROR(sched->dev, "Failed to create scheduler for %s.\n", name);
-		return ret;
-	}
+	sched->pause_submit = false;
 
 	sched->ready = true;
 	return 0;
@@ -1135,8 +1122,7 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
 	struct drm_sched_entity *s_entity;
 	int i;
 
-	if (sched->thread)
-		kthread_stop(sched->thread);
+	drm_sched_submit_stop(sched);
 
 	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
 		struct drm_sched_rq *rq = &sched->sched_rq[i];
@@ -1216,7 +1202,7 @@ EXPORT_SYMBOL(drm_sched_increase_karma);
  */
 bool drm_sched_submit_ready(struct drm_gpu_scheduler *sched)
 {
-	return !!sched->thread;
+	return sched->ready;
 
 }
 EXPORT_SYMBOL(drm_sched_submit_ready);
@@ -1228,7 +1214,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
  */
 void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
 {
-	kthread_park(sched->thread);
+	WRITE_ONCE(sched->pause_submit, true);
+	cancel_work_sync(&sched->work_submit);
 }
 EXPORT_SYMBOL(drm_sched_submit_stop);
 
@@ -1239,6 +1226,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
  */
 void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
 {
-	kthread_unpark(sched->thread);
+	WRITE_ONCE(sched->pause_submit, false);
+	queue_work(sched->submit_wq, &sched->work_submit);
 }
 EXPORT_SYMBOL(drm_sched_submit_start);
diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index 06238e6d7f5c..38e092ea41e6 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -388,7 +388,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 	int ret;
 
 	ret = drm_sched_init(&v3d->queue[V3D_BIN].sched,
-			     &v3d_bin_sched_ops,
+			     &v3d_bin_sched_ops, NULL,
 			     hw_jobs_limit, job_hang_limit,
 			     msecs_to_jiffies(hang_limit_ms), NULL,
 			     NULL, "v3d_bin", v3d->drm.dev);
@@ -396,7 +396,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 		return ret;
 
 	ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched,
-			     &v3d_render_sched_ops,
+			     &v3d_render_sched_ops, NULL,
 			     hw_jobs_limit, job_hang_limit,
 			     msecs_to_jiffies(hang_limit_ms), NULL,
 			     NULL, "v3d_render", v3d->drm.dev);
@@ -404,7 +404,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 		goto fail;
 
 	ret = drm_sched_init(&v3d->queue[V3D_TFU].sched,
-			     &v3d_tfu_sched_ops,
+			     &v3d_tfu_sched_ops, NULL,
 			     hw_jobs_limit, job_hang_limit,
 			     msecs_to_jiffies(hang_limit_ms), NULL,
 			     NULL, "v3d_tfu", v3d->drm.dev);
@@ -413,7 +413,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 
 	if (v3d_has_csd(v3d)) {
 		ret = drm_sched_init(&v3d->queue[V3D_CSD].sched,
-				     &v3d_csd_sched_ops,
+				     &v3d_csd_sched_ops, NULL,
 				     hw_jobs_limit, job_hang_limit,
 				     msecs_to_jiffies(hang_limit_ms), NULL,
 				     NULL, "v3d_csd", v3d->drm.dev);
@@ -421,7 +421,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 			goto fail;
 
 		ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched,
-				     &v3d_cache_clean_sched_ops,
+				     &v3d_cache_clean_sched_ops, NULL,
 				     hw_jobs_limit, job_hang_limit,
 				     msecs_to_jiffies(hang_limit_ms), NULL,
 				     NULL, "v3d_cache_clean", v3d->drm.dev);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index f12c5aea5294..278106e358a8 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -473,17 +473,16 @@ struct drm_sched_backend_ops {
  * @timeout: the time after which a job is removed from the scheduler.
  * @name: name of the ring for which this scheduler is being used.
  * @sched_rq: priority wise array of run queues.
- * @wake_up_worker: the wait queue on which the scheduler sleeps until a job
- *                  is ready to be scheduled.
  * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
  *                 waits on this wait queue until all the scheduled jobs are
  *                 finished.
  * @hw_rq_count: the number of jobs currently in the hardware queue.
  * @job_id_count: used to assign unique id to the each job.
+ * @submit_wq: workqueue used to queue @work_submit
  * @timeout_wq: workqueue used to queue @work_tdr
+ * @work_submit: schedules jobs and cleans up entities
  * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
  *            timeout interval is over.
- * @thread: the kthread on which the scheduler which run.
  * @pending_list: the list of jobs which are currently in the job queue.
  * @job_list_lock: lock to protect the pending_list.
  * @hang_limit: once the hangs by a job crosses this limit then it is marked
@@ -492,6 +491,7 @@ struct drm_sched_backend_ops {
  * @_score: score used when the driver doesn't provide one
  * @ready: marks if the underlying HW is ready to work
  * @free_guilty: A hit to time out handler to free the guilty job.
+ * @pause_submit: pause queuing of @work_submit on @submit_wq
  * @dev: system &struct device
  *
  * One scheduler is implemented for each hardware ring.
@@ -502,13 +502,13 @@ struct drm_gpu_scheduler {
 	long				timeout;
 	const char			*name;
 	struct drm_sched_rq		sched_rq[DRM_SCHED_PRIORITY_COUNT];
-	wait_queue_head_t		wake_up_worker;
 	wait_queue_head_t		job_scheduled;
 	atomic_t			hw_rq_count;
 	atomic64_t			job_id_count;
+	struct workqueue_struct		*submit_wq;
 	struct workqueue_struct		*timeout_wq;
+	struct work_struct		work_submit;
 	struct delayed_work		work_tdr;
-	struct task_struct		*thread;
 	struct list_head		pending_list;
 	spinlock_t			job_list_lock;
 	int				hang_limit;
@@ -516,11 +516,13 @@ struct drm_gpu_scheduler {
 	atomic_t                        _score;
 	bool				ready;
 	bool				free_guilty;
+	bool				pause_submit;
 	struct device			*dev;
 };
 
 int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   const struct drm_sched_backend_ops *ops,
+		   struct workqueue_struct *submit_wq,
 		   uint32_t hw_submission, unsigned hang_limit,
 		   long timeout, struct workqueue_struct *timeout_wq,
 		   atomic_t *score, const char *name, struct device *dev);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v2 2/9] drm/sched: Move schedule policy to scheduler / entity
  2023-08-11  2:31 [PATCH v2 0/9] DRM scheduler changes for Xe Matthew Brost
  2023-08-11  2:31 ` [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread Matthew Brost
@ 2023-08-11  2:31 ` Matthew Brost
  2023-08-11 21:43   ` Maira Canal
  2023-08-11  2:31 ` [PATCH v2 3/9] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy Matthew Brost
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 80+ messages in thread
From: Matthew Brost @ 2023-08-11  2:31 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, thomas.hellstrom, Matthew Brost, sarah.walker,
	ketil.johnsen, Liviu.Dudau, luben.tuikov, lina, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

Rather than a global modparam for scheduling policy, move the scheduling
policy to scheduler / entity so user can control each scheduler / entity
policy.

v2:
  - s/DRM_SCHED_POLICY_MAX/DRM_SCHED_POLICY_COUNT (Luben)
  - Only include policy in scheduler (Luben)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
 drivers/gpu/drm/etnaviv/etnaviv_sched.c    |  3 ++-
 drivers/gpu/drm/lima/lima_sched.c          |  3 ++-
 drivers/gpu/drm/msm/msm_ringbuffer.c       |  3 ++-
 drivers/gpu/drm/nouveau/nouveau_sched.c    |  3 ++-
 drivers/gpu/drm/panfrost/panfrost_job.c    |  3 ++-
 drivers/gpu/drm/scheduler/sched_entity.c   | 24 ++++++++++++++++++----
 drivers/gpu/drm/scheduler/sched_main.c     | 23 +++++++++++++++------
 drivers/gpu/drm/v3d/v3d_sched.c            | 15 +++++++++-----
 include/drm/gpu_scheduler.h                | 20 ++++++++++++------
 10 files changed, 72 insertions(+), 26 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3ebf9882edba..993a637dca0a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2335,6 +2335,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
 				   ring->num_hw_submission, 0,
 				   timeout, adev->reset_domain->wq,
 				   ring->sched_score, ring->name,
+				   DRM_SCHED_POLICY_DEFAULT,
 				   adev->dev);
 		if (r) {
 			DRM_ERROR("Failed to create scheduler on ring %s.\n",
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 8486a2923f1b..61204a3f8b0b 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -136,7 +136,8 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
 	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops, NULL,
 			     etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
 			     msecs_to_jiffies(500), NULL, NULL,
-			     dev_name(gpu->dev), gpu->dev);
+			     dev_name(gpu->dev), DRM_SCHED_POLICY_DEFAULT,
+			     gpu->dev);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index 8d858aed0e56..465d4bf3882b 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -491,7 +491,8 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
 	return drm_sched_init(&pipe->base, &lima_sched_ops, NULL, 1,
 			      lima_job_hang_limit,
 			      msecs_to_jiffies(timeout), NULL,
-			      NULL, name, pipe->ldev->dev);
+			      NULL, name, DRM_SCHED_POLICY_DEFAULT,
+			      pipe->ldev->dev);
 }
 
 void lima_sched_pipe_fini(struct lima_sched_pipe *pipe)
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
index 79aa112118da..78f05be89b61 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -95,7 +95,8 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id,
 
 	ret = drm_sched_init(&ring->sched, &msm_sched_ops, NULL,
 			num_hw_submissions, 0, sched_timeout,
-			NULL, NULL, to_msm_bo(ring->bo)->name, gpu->dev->dev);
+			NULL, NULL, to_msm_bo(ring->bo)->name,
+			DRM_SCHED_POLICY_DEFAULT, gpu->dev->dev);
 	if (ret) {
 		goto fail;
 	}
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index 2d607de5d393..59c263acf347 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -409,7 +409,8 @@ int nouveau_sched_init(struct nouveau_drm *drm)
 
 	return drm_sched_init(sched, &nouveau_sched_ops, NULL,
 			      NOUVEAU_SCHED_HW_SUBMISSIONS, 0, job_hang_limit,
-			      NULL, NULL, "nouveau_sched", drm->dev->dev);
+			      NULL, NULL, "nouveau_sched",
+			      DRM_SCHED_POLICY_DEFAULT, drm->dev->dev);
 }
 
 void nouveau_sched_fini(struct nouveau_drm *drm)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 4efbc8aa3c2f..c051aa01bb92 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -835,7 +835,8 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 				     nentries, 0,
 				     msecs_to_jiffies(JOB_TIMEOUT_MS),
 				     pfdev->reset.wq,
-				     NULL, "pan_js", pfdev->dev);
+				     NULL, "pan_js", DRM_SCHED_POLICY_DEFAULT,
+				     pfdev->dev);
 		if (ret) {
 			dev_err(pfdev->dev, "Failed to create scheduler: %d.", ret);
 			goto err_sched;
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index a42763e1429d..65a972b52eda 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -33,6 +33,20 @@
 #define to_drm_sched_job(sched_job)		\
 		container_of((sched_job), struct drm_sched_job, queue_node)
 
+static bool bad_policies(struct drm_gpu_scheduler **sched_list,
+			 unsigned int num_sched_list)
+{
+	enum drm_sched_policy sched_policy = sched_list[0]->sched_policy;
+	unsigned int i;
+
+	/* All schedule policies must match */
+	for (i = 1; i < num_sched_list; ++i)
+		if (sched_policy != sched_list[i]->sched_policy)
+			return true;
+
+	return false;
+}
+
 /**
  * drm_sched_entity_init - Init a context entity used by scheduler when
  * submit to HW ring.
@@ -62,7 +76,8 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
 			  unsigned int num_sched_list,
 			  atomic_t *guilty)
 {
-	if (!(entity && sched_list && (num_sched_list == 0 || sched_list[0])))
+	if (!(entity && sched_list && (num_sched_list == 0 || sched_list[0])) ||
+	    bad_policies(sched_list, num_sched_list))
 		return -EINVAL;
 
 	memset(entity, 0, sizeof(struct drm_sched_entity));
@@ -486,7 +501,7 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
 	 * Update the entity's location in the min heap according to
 	 * the timestamp of the next job, if any.
 	 */
-	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) {
+	if (entity->rq->sched->sched_policy == DRM_SCHED_POLICY_FIFO) {
 		struct drm_sched_job *next;
 
 		next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
@@ -558,7 +573,8 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
 void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 {
 	struct drm_sched_entity *entity = sched_job->entity;
-	bool first;
+	bool first, fifo = entity->rq->sched->sched_policy ==
+		DRM_SCHED_POLICY_FIFO;
 	ktime_t submit_ts;
 
 	trace_drm_sched_job(sched_job, entity);
@@ -587,7 +603,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 		drm_sched_rq_add_entity(entity->rq, entity);
 		spin_unlock(&entity->rq_lock);
 
-		if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
+		if (fifo)
 			drm_sched_rq_update_fifo(entity, submit_ts);
 
 		drm_sched_wakeup_if_can_queue(entity->rq->sched);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 614e8c97a622..545d5298c086 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -66,14 +66,14 @@
 #define to_drm_sched_job(sched_job)		\
 		container_of((sched_job), struct drm_sched_job, queue_node)
 
-int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
+int default_drm_sched_policy = DRM_SCHED_POLICY_FIFO;
 
 /**
  * DOC: sched_policy (int)
  * Used to override default entities scheduling policy in a run queue.
  */
-MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
-module_param_named(sched_policy, drm_sched_policy, int, 0444);
+MODULE_PARM_DESC(sched_policy, "Specify the default scheduling policy for entities on a run-queue, " __stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO (default).");
+module_param_named(sched_policy, default_drm_sched_policy, int, 0444);
 
 static __always_inline bool drm_sched_entity_compare_before(struct rb_node *a,
 							    const struct rb_node *b)
@@ -177,7 +177,7 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
 	if (rq->current_entity == entity)
 		rq->current_entity = NULL;
 
-	if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
+	if (rq->sched->sched_policy == DRM_SCHED_POLICY_FIFO)
 		drm_sched_rq_remove_fifo_locked(entity);
 
 	spin_unlock(&rq->lock);
@@ -898,7 +898,7 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
 
 	/* Kernel run queue has higher priority than normal run queue*/
 	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
-		entity = drm_sched_policy == DRM_SCHED_POLICY_FIFO ?
+		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
 			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
 			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
 		if (entity)
@@ -1071,6 +1071,7 @@ static void drm_sched_main(struct work_struct *w)
  *		used
  * @score: optional score atomic shared with other schedulers
  * @name: name used for debugging
+ * @sched_policy: schedule policy
  * @dev: target &struct device
  *
  * Return 0 on success, otherwise error code.
@@ -1080,9 +1081,15 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   struct workqueue_struct *submit_wq,
 		   unsigned hw_submission, unsigned hang_limit,
 		   long timeout, struct workqueue_struct *timeout_wq,
-		   atomic_t *score, const char *name, struct device *dev)
+		   atomic_t *score, const char *name,
+		   enum drm_sched_policy sched_policy,
+		   struct device *dev)
 {
 	int i;
+
+	if (sched_policy >= DRM_SCHED_POLICY_COUNT)
+		return -EINVAL;
+
 	sched->ops = ops;
 	sched->hw_submission_limit = hw_submission;
 	sched->name = name;
@@ -1092,6 +1099,10 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 	sched->hang_limit = hang_limit;
 	sched->score = score ? score : &sched->_score;
 	sched->dev = dev;
+	if (sched_policy == DRM_SCHED_POLICY_DEFAULT)
+		sched->sched_policy = default_drm_sched_policy;
+	else
+		sched->sched_policy = sched_policy;
 	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
 		drm_sched_rq_init(sched, &sched->sched_rq[i]);
 
diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index 38e092ea41e6..5e3fe77fa991 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -391,7 +391,8 @@ v3d_sched_init(struct v3d_dev *v3d)
 			     &v3d_bin_sched_ops, NULL,
 			     hw_jobs_limit, job_hang_limit,
 			     msecs_to_jiffies(hang_limit_ms), NULL,
-			     NULL, "v3d_bin", v3d->drm.dev);
+			     NULL, "v3d_bin", DRM_SCHED_POLICY_DEFAULT,
+			     v3d->drm.dev);
 	if (ret)
 		return ret;
 
@@ -399,7 +400,8 @@ v3d_sched_init(struct v3d_dev *v3d)
 			     &v3d_render_sched_ops, NULL,
 			     hw_jobs_limit, job_hang_limit,
 			     msecs_to_jiffies(hang_limit_ms), NULL,
-			     NULL, "v3d_render", v3d->drm.dev);
+			     ULL, "v3d_render", DRM_SCHED_POLICY_DEFAULT,
+			     v3d->drm.dev);
 	if (ret)
 		goto fail;
 
@@ -407,7 +409,8 @@ v3d_sched_init(struct v3d_dev *v3d)
 			     &v3d_tfu_sched_ops, NULL,
 			     hw_jobs_limit, job_hang_limit,
 			     msecs_to_jiffies(hang_limit_ms), NULL,
-			     NULL, "v3d_tfu", v3d->drm.dev);
+			     NULL, "v3d_tfu", DRM_SCHED_POLICY_DEFAULT,
+			     v3d->drm.dev);
 	if (ret)
 		goto fail;
 
@@ -416,7 +419,8 @@ v3d_sched_init(struct v3d_dev *v3d)
 				     &v3d_csd_sched_ops, NULL,
 				     hw_jobs_limit, job_hang_limit,
 				     msecs_to_jiffies(hang_limit_ms), NULL,
-				     NULL, "v3d_csd", v3d->drm.dev);
+				     NULL, "v3d_csd", DRM_SCHED_POLICY_DEFAULT,
+				     v3d->drm.dev);
 		if (ret)
 			goto fail;
 
@@ -424,7 +428,8 @@ v3d_sched_init(struct v3d_dev *v3d)
 				     &v3d_cache_clean_sched_ops, NULL,
 				     hw_jobs_limit, job_hang_limit,
 				     msecs_to_jiffies(hang_limit_ms), NULL,
-				     NULL, "v3d_cache_clean", v3d->drm.dev);
+				     NULL, "v3d_cache_clean",
+				     DRM_SCHED_POLICY_DEFAULT, v3d->drm.dev);
 		if (ret)
 			goto fail;
 	}
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 278106e358a8..897d52a4ff4f 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -72,11 +72,15 @@ enum drm_sched_priority {
 	DRM_SCHED_PRIORITY_UNSET = -2
 };
 
-/* Used to chose between FIFO and RR jobs scheduling */
-extern int drm_sched_policy;
-
-#define DRM_SCHED_POLICY_RR    0
-#define DRM_SCHED_POLICY_FIFO  1
+/* Used to chose default scheduling policy*/
+extern int default_drm_sched_policy;
+
+enum drm_sched_policy {
+	DRM_SCHED_POLICY_DEFAULT,
+	DRM_SCHED_POLICY_RR,
+	DRM_SCHED_POLICY_FIFO,
+	DRM_SCHED_POLICY_COUNT,
+};
 
 /**
  * struct drm_sched_entity - A wrapper around a job queue (typically
@@ -489,6 +493,7 @@ struct drm_sched_backend_ops {
  *              guilty and it will no longer be considered for scheduling.
  * @score: score to help loadbalancer pick a idle sched
  * @_score: score used when the driver doesn't provide one
+ * @sched_policy: Schedule policy for scheduler
  * @ready: marks if the underlying HW is ready to work
  * @free_guilty: A hit to time out handler to free the guilty job.
  * @pause_submit: pause queuing of @work_submit on @submit_wq
@@ -514,6 +519,7 @@ struct drm_gpu_scheduler {
 	int				hang_limit;
 	atomic_t                        *score;
 	atomic_t                        _score;
+	enum drm_sched_policy		sched_policy;
 	bool				ready;
 	bool				free_guilty;
 	bool				pause_submit;
@@ -525,7 +531,9 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   struct workqueue_struct *submit_wq,
 		   uint32_t hw_submission, unsigned hang_limit,
 		   long timeout, struct workqueue_struct *timeout_wq,
-		   atomic_t *score, const char *name, struct device *dev);
+		   atomic_t *score, const char *name,
+		   enum drm_sched_policy sched_policy,
+		   struct device *dev);
 
 void drm_sched_fini(struct drm_gpu_scheduler *sched);
 int drm_sched_job_init(struct drm_sched_job *job,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v2 3/9] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
  2023-08-11  2:31 [PATCH v2 0/9] DRM scheduler changes for Xe Matthew Brost
  2023-08-11  2:31 ` [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread Matthew Brost
  2023-08-11  2:31 ` [PATCH v2 2/9] drm/sched: Move schedule policy to scheduler / entity Matthew Brost
@ 2023-08-11  2:31 ` Matthew Brost
  2023-08-29 17:37   ` Danilo Krummrich
  2023-08-11  2:31 ` [PATCH v2 4/9] drm/sched: Split free_job into own work item Matthew Brost
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 80+ messages in thread
From: Matthew Brost @ 2023-08-11  2:31 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, thomas.hellstrom, Matthew Brost, sarah.walker,
	ketil.johnsen, Liviu.Dudau, luben.tuikov, lina, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

DRM_SCHED_POLICY_SINGLE_ENTITY creates a 1 to 1 relationship between
scheduler and entity. No priorities or run queue used in this mode.
Intended for devices with firmware schedulers.

v2:
  - Drop sched / rq union (Luben)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 69 ++++++++++++++++++------
 drivers/gpu/drm/scheduler/sched_fence.c  |  2 +-
 drivers/gpu/drm/scheduler/sched_main.c   | 63 +++++++++++++++++++---
 include/drm/gpu_scheduler.h              |  8 +++
 4 files changed, 119 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 65a972b52eda..1dec97caaba3 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -83,6 +83,7 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
 	memset(entity, 0, sizeof(struct drm_sched_entity));
 	INIT_LIST_HEAD(&entity->list);
 	entity->rq = NULL;
+	entity->single_sched = NULL;
 	entity->guilty = guilty;
 	entity->num_sched_list = num_sched_list;
 	entity->priority = priority;
@@ -90,8 +91,17 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
 	RCU_INIT_POINTER(entity->last_scheduled, NULL);
 	RB_CLEAR_NODE(&entity->rb_tree_node);
 
-	if(num_sched_list)
-		entity->rq = &sched_list[0]->sched_rq[entity->priority];
+	if (num_sched_list) {
+		if (sched_list[0]->sched_policy !=
+		    DRM_SCHED_POLICY_SINGLE_ENTITY) {
+			entity->rq = &sched_list[0]->sched_rq[entity->priority];
+		} else {
+			if (num_sched_list != 1 || sched_list[0]->single_entity)
+				return -EINVAL;
+			sched_list[0]->single_entity = entity;
+			entity->single_sched = sched_list[0];
+		}
+	}
 
 	init_completion(&entity->entity_idle);
 
@@ -124,7 +134,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 				    struct drm_gpu_scheduler **sched_list,
 				    unsigned int num_sched_list)
 {
-	WARN_ON(!num_sched_list || !sched_list);
+	WARN_ON(!num_sched_list || !sched_list ||
+		!!entity->single_sched);
 
 	entity->sched_list = sched_list;
 	entity->num_sched_list = num_sched_list;
@@ -231,13 +242,15 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity)
 {
 	struct drm_sched_job *job;
 	struct dma_fence *prev;
+	bool single_entity = !!entity->single_sched;
 
-	if (!entity->rq)
+	if (!entity->rq && !single_entity)
 		return;
 
 	spin_lock(&entity->rq_lock);
 	entity->stopped = true;
-	drm_sched_rq_remove_entity(entity->rq, entity);
+	if (!single_entity)
+		drm_sched_rq_remove_entity(entity->rq, entity);
 	spin_unlock(&entity->rq_lock);
 
 	/* Make sure this entity is not used by the scheduler at the moment */
@@ -259,6 +272,20 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity)
 	dma_fence_put(prev);
 }
 
+/**
+ * drm_sched_entity_to_scheduler - Schedule entity to GPU scheduler
+ * @entity: scheduler entity
+ *
+ * Returns GPU scheduler for the entity
+ */
+struct drm_gpu_scheduler *
+drm_sched_entity_to_scheduler(struct drm_sched_entity *entity)
+{
+	bool single_entity = !!entity->single_sched;
+
+	return single_entity ? entity->single_sched : entity->rq->sched;
+}
+
 /**
  * drm_sched_entity_flush - Flush a context entity
  *
@@ -276,11 +303,12 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
 	struct drm_gpu_scheduler *sched;
 	struct task_struct *last_user;
 	long ret = timeout;
+	bool single_entity = !!entity->single_sched;
 
-	if (!entity->rq)
+	if (!entity->rq && !single_entity)
 		return 0;
 
-	sched = entity->rq->sched;
+	sched = drm_sched_entity_to_scheduler(entity);
 	/**
 	 * The client will not queue more IBs during this fini, consume existing
 	 * queued IBs or discard them on SIGKILL
@@ -373,7 +401,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
 		container_of(cb, struct drm_sched_entity, cb);
 
 	drm_sched_entity_clear_dep(f, cb);
-	drm_sched_wakeup_if_can_queue(entity->rq->sched);
+	drm_sched_wakeup_if_can_queue(drm_sched_entity_to_scheduler(entity));
 }
 
 /**
@@ -387,6 +415,8 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
 void drm_sched_entity_set_priority(struct drm_sched_entity *entity,
 				   enum drm_sched_priority priority)
 {
+	WARN_ON(!!entity->single_sched);
+
 	spin_lock(&entity->rq_lock);
 	entity->priority = priority;
 	spin_unlock(&entity->rq_lock);
@@ -399,7 +429,7 @@ EXPORT_SYMBOL(drm_sched_entity_set_priority);
  */
 static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
 {
-	struct drm_gpu_scheduler *sched = entity->rq->sched;
+	struct drm_gpu_scheduler *sched = drm_sched_entity_to_scheduler(entity);
 	struct dma_fence *fence = entity->dependency;
 	struct drm_sched_fence *s_fence;
 
@@ -501,7 +531,8 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
 	 * Update the entity's location in the min heap according to
 	 * the timestamp of the next job, if any.
 	 */
-	if (entity->rq->sched->sched_policy == DRM_SCHED_POLICY_FIFO) {
+	if (drm_sched_entity_to_scheduler(entity)->sched_policy ==
+	    DRM_SCHED_POLICY_FIFO) {
 		struct drm_sched_job *next;
 
 		next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
@@ -524,6 +555,8 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_rq *rq;
 
+	WARN_ON(!!entity->single_sched);
+
 	/* single possible engine and already selected */
 	if (!entity->sched_list)
 		return;
@@ -573,12 +606,13 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
 void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 {
 	struct drm_sched_entity *entity = sched_job->entity;
-	bool first, fifo = entity->rq->sched->sched_policy ==
-		DRM_SCHED_POLICY_FIFO;
+	bool single_entity = !!entity->single_sched;
+	bool first;
 	ktime_t submit_ts;
 
 	trace_drm_sched_job(sched_job, entity);
-	atomic_inc(entity->rq->sched->score);
+	if (!single_entity)
+		atomic_inc(entity->rq->sched->score);
 	WRITE_ONCE(entity->last_user, current->group_leader);
 
 	/*
@@ -591,6 +625,10 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 
 	/* first job wakes up scheduler */
 	if (first) {
+		struct drm_gpu_scheduler *sched =
+			drm_sched_entity_to_scheduler(entity);
+		bool fifo = sched->sched_policy == DRM_SCHED_POLICY_FIFO;
+
 		/* Add the entity to the run queue */
 		spin_lock(&entity->rq_lock);
 		if (entity->stopped) {
@@ -600,13 +638,14 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
 			return;
 		}
 
-		drm_sched_rq_add_entity(entity->rq, entity);
+		if (!single_entity)
+			drm_sched_rq_add_entity(entity->rq, entity);
 		spin_unlock(&entity->rq_lock);
 
 		if (fifo)
 			drm_sched_rq_update_fifo(entity, submit_ts);
 
-		drm_sched_wakeup_if_can_queue(entity->rq->sched);
+		drm_sched_wakeup_if_can_queue(sched);
 	}
 }
 EXPORT_SYMBOL(drm_sched_entity_push_job);
diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
index 06cedfe4b486..f6b926f5e188 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -225,7 +225,7 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
 {
 	unsigned seq;
 
-	fence->sched = entity->rq->sched;
+	fence->sched = drm_sched_entity_to_scheduler(entity);
 	seq = atomic_inc_return(&entity->fence_seq);
 	dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
 		       &fence->lock, entity->fence_context, seq);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 545d5298c086..cede47afc800 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -32,7 +32,8 @@
  * backend operations to the scheduler like submitting a job to hardware run queue,
  * returning the dependencies of a job etc.
  *
- * The organisation of the scheduler is the following:
+ * The organisation of the scheduler is the following for scheduling policies
+ * DRM_SCHED_POLICY_RR and DRM_SCHED_POLICY_FIFO:
  *
  * 1. Each hw run queue has one scheduler
  * 2. Each scheduler has multiple run queues with different priorities
@@ -43,6 +44,23 @@
  *
  * The jobs in a entity are always scheduled in the order that they were pushed.
  *
+ * The organisation of the scheduler is the following for scheduling policy
+ * DRM_SCHED_POLICY_SINGLE_ENTITY:
+ *
+ * 1. One to one relationship between scheduler and entity
+ * 2. No priorities implemented per scheduler (single job queue)
+ * 3. No run queues in scheduler rather jobs are directly dequeued from entity
+ * 4. The entity maintains a queue of jobs that will be scheduled on the
+ * hardware
+ *
+ * The jobs in a entity are always scheduled in the order that they were pushed
+ * regardless of scheduling policy.
+ *
+ * A policy of DRM_SCHED_POLICY_RR or DRM_SCHED_POLICY_FIFO is expected to used
+ * when the KMD is scheduling directly on the hardware while a scheduling policy
+ * of DRM_SCHED_POLICY_SINGLE_ENTITY is expected to be used when there is a
+ * firmware scheduler.
+ *
  * Note that once a job was taken from the entities queue and pushed to the
  * hardware, i.e. the pending queue, the entity must not be referenced anymore
  * through the jobs entity pointer.
@@ -96,6 +114,8 @@ static inline void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti
 
 void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
 {
+	WARN_ON(!!entity->single_sched);
+
 	/*
 	 * Both locks need to be grabbed, one to protect from entity->rq change
 	 * for entity from within concurrent drm_sched_entity_select_rq and the
@@ -126,6 +146,8 @@ void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
 static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
 			      struct drm_sched_rq *rq)
 {
+	WARN_ON(sched->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
+
 	spin_lock_init(&rq->lock);
 	INIT_LIST_HEAD(&rq->entities);
 	rq->rb_tree_root = RB_ROOT_CACHED;
@@ -144,6 +166,8 @@ static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
 void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
 			     struct drm_sched_entity *entity)
 {
+	WARN_ON(!!entity->single_sched);
+
 	if (!list_empty(&entity->list))
 		return;
 
@@ -166,6 +190,8 @@ void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
 void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
 				struct drm_sched_entity *entity)
 {
+	WARN_ON(!!entity->single_sched);
+
 	if (list_empty(&entity->list))
 		return;
 
@@ -641,7 +667,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
 		       struct drm_sched_entity *entity,
 		       void *owner)
 {
-	if (!entity->rq)
+	if (!entity->rq && !entity->single_sched)
 		return -ENOENT;
 
 	job->entity = entity;
@@ -674,13 +700,16 @@ void drm_sched_job_arm(struct drm_sched_job *job)
 {
 	struct drm_gpu_scheduler *sched;
 	struct drm_sched_entity *entity = job->entity;
+	bool single_entity = !!entity->single_sched;
 
 	BUG_ON(!entity);
-	drm_sched_entity_select_rq(entity);
-	sched = entity->rq->sched;
+	if (!single_entity)
+		drm_sched_entity_select_rq(entity);
+	sched = drm_sched_entity_to_scheduler(entity);
 
 	job->sched = sched;
-	job->s_priority = entity->rq - sched->sched_rq;
+	if (!single_entity)
+		job->s_priority = entity->rq - sched->sched_rq;
 	job->id = atomic64_inc_return(&sched->job_id_count);
 
 	drm_sched_fence_init(job->s_fence, job->entity);
@@ -896,6 +925,13 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
 	if (!drm_sched_can_queue(sched))
 		return NULL;
 
+	if (sched->single_entity) {
+		if (drm_sched_entity_is_ready(sched->single_entity))
+			return sched->single_entity;
+
+		return NULL;
+	}
+
 	/* Kernel run queue has higher priority than normal run queue*/
 	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
 		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
@@ -1091,6 +1127,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 		return -EINVAL;
 
 	sched->ops = ops;
+	sched->single_entity = NULL;
 	sched->hw_submission_limit = hw_submission;
 	sched->name = name;
 	sched->submit_wq = submit_wq ? : system_wq;
@@ -1103,7 +1140,9 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 		sched->sched_policy = default_drm_sched_policy;
 	else
 		sched->sched_policy = sched_policy;
-	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
+	for (i = DRM_SCHED_PRIORITY_MIN; sched_policy !=
+	     DRM_SCHED_POLICY_SINGLE_ENTITY && i < DRM_SCHED_PRIORITY_COUNT;
+	     i++)
 		drm_sched_rq_init(sched, &sched->sched_rq[i]);
 
 	init_waitqueue_head(&sched->job_scheduled);
@@ -1135,7 +1174,15 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
 
 	drm_sched_submit_stop(sched);
 
-	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
+	if (sched->single_entity) {
+		spin_lock(&sched->single_entity->rq_lock);
+		sched->single_entity->stopped = true;
+		spin_unlock(&sched->single_entity->rq_lock);
+	}
+
+	for (i = DRM_SCHED_PRIORITY_COUNT - 1; sched->sched_policy !=
+	     DRM_SCHED_POLICY_SINGLE_ENTITY && i >= DRM_SCHED_PRIORITY_MIN;
+	     i--) {
 		struct drm_sched_rq *rq = &sched->sched_rq[i];
 
 		spin_lock(&rq->lock);
@@ -1176,6 +1223,8 @@ void drm_sched_increase_karma(struct drm_sched_job *bad)
 	struct drm_sched_entity *entity;
 	struct drm_gpu_scheduler *sched = bad->sched;
 
+	WARN_ON(sched->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
+
 	/* don't change @bad's karma if it's from KERNEL RQ,
 	 * because sometimes GPU hang would cause kernel jobs (like VM updating jobs)
 	 * corrupt but keep in mind that kernel jobs always considered good.
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 897d52a4ff4f..04eec2d7635f 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -79,6 +79,7 @@ enum drm_sched_policy {
 	DRM_SCHED_POLICY_DEFAULT,
 	DRM_SCHED_POLICY_RR,
 	DRM_SCHED_POLICY_FIFO,
+	DRM_SCHED_POLICY_SINGLE_ENTITY,
 	DRM_SCHED_POLICY_COUNT,
 };
 
@@ -112,6 +113,9 @@ struct drm_sched_entity {
 	 */
 	struct drm_sched_rq		*rq;
 
+	/** @single_sched: Single scheduler */
+	struct drm_gpu_scheduler	*single_sched;
+
 	/**
 	 * @sched_list:
 	 *
@@ -473,6 +477,7 @@ struct drm_sched_backend_ops {
  * struct drm_gpu_scheduler - scheduler instance-specific data
  *
  * @ops: backend operations provided by the driver.
+ * @single_entity: Single entity for the scheduler
  * @hw_submission_limit: the max size of the hardware queue.
  * @timeout: the time after which a job is removed from the scheduler.
  * @name: name of the ring for which this scheduler is being used.
@@ -503,6 +508,7 @@ struct drm_sched_backend_ops {
  */
 struct drm_gpu_scheduler {
 	const struct drm_sched_backend_ops	*ops;
+	struct drm_sched_entity		*single_entity;
 	uint32_t			hw_submission_limit;
 	long				timeout;
 	const char			*name;
@@ -585,6 +591,8 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
 			  struct drm_gpu_scheduler **sched_list,
 			  unsigned int num_sched_list,
 			  atomic_t *guilty);
+struct drm_gpu_scheduler *
+drm_sched_entity_to_scheduler(struct drm_sched_entity *entity);
 long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout);
 void drm_sched_entity_fini(struct drm_sched_entity *entity);
 void drm_sched_entity_destroy(struct drm_sched_entity *entity);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-11  2:31 [PATCH v2 0/9] DRM scheduler changes for Xe Matthew Brost
                   ` (2 preceding siblings ...)
  2023-08-11  2:31 ` [PATCH v2 3/9] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy Matthew Brost
@ 2023-08-11  2:31 ` Matthew Brost
  2023-08-17 13:39   ` Christian König
                     ` (2 more replies)
  2023-08-11  2:31 ` [PATCH v2 5/9] drm/sched: Add generic scheduler message interface Matthew Brost
                   ` (5 subsequent siblings)
  9 siblings, 3 replies; 80+ messages in thread
From: Matthew Brost @ 2023-08-11  2:31 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, thomas.hellstrom, Matthew Brost, sarah.walker,
	ketil.johnsen, Liviu.Dudau, luben.tuikov, lina, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

Rather than call free_job and run_job in same work item have a dedicated
work item for each. This aligns with the design and intended use of work
queues.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
 include/drm/gpu_scheduler.h            |   8 +-
 2 files changed, 106 insertions(+), 39 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index cede47afc800..b67469eac179 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
  * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
  *
  * @rq: scheduler run queue to check.
+ * @dequeue: dequeue selected entity
  *
  * Try to find a ready entity, returns NULL if none found.
  */
 static struct drm_sched_entity *
-drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
+drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
 {
 	struct drm_sched_entity *entity;
 
@@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
 	if (entity) {
 		list_for_each_entry_continue(entity, &rq->entities, list) {
 			if (drm_sched_entity_is_ready(entity)) {
-				rq->current_entity = entity;
-				reinit_completion(&entity->entity_idle);
+				if (dequeue) {
+					rq->current_entity = entity;
+					reinit_completion(&entity->entity_idle);
+				}
 				spin_unlock(&rq->lock);
 				return entity;
 			}
@@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
 	list_for_each_entry(entity, &rq->entities, list) {
 
 		if (drm_sched_entity_is_ready(entity)) {
-			rq->current_entity = entity;
-			reinit_completion(&entity->entity_idle);
+			if (dequeue) {
+				rq->current_entity = entity;
+				reinit_completion(&entity->entity_idle);
+			}
 			spin_unlock(&rq->lock);
 			return entity;
 		}
@@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
  * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
  *
  * @rq: scheduler run queue to check.
+ * @dequeue: dequeue selected entity
  *
  * Find oldest waiting ready entity, returns NULL if none found.
  */
 static struct drm_sched_entity *
-drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
+drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
 {
 	struct rb_node *rb;
 
@@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
 
 		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
 		if (drm_sched_entity_is_ready(entity)) {
-			rq->current_entity = entity;
-			reinit_completion(&entity->entity_idle);
+			if (dequeue) {
+				rq->current_entity = entity;
+				reinit_completion(&entity->entity_idle);
+			}
 			break;
 		}
 	}
@@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
 }
 
 /**
- * drm_sched_submit_queue - scheduler queue submission
+ * drm_sched_run_job_queue - queue job submission
  * @sched: scheduler instance
  */
-static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
+static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
 {
 	if (!READ_ONCE(sched->pause_submit))
-		queue_work(sched->submit_wq, &sched->work_submit);
+		queue_work(sched->submit_wq, &sched->work_run_job);
+}
+
+static struct drm_sched_entity *
+drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
+
+/**
+ * drm_sched_run_job_queue_if_ready - queue job submission if ready
+ * @sched: scheduler instance
+ */
+static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
+{
+	if (drm_sched_select_entity(sched, false))
+		drm_sched_run_job_queue(sched);
+}
+
+/**
+ * drm_sched_free_job_queue - queue free job
+ *
+ * @sched: scheduler instance to queue free job
+ */
+static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
+{
+	if (!READ_ONCE(sched->pause_submit))
+		queue_work(sched->submit_wq, &sched->work_free_job);
+}
+
+/**
+ * drm_sched_free_job_queue_if_ready - queue free job if ready
+ *
+ * @sched: scheduler instance to queue free job
+ */
+static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
+{
+	struct drm_sched_job *job;
+
+	spin_lock(&sched->job_list_lock);
+	job = list_first_entry_or_null(&sched->pending_list,
+				       struct drm_sched_job, list);
+	if (job && dma_fence_is_signaled(&job->s_fence->finished))
+		drm_sched_free_job_queue(sched);
+	spin_unlock(&sched->job_list_lock);
 }
 
 /**
@@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
 	dma_fence_get(&s_fence->finished);
 	drm_sched_fence_finished(s_fence, result);
 	dma_fence_put(&s_fence->finished);
-	drm_sched_submit_queue(sched);
+	drm_sched_free_job_queue(sched);
 }
 
 /**
@@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
 void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
 {
 	if (drm_sched_can_queue(sched))
-		drm_sched_submit_queue(sched);
+		drm_sched_run_job_queue(sched);
 }
 
 /**
  * drm_sched_select_entity - Select next entity to process
  *
  * @sched: scheduler instance
+ * @dequeue: dequeue selected entity
  *
  * Returns the entity to process or NULL if none are found.
  */
 static struct drm_sched_entity *
-drm_sched_select_entity(struct drm_gpu_scheduler *sched)
+drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
 {
 	struct drm_sched_entity *entity;
 	int i;
@@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
 	/* Kernel run queue has higher priority than normal run queue*/
 	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
 		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
-			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
-			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
+			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
+							dequeue) :
+			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
+						      dequeue);
 		if (entity)
 			break;
 	}
@@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
 EXPORT_SYMBOL(drm_sched_pick_best);
 
 /**
- * drm_sched_main - main scheduler thread
+ * drm_sched_free_job_work - worker to call free_job
  *
- * @param: scheduler instance
+ * @w: free job work
  */
-static void drm_sched_main(struct work_struct *w)
+static void drm_sched_free_job_work(struct work_struct *w)
 {
 	struct drm_gpu_scheduler *sched =
-		container_of(w, struct drm_gpu_scheduler, work_submit);
-	struct drm_sched_entity *entity;
+		container_of(w, struct drm_gpu_scheduler, work_free_job);
 	struct drm_sched_job *cleanup_job;
-	int r;
 
 	if (READ_ONCE(sched->pause_submit))
 		return;
 
 	cleanup_job = drm_sched_get_cleanup_job(sched);
-	entity = drm_sched_select_entity(sched);
+	if (cleanup_job) {
+		sched->ops->free_job(cleanup_job);
+
+		drm_sched_free_job_queue_if_ready(sched);
+		drm_sched_run_job_queue_if_ready(sched);
+	}
+}
 
-	if (!entity && !cleanup_job)
-		return;	/* No more work */
+/**
+ * drm_sched_run_job_work - worker to call run_job
+ *
+ * @w: run job work
+ */
+static void drm_sched_run_job_work(struct work_struct *w)
+{
+	struct drm_gpu_scheduler *sched =
+		container_of(w, struct drm_gpu_scheduler, work_run_job);
+	struct drm_sched_entity *entity;
+	int r;
 
-	if (cleanup_job)
-		sched->ops->free_job(cleanup_job);
+	if (READ_ONCE(sched->pause_submit))
+		return;
 
+	entity = drm_sched_select_entity(sched, true);
 	if (entity) {
 		struct dma_fence *fence;
 		struct drm_sched_fence *s_fence;
@@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
 		sched_job = drm_sched_entity_pop_job(entity);
 		if (!sched_job) {
 			complete_all(&entity->entity_idle);
-			if (!cleanup_job)
-				return;	/* No more work */
-			goto again;
+			return;	/* No more work */
 		}
 
 		s_fence = sched_job->s_fence;
@@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
 		}
 
 		wake_up(&sched->job_scheduled);
+		drm_sched_run_job_queue_if_ready(sched);
 	}
-
-again:
-	drm_sched_submit_queue(sched);
 }
 
 /**
@@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 	spin_lock_init(&sched->job_list_lock);
 	atomic_set(&sched->hw_rq_count, 0);
 	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
-	INIT_WORK(&sched->work_submit, drm_sched_main);
+	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
+	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
 	atomic_set(&sched->_score, 0);
 	atomic64_set(&sched->job_id_count, 0);
 	sched->pause_submit = false;
@@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
 void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
 {
 	WRITE_ONCE(sched->pause_submit, true);
-	cancel_work_sync(&sched->work_submit);
+	cancel_work_sync(&sched->work_run_job);
+	cancel_work_sync(&sched->work_free_job);
 }
 EXPORT_SYMBOL(drm_sched_submit_stop);
 
@@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
 void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
 {
 	WRITE_ONCE(sched->pause_submit, false);
-	queue_work(sched->submit_wq, &sched->work_submit);
+	queue_work(sched->submit_wq, &sched->work_run_job);
+	queue_work(sched->submit_wq, &sched->work_free_job);
 }
 EXPORT_SYMBOL(drm_sched_submit_start);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 04eec2d7635f..fbc083a92757 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
  *                 finished.
  * @hw_rq_count: the number of jobs currently in the hardware queue.
  * @job_id_count: used to assign unique id to the each job.
- * @submit_wq: workqueue used to queue @work_submit
+ * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
  * @timeout_wq: workqueue used to queue @work_tdr
- * @work_submit: schedules jobs and cleans up entities
+ * @work_run_job: schedules jobs
+ * @work_free_job: cleans up jobs
  * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
  *            timeout interval is over.
  * @pending_list: the list of jobs which are currently in the job queue.
@@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
 	atomic64_t			job_id_count;
 	struct workqueue_struct		*submit_wq;
 	struct workqueue_struct		*timeout_wq;
-	struct work_struct		work_submit;
+	struct work_struct		work_run_job;
+	struct work_struct		work_free_job;
 	struct delayed_work		work_tdr;
 	struct list_head		pending_list;
 	spinlock_t			job_list_lock;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v2 5/9] drm/sched: Add generic scheduler message interface
  2023-08-11  2:31 [PATCH v2 0/9] DRM scheduler changes for Xe Matthew Brost
                   ` (3 preceding siblings ...)
  2023-08-11  2:31 ` [PATCH v2 4/9] drm/sched: Split free_job into own work item Matthew Brost
@ 2023-08-11  2:31 ` Matthew Brost
  2023-08-11  2:31 ` [PATCH v2 6/9] drm/sched: Add drm_sched_start_timeout_unlocked helper Matthew Brost
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 80+ messages in thread
From: Matthew Brost @ 2023-08-11  2:31 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, thomas.hellstrom, Matthew Brost, sarah.walker,
	ketil.johnsen, Liviu.Dudau, luben.tuikov, lina, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

Add generic schedule message interface which sends messages to backend
from the drm_gpu_scheduler main submission thread. The idea is some of
these messages modify some state in drm_sched_entity which is also
modified during submission. By scheduling these messages and submission
in the same thread their is not race changing states in
drm_sched_entity.

This interface will be used in Xe, new Intel GPU driver, to cleanup,
suspend, resume, and change scheduling properties of a drm_sched_entity.

The interface is designed to be generic and extendable with only the
backend understanding the messages.

v2:
 - (Christian) We dedicated work item

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 98 ++++++++++++++++++++++++++
 include/drm/gpu_scheduler.h            | 34 ++++++++-
 2 files changed, 131 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index b67469eac179..fbd99f7e5b4a 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -340,6 +340,35 @@ static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
 	spin_unlock(&sched->job_list_lock);
 }
 
+/**
+ * drm_sched_process_msg_queue - queue process msg worker
+ *
+ * @sched: scheduler instance to queue process_msg worker
+ */
+static void drm_sched_process_msg_queue(struct drm_gpu_scheduler *sched)
+{
+	if (!READ_ONCE(sched->pause_submit))
+		queue_work(sched->submit_wq, &sched->work_process_msg);
+}
+
+/**
+ * drm_sched_process_msg_queue_if_ready - queue process msg worker if ready
+ *
+ * @sched: scheduler instance to queue process_msg worker
+ */
+static void
+drm_sched_process_msg_queue_if_ready(struct drm_gpu_scheduler *sched)
+{
+	struct drm_sched_msg *msg;
+
+	spin_lock(&sched->job_list_lock);
+	msg = list_first_entry_or_null(&sched->msgs,
+				       struct drm_sched_msg, link);
+	if (msg)
+		drm_sched_process_msg_queue(sched);
+	spin_unlock(&sched->job_list_lock);
+}
+
 /**
  * drm_sched_job_done - complete a job
  * @s_job: pointer to the job which is done
@@ -1075,6 +1104,71 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
 }
 EXPORT_SYMBOL(drm_sched_pick_best);
 
+/**
+ * drm_sched_add_msg - add scheduler message
+ *
+ * @sched: scheduler instance
+ * @msg: message to be added
+ *
+ * Can and will pass an jobs waiting on dependencies or in a runnable queue.
+ * Messages processing will stop if schedule run wq is stopped and resume when
+ * run wq is started.
+ */
+void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
+		       struct drm_sched_msg *msg)
+{
+	spin_lock(&sched->job_list_lock);
+	list_add_tail(&msg->link, &sched->msgs);
+	spin_unlock(&sched->job_list_lock);
+
+	drm_sched_process_msg_queue(sched);
+}
+EXPORT_SYMBOL(drm_sched_add_msg);
+
+/**
+ * drm_sched_get_msg - get scheduler message
+ *
+ * @sched: scheduler instance
+ *
+ * Returns NULL or message
+ */
+static struct drm_sched_msg *
+drm_sched_get_msg(struct drm_gpu_scheduler *sched)
+{
+	struct drm_sched_msg *msg;
+
+	spin_lock(&sched->job_list_lock);
+	msg = list_first_entry_or_null(&sched->msgs,
+				       struct drm_sched_msg, link);
+	if (msg)
+		list_del(&msg->link);
+	spin_unlock(&sched->job_list_lock);
+
+	return msg;
+}
+
+/**
+ * drm_sched_process_msg_work - worker to call process_msg
+ *
+ * @w: process msg work
+ */
+static void drm_sched_process_msg_work(struct work_struct *w)
+{
+	struct drm_gpu_scheduler *sched =
+		container_of(w, struct drm_gpu_scheduler, work_process_msg);
+	struct drm_sched_msg *msg;
+
+	if (READ_ONCE(sched->pause_submit))
+		return;
+
+	msg = drm_sched_get_msg(sched);
+	if (msg) {
+		sched->ops->process_msg(msg);
+
+		drm_sched_process_msg_queue_if_ready(sched);
+	}
+}
+
 /**
  * drm_sched_free_job_work - worker to call free_job
  *
@@ -1209,11 +1303,13 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 
 	init_waitqueue_head(&sched->job_scheduled);
 	INIT_LIST_HEAD(&sched->pending_list);
+	INIT_LIST_HEAD(&sched->msgs);
 	spin_lock_init(&sched->job_list_lock);
 	atomic_set(&sched->hw_rq_count, 0);
 	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
 	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
 	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
+	INIT_WORK(&sched->work_process_msg, drm_sched_process_msg_work);
 	atomic_set(&sched->_score, 0);
 	atomic64_set(&sched->job_id_count, 0);
 	sched->pause_submit = false;
@@ -1340,6 +1436,7 @@ void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
 	WRITE_ONCE(sched->pause_submit, true);
 	cancel_work_sync(&sched->work_run_job);
 	cancel_work_sync(&sched->work_free_job);
+	cancel_work_sync(&sched->work_process_msg);
 }
 EXPORT_SYMBOL(drm_sched_submit_stop);
 
@@ -1353,5 +1450,6 @@ void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
 	WRITE_ONCE(sched->pause_submit, false);
 	queue_work(sched->submit_wq, &sched->work_run_job);
 	queue_work(sched->submit_wq, &sched->work_free_job);
+	queue_work(sched->submit_wq, &sched->work_process_msg);
 }
 EXPORT_SYMBOL(drm_sched_submit_start);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index fbc083a92757..5d753ecb5d71 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -394,6 +394,23 @@ enum drm_gpu_sched_stat {
 	DRM_GPU_SCHED_STAT_ENODEV,
 };
 
+/**
+ * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
+ * message
+ *
+ * Generic enough for backend defined messages, backend can expand if needed.
+ */
+struct drm_sched_msg {
+	/** @link: list link into the gpu scheduler list of messages */
+	struct list_head		link;
+	/**
+	 * @private_data: opaque pointer to message private data (backend defined)
+	 */
+	void				*private_data;
+	/** @opcode: opcode of message (backend defined) */
+	unsigned int			opcode;
+};
+
 /**
  * struct drm_sched_backend_ops - Define the backend operations
  *	called by the scheduler
@@ -471,6 +488,12 @@ struct drm_sched_backend_ops {
          * and it's time to clean it up.
 	 */
 	void (*free_job)(struct drm_sched_job *sched_job);
+
+	/**
+	 * @process_msg: Process a message. Allowed to block, it is this
+	 * function's responsibility to free message if dynamically allocated.
+	 */
+	void (*process_msg)(struct drm_sched_msg *msg);
 };
 
 /**
@@ -482,15 +505,18 @@ struct drm_sched_backend_ops {
  * @timeout: the time after which a job is removed from the scheduler.
  * @name: name of the ring for which this scheduler is being used.
  * @sched_rq: priority wise array of run queues.
+ * @msgs: list of messages to be processed in @work_process_msg
  * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
  *                 waits on this wait queue until all the scheduled jobs are
  *                 finished.
  * @hw_rq_count: the number of jobs currently in the hardware queue.
  * @job_id_count: used to assign unique id to the each job.
- * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
+ * @submit_wq: workqueue used to queue @work_run_job, @work_free_job, and
+ *             @work_process_msg
  * @timeout_wq: workqueue used to queue @work_tdr
  * @work_run_job: schedules jobs
  * @work_free_job: cleans up jobs
+ * @work_process_msg: processes messages
  * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
  *            timeout interval is over.
  * @pending_list: the list of jobs which are currently in the job queue.
@@ -502,6 +528,8 @@ struct drm_sched_backend_ops {
  * @sched_policy: Schedule policy for scheduler
  * @ready: marks if the underlying HW is ready to work
  * @free_guilty: A hit to time out handler to free the guilty job.
+ * @pause_submit: pause queuing of @work_run_job, @work_free_job, and
+ *                @work_process_msg on @submit_wq
  * @pause_submit: pause queuing of @work_submit on @submit_wq
  * @dev: system &struct device
  *
@@ -514,6 +542,7 @@ struct drm_gpu_scheduler {
 	long				timeout;
 	const char			*name;
 	struct drm_sched_rq		sched_rq[DRM_SCHED_PRIORITY_COUNT];
+	struct list_head		msgs;
 	wait_queue_head_t		job_scheduled;
 	atomic_t			hw_rq_count;
 	atomic64_t			job_id_count;
@@ -521,6 +550,7 @@ struct drm_gpu_scheduler {
 	struct workqueue_struct		*timeout_wq;
 	struct work_struct		work_run_job;
 	struct work_struct		work_free_job;
+	struct work_struct		work_process_msg;
 	struct delayed_work		work_tdr;
 	struct list_head		pending_list;
 	spinlock_t			job_list_lock;
@@ -568,6 +598,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 
 void drm_sched_job_cleanup(struct drm_sched_job *job);
 void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
+void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
+		       struct drm_sched_msg *msg);
 bool drm_sched_submit_ready(struct drm_gpu_scheduler *sched);
 void drm_sched_submit_stop(struct drm_gpu_scheduler *sched);
 void drm_sched_submit_start(struct drm_gpu_scheduler *sched);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v2 6/9] drm/sched: Add drm_sched_start_timeout_unlocked helper
  2023-08-11  2:31 [PATCH v2 0/9] DRM scheduler changes for Xe Matthew Brost
                   ` (4 preceding siblings ...)
  2023-08-11  2:31 ` [PATCH v2 5/9] drm/sched: Add generic scheduler message interface Matthew Brost
@ 2023-08-11  2:31 ` Matthew Brost
  2023-08-11  2:31 ` [PATCH v2 7/9] drm/sched: Start run wq before TDR in drm_sched_start Matthew Brost
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 80+ messages in thread
From: Matthew Brost @ 2023-08-11  2:31 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, thomas.hellstrom, Matthew Brost, sarah.walker,
	ketil.johnsen, Liviu.Dudau, luben.tuikov, lina, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

Also add a lockdep assert to drm_sched_start_timeout.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index fbd99f7e5b4a..d5f6d86985c5 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -412,11 +412,20 @@ static void drm_sched_job_done_cb(struct dma_fence *f, struct dma_fence_cb *cb)
  */
 static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
 {
+	lockdep_assert_held(&sched->job_list_lock);
+
 	if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
 	    !list_empty(&sched->pending_list))
 		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
 }
 
+static void drm_sched_start_timeout_unlocked(struct drm_gpu_scheduler *sched)
+{
+	spin_lock(&sched->job_list_lock);
+	drm_sched_start_timeout(sched);
+	spin_unlock(&sched->job_list_lock);
+}
+
 /**
  * drm_sched_fault - immediately start timeout handler
  *
@@ -529,11 +538,8 @@ static void drm_sched_job_timedout(struct work_struct *work)
 		spin_unlock(&sched->job_list_lock);
 	}
 
-	if (status != DRM_GPU_SCHED_STAT_ENODEV) {
-		spin_lock(&sched->job_list_lock);
-		drm_sched_start_timeout(sched);
-		spin_unlock(&sched->job_list_lock);
-	}
+	if (status != DRM_GPU_SCHED_STAT_ENODEV)
+		drm_sched_start_timeout_unlocked(sched);
 }
 
 /**
@@ -659,11 +665,8 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 			drm_sched_job_done(s_job, -ECANCELED);
 	}
 
-	if (full_recovery) {
-		spin_lock(&sched->job_list_lock);
-		drm_sched_start_timeout(sched);
-		spin_unlock(&sched->job_list_lock);
-	}
+	if (full_recovery)
+		drm_sched_start_timeout_unlocked(sched);
 
 	drm_sched_submit_start(sched);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v2 7/9] drm/sched: Start run wq before TDR in drm_sched_start
  2023-08-11  2:31 [PATCH v2 0/9] DRM scheduler changes for Xe Matthew Brost
                   ` (5 preceding siblings ...)
  2023-08-11  2:31 ` [PATCH v2 6/9] drm/sched: Add drm_sched_start_timeout_unlocked helper Matthew Brost
@ 2023-08-11  2:31 ` Matthew Brost
  2023-08-11  2:31 ` [PATCH v2 8/9] drm/sched: Submit job before starting TDR Matthew Brost
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 80+ messages in thread
From: Matthew Brost @ 2023-08-11  2:31 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, thomas.hellstrom, Matthew Brost, sarah.walker,
	ketil.johnsen, Liviu.Dudau, luben.tuikov, lina, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

If the TDR is set to a very small value it can fire before the run wq is
started in the function drm_sched_start. The run wq is expected to
running when the TDR fires, fix this ordering so this expectation is
always met.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index d5f6d86985c5..0e7d9e227a6a 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -665,10 +665,10 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 			drm_sched_job_done(s_job, -ECANCELED);
 	}
 
+	drm_sched_submit_start(sched);
+
 	if (full_recovery)
 		drm_sched_start_timeout_unlocked(sched);
-
-	drm_sched_submit_start(sched);
 }
 EXPORT_SYMBOL(drm_sched_start);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v2 8/9] drm/sched: Submit job before starting TDR
  2023-08-11  2:31 [PATCH v2 0/9] DRM scheduler changes for Xe Matthew Brost
                   ` (6 preceding siblings ...)
  2023-08-11  2:31 ` [PATCH v2 7/9] drm/sched: Start run wq before TDR in drm_sched_start Matthew Brost
@ 2023-08-11  2:31 ` Matthew Brost
  2023-08-11  2:31 ` [PATCH v2 9/9] drm/sched: Add helper to set TDR timeout Matthew Brost
  2023-08-24  0:08 ` [PATCH v2 0/9] DRM scheduler changes for Xe Danilo Krummrich
  9 siblings, 0 replies; 80+ messages in thread
From: Matthew Brost @ 2023-08-11  2:31 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, thomas.hellstrom, Matthew Brost, sarah.walker,
	ketil.johnsen, Liviu.Dudau, luben.tuikov, lina, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

If the TDR is set to a value, it can fire before a job is submitted in
drm_sched_main. The job should be always be submitted before the TDR
fires, fix this ordering.

v2:
  - Add to pending list before run_job, start TDR after (Luben, Boris)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 0e7d9e227a6a..6aa3a35f55dc 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -498,7 +498,6 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job)
 
 	spin_lock(&sched->job_list_lock);
 	list_add_tail(&s_job->list, &sched->pending_list);
-	drm_sched_start_timeout(sched);
 	spin_unlock(&sched->job_list_lock);
 }
 
@@ -1231,6 +1230,7 @@ static void drm_sched_run_job_work(struct work_struct *w)
 		fence = sched->ops->run_job(sched_job);
 		complete_all(&entity->entity_idle);
 		drm_sched_fence_scheduled(s_fence, fence);
+		drm_sched_start_timeout_unlocked(sched);
 
 		if (!IS_ERR_OR_NULL(fence)) {
 			/* Drop for original kref_init of the fence */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH v2 9/9] drm/sched: Add helper to set TDR timeout
  2023-08-11  2:31 [PATCH v2 0/9] DRM scheduler changes for Xe Matthew Brost
                   ` (7 preceding siblings ...)
  2023-08-11  2:31 ` [PATCH v2 8/9] drm/sched: Submit job before starting TDR Matthew Brost
@ 2023-08-11  2:31 ` Matthew Brost
  2023-08-24  0:08 ` [PATCH v2 0/9] DRM scheduler changes for Xe Danilo Krummrich
  9 siblings, 0 replies; 80+ messages in thread
From: Matthew Brost @ 2023-08-11  2:31 UTC (permalink / raw)
  To: dri-devel, intel-xe
  Cc: robdclark, thomas.hellstrom, Matthew Brost, sarah.walker,
	ketil.johnsen, Liviu.Dudau, luben.tuikov, lina, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

Add helper to set TDR timeout and restart the TDR with new timeout
value. This will be used in XE, new Intel GPU driver, to trigger the TDR
to cleanup drm_sched_entity that encounter errors.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++
 include/drm/gpu_scheduler.h            |  1 +
 2 files changed, 19 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 6aa3a35f55dc..67e0fb6e7d18 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -426,6 +426,24 @@ static void drm_sched_start_timeout_unlocked(struct drm_gpu_scheduler *sched)
 	spin_unlock(&sched->job_list_lock);
 }
 
+/**
+ * drm_sched_set_timeout - set timeout for reset worker
+ *
+ * @sched: scheduler instance to set and (re)-start the worker for
+ * @timeout: timeout period
+ *
+ * Set and (re)-start the timeout for the given scheduler.
+ */
+void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout)
+{
+	spin_lock(&sched->job_list_lock);
+	sched->timeout = timeout;
+	cancel_delayed_work(&sched->work_tdr);
+	drm_sched_start_timeout(sched);
+	spin_unlock(&sched->job_list_lock);
+}
+EXPORT_SYMBOL(drm_sched_set_timeout);
+
 /**
  * drm_sched_fault - immediately start timeout handler
  *
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 5d753ecb5d71..b7b818cd81b6 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -596,6 +596,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
 				    struct drm_gpu_scheduler **sched_list,
                                    unsigned int num_sched_list);
 
+void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout);
 void drm_sched_job_cleanup(struct drm_sched_job *job);
 void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched);
 void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 2/9] drm/sched: Move schedule policy to scheduler / entity
  2023-08-11  2:31 ` [PATCH v2 2/9] drm/sched: Move schedule policy to scheduler / entity Matthew Brost
@ 2023-08-11 21:43   ` Maira Canal
  2023-08-12  3:20     ` Matthew Brost
  0 siblings, 1 reply; 80+ messages in thread
From: Maira Canal @ 2023-08-11 21:43 UTC (permalink / raw)
  To: Matthew Brost, dri-devel, intel-xe
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, luben.tuikov, donald.robson, boris.brezillon,
	christian.koenig, faith.ekstrand

Hi Matthew,

I'm not sure in which tree you plan to apply this series, but if you
plan to apply it on drm-misc-next, it would be nice to rebase on top of
it. It would make it easier for driver maintainers to review it.

Apart from the small nit below it, I tested the Xe tree on v3d and 
things seems to be running smoothly.

On 8/10/23 23:31, Matthew Brost wrote:
> Rather than a global modparam for scheduling policy, move the scheduling
> policy to scheduler / entity so user can control each scheduler / entity
> policy.
> 
> v2:
>    - s/DRM_SCHED_POLICY_MAX/DRM_SCHED_POLICY_COUNT (Luben)
>    - Only include policy in scheduler (Luben)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
>   drivers/gpu/drm/etnaviv/etnaviv_sched.c    |  3 ++-
>   drivers/gpu/drm/lima/lima_sched.c          |  3 ++-
>   drivers/gpu/drm/msm/msm_ringbuffer.c       |  3 ++-
>   drivers/gpu/drm/nouveau/nouveau_sched.c    |  3 ++-
>   drivers/gpu/drm/panfrost/panfrost_job.c    |  3 ++-
>   drivers/gpu/drm/scheduler/sched_entity.c   | 24 ++++++++++++++++++----
>   drivers/gpu/drm/scheduler/sched_main.c     | 23 +++++++++++++++------
>   drivers/gpu/drm/v3d/v3d_sched.c            | 15 +++++++++-----
>   include/drm/gpu_scheduler.h                | 20 ++++++++++++------
>   10 files changed, 72 insertions(+), 26 deletions(-)
> 

[...]

>   
> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> index 38e092ea41e6..5e3fe77fa991 100644
> --- a/drivers/gpu/drm/v3d/v3d_sched.c
> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> @@ -391,7 +391,8 @@ v3d_sched_init(struct v3d_dev *v3d)
>   			     &v3d_bin_sched_ops, NULL,
>   			     hw_jobs_limit, job_hang_limit,
>   			     msecs_to_jiffies(hang_limit_ms), NULL,
> -			     NULL, "v3d_bin", v3d->drm.dev);
> +			     NULL, "v3d_bin", DRM_SCHED_POLICY_DEFAULT,
> +			     v3d->drm.dev);
>   	if (ret)
>   		return ret;
>   
> @@ -399,7 +400,8 @@ v3d_sched_init(struct v3d_dev *v3d)
>   			     &v3d_render_sched_ops, NULL,
>   			     hw_jobs_limit, job_hang_limit,
>   			     msecs_to_jiffies(hang_limit_ms), NULL,
> -			     NULL, "v3d_render", v3d->drm.dev);
> +			     ULL, "v3d_render", DRM_SCHED_POLICY_DEFAULT,

Small nit: s/ULL/NULL

Best Regards,
- Maíra

> +			     v3d->drm.dev);
>   	if (ret)
>   		goto fail;
>   
> @@ -407,7 +409,8 @@ v3d_sched_init(struct v3d_dev *v3d)
>   			     &v3d_tfu_sched_ops, NULL,
>   			     hw_jobs_limit, job_hang_limit,
>   			     msecs_to_jiffies(hang_limit_ms), NULL,
> -			     NULL, "v3d_tfu", v3d->drm.dev);
> +			     NULL, "v3d_tfu", DRM_SCHED_POLICY_DEFAULT,
> +			     v3d->drm.dev);
>   	if (ret)
>   		goto fail;
>   
> @@ -416,7 +419,8 @@ v3d_sched_init(struct v3d_dev *v3d)
>   				     &v3d_csd_sched_ops, NULL,
>   				     hw_jobs_limit, job_hang_limit,
>   				     msecs_to_jiffies(hang_limit_ms), NULL,
> -				     NULL, "v3d_csd", v3d->drm.dev);
> +				     NULL, "v3d_csd", DRM_SCHED_POLICY_DEFAULT,
> +				     v3d->drm.dev);
>   		if (ret)
>   			goto fail;
>   
> @@ -424,7 +428,8 @@ v3d_sched_init(struct v3d_dev *v3d)
>   				     &v3d_cache_clean_sched_ops, NULL,
>   				     hw_jobs_limit, job_hang_limit,
>   				     msecs_to_jiffies(hang_limit_ms), NULL,
> -				     NULL, "v3d_cache_clean", v3d->drm.dev);
> +				     NULL, "v3d_cache_clean",
> +				     DRM_SCHED_POLICY_DEFAULT, v3d->drm.dev);
>   		if (ret)
>   			goto fail;
>   	}
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 278106e358a8..897d52a4ff4f 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -72,11 +72,15 @@ enum drm_sched_priority {
>   	DRM_SCHED_PRIORITY_UNSET = -2
>   };
>   
> -/* Used to chose between FIFO and RR jobs scheduling */
> -extern int drm_sched_policy;
> -
> -#define DRM_SCHED_POLICY_RR    0
> -#define DRM_SCHED_POLICY_FIFO  1
> +/* Used to chose default scheduling policy*/
> +extern int default_drm_sched_policy;
> +
> +enum drm_sched_policy {
> +	DRM_SCHED_POLICY_DEFAULT,
> +	DRM_SCHED_POLICY_RR,
> +	DRM_SCHED_POLICY_FIFO,
> +	DRM_SCHED_POLICY_COUNT,
> +};
>   
>   /**
>    * struct drm_sched_entity - A wrapper around a job queue (typically
> @@ -489,6 +493,7 @@ struct drm_sched_backend_ops {
>    *              guilty and it will no longer be considered for scheduling.
>    * @score: score to help loadbalancer pick a idle sched
>    * @_score: score used when the driver doesn't provide one
> + * @sched_policy: Schedule policy for scheduler
>    * @ready: marks if the underlying HW is ready to work
>    * @free_guilty: A hit to time out handler to free the guilty job.
>    * @pause_submit: pause queuing of @work_submit on @submit_wq
> @@ -514,6 +519,7 @@ struct drm_gpu_scheduler {
>   	int				hang_limit;
>   	atomic_t                        *score;
>   	atomic_t                        _score;
> +	enum drm_sched_policy		sched_policy;
>   	bool				ready;
>   	bool				free_guilty;
>   	bool				pause_submit;
> @@ -525,7 +531,9 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>   		   struct workqueue_struct *submit_wq,
>   		   uint32_t hw_submission, unsigned hang_limit,
>   		   long timeout, struct workqueue_struct *timeout_wq,
> -		   atomic_t *score, const char *name, struct device *dev);
> +		   atomic_t *score, const char *name,
> +		   enum drm_sched_policy sched_policy,
> +		   struct device *dev);
>   
>   void drm_sched_fini(struct drm_gpu_scheduler *sched);
>   int drm_sched_job_init(struct drm_sched_job *job,

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 2/9] drm/sched: Move schedule policy to scheduler / entity
  2023-08-11 21:43   ` Maira Canal
@ 2023-08-12  3:20     ` Matthew Brost
  0 siblings, 0 replies; 80+ messages in thread
From: Matthew Brost @ 2023-08-12  3:20 UTC (permalink / raw)
  To: Maira Canal
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, christian.koenig, luben.tuikov,
	donald.robson, boris.brezillon, intel-xe, faith.ekstrand

On Fri, Aug 11, 2023 at 06:43:22PM -0300, Maira Canal wrote:
> Hi Matthew,
> 
> I'm not sure in which tree you plan to apply this series, but if you
> plan to apply it on drm-misc-next, it would be nice to rebase on top of
> it. It would make it easier for driver maintainers to review it.
> 

I rebased this on drm-tip but forgot the first patch in the series.

Let me make sure I get this correct and will send a rev3 early next week.

> Apart from the small nit below it, I tested the Xe tree on v3d and things
> seems to be running smoothly.
> 
> On 8/10/23 23:31, Matthew Brost wrote:
> > Rather than a global modparam for scheduling policy, move the scheduling
> > policy to scheduler / entity so user can control each scheduler / entity
> > policy.
> > 
> > v2:
> >    - s/DRM_SCHED_POLICY_MAX/DRM_SCHED_POLICY_COUNT (Luben)
> >    - Only include policy in scheduler (Luben)
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
> >   drivers/gpu/drm/etnaviv/etnaviv_sched.c    |  3 ++-
> >   drivers/gpu/drm/lima/lima_sched.c          |  3 ++-
> >   drivers/gpu/drm/msm/msm_ringbuffer.c       |  3 ++-
> >   drivers/gpu/drm/nouveau/nouveau_sched.c    |  3 ++-
> >   drivers/gpu/drm/panfrost/panfrost_job.c    |  3 ++-
> >   drivers/gpu/drm/scheduler/sched_entity.c   | 24 ++++++++++++++++++----
> >   drivers/gpu/drm/scheduler/sched_main.c     | 23 +++++++++++++++------
> >   drivers/gpu/drm/v3d/v3d_sched.c            | 15 +++++++++-----
> >   include/drm/gpu_scheduler.h                | 20 ++++++++++++------
> >   10 files changed, 72 insertions(+), 26 deletions(-)
> > 
> 
> [...]
> 
> > diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> > index 38e092ea41e6..5e3fe77fa991 100644
> > --- a/drivers/gpu/drm/v3d/v3d_sched.c
> > +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> > @@ -391,7 +391,8 @@ v3d_sched_init(struct v3d_dev *v3d)
> >   			     &v3d_bin_sched_ops, NULL,
> >   			     hw_jobs_limit, job_hang_limit,
> >   			     msecs_to_jiffies(hang_limit_ms), NULL,
> > -			     NULL, "v3d_bin", v3d->drm.dev);
> > +			     NULL, "v3d_bin", DRM_SCHED_POLICY_DEFAULT,
> > +			     v3d->drm.dev);
> >   	if (ret)
> >   		return ret;
> > @@ -399,7 +400,8 @@ v3d_sched_init(struct v3d_dev *v3d)
> >   			     &v3d_render_sched_ops, NULL,
> >   			     hw_jobs_limit, job_hang_limit,
> >   			     msecs_to_jiffies(hang_limit_ms), NULL,
> > -			     NULL, "v3d_render", v3d->drm.dev);
> > +			     ULL, "v3d_render", DRM_SCHED_POLICY_DEFAULT,
> 
> Small nit: s/ULL/NULL
> 

Yep, will fix.

Matt

> Best Regards,
> - Maíra
> 
> > +			     v3d->drm.dev);
> >   	if (ret)
> >   		goto fail;
> > @@ -407,7 +409,8 @@ v3d_sched_init(struct v3d_dev *v3d)
> >   			     &v3d_tfu_sched_ops, NULL,
> >   			     hw_jobs_limit, job_hang_limit,
> >   			     msecs_to_jiffies(hang_limit_ms), NULL,
> > -			     NULL, "v3d_tfu", v3d->drm.dev);
> > +			     NULL, "v3d_tfu", DRM_SCHED_POLICY_DEFAULT,
> > +			     v3d->drm.dev);
> >   	if (ret)
> >   		goto fail;
> > @@ -416,7 +419,8 @@ v3d_sched_init(struct v3d_dev *v3d)
> >   				     &v3d_csd_sched_ops, NULL,
> >   				     hw_jobs_limit, job_hang_limit,
> >   				     msecs_to_jiffies(hang_limit_ms), NULL,
> > -				     NULL, "v3d_csd", v3d->drm.dev);
> > +				     NULL, "v3d_csd", DRM_SCHED_POLICY_DEFAULT,
> > +				     v3d->drm.dev);
> >   		if (ret)
> >   			goto fail;
> > @@ -424,7 +428,8 @@ v3d_sched_init(struct v3d_dev *v3d)
> >   				     &v3d_cache_clean_sched_ops, NULL,
> >   				     hw_jobs_limit, job_hang_limit,
> >   				     msecs_to_jiffies(hang_limit_ms), NULL,
> > -				     NULL, "v3d_cache_clean", v3d->drm.dev);
> > +				     NULL, "v3d_cache_clean",
> > +				     DRM_SCHED_POLICY_DEFAULT, v3d->drm.dev);
> >   		if (ret)
> >   			goto fail;
> >   	}
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 278106e358a8..897d52a4ff4f 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -72,11 +72,15 @@ enum drm_sched_priority {
> >   	DRM_SCHED_PRIORITY_UNSET = -2
> >   };
> > -/* Used to chose between FIFO and RR jobs scheduling */
> > -extern int drm_sched_policy;
> > -
> > -#define DRM_SCHED_POLICY_RR    0
> > -#define DRM_SCHED_POLICY_FIFO  1
> > +/* Used to chose default scheduling policy*/
> > +extern int default_drm_sched_policy;
> > +
> > +enum drm_sched_policy {
> > +	DRM_SCHED_POLICY_DEFAULT,
> > +	DRM_SCHED_POLICY_RR,
> > +	DRM_SCHED_POLICY_FIFO,
> > +	DRM_SCHED_POLICY_COUNT,
> > +};
> >   /**
> >    * struct drm_sched_entity - A wrapper around a job queue (typically
> > @@ -489,6 +493,7 @@ struct drm_sched_backend_ops {
> >    *              guilty and it will no longer be considered for scheduling.
> >    * @score: score to help loadbalancer pick a idle sched
> >    * @_score: score used when the driver doesn't provide one
> > + * @sched_policy: Schedule policy for scheduler
> >    * @ready: marks if the underlying HW is ready to work
> >    * @free_guilty: A hit to time out handler to free the guilty job.
> >    * @pause_submit: pause queuing of @work_submit on @submit_wq
> > @@ -514,6 +519,7 @@ struct drm_gpu_scheduler {
> >   	int				hang_limit;
> >   	atomic_t                        *score;
> >   	atomic_t                        _score;
> > +	enum drm_sched_policy		sched_policy;
> >   	bool				ready;
> >   	bool				free_guilty;
> >   	bool				pause_submit;
> > @@ -525,7 +531,9 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> >   		   struct workqueue_struct *submit_wq,
> >   		   uint32_t hw_submission, unsigned hang_limit,
> >   		   long timeout, struct workqueue_struct *timeout_wq,
> > -		   atomic_t *score, const char *name, struct device *dev);
> > +		   atomic_t *score, const char *name,
> > +		   enum drm_sched_policy sched_policy,
> > +		   struct device *dev);
> >   void drm_sched_fini(struct drm_gpu_scheduler *sched);
> >   int drm_sched_job_init(struct drm_sched_job *job,

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-11  2:31 ` [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread Matthew Brost
@ 2023-08-16 11:30   ` Danilo Krummrich
  2023-08-16 14:05     ` Christian König
  0 siblings, 1 reply; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-16 11:30 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, intel-xe, luben.tuikov, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

Hi Matt,

On 8/11/23 04:31, Matthew Brost wrote:
> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> seems a bit odd but let us explain the reasoning below.
> 
> 1. In XE the submission order from multiple drm_sched_entity is not
> guaranteed to be the same completion even if targeting the same hardware
> engine. This is because in XE we have a firmware scheduler, the GuC,
> which allowed to reorder, timeslice, and preempt submissions. If a using
> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> apart as the TDR expects submission order == completion order. Using a
> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
> 
> 2. In XE submissions are done via programming a ring buffer (circular
> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
> control on the ring for free.

In XE, where does the limitation of MAX_SIZE_PER_JOB come from?

In Nouveau we currently do have such a limitation as well, but it is 
derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would 
always be 1. However, I think most jobs won't actually utilize the whole 
ring.

Given that, it seems like it would be better to let the scheduler keep 
track of empty ring "slots" instead, such that the scheduler can deceide 
whether a subsequent job will still fit on the ring and if not 
re-evaluate once a previous job finished. Of course each submitted job 
would be required to carry the number of slots it requires on the ring.

What to you think of implementing this as alternative flow control 
mechanism? Implementation wise this could be a union with the existing 
hw_submission_limit.

- Danilo

> 
> A problem with this design is currently a drm_gpu_scheduler uses a
> kthread for submission / job cleanup. This doesn't scale if a large
> number of drm_gpu_scheduler are used. To work around the scaling issue,
> use a worker rather than kthread for submission / job cleanup.
> 
> v2:
>    - (Rob Clark) Fix msm build
>    - Pass in run work queue
> v3:
>    - (Boris) don't have loop in worker
> v4:
>    - (Tvrtko) break out submit ready, stop, start helpers into own patch
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-16 14:05     ` Christian König
@ 2023-08-16 12:30       ` Danilo Krummrich
  2023-08-16 14:38         ` Matthew Brost
  2023-08-16 14:59         ` Christian König
  0 siblings, 2 replies; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-16 12:30 UTC (permalink / raw)
  To: Christian König, Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

On 8/16/23 16:05, Christian König wrote:
> Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
>> Hi Matt,
>>
>> On 8/11/23 04:31, Matthew Brost wrote:
>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
>>> seems a bit odd but let us explain the reasoning below.
>>>
>>> 1. In XE the submission order from multiple drm_sched_entity is not
>>> guaranteed to be the same completion even if targeting the same hardware
>>> engine. This is because in XE we have a firmware scheduler, the GuC,
>>> which allowed to reorder, timeslice, and preempt submissions. If a using
>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
>>> apart as the TDR expects submission order == completion order. Using a
>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>>>
>>> 2. In XE submissions are done via programming a ring buffer (circular
>>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
>>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
>>> control on the ring for free.
>>
>> In XE, where does the limitation of MAX_SIZE_PER_JOB come from?
>>
>> In Nouveau we currently do have such a limitation as well, but it is 
>> derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would 
>> always be 1. However, I think most jobs won't actually utilize the 
>> whole ring.
> 
> Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = 
> hw_submission_limit (or even hw_submission_limit - 1 when the hw can't 
> distinct full vs empty ring buffer).

Not sure if I get you right, let me try to clarify what I was trying to 
say: I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't really 
limited by anything other than the RING_SIZE and hence we'd never allow 
more than 1 active job.

However, it seems to be more efficient to base ring flow control on the 
actual size of each incoming job rather than the worst case, namely the 
maximum size of a job.

> 
> Otherwise your scheduler might just overwrite the ring buffer by pushing 
> things to fast.
> 
> Christian.
> 
>>
>> Given that, it seems like it would be better to let the scheduler keep 
>> track of empty ring "slots" instead, such that the scheduler can 
>> deceide whether a subsequent job will still fit on the ring and if not 
>> re-evaluate once a previous job finished. Of course each submitted job 
>> would be required to carry the number of slots it requires on the ring.
>>
>> What to you think of implementing this as alternative flow control 
>> mechanism? Implementation wise this could be a union with the existing 
>> hw_submission_limit.
>>
>> - Danilo
>>
>>>
>>> A problem with this design is currently a drm_gpu_scheduler uses a
>>> kthread for submission / job cleanup. This doesn't scale if a large
>>> number of drm_gpu_scheduler are used. To work around the scaling issue,
>>> use a worker rather than kthread for submission / job cleanup.
>>>
>>> v2:
>>>    - (Rob Clark) Fix msm build
>>>    - Pass in run work queue
>>> v3:
>>>    - (Boris) don't have loop in worker
>>> v4:
>>>    - (Tvrtko) break out submit ready, stop, start helpers into own patch
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-16 11:30   ` Danilo Krummrich
@ 2023-08-16 14:05     ` Christian König
  2023-08-16 12:30       ` Danilo Krummrich
  0 siblings, 1 reply; 80+ messages in thread
From: Christian König @ 2023-08-16 14:05 UTC (permalink / raw)
  To: Danilo Krummrich, Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
> Hi Matt,
>
> On 8/11/23 04:31, Matthew Brost wrote:
>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
>> seems a bit odd but let us explain the reasoning below.
>>
>> 1. In XE the submission order from multiple drm_sched_entity is not
>> guaranteed to be the same completion even if targeting the same hardware
>> engine. This is because in XE we have a firmware scheduler, the GuC,
>> which allowed to reorder, timeslice, and preempt submissions. If a using
>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
>> apart as the TDR expects submission order == completion order. Using a
>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>>
>> 2. In XE submissions are done via programming a ring buffer (circular
>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
>> control on the ring for free.
>
> In XE, where does the limitation of MAX_SIZE_PER_JOB come from?
>
> In Nouveau we currently do have such a limitation as well, but it is 
> derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would 
> always be 1. However, I think most jobs won't actually utilize the 
> whole ring.

Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = 
hw_submission_limit (or even hw_submission_limit - 1 when the hw can't 
distinct full vs empty ring buffer).

Otherwise your scheduler might just overwrite the ring buffer by pushing 
things to fast.

Christian.

>
> Given that, it seems like it would be better to let the scheduler keep 
> track of empty ring "slots" instead, such that the scheduler can 
> deceide whether a subsequent job will still fit on the ring and if not 
> re-evaluate once a previous job finished. Of course each submitted job 
> would be required to carry the number of slots it requires on the ring.
>
> What to you think of implementing this as alternative flow control 
> mechanism? Implementation wise this could be a union with the existing 
> hw_submission_limit.
>
> - Danilo
>
>>
>> A problem with this design is currently a drm_gpu_scheduler uses a
>> kthread for submission / job cleanup. This doesn't scale if a large
>> number of drm_gpu_scheduler are used. To work around the scaling issue,
>> use a worker rather than kthread for submission / job cleanup.
>>
>> v2:
>>    - (Rob Clark) Fix msm build
>>    - Pass in run work queue
>> v3:
>>    - (Boris) don't have loop in worker
>> v4:
>>    - (Tvrtko) break out submit ready, stop, start helpers into own patch
>>
>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-16 12:30       ` Danilo Krummrich
@ 2023-08-16 14:38         ` Matthew Brost
  2023-08-16 15:40           ` Danilo Krummrich
  2023-08-16 14:59         ` Christian König
  1 sibling, 1 reply; 80+ messages in thread
From: Matthew Brost @ 2023-08-16 14:38 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, intel-xe, luben.tuikov, donald.robson,
	boris.brezillon, Christian König, faith.ekstrand

On Wed, Aug 16, 2023 at 02:30:38PM +0200, Danilo Krummrich wrote:
> On 8/16/23 16:05, Christian König wrote:
> > Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
> > > Hi Matt,
> > > 
> > > On 8/11/23 04:31, Matthew Brost wrote:
> > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > > > seems a bit odd but let us explain the reasoning below.
> > > > 
> > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > guaranteed to be the same completion even if targeting the same hardware
> > > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > > which allowed to reorder, timeslice, and preempt submissions. If a using
> > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > > > apart as the TDR expects submission order == completion order. Using a
> > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
> > > > 
> > > > 2. In XE submissions are done via programming a ring buffer (circular
> > > > buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
> > > > limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
> > > > control on the ring for free.
> > > 
> > > In XE, where does the limitation of MAX_SIZE_PER_JOB come from?
> > > 

In Xe the job submission is series of ring instructions done by the KMD.
The instructions are cache flushes, seqno writes, jump to user BB,
etc... The exact instructions for each job vary based on hw engine type,
platform, etc... We dervive MAX_SIZE_PER_JOB from the largest set of
instructions to submit a job and have a define in the driver for this. I
believe it is currently set to 192 bytes (the actual define is
MAX_JOB_SIZE_BYTES). So a 16k ring lets Xe have 85 jobs inflight at
once.

> > > In Nouveau we currently do have such a limitation as well, but it is
> > > derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would
> > > always be 1. However, I think most jobs won't actually utilize the
> > > whole ring.
> > 
> > Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB =
> > hw_submission_limit (or even hw_submission_limit - 1 when the hw can't

Yes, hw_submission_limit = RING_SIZE / MAX_SIZE_PER_JOB in Xe.


> > distinct full vs empty ring buffer).
> 
> Not sure if I get you right, let me try to clarify what I was trying to say:
> I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't really limited by
> anything other than the RING_SIZE and hence we'd never allow more than 1
> active job.

I'm confused how there isn't a limit on the size of the job in Nouveau?
Based on what you have said, a job could be larger than the ring then?

> 
> However, it seems to be more efficient to base ring flow control on the
> actual size of each incoming job rather than the worst case, namely the
> maximum size of a job.
>

If this doesn't work for Nouveau, feel free flow control the ring
differently but this works rather well (and simple) for Xe.

Matt

> > 
> > Otherwise your scheduler might just overwrite the ring buffer by pushing
> > things to fast.
> > 
> > Christian.
> > 
> > > 
> > > Given that, it seems like it would be better to let the scheduler
> > > keep track of empty ring "slots" instead, such that the scheduler
> > > can deceide whether a subsequent job will still fit on the ring and
> > > if not re-evaluate once a previous job finished. Of course each
> > > submitted job would be required to carry the number of slots it
> > > requires on the ring.
> > > 
> > > What to you think of implementing this as alternative flow control
> > > mechanism? Implementation wise this could be a union with the
> > > existing hw_submission_limit.
> > > 
> > > - Danilo
> > > 
> > > > 
> > > > A problem with this design is currently a drm_gpu_scheduler uses a
> > > > kthread for submission / job cleanup. This doesn't scale if a large
> > > > number of drm_gpu_scheduler are used. To work around the scaling issue,
> > > > use a worker rather than kthread for submission / job cleanup.
> > > > 
> > > > v2:
> > > >    - (Rob Clark) Fix msm build
> > > >    - Pass in run work queue
> > > > v3:
> > > >    - (Boris) don't have loop in worker
> > > > v4:
> > > >    - (Tvrtko) break out submit ready, stop, start helpers into own patch
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-16 12:30       ` Danilo Krummrich
  2023-08-16 14:38         ` Matthew Brost
@ 2023-08-16 14:59         ` Christian König
  2023-08-16 16:33           ` Danilo Krummrich
  1 sibling, 1 reply; 80+ messages in thread
From: Christian König @ 2023-08-16 14:59 UTC (permalink / raw)
  To: Danilo Krummrich, Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

Am 16.08.23 um 14:30 schrieb Danilo Krummrich:
> On 8/16/23 16:05, Christian König wrote:
>> Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
>>> Hi Matt,
>>>
>>> On 8/11/23 04:31, Matthew Brost wrote:
>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
>>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first 
>>>> this
>>>> seems a bit odd but let us explain the reasoning below.
>>>>
>>>> 1. In XE the submission order from multiple drm_sched_entity is not
>>>> guaranteed to be the same completion even if targeting the same 
>>>> hardware
>>>> engine. This is because in XE we have a firmware scheduler, the GuC,
>>>> which allowed to reorder, timeslice, and preempt submissions. If a 
>>>> using
>>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR 
>>>> falls
>>>> apart as the TDR expects submission order == completion order. Using a
>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>>>>
>>>> 2. In XE submissions are done via programming a ring buffer (circular
>>>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if 
>>>> the
>>>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get 
>>>> flow
>>>> control on the ring for free.
>>>
>>> In XE, where does the limitation of MAX_SIZE_PER_JOB come from?
>>>
>>> In Nouveau we currently do have such a limitation as well, but it is 
>>> derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would 
>>> always be 1. However, I think most jobs won't actually utilize the 
>>> whole ring.
>>
>> Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = 
>> hw_submission_limit (or even hw_submission_limit - 1 when the hw 
>> can't distinct full vs empty ring buffer).
>
> Not sure if I get you right, let me try to clarify what I was trying 
> to say: I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't really 
> limited by anything other than the RING_SIZE and hence we'd never 
> allow more than 1 active job.

But that lets the hw run dry between submissions. That is usually a 
pretty horrible idea for performance.

>
> However, it seems to be more efficient to base ring flow control on 
> the actual size of each incoming job rather than the worst case, 
> namely the maximum size of a job.

That doesn't sounds like a good idea to me. See we don't limit the 
number of submitted jobs based on the ring size, but rather we calculate 
the ring size based on the number of submitted jobs.

In other words the hw_submission_limit defines the ring size, not the 
other way around. And you usually want the hw_submission_limit as low as 
possible for good scheduler granularity and to avoid extra overhead.

Christian.

>
>>
>> Otherwise your scheduler might just overwrite the ring buffer by 
>> pushing things to fast.
>>
>> Christian.
>>
>>>
>>> Given that, it seems like it would be better to let the scheduler 
>>> keep track of empty ring "slots" instead, such that the scheduler 
>>> can deceide whether a subsequent job will still fit on the ring and 
>>> if not re-evaluate once a previous job finished. Of course each 
>>> submitted job would be required to carry the number of slots it 
>>> requires on the ring.
>>>
>>> What to you think of implementing this as alternative flow control 
>>> mechanism? Implementation wise this could be a union with the 
>>> existing hw_submission_limit.
>>>
>>> - Danilo
>>>
>>>>
>>>> A problem with this design is currently a drm_gpu_scheduler uses a
>>>> kthread for submission / job cleanup. This doesn't scale if a large
>>>> number of drm_gpu_scheduler are used. To work around the scaling 
>>>> issue,
>>>> use a worker rather than kthread for submission / job cleanup.
>>>>
>>>> v2:
>>>>    - (Rob Clark) Fix msm build
>>>>    - Pass in run work queue
>>>> v3:
>>>>    - (Boris) don't have loop in worker
>>>> v4:
>>>>    - (Tvrtko) break out submit ready, stop, start helpers into own 
>>>> patch
>>>>
>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>
>>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-16 14:38         ` Matthew Brost
@ 2023-08-16 15:40           ` Danilo Krummrich
  0 siblings, 0 replies; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-16 15:40 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, intel-xe, luben.tuikov, donald.robson,
	boris.brezillon, Christian König, faith.ekstrand

On 8/16/23 16:38, Matthew Brost wrote:
> On Wed, Aug 16, 2023 at 02:30:38PM +0200, Danilo Krummrich wrote:
>> On 8/16/23 16:05, Christian König wrote:
>>> Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
>>>> Hi Matt,
>>>>
>>>> On 8/11/23 04:31, Matthew Brost wrote:
>>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
>>>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
>>>>> seems a bit odd but let us explain the reasoning below.
>>>>>
>>>>> 1. In XE the submission order from multiple drm_sched_entity is not
>>>>> guaranteed to be the same completion even if targeting the same hardware
>>>>> engine. This is because in XE we have a firmware scheduler, the GuC,
>>>>> which allowed to reorder, timeslice, and preempt submissions. If a using
>>>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
>>>>> apart as the TDR expects submission order == completion order. Using a
>>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>>>>>
>>>>> 2. In XE submissions are done via programming a ring buffer (circular
>>>>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if the
>>>>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get flow
>>>>> control on the ring for free.
>>>>
>>>> In XE, where does the limitation of MAX_SIZE_PER_JOB come from?
>>>>
> 
> In Xe the job submission is series of ring instructions done by the KMD.
> The instructions are cache flushes, seqno writes, jump to user BB,
> etc... The exact instructions for each job vary based on hw engine type,
> platform, etc... We dervive MAX_SIZE_PER_JOB from the largest set of
> instructions to submit a job and have a define in the driver for this. I
> believe it is currently set to 192 bytes (the actual define is
> MAX_JOB_SIZE_BYTES). So a 16k ring lets Xe have 85 jobs inflight at
> once.

Ok, that sounds different to how Nouveau works. The "largest set of 
instructions to submit a job" really is a given by how the hardware 
works rather than an arbitrary limit?

In Nouveau, userspace can submit an arbitrary amount of addresses of 
indirect bufferes containing the ring instructions. The ring on the 
kernel side takes the addresses of the indirect buffers rather than the 
instructions themself. Hence, technically there isn't really a limit on 
the amount of IBs submitted by a job except for the ring size.

> 
>>>> In Nouveau we currently do have such a limitation as well, but it is
>>>> derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would
>>>> always be 1. However, I think most jobs won't actually utilize the
>>>> whole ring.
>>>
>>> Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB =
>>> hw_submission_limit (or even hw_submission_limit - 1 when the hw can't
> 
> Yes, hw_submission_limit = RING_SIZE / MAX_SIZE_PER_JOB in Xe.
> 
> 
>>> distinct full vs empty ring buffer).
>>
>> Not sure if I get you right, let me try to clarify what I was trying to say:
>> I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't really limited by
>> anything other than the RING_SIZE and hence we'd never allow more than 1
>> active job.
> 
> I'm confused how there isn't a limit on the size of the job in Nouveau?
> Based on what you have said, a job could be larger than the ring then?

As explained above, theoretically it could. It's only limited by the 
ring size.

> 
>>
>> However, it seems to be more efficient to base ring flow control on the
>> actual size of each incoming job rather than the worst case, namely the
>> maximum size of a job.
>>
> 
> If this doesn't work for Nouveau, feel free flow control the ring
> differently but this works rather well (and simple) for Xe.
> 
> Matt
> 
>>>
>>> Otherwise your scheduler might just overwrite the ring buffer by pushing
>>> things to fast.
>>>
>>> Christian.
>>>
>>>>
>>>> Given that, it seems like it would be better to let the scheduler
>>>> keep track of empty ring "slots" instead, such that the scheduler
>>>> can deceide whether a subsequent job will still fit on the ring and
>>>> if not re-evaluate once a previous job finished. Of course each
>>>> submitted job would be required to carry the number of slots it
>>>> requires on the ring.
>>>>
>>>> What to you think of implementing this as alternative flow control
>>>> mechanism? Implementation wise this could be a union with the
>>>> existing hw_submission_limit.
>>>>
>>>> - Danilo
>>>>
>>>>>
>>>>> A problem with this design is currently a drm_gpu_scheduler uses a
>>>>> kthread for submission / job cleanup. This doesn't scale if a large
>>>>> number of drm_gpu_scheduler are used. To work around the scaling issue,
>>>>> use a worker rather than kthread for submission / job cleanup.
>>>>>
>>>>> v2:
>>>>>     - (Rob Clark) Fix msm build
>>>>>     - Pass in run work queue
>>>>> v3:
>>>>>     - (Boris) don't have loop in worker
>>>>> v4:
>>>>>     - (Tvrtko) break out submit ready, stop, start helpers into own patch
>>>>>
>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-16 14:59         ` Christian König
@ 2023-08-16 16:33           ` Danilo Krummrich
  2023-08-17  5:33             ` Christian König
  0 siblings, 1 reply; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-16 16:33 UTC (permalink / raw)
  To: Christian König, Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

On 8/16/23 16:59, Christian König wrote:
> Am 16.08.23 um 14:30 schrieb Danilo Krummrich:
>> On 8/16/23 16:05, Christian König wrote:
>>> Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
>>>> Hi Matt,
>>>>
>>>> On 8/11/23 04:31, Matthew Brost wrote:
>>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
>>>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first 
>>>>> this
>>>>> seems a bit odd but let us explain the reasoning below.
>>>>>
>>>>> 1. In XE the submission order from multiple drm_sched_entity is not
>>>>> guaranteed to be the same completion even if targeting the same 
>>>>> hardware
>>>>> engine. This is because in XE we have a firmware scheduler, the GuC,
>>>>> which allowed to reorder, timeslice, and preempt submissions. If a 
>>>>> using
>>>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR 
>>>>> falls
>>>>> apart as the TDR expects submission order == completion order. Using a
>>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>>>>>
>>>>> 2. In XE submissions are done via programming a ring buffer (circular
>>>>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, if 
>>>>> the
>>>>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we get 
>>>>> flow
>>>>> control on the ring for free.
>>>>
>>>> In XE, where does the limitation of MAX_SIZE_PER_JOB come from?
>>>>
>>>> In Nouveau we currently do have such a limitation as well, but it is 
>>>> derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB would 
>>>> always be 1. However, I think most jobs won't actually utilize the 
>>>> whole ring.
>>>
>>> Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = 
>>> hw_submission_limit (or even hw_submission_limit - 1 when the hw 
>>> can't distinct full vs empty ring buffer).
>>
>> Not sure if I get you right, let me try to clarify what I was trying 
>> to say: I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't really 
>> limited by anything other than the RING_SIZE and hence we'd never 
>> allow more than 1 active job.
> 
> But that lets the hw run dry between submissions. That is usually a 
> pretty horrible idea for performance.

Correct, that's the reason why I said it seems to be more efficient to 
base ring flow control on the actual size of each incoming job rather 
than the maximum size of a job.

> 
>>
>> However, it seems to be more efficient to base ring flow control on 
>> the actual size of each incoming job rather than the worst case, 
>> namely the maximum size of a job.
> 
> That doesn't sounds like a good idea to me. See we don't limit the 
> number of submitted jobs based on the ring size, but rather we calculate 
> the ring size based on the number of submitted jobs.
> 

My point isn't really about whether we derive the ring size from the job 
limit or the other way around. It's more about the job size (or its 
maximum size) being arbitrary.

As mentioned in my reply to Matt:

"In Nouveau, userspace can submit an arbitrary amount of addresses of 
indirect bufferes containing the ring instructions. The ring on the 
kernel side takes the addresses of the indirect buffers rather than the 
instructions themself. Hence, technically there isn't really a limit on 
the amount of IBs submitted by a job except for the ring size."

So, my point is that I don't really want to limit the job size 
artificially just to be able to fit multiple jobs into the ring even if 
they're submitted at their "artificial" maximum size, but rather track 
how much of the ring the submitted job actually occupies.

> In other words the hw_submission_limit defines the ring size, not the 
> other way around. And you usually want the hw_submission_limit as low as 
> possible for good scheduler granularity and to avoid extra overhead.

I don't think you really mean "as low as possible", do you? Because one 
really is the minimum if you want to do work at all, but as you 
mentioned above a job limit of one can let the ring run dry.

In the end my proposal comes down to tracking the actual size of a job 
rather than just assuming a pre-defined maximum job size, and hence a 
dynamic job limit.

I don't think this would hurt the scheduler granularity. In fact, it 
should even contribute to the desire of not letting the ring run dry 
even better. Especially for sequences of small jobs, where the current 
implementation might wrongly assume the ring is already full although 
actually there would still be enough space left.

> 
> Christian.
> 
>>
>>>
>>> Otherwise your scheduler might just overwrite the ring buffer by 
>>> pushing things to fast.
>>>
>>> Christian.
>>>
>>>>
>>>> Given that, it seems like it would be better to let the scheduler 
>>>> keep track of empty ring "slots" instead, such that the scheduler 
>>>> can deceide whether a subsequent job will still fit on the ring and 
>>>> if not re-evaluate once a previous job finished. Of course each 
>>>> submitted job would be required to carry the number of slots it 
>>>> requires on the ring.
>>>>
>>>> What to you think of implementing this as alternative flow control 
>>>> mechanism? Implementation wise this could be a union with the 
>>>> existing hw_submission_limit.
>>>>
>>>> - Danilo
>>>>
>>>>>
>>>>> A problem with this design is currently a drm_gpu_scheduler uses a
>>>>> kthread for submission / job cleanup. This doesn't scale if a large
>>>>> number of drm_gpu_scheduler are used. To work around the scaling 
>>>>> issue,
>>>>> use a worker rather than kthread for submission / job cleanup.
>>>>>
>>>>> v2:
>>>>>    - (Rob Clark) Fix msm build
>>>>>    - Pass in run work queue
>>>>> v3:
>>>>>    - (Boris) don't have loop in worker
>>>>> v4:
>>>>>    - (Tvrtko) break out submit ready, stop, start helpers into own 
>>>>> patch
>>>>>
>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-16 16:33           ` Danilo Krummrich
@ 2023-08-17  5:33             ` Christian König
  2023-08-17 11:13               ` Danilo Krummrich
  0 siblings, 1 reply; 80+ messages in thread
From: Christian König @ 2023-08-17  5:33 UTC (permalink / raw)
  To: Danilo Krummrich, Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

Am 16.08.23 um 18:33 schrieb Danilo Krummrich:
> On 8/16/23 16:59, Christian König wrote:
>> Am 16.08.23 um 14:30 schrieb Danilo Krummrich:
>>> On 8/16/23 16:05, Christian König wrote:
>>>> Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
>>>>> Hi Matt,
>>>>>
>>>>> On 8/11/23 04:31, Matthew Brost wrote:
>>>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
>>>>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At 
>>>>>> first this
>>>>>> seems a bit odd but let us explain the reasoning below.
>>>>>>
>>>>>> 1. In XE the submission order from multiple drm_sched_entity is not
>>>>>> guaranteed to be the same completion even if targeting the same 
>>>>>> hardware
>>>>>> engine. This is because in XE we have a firmware scheduler, the GuC,
>>>>>> which allowed to reorder, timeslice, and preempt submissions. If 
>>>>>> a using
>>>>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the 
>>>>>> TDR falls
>>>>>> apart as the TDR expects submission order == completion order. 
>>>>>> Using a
>>>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>>>>>>
>>>>>> 2. In XE submissions are done via programming a ring buffer 
>>>>>> (circular
>>>>>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, 
>>>>>> if the
>>>>>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we 
>>>>>> get flow
>>>>>> control on the ring for free.
>>>>>
>>>>> In XE, where does the limitation of MAX_SIZE_PER_JOB come from?
>>>>>
>>>>> In Nouveau we currently do have such a limitation as well, but it 
>>>>> is derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB 
>>>>> would always be 1. However, I think most jobs won't actually 
>>>>> utilize the whole ring.
>>>>
>>>> Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = 
>>>> hw_submission_limit (or even hw_submission_limit - 1 when the hw 
>>>> can't distinct full vs empty ring buffer).
>>>
>>> Not sure if I get you right, let me try to clarify what I was trying 
>>> to say: I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't 
>>> really limited by anything other than the RING_SIZE and hence we'd 
>>> never allow more than 1 active job.
>>
>> But that lets the hw run dry between submissions. That is usually a 
>> pretty horrible idea for performance.
>
> Correct, that's the reason why I said it seems to be more efficient to 
> base ring flow control on the actual size of each incoming job rather 
> than the maximum size of a job.
>
>>
>>>
>>> However, it seems to be more efficient to base ring flow control on 
>>> the actual size of each incoming job rather than the worst case, 
>>> namely the maximum size of a job.
>>
>> That doesn't sounds like a good idea to me. See we don't limit the 
>> number of submitted jobs based on the ring size, but rather we 
>> calculate the ring size based on the number of submitted jobs.
>>
>
> My point isn't really about whether we derive the ring size from the 
> job limit or the other way around. It's more about the job size (or 
> its maximum size) being arbitrary.
>
> As mentioned in my reply to Matt:
>
> "In Nouveau, userspace can submit an arbitrary amount of addresses of 
> indirect bufferes containing the ring instructions. The ring on the 
> kernel side takes the addresses of the indirect buffers rather than 
> the instructions themself. Hence, technically there isn't really a 
> limit on the amount of IBs submitted by a job except for the ring size."
>
> So, my point is that I don't really want to limit the job size 
> artificially just to be able to fit multiple jobs into the ring even 
> if they're submitted at their "artificial" maximum size, but rather 
> track how much of the ring the submitted job actually occupies.
>
>> In other words the hw_submission_limit defines the ring size, not the 
>> other way around. And you usually want the hw_submission_limit as low 
>> as possible for good scheduler granularity and to avoid extra overhead.
>
> I don't think you really mean "as low as possible", do you?

No, I do mean as low as possible or in other words as few as possible.

Ideally the scheduler would submit only the minimum amount of work to 
the hardware to keep the hardware busy.

The hardware seems to work mostly the same for all vendors, but you 
somehow seem to think that filling the ring is somehow beneficial which 
is really not the case as far as I can see.

Regards,
Christian.

> Because one really is the minimum if you want to do work at all, but 
> as you mentioned above a job limit of one can let the ring run dry.
>
> In the end my proposal comes down to tracking the actual size of a job 
> rather than just assuming a pre-defined maximum job size, and hence a 
> dynamic job limit.
>
> I don't think this would hurt the scheduler granularity. In fact, it 
> should even contribute to the desire of not letting the ring run dry 
> even better. Especially for sequences of small jobs, where the current 
> implementation might wrongly assume the ring is already full although 
> actually there would still be enough space left.
>
>>
>> Christian.
>>
>>>
>>>>
>>>> Otherwise your scheduler might just overwrite the ring buffer by 
>>>> pushing things to fast.
>>>>
>>>> Christian.
>>>>
>>>>>
>>>>> Given that, it seems like it would be better to let the scheduler 
>>>>> keep track of empty ring "slots" instead, such that the scheduler 
>>>>> can deceide whether a subsequent job will still fit on the ring 
>>>>> and if not re-evaluate once a previous job finished. Of course 
>>>>> each submitted job would be required to carry the number of slots 
>>>>> it requires on the ring.
>>>>>
>>>>> What to you think of implementing this as alternative flow control 
>>>>> mechanism? Implementation wise this could be a union with the 
>>>>> existing hw_submission_limit.
>>>>>
>>>>> - Danilo
>>>>>
>>>>>>
>>>>>> A problem with this design is currently a drm_gpu_scheduler uses a
>>>>>> kthread for submission / job cleanup. This doesn't scale if a large
>>>>>> number of drm_gpu_scheduler are used. To work around the scaling 
>>>>>> issue,
>>>>>> use a worker rather than kthread for submission / job cleanup.
>>>>>>
>>>>>> v2:
>>>>>>    - (Rob Clark) Fix msm build
>>>>>>    - Pass in run work queue
>>>>>> v3:
>>>>>>    - (Boris) don't have loop in worker
>>>>>> v4:
>>>>>>    - (Tvrtko) break out submit ready, stop, start helpers into 
>>>>>> own patch
>>>>>>
>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>
>>>>
>>>
>>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-17  5:33             ` Christian König
@ 2023-08-17 11:13               ` Danilo Krummrich
  2023-08-17 13:35                 ` Christian König
                                   ` (2 more replies)
  0 siblings, 3 replies; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-17 11:13 UTC (permalink / raw)
  To: Christian König, Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

On 8/17/23 07:33, Christian König wrote:
> Am 16.08.23 um 18:33 schrieb Danilo Krummrich:
>> On 8/16/23 16:59, Christian König wrote:
>>> Am 16.08.23 um 14:30 schrieb Danilo Krummrich:
>>>> On 8/16/23 16:05, Christian König wrote:
>>>>> Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
>>>>>> Hi Matt,
>>>>>>
>>>>>> On 8/11/23 04:31, Matthew Brost wrote:
>>>>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
>>>>>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At 
>>>>>>> first this
>>>>>>> seems a bit odd but let us explain the reasoning below.
>>>>>>>
>>>>>>> 1. In XE the submission order from multiple drm_sched_entity is not
>>>>>>> guaranteed to be the same completion even if targeting the same 
>>>>>>> hardware
>>>>>>> engine. This is because in XE we have a firmware scheduler, the GuC,
>>>>>>> which allowed to reorder, timeslice, and preempt submissions. If 
>>>>>>> a using
>>>>>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the 
>>>>>>> TDR falls
>>>>>>> apart as the TDR expects submission order == completion order. 
>>>>>>> Using a
>>>>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>>>>>>>
>>>>>>> 2. In XE submissions are done via programming a ring buffer 
>>>>>>> (circular
>>>>>>> buffer), a drm_gpu_scheduler provides a limit on number of jobs, 
>>>>>>> if the
>>>>>>> limit of number jobs is set to RING_SIZE / MAX_SIZE_PER_JOB we 
>>>>>>> get flow
>>>>>>> control on the ring for free.
>>>>>>
>>>>>> In XE, where does the limitation of MAX_SIZE_PER_JOB come from?
>>>>>>
>>>>>> In Nouveau we currently do have such a limitation as well, but it 
>>>>>> is derived from the RING_SIZE, hence RING_SIZE / MAX_SIZE_PER_JOB 
>>>>>> would always be 1. However, I think most jobs won't actually 
>>>>>> utilize the whole ring.
>>>>>
>>>>> Well that should probably rather be RING_SIZE / MAX_SIZE_PER_JOB = 
>>>>> hw_submission_limit (or even hw_submission_limit - 1 when the hw 
>>>>> can't distinct full vs empty ring buffer).
>>>>
>>>> Not sure if I get you right, let me try to clarify what I was trying 
>>>> to say: I wanted to say that in Nouveau MAX_SIZE_PER_JOB isn't 
>>>> really limited by anything other than the RING_SIZE and hence we'd 
>>>> never allow more than 1 active job.
>>>
>>> But that lets the hw run dry between submissions. That is usually a 
>>> pretty horrible idea for performance.
>>
>> Correct, that's the reason why I said it seems to be more efficient to 
>> base ring flow control on the actual size of each incoming job rather 
>> than the maximum size of a job.
>>
>>>
>>>>
>>>> However, it seems to be more efficient to base ring flow control on 
>>>> the actual size of each incoming job rather than the worst case, 
>>>> namely the maximum size of a job.
>>>
>>> That doesn't sounds like a good idea to me. See we don't limit the 
>>> number of submitted jobs based on the ring size, but rather we 
>>> calculate the ring size based on the number of submitted jobs.
>>>
>>
>> My point isn't really about whether we derive the ring size from the 
>> job limit or the other way around. It's more about the job size (or 
>> its maximum size) being arbitrary.
>>
>> As mentioned in my reply to Matt:
>>
>> "In Nouveau, userspace can submit an arbitrary amount of addresses of 
>> indirect bufferes containing the ring instructions. The ring on the 
>> kernel side takes the addresses of the indirect buffers rather than 
>> the instructions themself. Hence, technically there isn't really a 
>> limit on the amount of IBs submitted by a job except for the ring size."
>>
>> So, my point is that I don't really want to limit the job size 
>> artificially just to be able to fit multiple jobs into the ring even 
>> if they're submitted at their "artificial" maximum size, but rather 
>> track how much of the ring the submitted job actually occupies.
>>
>>> In other words the hw_submission_limit defines the ring size, not the 
>>> other way around. And you usually want the hw_submission_limit as low 
>>> as possible for good scheduler granularity and to avoid extra overhead.
>>
>> I don't think you really mean "as low as possible", do you?
> 
> No, I do mean as low as possible or in other words as few as possible.
> 
> Ideally the scheduler would submit only the minimum amount of work to 
> the hardware to keep the hardware busy. >
> The hardware seems to work mostly the same for all vendors, but you 
> somehow seem to think that filling the ring is somehow beneficial which 
> is really not the case as far as I can see.

I think that's a misunderstanding. I'm not trying to say that it is 
*always* beneficial to fill up the ring as much as possible. But I think 
it is under certain circumstances, exactly those circumstances I 
described for Nouveau.

As mentioned, in Nouveau the size of a job is only really limited by the 
ring size, which means that one job can (but does not necessarily) fill 
up the whole ring. We both agree that this is inefficient, because it 
potentially results into the HW run dry due to hw_submission_limit == 1.

I recognize you said that one should define hw_submission_limit and 
adjust the other parts of the equation accordingly, the options I see are:

(1) Increase the ring size while keeping the maximum job size.
(2) Decrease the maximum job size while keeping the ring size.
(3) Let the scheduler track the actual job size rather than the maximum 
job size.

(1) results into potentially wasted ring memory, because we're not 
always reaching the maximum job size, but the scheduler assumes so.

(2) results into more IOCTLs from userspace for the same amount of IBs 
and more jobs result into more memory allocations and more work being 
submitted to the workqueue (with Matt's patches).

(3) doesn't seem to have any of those draw backs.

What would be your take on that?

Actually, if none of the other drivers is interested into a more precise 
way of keeping track of the ring utilization, I'd be totally fine to do 
it in a driver specific way. However, unfortunately I don't see how this 
would be possible.

My proposal would be to just keep the hw_submission_limit (maybe rename 
it to submission_unit_limit) and add a submission_units field to struct 
drm_sched_job. By default a jobs submission_units field would be 0 and 
the scheduler would behave the exact same way as it does now.

Accordingly, jobs with submission_units > 1 would contribute more than 
one unit to the submission_unit_limit.

What do you think about that?

Besides all that, you said that filling up the ring just enough to not 
let the HW run dry rather than filling it up entirely is desirable. Why 
do you think so? I tend to think that in most cases it shouldn't make 
difference.

- Danilo

> 
> Regards,
> Christian.
> 
>> Because one really is the minimum if you want to do work at all, but 
>> as you mentioned above a job limit of one can let the ring run dry.
>>
>> In the end my proposal comes down to tracking the actual size of a job 
>> rather than just assuming a pre-defined maximum job size, and hence a 
>> dynamic job limit.
>>
>> I don't think this would hurt the scheduler granularity. In fact, it 
>> should even contribute to the desire of not letting the ring run dry 
>> even better. Especially for sequences of small jobs, where the current 
>> implementation might wrongly assume the ring is already full although 
>> actually there would still be enough space left.
>>
>>>
>>> Christian.
>>>
>>>>
>>>>>
>>>>> Otherwise your scheduler might just overwrite the ring buffer by 
>>>>> pushing things to fast.
>>>>>
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Given that, it seems like it would be better to let the scheduler 
>>>>>> keep track of empty ring "slots" instead, such that the scheduler 
>>>>>> can deceide whether a subsequent job will still fit on the ring 
>>>>>> and if not re-evaluate once a previous job finished. Of course 
>>>>>> each submitted job would be required to carry the number of slots 
>>>>>> it requires on the ring.
>>>>>>
>>>>>> What to you think of implementing this as alternative flow control 
>>>>>> mechanism? Implementation wise this could be a union with the 
>>>>>> existing hw_submission_limit.
>>>>>>
>>>>>> - Danilo
>>>>>>
>>>>>>>
>>>>>>> A problem with this design is currently a drm_gpu_scheduler uses a
>>>>>>> kthread for submission / job cleanup. This doesn't scale if a large
>>>>>>> number of drm_gpu_scheduler are used. To work around the scaling 
>>>>>>> issue,
>>>>>>> use a worker rather than kthread for submission / job cleanup.
>>>>>>>
>>>>>>> v2:
>>>>>>>    - (Rob Clark) Fix msm build
>>>>>>>    - Pass in run work queue
>>>>>>> v3:
>>>>>>>    - (Boris) don't have loop in worker
>>>>>>> v4:
>>>>>>>    - (Tvrtko) break out submit ready, stop, start helpers into 
>>>>>>> own patch
>>>>>>>
>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>
>>>>>
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-17 13:35                 ` Christian König
@ 2023-08-17 12:48                   ` Danilo Krummrich
  2023-08-17 16:17                     ` Christian König
  0 siblings, 1 reply; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-17 12:48 UTC (permalink / raw)
  To: Christian König, Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

On 8/17/23 15:35, Christian König wrote:
> Am 17.08.23 um 13:13 schrieb Danilo Krummrich:
>> On 8/17/23 07:33, Christian König wrote:
>>> [SNIP]
>>> The hardware seems to work mostly the same for all vendors, but you 
>>> somehow seem to think that filling the ring is somehow beneficial 
>>> which is really not the case as far as I can see.
>>
>> I think that's a misunderstanding. I'm not trying to say that it is 
>> *always* beneficial to fill up the ring as much as possible. But I 
>> think it is under certain circumstances, exactly those circumstances I 
>> described for Nouveau.
> 
> As far as I can see this is not correct for Nouveau either.
> 
>>
>> As mentioned, in Nouveau the size of a job is only really limited by 
>> the ring size, which means that one job can (but does not necessarily) 
>> fill up the whole ring. We both agree that this is inefficient, 
>> because it potentially results into the HW run dry due to 
>> hw_submission_limit == 1.
>>
>> I recognize you said that one should define hw_submission_limit and 
>> adjust the other parts of the equation accordingly, the options I see 
>> are:
>>
>> (1) Increase the ring size while keeping the maximum job size.
>> (2) Decrease the maximum job size while keeping the ring size.
>> (3) Let the scheduler track the actual job size rather than the 
>> maximum job size.
>>
>> (1) results into potentially wasted ring memory, because we're not 
>> always reaching the maximum job size, but the scheduler assumes so.
>>
>> (2) results into more IOCTLs from userspace for the same amount of IBs 
>> and more jobs result into more memory allocations and more work being 
>> submitted to the workqueue (with Matt's patches).
>>
>> (3) doesn't seem to have any of those draw backs.
>>
>> What would be your take on that?
>>
>> Actually, if none of the other drivers is interested into a more 
>> precise way of keeping track of the ring utilization, I'd be totally 
>> fine to do it in a driver specific way. However, unfortunately I don't 
>> see how this would be possible.
>>
>> My proposal would be to just keep the hw_submission_limit (maybe 
>> rename it to submission_unit_limit) and add a submission_units field 
>> to struct drm_sched_job. By default a jobs submission_units field 
>> would be 0 and the scheduler would behave the exact same way as it 
>> does now.
>>
>> Accordingly, jobs with submission_units > 1 would contribute more than 
>> one unit to the submission_unit_limit.
>>
>> What do you think about that?
> 
> I think you are approaching this from the completely wrong side.

First of all, thanks for keeping up the discussion - I appreciate it. 
Some more comments / questions below.

> 
> See the UAPI needs to be stable, so you need a maximum job size 
> otherwise it can happen that a combination of large and small 
> submissions work while a different combination doesn't.

How is this related to the uAPI being stable? What do you mean by 
'stable' in this context?

The Nouveau uAPI allows userspace to pass EXEC jobs by supplying the 
ring ID (channel), in-/out-syncs and a certain amount of indirect push 
buffers. The amount of IBs per job is limited by the amount of IBs 
fitting into the ring. Just to be clear, when I say 'job size' I mean 
the amount of IBs per job.

Maybe I should also mention that the rings we are talking about are 
software rings managed by a firmware scheduler. We can have an arbitrary 
amount of software rings and even multiple ones per FD.

Given a constant ring size I really don't see why I should limit the 
maximum amount of IBs userspace can push per job just to end up with a 
hw_submission_limit > 1.

For example, let's just assume the ring can take 128 IBs, why would I 
limit userspace to submit just e.g. 16 IBs at a time, such that the 
hw_submission_limit becomes 8?

What is the advantage of doing that, rather than letting userspace 
submit *up to* 128 IBs per job and just letting the scheduler push IBs 
to the ring as long as there's actually space left on the ring?

> 
> So what you usually do, and this is driver independent because simply a 
> requirement of the UAPI, is that you say here that's my maximum job size 
> as well as the number of submission which should be pushed to the hw at 
> the same time. And then get the resulting ring size by the product of 
> the two.

Given the above, how is that a requirement of the uAPI?

> 
> That the ring in this use case can't be fully utilized is not a draw 
> back, this is completely intentional design which should apply to all 
> drivers independent of the vendor.

Why wouldn't we want to fully utilize the ring size?

- Danilo

> 
>>
>> Besides all that, you said that filling up the ring just enough to not 
>> let the HW run dry rather than filling it up entirely is desirable. 
>> Why do you think so? I tend to think that in most cases it shouldn't 
>> make difference.
> 
> That results in better scheduling behavior. It's mostly beneficial if 
> you don't have a hw scheduler, but as far as I can see there is no need 
> to pump everything to the hw as fast as possible.
> 
> Regards,
> Christian.
> 
>>
>> - Danilo
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>> Because one really is the minimum if you want to do work at all, but 
>>>> as you mentioned above a job limit of one can let the ring run dry.
>>>>
>>>> In the end my proposal comes down to tracking the actual size of a 
>>>> job rather than just assuming a pre-defined maximum job size, and 
>>>> hence a dynamic job limit.
>>>>
>>>> I don't think this would hurt the scheduler granularity. In fact, it 
>>>> should even contribute to the desire of not letting the ring run dry 
>>>> even better. Especially for sequences of small jobs, where the 
>>>> current implementation might wrongly assume the ring is already full 
>>>> although actually there would still be enough space left.
>>>>
>>>>>
>>>>> Christian.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Otherwise your scheduler might just overwrite the ring buffer by 
>>>>>>> pushing things to fast.
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>> Given that, it seems like it would be better to let the 
>>>>>>>> scheduler keep track of empty ring "slots" instead, such that 
>>>>>>>> the scheduler can deceide whether a subsequent job will still 
>>>>>>>> fit on the ring and if not re-evaluate once a previous job 
>>>>>>>> finished. Of course each submitted job would be required to 
>>>>>>>> carry the number of slots it requires on the ring.
>>>>>>>>
>>>>>>>> What to you think of implementing this as alternative flow 
>>>>>>>> control mechanism? Implementation wise this could be a union 
>>>>>>>> with the existing hw_submission_limit.
>>>>>>>>
>>>>>>>> - Danilo
>>>>>>>>
>>>>>>>>>
>>>>>>>>> A problem with this design is currently a drm_gpu_scheduler uses a
>>>>>>>>> kthread for submission / job cleanup. This doesn't scale if a 
>>>>>>>>> large
>>>>>>>>> number of drm_gpu_scheduler are used. To work around the 
>>>>>>>>> scaling issue,
>>>>>>>>> use a worker rather than kthread for submission / job cleanup.
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>>    - (Rob Clark) Fix msm build
>>>>>>>>>    - Pass in run work queue
>>>>>>>>> v3:
>>>>>>>>>    - (Boris) don't have loop in worker
>>>>>>>>> v4:
>>>>>>>>>    - (Tvrtko) break out submit ready, stop, start helpers into 
>>>>>>>>> own patch
>>>>>>>>>
>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-17 11:13               ` Danilo Krummrich
@ 2023-08-17 13:35                 ` Christian König
  2023-08-17 12:48                   ` Danilo Krummrich
  2023-08-18  3:08                 ` Matthew Brost
  2023-09-12 14:28                 ` Boris Brezillon
  2 siblings, 1 reply; 80+ messages in thread
From: Christian König @ 2023-08-17 13:35 UTC (permalink / raw)
  To: Danilo Krummrich, Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

Am 17.08.23 um 13:13 schrieb Danilo Krummrich:
> On 8/17/23 07:33, Christian König wrote:
>> [SNIP]
>> The hardware seems to work mostly the same for all vendors, but you 
>> somehow seem to think that filling the ring is somehow beneficial 
>> which is really not the case as far as I can see.
>
> I think that's a misunderstanding. I'm not trying to say that it is 
> *always* beneficial to fill up the ring as much as possible. But I 
> think it is under certain circumstances, exactly those circumstances I 
> described for Nouveau.

As far as I can see this is not correct for Nouveau either.

>
> As mentioned, in Nouveau the size of a job is only really limited by 
> the ring size, which means that one job can (but does not necessarily) 
> fill up the whole ring. We both agree that this is inefficient, 
> because it potentially results into the HW run dry due to 
> hw_submission_limit == 1.
>
> I recognize you said that one should define hw_submission_limit and 
> adjust the other parts of the equation accordingly, the options I see 
> are:
>
> (1) Increase the ring size while keeping the maximum job size.
> (2) Decrease the maximum job size while keeping the ring size.
> (3) Let the scheduler track the actual job size rather than the 
> maximum job size.
>
> (1) results into potentially wasted ring memory, because we're not 
> always reaching the maximum job size, but the scheduler assumes so.
>
> (2) results into more IOCTLs from userspace for the same amount of IBs 
> and more jobs result into more memory allocations and more work being 
> submitted to the workqueue (with Matt's patches).
>
> (3) doesn't seem to have any of those draw backs.
>
> What would be your take on that?
>
> Actually, if none of the other drivers is interested into a more 
> precise way of keeping track of the ring utilization, I'd be totally 
> fine to do it in a driver specific way. However, unfortunately I don't 
> see how this would be possible.
>
> My proposal would be to just keep the hw_submission_limit (maybe 
> rename it to submission_unit_limit) and add a submission_units field 
> to struct drm_sched_job. By default a jobs submission_units field 
> would be 0 and the scheduler would behave the exact same way as it 
> does now.
>
> Accordingly, jobs with submission_units > 1 would contribute more than 
> one unit to the submission_unit_limit.
>
> What do you think about that?

I think you are approaching this from the completely wrong side.

See the UAPI needs to be stable, so you need a maximum job size 
otherwise it can happen that a combination of large and small 
submissions work while a different combination doesn't.

So what you usually do, and this is driver independent because simply a 
requirement of the UAPI, is that you say here that's my maximum job size 
as well as the number of submission which should be pushed to the hw at 
the same time. And then get the resulting ring size by the product of 
the two.

That the ring in this use case can't be fully utilized is not a draw 
back, this is completely intentional design which should apply to all 
drivers independent of the vendor.

>
> Besides all that, you said that filling up the ring just enough to not 
> let the HW run dry rather than filling it up entirely is desirable. 
> Why do you think so? I tend to think that in most cases it shouldn't 
> make difference.

That results in better scheduling behavior. It's mostly beneficial if 
you don't have a hw scheduler, but as far as I can see there is no need 
to pump everything to the hw as fast as possible.

Regards,
Christian.

>
> - Danilo
>
>>
>> Regards,
>> Christian.
>>
>>> Because one really is the minimum if you want to do work at all, but 
>>> as you mentioned above a job limit of one can let the ring run dry.
>>>
>>> In the end my proposal comes down to tracking the actual size of a 
>>> job rather than just assuming a pre-defined maximum job size, and 
>>> hence a dynamic job limit.
>>>
>>> I don't think this would hurt the scheduler granularity. In fact, it 
>>> should even contribute to the desire of not letting the ring run dry 
>>> even better. Especially for sequences of small jobs, where the 
>>> current implementation might wrongly assume the ring is already full 
>>> although actually there would still be enough space left.
>>>
>>>>
>>>> Christian.
>>>>
>>>>>
>>>>>>
>>>>>> Otherwise your scheduler might just overwrite the ring buffer by 
>>>>>> pushing things to fast.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Given that, it seems like it would be better to let the 
>>>>>>> scheduler keep track of empty ring "slots" instead, such that 
>>>>>>> the scheduler can deceide whether a subsequent job will still 
>>>>>>> fit on the ring and if not re-evaluate once a previous job 
>>>>>>> finished. Of course each submitted job would be required to 
>>>>>>> carry the number of slots it requires on the ring.
>>>>>>>
>>>>>>> What to you think of implementing this as alternative flow 
>>>>>>> control mechanism? Implementation wise this could be a union 
>>>>>>> with the existing hw_submission_limit.
>>>>>>>
>>>>>>> - Danilo
>>>>>>>
>>>>>>>>
>>>>>>>> A problem with this design is currently a drm_gpu_scheduler uses a
>>>>>>>> kthread for submission / job cleanup. This doesn't scale if a 
>>>>>>>> large
>>>>>>>> number of drm_gpu_scheduler are used. To work around the 
>>>>>>>> scaling issue,
>>>>>>>> use a worker rather than kthread for submission / job cleanup.
>>>>>>>>
>>>>>>>> v2:
>>>>>>>>    - (Rob Clark) Fix msm build
>>>>>>>>    - Pass in run work queue
>>>>>>>> v3:
>>>>>>>>    - (Boris) don't have loop in worker
>>>>>>>> v4:
>>>>>>>>    - (Tvrtko) break out submit ready, stop, start helpers into 
>>>>>>>> own patch
>>>>>>>>
>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-11  2:31 ` [PATCH v2 4/9] drm/sched: Split free_job into own work item Matthew Brost
@ 2023-08-17 13:39   ` Christian König
  2023-08-17 17:54     ` Matthew Brost
  2023-08-24 23:04   ` Danilo Krummrich
  2023-08-28 18:04   ` Danilo Krummrich
  2 siblings, 1 reply; 80+ messages in thread
From: Christian König @ 2023-08-17 13:39 UTC (permalink / raw)
  To: Matthew Brost, dri-devel, intel-xe
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen,
	Liviu.Dudau, luben.tuikov, lina, donald.robson, boris.brezillon,
	faith.ekstrand

Am 11.08.23 um 04:31 schrieb Matthew Brost:
> Rather than call free_job and run_job in same work item have a dedicated
> work item for each. This aligns with the design and intended use of work
> queues.

I would rather say we should get completely rid of the free_job callback.

Essentially the job is just the container which carries the information 
which are necessary before you push it to the hw. The real 
representation of the submission is actually the scheduler fence.

All the lifetime issues we had came from ignoring this fact and I think 
we should push for fixing this design up again.

Regards,
Christian.

>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
>   include/drm/gpu_scheduler.h            |   8 +-
>   2 files changed, 106 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index cede47afc800..b67469eac179 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>    * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>    *
>    * @rq: scheduler run queue to check.
> + * @dequeue: dequeue selected entity
>    *
>    * Try to find a ready entity, returns NULL if none found.
>    */
>   static struct drm_sched_entity *
> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
>   {
>   	struct drm_sched_entity *entity;
>   
> @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>   	if (entity) {
>   		list_for_each_entry_continue(entity, &rq->entities, list) {
>   			if (drm_sched_entity_is_ready(entity)) {
> -				rq->current_entity = entity;
> -				reinit_completion(&entity->entity_idle);
> +				if (dequeue) {
> +					rq->current_entity = entity;
> +					reinit_completion(&entity->entity_idle);
> +				}
>   				spin_unlock(&rq->lock);
>   				return entity;
>   			}
> @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>   	list_for_each_entry(entity, &rq->entities, list) {
>   
>   		if (drm_sched_entity_is_ready(entity)) {
> -			rq->current_entity = entity;
> -			reinit_completion(&entity->entity_idle);
> +			if (dequeue) {
> +				rq->current_entity = entity;
> +				reinit_completion(&entity->entity_idle);
> +			}
>   			spin_unlock(&rq->lock);
>   			return entity;
>   		}
> @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>    * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>    *
>    * @rq: scheduler run queue to check.
> + * @dequeue: dequeue selected entity
>    *
>    * Find oldest waiting ready entity, returns NULL if none found.
>    */
>   static struct drm_sched_entity *
> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
>   {
>   	struct rb_node *rb;
>   
> @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>   
>   		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>   		if (drm_sched_entity_is_ready(entity)) {
> -			rq->current_entity = entity;
> -			reinit_completion(&entity->entity_idle);
> +			if (dequeue) {
> +				rq->current_entity = entity;
> +				reinit_completion(&entity->entity_idle);
> +			}
>   			break;
>   		}
>   	}
> @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>   }
>   
>   /**
> - * drm_sched_submit_queue - scheduler queue submission
> + * drm_sched_run_job_queue - queue job submission
>    * @sched: scheduler instance
>    */
> -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>   {
>   	if (!READ_ONCE(sched->pause_submit))
> -		queue_work(sched->submit_wq, &sched->work_submit);
> +		queue_work(sched->submit_wq, &sched->work_run_job);
> +}
> +
> +static struct drm_sched_entity *
> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> +
> +/**
> + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> + * @sched: scheduler instance
> + */
> +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> +{
> +	if (drm_sched_select_entity(sched, false))
> +		drm_sched_run_job_queue(sched);
> +}
> +
> +/**
> + * drm_sched_free_job_queue - queue free job
> + *
> + * @sched: scheduler instance to queue free job
> + */
> +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> +{
> +	if (!READ_ONCE(sched->pause_submit))
> +		queue_work(sched->submit_wq, &sched->work_free_job);
> +}
> +
> +/**
> + * drm_sched_free_job_queue_if_ready - queue free job if ready
> + *
> + * @sched: scheduler instance to queue free job
> + */
> +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> +{
> +	struct drm_sched_job *job;
> +
> +	spin_lock(&sched->job_list_lock);
> +	job = list_first_entry_or_null(&sched->pending_list,
> +				       struct drm_sched_job, list);
> +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
> +		drm_sched_free_job_queue(sched);
> +	spin_unlock(&sched->job_list_lock);
>   }
>   
>   /**
> @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>   	dma_fence_get(&s_fence->finished);
>   	drm_sched_fence_finished(s_fence, result);
>   	dma_fence_put(&s_fence->finished);
> -	drm_sched_submit_queue(sched);
> +	drm_sched_free_job_queue(sched);
>   }
>   
>   /**
> @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
>   void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>   {
>   	if (drm_sched_can_queue(sched))
> -		drm_sched_submit_queue(sched);
> +		drm_sched_run_job_queue(sched);
>   }
>   
>   /**
>    * drm_sched_select_entity - Select next entity to process
>    *
>    * @sched: scheduler instance
> + * @dequeue: dequeue selected entity
>    *
>    * Returns the entity to process or NULL if none are found.
>    */
>   static struct drm_sched_entity *
> -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
>   {
>   	struct drm_sched_entity *entity;
>   	int i;
> @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>   	/* Kernel run queue has higher priority than normal run queue*/
>   	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>   		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> +							dequeue) :
> +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> +						      dequeue);
>   		if (entity)
>   			break;
>   	}
> @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>   EXPORT_SYMBOL(drm_sched_pick_best);
>   
>   /**
> - * drm_sched_main - main scheduler thread
> + * drm_sched_free_job_work - worker to call free_job
>    *
> - * @param: scheduler instance
> + * @w: free job work
>    */
> -static void drm_sched_main(struct work_struct *w)
> +static void drm_sched_free_job_work(struct work_struct *w)
>   {
>   	struct drm_gpu_scheduler *sched =
> -		container_of(w, struct drm_gpu_scheduler, work_submit);
> -	struct drm_sched_entity *entity;
> +		container_of(w, struct drm_gpu_scheduler, work_free_job);
>   	struct drm_sched_job *cleanup_job;
> -	int r;
>   
>   	if (READ_ONCE(sched->pause_submit))
>   		return;
>   
>   	cleanup_job = drm_sched_get_cleanup_job(sched);
> -	entity = drm_sched_select_entity(sched);
> +	if (cleanup_job) {
> +		sched->ops->free_job(cleanup_job);
> +
> +		drm_sched_free_job_queue_if_ready(sched);
> +		drm_sched_run_job_queue_if_ready(sched);
> +	}
> +}
>   
> -	if (!entity && !cleanup_job)
> -		return;	/* No more work */
> +/**
> + * drm_sched_run_job_work - worker to call run_job
> + *
> + * @w: run job work
> + */
> +static void drm_sched_run_job_work(struct work_struct *w)
> +{
> +	struct drm_gpu_scheduler *sched =
> +		container_of(w, struct drm_gpu_scheduler, work_run_job);
> +	struct drm_sched_entity *entity;
> +	int r;
>   
> -	if (cleanup_job)
> -		sched->ops->free_job(cleanup_job);
> +	if (READ_ONCE(sched->pause_submit))
> +		return;
>   
> +	entity = drm_sched_select_entity(sched, true);
>   	if (entity) {
>   		struct dma_fence *fence;
>   		struct drm_sched_fence *s_fence;
> @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
>   		sched_job = drm_sched_entity_pop_job(entity);
>   		if (!sched_job) {
>   			complete_all(&entity->entity_idle);
> -			if (!cleanup_job)
> -				return;	/* No more work */
> -			goto again;
> +			return;	/* No more work */
>   		}
>   
>   		s_fence = sched_job->s_fence;
> @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
>   		}
>   
>   		wake_up(&sched->job_scheduled);
> +		drm_sched_run_job_queue_if_ready(sched);
>   	}
> -
> -again:
> -	drm_sched_submit_queue(sched);
>   }
>   
>   /**
> @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>   	spin_lock_init(&sched->job_list_lock);
>   	atomic_set(&sched->hw_rq_count, 0);
>   	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> -	INIT_WORK(&sched->work_submit, drm_sched_main);
> +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
>   	atomic_set(&sched->_score, 0);
>   	atomic64_set(&sched->job_id_count, 0);
>   	sched->pause_submit = false;
> @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
>   void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
>   {
>   	WRITE_ONCE(sched->pause_submit, true);
> -	cancel_work_sync(&sched->work_submit);
> +	cancel_work_sync(&sched->work_run_job);
> +	cancel_work_sync(&sched->work_free_job);
>   }
>   EXPORT_SYMBOL(drm_sched_submit_stop);
>   
> @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
>   void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
>   {
>   	WRITE_ONCE(sched->pause_submit, false);
> -	queue_work(sched->submit_wq, &sched->work_submit);
> +	queue_work(sched->submit_wq, &sched->work_run_job);
> +	queue_work(sched->submit_wq, &sched->work_free_job);
>   }
>   EXPORT_SYMBOL(drm_sched_submit_start);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 04eec2d7635f..fbc083a92757 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
>    *                 finished.
>    * @hw_rq_count: the number of jobs currently in the hardware queue.
>    * @job_id_count: used to assign unique id to the each job.
> - * @submit_wq: workqueue used to queue @work_submit
> + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
>    * @timeout_wq: workqueue used to queue @work_tdr
> - * @work_submit: schedules jobs and cleans up entities
> + * @work_run_job: schedules jobs
> + * @work_free_job: cleans up jobs
>    * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>    *            timeout interval is over.
>    * @pending_list: the list of jobs which are currently in the job queue.
> @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
>   	atomic64_t			job_id_count;
>   	struct workqueue_struct		*submit_wq;
>   	struct workqueue_struct		*timeout_wq;
> -	struct work_struct		work_submit;
> +	struct work_struct		work_run_job;
> +	struct work_struct		work_free_job;
>   	struct delayed_work		work_tdr;
>   	struct list_head		pending_list;
>   	spinlock_t			job_list_lock;


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-17 12:48                   ` Danilo Krummrich
@ 2023-08-17 16:17                     ` Christian König
  2023-08-18 11:58                       ` Danilo Krummrich
  0 siblings, 1 reply; 80+ messages in thread
From: Christian König @ 2023-08-17 16:17 UTC (permalink / raw)
  To: Danilo Krummrich, Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

Am 17.08.23 um 14:48 schrieb Danilo Krummrich:
> On 8/17/23 15:35, Christian König wrote:
>> Am 17.08.23 um 13:13 schrieb Danilo Krummrich:
>>> On 8/17/23 07:33, Christian König wrote:
>>>> [SNIP]
>>>> My proposal would be to just keep the hw_submission_limit (maybe 
>>>> rename it to submission_unit_limit) and add a submission_units 
>>>> field to struct drm_sched_job. By default a jobs submission_units 
>>>> field would be 0 and the scheduler would behave the exact same way 
>>>> as it does now.
>>>>
>>>> Accordingly, jobs with submission_units > 1 would contribute more 
>>>> than one unit to the submission_unit_limit.
>>>>
>>>> What do you think about that?
>>
>> I think you are approaching this from the completely wrong side.
>
> First of all, thanks for keeping up the discussion - I appreciate it. 
> Some more comments / questions below.
>
>>
>> See the UAPI needs to be stable, so you need a maximum job size 
>> otherwise it can happen that a combination of large and small 
>> submissions work while a different combination doesn't.
>
> How is this related to the uAPI being stable? What do you mean by 
> 'stable' in this context?

Stable is in you don't get indifferent behavior, not stable is in the 
sense of backward compatibility. Sorry for the confusing wording :)

>
> The Nouveau uAPI allows userspace to pass EXEC jobs by supplying the 
> ring ID (channel), in-/out-syncs and a certain amount of indirect push 
> buffers. The amount of IBs per job is limited by the amount of IBs 
> fitting into the ring. Just to be clear, when I say 'job size' I mean 
> the amount of IBs per job.

Well that more or less sounds identical to all other hardware I know of, 
e.g. AMD, Intel and the different ARM chips seem to all work like this. 
But on those drivers the job size limit is not the ring size, but rather 
a fixed value (at least as far as I know).

>
> Maybe I should also mention that the rings we are talking about are 
> software rings managed by a firmware scheduler. We can have an 
> arbitrary amount of software rings and even multiple ones per FD.
>
> Given a constant ring size I really don't see why I should limit the 
> maximum amount of IBs userspace can push per job just to end up with a 
> hw_submission_limit > 1.
>
> For example, let's just assume the ring can take 128 IBs, why would I 
> limit userspace to submit just e.g. 16 IBs at a time, such that the 
> hw_submission_limit becomes 8?

Well the question is what happens when you have two submissions back to 
back which use more than halve of the ring buffer?

I only see two possible outcomes:
1. You return -EBUSY (or similar) error code indicating the the hw can't 
receive more commands.
2. Wait on previously pushed commands to be executed.
(3. Your driver crash because you accidentally overwrite stuff in the 
ring buffer which is still executed. I just assume that's prevented).

Resolution #1 with -EBUSY is actually something the UAPI should not do, 
because your UAPI then depends on the specific timing of submissions 
which is a really bad idea.

Resolution #2 is usually bad because it forces the hw to run dry between 
submission and so degrade performance.

>
> What is the advantage of doing that, rather than letting userspace 
> submit *up to* 128 IBs per job and just letting the scheduler push IBs 
> to the ring as long as there's actually space left on the ring?

Predictable behavior I think. Basically you want organize things so that 
the hw is at least kept busy all the time without depending on actual 
timing.

>
>>
>> So what you usually do, and this is driver independent because simply 
>> a requirement of the UAPI, is that you say here that's my maximum job 
>> size as well as the number of submission which should be pushed to 
>> the hw at the same time. And then get the resulting ring size by the 
>> product of the two.
>
> Given the above, how is that a requirement of the uAPI?

The requirement of the UAPI is actually pretty simple: You should get 
consistent results, independent of the timing (at least as long as you 
don't do stuff in parallel).

Otherwise you can run into issues when on a certain configuration stuff 
suddenly runs faster or slower than expected. In other words you should 
not depend on that stuff finishes in a certain amount of time.

>
>>
>> That the ring in this use case can't be fully utilized is not a draw 
>> back, this is completely intentional design which should apply to all 
>> drivers independent of the vendor.
>
> Why wouldn't we want to fully utilize the ring size?

As far as I know everybody restricts the submission size to something 
fixed which is at least smaller than halve the ring size to avoid the 
problems mentioned above.

Regards,
Christian.

>
> - Danilo
>
>>
>>>
>>> Besides all that, you said that filling up the ring just enough to 
>>> not let the HW run dry rather than filling it up entirely is 
>>> desirable. Why do you think so? I tend to think that in most cases 
>>> it shouldn't make difference.
>>
>> That results in better scheduling behavior. It's mostly beneficial if 
>> you don't have a hw scheduler, but as far as I can see there is no 
>> need to pump everything to the hw as fast as possible.
>>
>> Regards,
>> Christian.
>>
>>>
>>> - Danilo
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Because one really is the minimum if you want to do work at all, 
>>>>> but as you mentioned above a job limit of one can let the ring run 
>>>>> dry.
>>>>>
>>>>> In the end my proposal comes down to tracking the actual size of a 
>>>>> job rather than just assuming a pre-defined maximum job size, and 
>>>>> hence a dynamic job limit.
>>>>>
>>>>> I don't think this would hurt the scheduler granularity. In fact, 
>>>>> it should even contribute to the desire of not letting the ring 
>>>>> run dry even better. Especially for sequences of small jobs, where 
>>>>> the current implementation might wrongly assume the ring is 
>>>>> already full although actually there would still be enough space 
>>>>> left.
>>>>>
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Otherwise your scheduler might just overwrite the ring buffer 
>>>>>>>> by pushing things to fast.
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Given that, it seems like it would be better to let the 
>>>>>>>>> scheduler keep track of empty ring "slots" instead, such that 
>>>>>>>>> the scheduler can deceide whether a subsequent job will still 
>>>>>>>>> fit on the ring and if not re-evaluate once a previous job 
>>>>>>>>> finished. Of course each submitted job would be required to 
>>>>>>>>> carry the number of slots it requires on the ring.
>>>>>>>>>
>>>>>>>>> What to you think of implementing this as alternative flow 
>>>>>>>>> control mechanism? Implementation wise this could be a union 
>>>>>>>>> with the existing hw_submission_limit.
>>>>>>>>>
>>>>>>>>> - Danilo
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> A problem with this design is currently a drm_gpu_scheduler 
>>>>>>>>>> uses a
>>>>>>>>>> kthread for submission / job cleanup. This doesn't scale if a 
>>>>>>>>>> large
>>>>>>>>>> number of drm_gpu_scheduler are used. To work around the 
>>>>>>>>>> scaling issue,
>>>>>>>>>> use a worker rather than kthread for submission / job cleanup.
>>>>>>>>>>
>>>>>>>>>> v2:
>>>>>>>>>>    - (Rob Clark) Fix msm build
>>>>>>>>>>    - Pass in run work queue
>>>>>>>>>> v3:
>>>>>>>>>>    - (Boris) don't have loop in worker
>>>>>>>>>> v4:
>>>>>>>>>>    - (Tvrtko) break out submit ready, stop, start helpers 
>>>>>>>>>> into own patch
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-17 13:39   ` Christian König
@ 2023-08-17 17:54     ` Matthew Brost
  2023-08-18  5:27       ` Christian König
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Brost @ 2023-08-17 17:54 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen,
	Liviu.Dudau, dri-devel, luben.tuikov, lina, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

On Thu, Aug 17, 2023 at 03:39:40PM +0200, Christian König wrote:
> Am 11.08.23 um 04:31 schrieb Matthew Brost:
> > Rather than call free_job and run_job in same work item have a dedicated
> > work item for each. This aligns with the design and intended use of work
> > queues.
> 
> I would rather say we should get completely rid of the free_job callback.
> 

Would we still have work item? e.g. Would we still want to call
drm_sched_get_cleanup_job which removes the job from the pending list
and adjusts the TDR? Trying to figure out out what this looks like. We
probably can't do all of this from an IRQ context.

> Essentially the job is just the container which carries the information
> which are necessary before you push it to the hw. The real representation of
> the submission is actually the scheduler fence.
>

Most of the free_jobs call plus drm_sched_job_cleanup + a put on job. In
Xe this cannot be called from an IRQ context either.

I'm just confused what exactly you are suggesting here.

Matt

> All the lifetime issues we had came from ignoring this fact and I think we
> should push for fixing this design up again.
> 
> Regards,
> Christian.
> 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
> >   include/drm/gpu_scheduler.h            |   8 +-
> >   2 files changed, 106 insertions(+), 39 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index cede47afc800..b67469eac179 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> >    * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
> >    *
> >    * @rq: scheduler run queue to check.
> > + * @dequeue: dequeue selected entity
> >    *
> >    * Try to find a ready entity, returns NULL if none found.
> >    */
> >   static struct drm_sched_entity *
> > -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
> >   {
> >   	struct drm_sched_entity *entity;
> > @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >   	if (entity) {
> >   		list_for_each_entry_continue(entity, &rq->entities, list) {
> >   			if (drm_sched_entity_is_ready(entity)) {
> > -				rq->current_entity = entity;
> > -				reinit_completion(&entity->entity_idle);
> > +				if (dequeue) {
> > +					rq->current_entity = entity;
> > +					reinit_completion(&entity->entity_idle);
> > +				}
> >   				spin_unlock(&rq->lock);
> >   				return entity;
> >   			}
> > @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >   	list_for_each_entry(entity, &rq->entities, list) {
> >   		if (drm_sched_entity_is_ready(entity)) {
> > -			rq->current_entity = entity;
> > -			reinit_completion(&entity->entity_idle);
> > +			if (dequeue) {
> > +				rq->current_entity = entity;
> > +				reinit_completion(&entity->entity_idle);
> > +			}
> >   			spin_unlock(&rq->lock);
> >   			return entity;
> >   		}
> > @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >    * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
> >    *
> >    * @rq: scheduler run queue to check.
> > + * @dequeue: dequeue selected entity
> >    *
> >    * Find oldest waiting ready entity, returns NULL if none found.
> >    */
> >   static struct drm_sched_entity *
> > -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
> >   {
> >   	struct rb_node *rb;
> > @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> >   		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> >   		if (drm_sched_entity_is_ready(entity)) {
> > -			rq->current_entity = entity;
> > -			reinit_completion(&entity->entity_idle);
> > +			if (dequeue) {
> > +				rq->current_entity = entity;
> > +				reinit_completion(&entity->entity_idle);
> > +			}
> >   			break;
> >   		}
> >   	}
> > @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> >   }
> >   /**
> > - * drm_sched_submit_queue - scheduler queue submission
> > + * drm_sched_run_job_queue - queue job submission
> >    * @sched: scheduler instance
> >    */
> > -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> > +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> >   {
> >   	if (!READ_ONCE(sched->pause_submit))
> > -		queue_work(sched->submit_wq, &sched->work_submit);
> > +		queue_work(sched->submit_wq, &sched->work_run_job);
> > +}
> > +
> > +static struct drm_sched_entity *
> > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> > +
> > +/**
> > + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> > + * @sched: scheduler instance
> > + */
> > +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > +{
> > +	if (drm_sched_select_entity(sched, false))
> > +		drm_sched_run_job_queue(sched);
> > +}
> > +
> > +/**
> > + * drm_sched_free_job_queue - queue free job
> > + *
> > + * @sched: scheduler instance to queue free job
> > + */
> > +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> > +{
> > +	if (!READ_ONCE(sched->pause_submit))
> > +		queue_work(sched->submit_wq, &sched->work_free_job);
> > +}
> > +
> > +/**
> > + * drm_sched_free_job_queue_if_ready - queue free job if ready
> > + *
> > + * @sched: scheduler instance to queue free job
> > + */
> > +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > +{
> > +	struct drm_sched_job *job;
> > +
> > +	spin_lock(&sched->job_list_lock);
> > +	job = list_first_entry_or_null(&sched->pending_list,
> > +				       struct drm_sched_job, list);
> > +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
> > +		drm_sched_free_job_queue(sched);
> > +	spin_unlock(&sched->job_list_lock);
> >   }
> >   /**
> > @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
> >   	dma_fence_get(&s_fence->finished);
> >   	drm_sched_fence_finished(s_fence, result);
> >   	dma_fence_put(&s_fence->finished);
> > -	drm_sched_submit_queue(sched);
> > +	drm_sched_free_job_queue(sched);
> >   }
> >   /**
> > @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> >   void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> >   {
> >   	if (drm_sched_can_queue(sched))
> > -		drm_sched_submit_queue(sched);
> > +		drm_sched_run_job_queue(sched);
> >   }
> >   /**
> >    * drm_sched_select_entity - Select next entity to process
> >    *
> >    * @sched: scheduler instance
> > + * @dequeue: dequeue selected entity
> >    *
> >    * Returns the entity to process or NULL if none are found.
> >    */
> >   static struct drm_sched_entity *
> > -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
> >   {
> >   	struct drm_sched_entity *entity;
> >   	int i;
> > @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> >   	/* Kernel run queue has higher priority than normal run queue*/
> >   	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> >   		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> > -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> > -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> > +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> > +							dequeue) :
> > +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> > +						      dequeue);
> >   		if (entity)
> >   			break;
> >   	}
> > @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> >   EXPORT_SYMBOL(drm_sched_pick_best);
> >   /**
> > - * drm_sched_main - main scheduler thread
> > + * drm_sched_free_job_work - worker to call free_job
> >    *
> > - * @param: scheduler instance
> > + * @w: free job work
> >    */
> > -static void drm_sched_main(struct work_struct *w)
> > +static void drm_sched_free_job_work(struct work_struct *w)
> >   {
> >   	struct drm_gpu_scheduler *sched =
> > -		container_of(w, struct drm_gpu_scheduler, work_submit);
> > -	struct drm_sched_entity *entity;
> > +		container_of(w, struct drm_gpu_scheduler, work_free_job);
> >   	struct drm_sched_job *cleanup_job;
> > -	int r;
> >   	if (READ_ONCE(sched->pause_submit))
> >   		return;
> >   	cleanup_job = drm_sched_get_cleanup_job(sched);
> > -	entity = drm_sched_select_entity(sched);
> > +	if (cleanup_job) {
> > +		sched->ops->free_job(cleanup_job);
> > +
> > +		drm_sched_free_job_queue_if_ready(sched);
> > +		drm_sched_run_job_queue_if_ready(sched);
> > +	}
> > +}
> > -	if (!entity && !cleanup_job)
> > -		return;	/* No more work */
> > +/**
> > + * drm_sched_run_job_work - worker to call run_job
> > + *
> > + * @w: run job work
> > + */
> > +static void drm_sched_run_job_work(struct work_struct *w)
> > +{
> > +	struct drm_gpu_scheduler *sched =
> > +		container_of(w, struct drm_gpu_scheduler, work_run_job);
> > +	struct drm_sched_entity *entity;
> > +	int r;
> > -	if (cleanup_job)
> > -		sched->ops->free_job(cleanup_job);
> > +	if (READ_ONCE(sched->pause_submit))
> > +		return;
> > +	entity = drm_sched_select_entity(sched, true);
> >   	if (entity) {
> >   		struct dma_fence *fence;
> >   		struct drm_sched_fence *s_fence;
> > @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
> >   		sched_job = drm_sched_entity_pop_job(entity);
> >   		if (!sched_job) {
> >   			complete_all(&entity->entity_idle);
> > -			if (!cleanup_job)
> > -				return;	/* No more work */
> > -			goto again;
> > +			return;	/* No more work */
> >   		}
> >   		s_fence = sched_job->s_fence;
> > @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
> >   		}
> >   		wake_up(&sched->job_scheduled);
> > +		drm_sched_run_job_queue_if_ready(sched);
> >   	}
> > -
> > -again:
> > -	drm_sched_submit_queue(sched);
> >   }
> >   /**
> > @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> >   	spin_lock_init(&sched->job_list_lock);
> >   	atomic_set(&sched->hw_rq_count, 0);
> >   	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > -	INIT_WORK(&sched->work_submit, drm_sched_main);
> > +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> >   	atomic_set(&sched->_score, 0);
> >   	atomic64_set(&sched->job_id_count, 0);
> >   	sched->pause_submit = false;
> > @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
> >   void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
> >   {
> >   	WRITE_ONCE(sched->pause_submit, true);
> > -	cancel_work_sync(&sched->work_submit);
> > +	cancel_work_sync(&sched->work_run_job);
> > +	cancel_work_sync(&sched->work_free_job);
> >   }
> >   EXPORT_SYMBOL(drm_sched_submit_stop);
> > @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
> >   void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
> >   {
> >   	WRITE_ONCE(sched->pause_submit, false);
> > -	queue_work(sched->submit_wq, &sched->work_submit);
> > +	queue_work(sched->submit_wq, &sched->work_run_job);
> > +	queue_work(sched->submit_wq, &sched->work_free_job);
> >   }
> >   EXPORT_SYMBOL(drm_sched_submit_start);
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 04eec2d7635f..fbc083a92757 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
> >    *                 finished.
> >    * @hw_rq_count: the number of jobs currently in the hardware queue.
> >    * @job_id_count: used to assign unique id to the each job.
> > - * @submit_wq: workqueue used to queue @work_submit
> > + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
> >    * @timeout_wq: workqueue used to queue @work_tdr
> > - * @work_submit: schedules jobs and cleans up entities
> > + * @work_run_job: schedules jobs
> > + * @work_free_job: cleans up jobs
> >    * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> >    *            timeout interval is over.
> >    * @pending_list: the list of jobs which are currently in the job queue.
> > @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
> >   	atomic64_t			job_id_count;
> >   	struct workqueue_struct		*submit_wq;
> >   	struct workqueue_struct		*timeout_wq;
> > -	struct work_struct		work_submit;
> > +	struct work_struct		work_run_job;
> > +	struct work_struct		work_free_job;
> >   	struct delayed_work		work_tdr;
> >   	struct list_head		pending_list;
> >   	spinlock_t			job_list_lock;
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-17 11:13               ` Danilo Krummrich
  2023-08-17 13:35                 ` Christian König
@ 2023-08-18  3:08                 ` Matthew Brost
  2023-08-18  5:40                   ` Christian König
  2023-09-12 14:28                 ` Boris Brezillon
  2 siblings, 1 reply; 80+ messages in thread
From: Matthew Brost @ 2023-08-18  3:08 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, intel-xe, luben.tuikov, donald.robson,
	boris.brezillon, Christian König, faith.ekstrand

On Thu, Aug 17, 2023 at 01:13:31PM +0200, Danilo Krummrich wrote:
> On 8/17/23 07:33, Christian König wrote:
> > Am 16.08.23 um 18:33 schrieb Danilo Krummrich:
> > > On 8/16/23 16:59, Christian König wrote:
> > > > Am 16.08.23 um 14:30 schrieb Danilo Krummrich:
> > > > > On 8/16/23 16:05, Christian König wrote:
> > > > > > Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
> > > > > > > Hi Matt,
> > > > > > > 
> > > > > > > On 8/11/23 04:31, Matthew Brost wrote:
> > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > mapping between a drm_gpu_scheduler and
> > > > > > > > drm_sched_entity. At first this
> > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > 
> > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > guaranteed to be the same completion even if
> > > > > > > > targeting the same hardware
> > > > > > > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > > > > > > which allowed to reorder, timeslice, and preempt
> > > > > > > > submissions. If a using
> > > > > > > > shared drm_gpu_scheduler across multiple
> > > > > > > > drm_sched_entity, the TDR falls
> > > > > > > > apart as the TDR expects submission order ==
> > > > > > > > completion order. Using a
> > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
> > > > > > > > 
> > > > > > > > 2. In XE submissions are done via programming a
> > > > > > > > ring buffer (circular
> > > > > > > > buffer), a drm_gpu_scheduler provides a limit on
> > > > > > > > number of jobs, if the
> > > > > > > > limit of number jobs is set to RING_SIZE /
> > > > > > > > MAX_SIZE_PER_JOB we get flow
> > > > > > > > control on the ring for free.
> > > > > > > 
> > > > > > > In XE, where does the limitation of MAX_SIZE_PER_JOB come from?
> > > > > > > 
> > > > > > > In Nouveau we currently do have such a limitation as
> > > > > > > well, but it is derived from the RING_SIZE, hence
> > > > > > > RING_SIZE / MAX_SIZE_PER_JOB would always be 1.
> > > > > > > However, I think most jobs won't actually utilize
> > > > > > > the whole ring.
> > > > > > 
> > > > > > Well that should probably rather be RING_SIZE /
> > > > > > MAX_SIZE_PER_JOB = hw_submission_limit (or even
> > > > > > hw_submission_limit - 1 when the hw can't distinct full
> > > > > > vs empty ring buffer).
> > > > > 
> > > > > Not sure if I get you right, let me try to clarify what I
> > > > > was trying to say: I wanted to say that in Nouveau
> > > > > MAX_SIZE_PER_JOB isn't really limited by anything other than
> > > > > the RING_SIZE and hence we'd never allow more than 1 active
> > > > > job.
> > > > 
> > > > But that lets the hw run dry between submissions. That is
> > > > usually a pretty horrible idea for performance.
> > > 
> > > Correct, that's the reason why I said it seems to be more efficient
> > > to base ring flow control on the actual size of each incoming job
> > > rather than the maximum size of a job.
> > > 
> > > > 
> > > > > 
> > > > > However, it seems to be more efficient to base ring flow
> > > > > control on the actual size of each incoming job rather than
> > > > > the worst case, namely the maximum size of a job.
> > > > 
> > > > That doesn't sounds like a good idea to me. See we don't limit
> > > > the number of submitted jobs based on the ring size, but rather
> > > > we calculate the ring size based on the number of submitted
> > > > jobs.
> > > > 
> > > 
> > > My point isn't really about whether we derive the ring size from the
> > > job limit or the other way around. It's more about the job size (or
> > > its maximum size) being arbitrary.
> > > 
> > > As mentioned in my reply to Matt:
> > > 
> > > "In Nouveau, userspace can submit an arbitrary amount of addresses
> > > of indirect bufferes containing the ring instructions. The ring on
> > > the kernel side takes the addresses of the indirect buffers rather
> > > than the instructions themself. Hence, technically there isn't
> > > really a limit on the amount of IBs submitted by a job except for
> > > the ring size."
> > > 
> > > So, my point is that I don't really want to limit the job size
> > > artificially just to be able to fit multiple jobs into the ring even
> > > if they're submitted at their "artificial" maximum size, but rather
> > > track how much of the ring the submitted job actually occupies.
> > > 
> > > > In other words the hw_submission_limit defines the ring size,
> > > > not the other way around. And you usually want the
> > > > hw_submission_limit as low as possible for good scheduler
> > > > granularity and to avoid extra overhead.
> > > 
> > > I don't think you really mean "as low as possible", do you?
> > 
> > No, I do mean as low as possible or in other words as few as possible.
> > 
> > Ideally the scheduler would submit only the minimum amount of work to
> > the hardware to keep the hardware busy. >
> > The hardware seems to work mostly the same for all vendors, but you
> > somehow seem to think that filling the ring is somehow beneficial which
> > is really not the case as far as I can see.
> 
> I think that's a misunderstanding. I'm not trying to say that it is *always*
> beneficial to fill up the ring as much as possible. But I think it is under
> certain circumstances, exactly those circumstances I described for Nouveau.
> 
> As mentioned, in Nouveau the size of a job is only really limited by the
> ring size, which means that one job can (but does not necessarily) fill up
> the whole ring. We both agree that this is inefficient, because it
> potentially results into the HW run dry due to hw_submission_limit == 1.
> 
> I recognize you said that one should define hw_submission_limit and adjust
> the other parts of the equation accordingly, the options I see are:
> 
> (1) Increase the ring size while keeping the maximum job size.
> (2) Decrease the maximum job size while keeping the ring size.
> (3) Let the scheduler track the actual job size rather than the maximum job
> size.
> 
> (1) results into potentially wasted ring memory, because we're not always
> reaching the maximum job size, but the scheduler assumes so.
> 
> (2) results into more IOCTLs from userspace for the same amount of IBs and
> more jobs result into more memory allocations and more work being submitted
> to the workqueue (with Matt's patches).
> 
> (3) doesn't seem to have any of those draw backs.
> 
> What would be your take on that?
> 
> Actually, if none of the other drivers is interested into a more precise way
> of keeping track of the ring utilization, I'd be totally fine to do it in a
> driver specific way. However, unfortunately I don't see how this would be
> possible.
> 
> My proposal would be to just keep the hw_submission_limit (maybe rename it
> to submission_unit_limit) and add a submission_units field to struct
> drm_sched_job. By default a jobs submission_units field would be 0 and the
> scheduler would behave the exact same way as it does now.
> 
> Accordingly, jobs with submission_units > 1 would contribute more than one
> unit to the submission_unit_limit.
> 
> What do you think about that?
> 

This seems reasonible to me and a very minimal change to the scheduler.

Matt 

> Besides all that, you said that filling up the ring just enough to not let
> the HW run dry rather than filling it up entirely is desirable. Why do you
> think so? I tend to think that in most cases it shouldn't make difference.
> 
> - Danilo
> 
> > 
> > Regards,
> > Christian.
> > 
> > > Because one really is the minimum if you want to do work at all, but
> > > as you mentioned above a job limit of one can let the ring run dry.
> > > 
> > > In the end my proposal comes down to tracking the actual size of a
> > > job rather than just assuming a pre-defined maximum job size, and
> > > hence a dynamic job limit.
> > > 
> > > I don't think this would hurt the scheduler granularity. In fact, it
> > > should even contribute to the desire of not letting the ring run dry
> > > even better. Especially for sequences of small jobs, where the
> > > current implementation might wrongly assume the ring is already full
> > > although actually there would still be enough space left.
> > > 
> > > > 
> > > > Christian.
> > > > 
> > > > > 
> > > > > > 
> > > > > > Otherwise your scheduler might just overwrite the ring
> > > > > > buffer by pushing things to fast.
> > > > > > 
> > > > > > Christian.
> > > > > > 
> > > > > > > 
> > > > > > > Given that, it seems like it would be better to let
> > > > > > > the scheduler keep track of empty ring "slots"
> > > > > > > instead, such that the scheduler can deceide whether
> > > > > > > a subsequent job will still fit on the ring and if
> > > > > > > not re-evaluate once a previous job finished. Of
> > > > > > > course each submitted job would be required to carry
> > > > > > > the number of slots it requires on the ring.
> > > > > > > 
> > > > > > > What to you think of implementing this as
> > > > > > > alternative flow control mechanism? Implementation
> > > > > > > wise this could be a union with the existing
> > > > > > > hw_submission_limit.
> > > > > > > 
> > > > > > > - Danilo
> > > > > > > 
> > > > > > > > 
> > > > > > > > A problem with this design is currently a drm_gpu_scheduler uses a
> > > > > > > > kthread for submission / job cleanup. This doesn't scale if a large
> > > > > > > > number of drm_gpu_scheduler are used. To work
> > > > > > > > around the scaling issue,
> > > > > > > > use a worker rather than kthread for submission / job cleanup.
> > > > > > > > 
> > > > > > > > v2:
> > > > > > > >    - (Rob Clark) Fix msm build
> > > > > > > >    - Pass in run work queue
> > > > > > > > v3:
> > > > > > > >    - (Boris) don't have loop in worker
> > > > > > > > v4:
> > > > > > > >    - (Tvrtko) break out submit ready, stop,
> > > > > > > > start helpers into own patch
> > > > > > > > 
> > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-17 17:54     ` Matthew Brost
@ 2023-08-18  5:27       ` Christian König
  2023-08-18 13:13         ` Matthew Brost
  0 siblings, 1 reply; 80+ messages in thread
From: Christian König @ 2023-08-18  5:27 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen,
	Liviu.Dudau, dri-devel, luben.tuikov, lina, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

Am 17.08.23 um 19:54 schrieb Matthew Brost:
> On Thu, Aug 17, 2023 at 03:39:40PM +0200, Christian König wrote:
>> Am 11.08.23 um 04:31 schrieb Matthew Brost:
>>> Rather than call free_job and run_job in same work item have a dedicated
>>> work item for each. This aligns with the design and intended use of work
>>> queues.
>> I would rather say we should get completely rid of the free_job callback.
>>
> Would we still have work item? e.g. Would we still want to call
> drm_sched_get_cleanup_job which removes the job from the pending list
> and adjusts the TDR? Trying to figure out out what this looks like. We
> probably can't do all of this from an IRQ context.
>
>> Essentially the job is just the container which carries the information
>> which are necessary before you push it to the hw. The real representation of
>> the submission is actually the scheduler fence.
>>
> Most of the free_jobs call plus drm_sched_job_cleanup + a put on job. In
> Xe this cannot be called from an IRQ context either.
>
> I'm just confused what exactly you are suggesting here.

To summarize on one sentence: Instead of the job we keep the scheduler 
and hardware fences around after pushing the job to the hw.

The free_job callback would then be replaced by dropping the reference 
on the scheduler and hw fence.

Would that work for you?

Christian.

>
> Matt
>
>> All the lifetime issues we had came from ignoring this fact and I think we
>> should push for fixing this design up again.
>>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
>>>    include/drm/gpu_scheduler.h            |   8 +-
>>>    2 files changed, 106 insertions(+), 39 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index cede47afc800..b67469eac179 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>>     * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>>>     *
>>>     * @rq: scheduler run queue to check.
>>> + * @dequeue: dequeue selected entity
>>>     *
>>>     * Try to find a ready entity, returns NULL if none found.
>>>     */
>>>    static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
>>>    {
>>>    	struct drm_sched_entity *entity;
>>> @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>    	if (entity) {
>>>    		list_for_each_entry_continue(entity, &rq->entities, list) {
>>>    			if (drm_sched_entity_is_ready(entity)) {
>>> -				rq->current_entity = entity;
>>> -				reinit_completion(&entity->entity_idle);
>>> +				if (dequeue) {
>>> +					rq->current_entity = entity;
>>> +					reinit_completion(&entity->entity_idle);
>>> +				}
>>>    				spin_unlock(&rq->lock);
>>>    				return entity;
>>>    			}
>>> @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>    	list_for_each_entry(entity, &rq->entities, list) {
>>>    		if (drm_sched_entity_is_ready(entity)) {
>>> -			rq->current_entity = entity;
>>> -			reinit_completion(&entity->entity_idle);
>>> +			if (dequeue) {
>>> +				rq->current_entity = entity;
>>> +				reinit_completion(&entity->entity_idle);
>>> +			}
>>>    			spin_unlock(&rq->lock);
>>>    			return entity;
>>>    		}
>>> @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>     * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>>>     *
>>>     * @rq: scheduler run queue to check.
>>> + * @dequeue: dequeue selected entity
>>>     *
>>>     * Find oldest waiting ready entity, returns NULL if none found.
>>>     */
>>>    static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
>>>    {
>>>    	struct rb_node *rb;
>>> @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>    		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>>    		if (drm_sched_entity_is_ready(entity)) {
>>> -			rq->current_entity = entity;
>>> -			reinit_completion(&entity->entity_idle);
>>> +			if (dequeue) {
>>> +				rq->current_entity = entity;
>>> +				reinit_completion(&entity->entity_idle);
>>> +			}
>>>    			break;
>>>    		}
>>>    	}
>>> @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>    }
>>>    /**
>>> - * drm_sched_submit_queue - scheduler queue submission
>>> + * drm_sched_run_job_queue - queue job submission
>>>     * @sched: scheduler instance
>>>     */
>>> -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
>>> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>>    {
>>>    	if (!READ_ONCE(sched->pause_submit))
>>> -		queue_work(sched->submit_wq, &sched->work_submit);
>>> +		queue_work(sched->submit_wq, &sched->work_run_job);
>>> +}
>>> +
>>> +static struct drm_sched_entity *
>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
>>> +
>>> +/**
>>> + * drm_sched_run_job_queue_if_ready - queue job submission if ready
>>> + * @sched: scheduler instance
>>> + */
>>> +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
>>> +{
>>> +	if (drm_sched_select_entity(sched, false))
>>> +		drm_sched_run_job_queue(sched);
>>> +}
>>> +
>>> +/**
>>> + * drm_sched_free_job_queue - queue free job
>>> + *
>>> + * @sched: scheduler instance to queue free job
>>> + */
>>> +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
>>> +{
>>> +	if (!READ_ONCE(sched->pause_submit))
>>> +		queue_work(sched->submit_wq, &sched->work_free_job);
>>> +}
>>> +
>>> +/**
>>> + * drm_sched_free_job_queue_if_ready - queue free job if ready
>>> + *
>>> + * @sched: scheduler instance to queue free job
>>> + */
>>> +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
>>> +{
>>> +	struct drm_sched_job *job;
>>> +
>>> +	spin_lock(&sched->job_list_lock);
>>> +	job = list_first_entry_or_null(&sched->pending_list,
>>> +				       struct drm_sched_job, list);
>>> +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
>>> +		drm_sched_free_job_queue(sched);
>>> +	spin_unlock(&sched->job_list_lock);
>>>    }
>>>    /**
>>> @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>>>    	dma_fence_get(&s_fence->finished);
>>>    	drm_sched_fence_finished(s_fence, result);
>>>    	dma_fence_put(&s_fence->finished);
>>> -	drm_sched_submit_queue(sched);
>>> +	drm_sched_free_job_queue(sched);
>>>    }
>>>    /**
>>> @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
>>>    void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>>>    {
>>>    	if (drm_sched_can_queue(sched))
>>> -		drm_sched_submit_queue(sched);
>>> +		drm_sched_run_job_queue(sched);
>>>    }
>>>    /**
>>>     * drm_sched_select_entity - Select next entity to process
>>>     *
>>>     * @sched: scheduler instance
>>> + * @dequeue: dequeue selected entity
>>>     *
>>>     * Returns the entity to process or NULL if none are found.
>>>     */
>>>    static struct drm_sched_entity *
>>> -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
>>>    {
>>>    	struct drm_sched_entity *entity;
>>>    	int i;
>>> @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>>    	/* Kernel run queue has higher priority than normal run queue*/
>>>    	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>    		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
>>> -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
>>> -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
>>> +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
>>> +							dequeue) :
>>> +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
>>> +						      dequeue);
>>>    		if (entity)
>>>    			break;
>>>    	}
>>> @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>>>    EXPORT_SYMBOL(drm_sched_pick_best);
>>>    /**
>>> - * drm_sched_main - main scheduler thread
>>> + * drm_sched_free_job_work - worker to call free_job
>>>     *
>>> - * @param: scheduler instance
>>> + * @w: free job work
>>>     */
>>> -static void drm_sched_main(struct work_struct *w)
>>> +static void drm_sched_free_job_work(struct work_struct *w)
>>>    {
>>>    	struct drm_gpu_scheduler *sched =
>>> -		container_of(w, struct drm_gpu_scheduler, work_submit);
>>> -	struct drm_sched_entity *entity;
>>> +		container_of(w, struct drm_gpu_scheduler, work_free_job);
>>>    	struct drm_sched_job *cleanup_job;
>>> -	int r;
>>>    	if (READ_ONCE(sched->pause_submit))
>>>    		return;
>>>    	cleanup_job = drm_sched_get_cleanup_job(sched);
>>> -	entity = drm_sched_select_entity(sched);
>>> +	if (cleanup_job) {
>>> +		sched->ops->free_job(cleanup_job);
>>> +
>>> +		drm_sched_free_job_queue_if_ready(sched);
>>> +		drm_sched_run_job_queue_if_ready(sched);
>>> +	}
>>> +}
>>> -	if (!entity && !cleanup_job)
>>> -		return;	/* No more work */
>>> +/**
>>> + * drm_sched_run_job_work - worker to call run_job
>>> + *
>>> + * @w: run job work
>>> + */
>>> +static void drm_sched_run_job_work(struct work_struct *w)
>>> +{
>>> +	struct drm_gpu_scheduler *sched =
>>> +		container_of(w, struct drm_gpu_scheduler, work_run_job);
>>> +	struct drm_sched_entity *entity;
>>> +	int r;
>>> -	if (cleanup_job)
>>> -		sched->ops->free_job(cleanup_job);
>>> +	if (READ_ONCE(sched->pause_submit))
>>> +		return;
>>> +	entity = drm_sched_select_entity(sched, true);
>>>    	if (entity) {
>>>    		struct dma_fence *fence;
>>>    		struct drm_sched_fence *s_fence;
>>> @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
>>>    		sched_job = drm_sched_entity_pop_job(entity);
>>>    		if (!sched_job) {
>>>    			complete_all(&entity->entity_idle);
>>> -			if (!cleanup_job)
>>> -				return;	/* No more work */
>>> -			goto again;
>>> +			return;	/* No more work */
>>>    		}
>>>    		s_fence = sched_job->s_fence;
>>> @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
>>>    		}
>>>    		wake_up(&sched->job_scheduled);
>>> +		drm_sched_run_job_queue_if_ready(sched);
>>>    	}
>>> -
>>> -again:
>>> -	drm_sched_submit_queue(sched);
>>>    }
>>>    /**
>>> @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>    	spin_lock_init(&sched->job_list_lock);
>>>    	atomic_set(&sched->hw_rq_count, 0);
>>>    	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>> -	INIT_WORK(&sched->work_submit, drm_sched_main);
>>> +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
>>> +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
>>>    	atomic_set(&sched->_score, 0);
>>>    	atomic64_set(&sched->job_id_count, 0);
>>>    	sched->pause_submit = false;
>>> @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
>>>    void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
>>>    {
>>>    	WRITE_ONCE(sched->pause_submit, true);
>>> -	cancel_work_sync(&sched->work_submit);
>>> +	cancel_work_sync(&sched->work_run_job);
>>> +	cancel_work_sync(&sched->work_free_job);
>>>    }
>>>    EXPORT_SYMBOL(drm_sched_submit_stop);
>>> @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
>>>    void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
>>>    {
>>>    	WRITE_ONCE(sched->pause_submit, false);
>>> -	queue_work(sched->submit_wq, &sched->work_submit);
>>> +	queue_work(sched->submit_wq, &sched->work_run_job);
>>> +	queue_work(sched->submit_wq, &sched->work_free_job);
>>>    }
>>>    EXPORT_SYMBOL(drm_sched_submit_start);
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index 04eec2d7635f..fbc083a92757 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
>>>     *                 finished.
>>>     * @hw_rq_count: the number of jobs currently in the hardware queue.
>>>     * @job_id_count: used to assign unique id to the each job.
>>> - * @submit_wq: workqueue used to queue @work_submit
>>> + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
>>>     * @timeout_wq: workqueue used to queue @work_tdr
>>> - * @work_submit: schedules jobs and cleans up entities
>>> + * @work_run_job: schedules jobs
>>> + * @work_free_job: cleans up jobs
>>>     * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>>>     *            timeout interval is over.
>>>     * @pending_list: the list of jobs which are currently in the job queue.
>>> @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
>>>    	atomic64_t			job_id_count;
>>>    	struct workqueue_struct		*submit_wq;
>>>    	struct workqueue_struct		*timeout_wq;
>>> -	struct work_struct		work_submit;
>>> +	struct work_struct		work_run_job;
>>> +	struct work_struct		work_free_job;
>>>    	struct delayed_work		work_tdr;
>>>    	struct list_head		pending_list;
>>>    	spinlock_t			job_list_lock;


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-18  3:08                 ` Matthew Brost
@ 2023-08-18  5:40                   ` Christian König
  2023-08-18 12:49                     ` Matthew Brost
  0 siblings, 1 reply; 80+ messages in thread
From: Christian König @ 2023-08-18  5:40 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

Am 18.08.23 um 05:08 schrieb Matthew Brost:
> On Thu, Aug 17, 2023 at 01:13:31PM +0200, Danilo Krummrich wrote:
>> On 8/17/23 07:33, Christian König wrote:
>>> Am 16.08.23 um 18:33 schrieb Danilo Krummrich:
>>>> On 8/16/23 16:59, Christian König wrote:
>>>>> Am 16.08.23 um 14:30 schrieb Danilo Krummrich:
>>>>>> On 8/16/23 16:05, Christian König wrote:
>>>>>>> Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
>>>>>>>> Hi Matt,
>>>>>>>>
>>>>>>>> On 8/11/23 04:31, Matthew Brost wrote:
>>>>>>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
>>>>>>>>> mapping between a drm_gpu_scheduler and
>>>>>>>>> drm_sched_entity. At first this
>>>>>>>>> seems a bit odd but let us explain the reasoning below.
>>>>>>>>>
>>>>>>>>> 1. In XE the submission order from multiple drm_sched_entity is not
>>>>>>>>> guaranteed to be the same completion even if
>>>>>>>>> targeting the same hardware
>>>>>>>>> engine. This is because in XE we have a firmware scheduler, the GuC,
>>>>>>>>> which allowed to reorder, timeslice, and preempt
>>>>>>>>> submissions. If a using
>>>>>>>>> shared drm_gpu_scheduler across multiple
>>>>>>>>> drm_sched_entity, the TDR falls
>>>>>>>>> apart as the TDR expects submission order ==
>>>>>>>>> completion order. Using a
>>>>>>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>>>>>>>>>
>>>>>>>>> 2. In XE submissions are done via programming a
>>>>>>>>> ring buffer (circular
>>>>>>>>> buffer), a drm_gpu_scheduler provides a limit on
>>>>>>>>> number of jobs, if the
>>>>>>>>> limit of number jobs is set to RING_SIZE /
>>>>>>>>> MAX_SIZE_PER_JOB we get flow
>>>>>>>>> control on the ring for free.
>>>>>>>> In XE, where does the limitation of MAX_SIZE_PER_JOB come from?
>>>>>>>>
>>>>>>>> In Nouveau we currently do have such a limitation as
>>>>>>>> well, but it is derived from the RING_SIZE, hence
>>>>>>>> RING_SIZE / MAX_SIZE_PER_JOB would always be 1.
>>>>>>>> However, I think most jobs won't actually utilize
>>>>>>>> the whole ring.
>>>>>>> Well that should probably rather be RING_SIZE /
>>>>>>> MAX_SIZE_PER_JOB = hw_submission_limit (or even
>>>>>>> hw_submission_limit - 1 when the hw can't distinct full
>>>>>>> vs empty ring buffer).
>>>>>> Not sure if I get you right, let me try to clarify what I
>>>>>> was trying to say: I wanted to say that in Nouveau
>>>>>> MAX_SIZE_PER_JOB isn't really limited by anything other than
>>>>>> the RING_SIZE and hence we'd never allow more than 1 active
>>>>>> job.
>>>>> But that lets the hw run dry between submissions. That is
>>>>> usually a pretty horrible idea for performance.
>>>> Correct, that's the reason why I said it seems to be more efficient
>>>> to base ring flow control on the actual size of each incoming job
>>>> rather than the maximum size of a job.
>>>>
>>>>>> However, it seems to be more efficient to base ring flow
>>>>>> control on the actual size of each incoming job rather than
>>>>>> the worst case, namely the maximum size of a job.
>>>>> That doesn't sounds like a good idea to me. See we don't limit
>>>>> the number of submitted jobs based on the ring size, but rather
>>>>> we calculate the ring size based on the number of submitted
>>>>> jobs.
>>>>>
>>>> My point isn't really about whether we derive the ring size from the
>>>> job limit or the other way around. It's more about the job size (or
>>>> its maximum size) being arbitrary.
>>>>
>>>> As mentioned in my reply to Matt:
>>>>
>>>> "In Nouveau, userspace can submit an arbitrary amount of addresses
>>>> of indirect bufferes containing the ring instructions. The ring on
>>>> the kernel side takes the addresses of the indirect buffers rather
>>>> than the instructions themself. Hence, technically there isn't
>>>> really a limit on the amount of IBs submitted by a job except for
>>>> the ring size."
>>>>
>>>> So, my point is that I don't really want to limit the job size
>>>> artificially just to be able to fit multiple jobs into the ring even
>>>> if they're submitted at their "artificial" maximum size, but rather
>>>> track how much of the ring the submitted job actually occupies.
>>>>
>>>>> In other words the hw_submission_limit defines the ring size,
>>>>> not the other way around. And you usually want the
>>>>> hw_submission_limit as low as possible for good scheduler
>>>>> granularity and to avoid extra overhead.
>>>> I don't think you really mean "as low as possible", do you?
>>> No, I do mean as low as possible or in other words as few as possible.
>>>
>>> Ideally the scheduler would submit only the minimum amount of work to
>>> the hardware to keep the hardware busy. >
>>> The hardware seems to work mostly the same for all vendors, but you
>>> somehow seem to think that filling the ring is somehow beneficial which
>>> is really not the case as far as I can see.
>> I think that's a misunderstanding. I'm not trying to say that it is *always*
>> beneficial to fill up the ring as much as possible. But I think it is under
>> certain circumstances, exactly those circumstances I described for Nouveau.
>>
>> As mentioned, in Nouveau the size of a job is only really limited by the
>> ring size, which means that one job can (but does not necessarily) fill up
>> the whole ring. We both agree that this is inefficient, because it
>> potentially results into the HW run dry due to hw_submission_limit == 1.
>>
>> I recognize you said that one should define hw_submission_limit and adjust
>> the other parts of the equation accordingly, the options I see are:
>>
>> (1) Increase the ring size while keeping the maximum job size.
>> (2) Decrease the maximum job size while keeping the ring size.
>> (3) Let the scheduler track the actual job size rather than the maximum job
>> size.
>>
>> (1) results into potentially wasted ring memory, because we're not always
>> reaching the maximum job size, but the scheduler assumes so.
>>
>> (2) results into more IOCTLs from userspace for the same amount of IBs and
>> more jobs result into more memory allocations and more work being submitted
>> to the workqueue (with Matt's patches).
>>
>> (3) doesn't seem to have any of those draw backs.
>>
>> What would be your take on that?
>>
>> Actually, if none of the other drivers is interested into a more precise way
>> of keeping track of the ring utilization, I'd be totally fine to do it in a
>> driver specific way. However, unfortunately I don't see how this would be
>> possible.
>>
>> My proposal would be to just keep the hw_submission_limit (maybe rename it
>> to submission_unit_limit) and add a submission_units field to struct
>> drm_sched_job. By default a jobs submission_units field would be 0 and the
>> scheduler would behave the exact same way as it does now.
>>
>> Accordingly, jobs with submission_units > 1 would contribute more than one
>> unit to the submission_unit_limit.
>>
>> What do you think about that?
>>
> This seems reasonible to me and a very minimal change to the scheduler.

If you have a good use case for this then the approach sounds sane to me 
as well.

My question is rather how exactly does Nouveau comes to have this use 
case? Allowing the full ring size in the UAPI sounds a bit questionable.

Christian.

>
> Matt
>
>> Besides all that, you said that filling up the ring just enough to not let
>> the HW run dry rather than filling it up entirely is desirable. Why do you
>> think so? I tend to think that in most cases it shouldn't make difference.
>>
>> - Danilo
>>
>>> Regards,
>>> Christian.
>>>
>>>> Because one really is the minimum if you want to do work at all, but
>>>> as you mentioned above a job limit of one can let the ring run dry.
>>>>
>>>> In the end my proposal comes down to tracking the actual size of a
>>>> job rather than just assuming a pre-defined maximum job size, and
>>>> hence a dynamic job limit.
>>>>
>>>> I don't think this would hurt the scheduler granularity. In fact, it
>>>> should even contribute to the desire of not letting the ring run dry
>>>> even better. Especially for sequences of small jobs, where the
>>>> current implementation might wrongly assume the ring is already full
>>>> although actually there would still be enough space left.
>>>>
>>>>> Christian.
>>>>>
>>>>>>> Otherwise your scheduler might just overwrite the ring
>>>>>>> buffer by pushing things to fast.
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Given that, it seems like it would be better to let
>>>>>>>> the scheduler keep track of empty ring "slots"
>>>>>>>> instead, such that the scheduler can deceide whether
>>>>>>>> a subsequent job will still fit on the ring and if
>>>>>>>> not re-evaluate once a previous job finished. Of
>>>>>>>> course each submitted job would be required to carry
>>>>>>>> the number of slots it requires on the ring.
>>>>>>>>
>>>>>>>> What to you think of implementing this as
>>>>>>>> alternative flow control mechanism? Implementation
>>>>>>>> wise this could be a union with the existing
>>>>>>>> hw_submission_limit.
>>>>>>>>
>>>>>>>> - Danilo
>>>>>>>>
>>>>>>>>> A problem with this design is currently a drm_gpu_scheduler uses a
>>>>>>>>> kthread for submission / job cleanup. This doesn't scale if a large
>>>>>>>>> number of drm_gpu_scheduler are used. To work
>>>>>>>>> around the scaling issue,
>>>>>>>>> use a worker rather than kthread for submission / job cleanup.
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>>     - (Rob Clark) Fix msm build
>>>>>>>>>     - Pass in run work queue
>>>>>>>>> v3:
>>>>>>>>>     - (Boris) don't have loop in worker
>>>>>>>>> v4:
>>>>>>>>>     - (Tvrtko) break out submit ready, stop,
>>>>>>>>> start helpers into own patch
>>>>>>>>>
>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-17 16:17                     ` Christian König
@ 2023-08-18 11:58                       ` Danilo Krummrich
  2023-08-21 14:07                         ` Christian König
  0 siblings, 1 reply; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-18 11:58 UTC (permalink / raw)
  To: Christian König, Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

On 8/17/23 18:17, Christian König wrote:
> Am 17.08.23 um 14:48 schrieb Danilo Krummrich:
>> On 8/17/23 15:35, Christian König wrote:
>>> Am 17.08.23 um 13:13 schrieb Danilo Krummrich:
>>>> On 8/17/23 07:33, Christian König wrote:
>>>>> [SNIP]
>>>>> My proposal would be to just keep the hw_submission_limit (maybe 
>>>>> rename it to submission_unit_limit) and add a submission_units 
>>>>> field to struct drm_sched_job. By default a jobs submission_units 
>>>>> field would be 0 and the scheduler would behave the exact same way 
>>>>> as it does now.
>>>>>
>>>>> Accordingly, jobs with submission_units > 1 would contribute more 
>>>>> than one unit to the submission_unit_limit.
>>>>>
>>>>> What do you think about that?
>>>
>>> I think you are approaching this from the completely wrong side.
>>
>> First of all, thanks for keeping up the discussion - I appreciate it. 
>> Some more comments / questions below.
>>
>>>
>>> See the UAPI needs to be stable, so you need a maximum job size 
>>> otherwise it can happen that a combination of large and small 
>>> submissions work while a different combination doesn't.
>>
>> How is this related to the uAPI being stable? What do you mean by 
>> 'stable' in this context?
> 
> Stable is in you don't get indifferent behavior, not stable is in the 
> sense of backward compatibility. Sorry for the confusing wording :)
> 
>>
>> The Nouveau uAPI allows userspace to pass EXEC jobs by supplying the 
>> ring ID (channel), in-/out-syncs and a certain amount of indirect push 
>> buffers. The amount of IBs per job is limited by the amount of IBs 
>> fitting into the ring. Just to be clear, when I say 'job size' I mean 
>> the amount of IBs per job.
> 
> Well that more or less sounds identical to all other hardware I know of, 
> e.g. AMD, Intel and the different ARM chips seem to all work like this. 
> But on those drivers the job size limit is not the ring size, but rather 
> a fixed value (at least as far as I know).
> 
>>
>> Maybe I should also mention that the rings we are talking about are 
>> software rings managed by a firmware scheduler. We can have an 
>> arbitrary amount of software rings and even multiple ones per FD.
>>
>> Given a constant ring size I really don't see why I should limit the 
>> maximum amount of IBs userspace can push per job just to end up with a 
>> hw_submission_limit > 1.
>>
>> For example, let's just assume the ring can take 128 IBs, why would I 
>> limit userspace to submit just e.g. 16 IBs at a time, such that the 
>> hw_submission_limit becomes 8?
> 
> Well the question is what happens when you have two submissions back to 
> back which use more than halve of the ring buffer?
> 
> I only see two possible outcomes:
> 1. You return -EBUSY (or similar) error code indicating the the hw can't 
> receive more commands.
> 2. Wait on previously pushed commands to be executed.
> (3. Your driver crash because you accidentally overwrite stuff in the 
> ring buffer which is still executed. I just assume that's prevented).
> 
> Resolution #1 with -EBUSY is actually something the UAPI should not do, 
> because your UAPI then depends on the specific timing of submissions 
> which is a really bad idea.
> 
> Resolution #2 is usually bad because it forces the hw to run dry between 
> submission and so degrade performance.

I agree, that is a good reason for at least limiting the maximum job 
size to half of the ring size.

However, there could still be cases where two subsequent jobs are 
submitted with just a single IB, which as is would still block 
subsequent jobs to be pushed to the ring although there is still plenty 
of space. Depending on the (CPU) scheduler latency, such a case can let 
the HW run dry as well.

Surely, we could just continue decrease the maximum job size even 
further, but this would result in further overhead on user and kernel 
for larger IB counts. Tracking the actual job size seems to be the 
better solution for drivers where the job size can vary over a rather 
huge range.

- Danilo

> 
>>
>> What is the advantage of doing that, rather than letting userspace 
>> submit *up to* 128 IBs per job and just letting the scheduler push IBs 
>> to the ring as long as there's actually space left on the ring?
> 
> Predictable behavior I think. Basically you want organize things so that 
> the hw is at least kept busy all the time without depending on actual 
> timing.
> 
>>
>>>
>>> So what you usually do, and this is driver independent because simply 
>>> a requirement of the UAPI, is that you say here that's my maximum job 
>>> size as well as the number of submission which should be pushed to 
>>> the hw at the same time. And then get the resulting ring size by the 
>>> product of the two.
>>
>> Given the above, how is that a requirement of the uAPI?
> 
> The requirement of the UAPI is actually pretty simple: You should get 
> consistent results, independent of the timing (at least as long as you 
> don't do stuff in parallel).
> 
> Otherwise you can run into issues when on a certain configuration stuff 
> suddenly runs faster or slower than expected. In other words you should 
> not depend on that stuff finishes in a certain amount of time.
> 
>>
>>>
>>> That the ring in this use case can't be fully utilized is not a draw 
>>> back, this is completely intentional design which should apply to all 
>>> drivers independent of the vendor.
>>
>> Why wouldn't we want to fully utilize the ring size?
> 
> As far as I know everybody restricts the submission size to something 
> fixed which is at least smaller than halve the ring size to avoid the 
> problems mentioned above.
> 
> Regards,
> Christian.
> 
>>
>> - Danilo
>>
>>>
>>>>
>>>> Besides all that, you said that filling up the ring just enough to 
>>>> not let the HW run dry rather than filling it up entirely is 
>>>> desirable. Why do you think so? I tend to think that in most cases 
>>>> it shouldn't make difference.
>>>
>>> That results in better scheduling behavior. It's mostly beneficial if 
>>> you don't have a hw scheduler, but as far as I can see there is no 
>>> need to pump everything to the hw as fast as possible.
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> - Danilo
>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Because one really is the minimum if you want to do work at all, 
>>>>>> but as you mentioned above a job limit of one can let the ring run 
>>>>>> dry.
>>>>>>
>>>>>> In the end my proposal comes down to tracking the actual size of a 
>>>>>> job rather than just assuming a pre-defined maximum job size, and 
>>>>>> hence a dynamic job limit.
>>>>>>
>>>>>> I don't think this would hurt the scheduler granularity. In fact, 
>>>>>> it should even contribute to the desire of not letting the ring 
>>>>>> run dry even better. Especially for sequences of small jobs, where 
>>>>>> the current implementation might wrongly assume the ring is 
>>>>>> already full although actually there would still be enough space 
>>>>>> left.
>>>>>>
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Otherwise your scheduler might just overwrite the ring buffer 
>>>>>>>>> by pushing things to fast.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Given that, it seems like it would be better to let the 
>>>>>>>>>> scheduler keep track of empty ring "slots" instead, such that 
>>>>>>>>>> the scheduler can deceide whether a subsequent job will still 
>>>>>>>>>> fit on the ring and if not re-evaluate once a previous job 
>>>>>>>>>> finished. Of course each submitted job would be required to 
>>>>>>>>>> carry the number of slots it requires on the ring.
>>>>>>>>>>
>>>>>>>>>> What to you think of implementing this as alternative flow 
>>>>>>>>>> control mechanism? Implementation wise this could be a union 
>>>>>>>>>> with the existing hw_submission_limit.
>>>>>>>>>>
>>>>>>>>>> - Danilo
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> A problem with this design is currently a drm_gpu_scheduler 
>>>>>>>>>>> uses a
>>>>>>>>>>> kthread for submission / job cleanup. This doesn't scale if a 
>>>>>>>>>>> large
>>>>>>>>>>> number of drm_gpu_scheduler are used. To work around the 
>>>>>>>>>>> scaling issue,
>>>>>>>>>>> use a worker rather than kthread for submission / job cleanup.
>>>>>>>>>>>
>>>>>>>>>>> v2:
>>>>>>>>>>>    - (Rob Clark) Fix msm build
>>>>>>>>>>>    - Pass in run work queue
>>>>>>>>>>> v3:
>>>>>>>>>>>    - (Boris) don't have loop in worker
>>>>>>>>>>> v4:
>>>>>>>>>>>    - (Tvrtko) break out submit ready, stop, start helpers 
>>>>>>>>>>> into own patch
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-18 12:49                     ` Matthew Brost
@ 2023-08-18 12:06                       ` Danilo Krummrich
  0 siblings, 0 replies; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-18 12:06 UTC (permalink / raw)
  To: Matthew Brost, Christian König
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

On 8/18/23 14:49, Matthew Brost wrote:
> On Fri, Aug 18, 2023 at 07:40:41AM +0200, Christian König wrote:
>> Am 18.08.23 um 05:08 schrieb Matthew Brost:
>>> On Thu, Aug 17, 2023 at 01:13:31PM +0200, Danilo Krummrich wrote:
>>>> On 8/17/23 07:33, Christian König wrote:
>>>>> Am 16.08.23 um 18:33 schrieb Danilo Krummrich:
>>>>>> On 8/16/23 16:59, Christian König wrote:
>>>>>>> Am 16.08.23 um 14:30 schrieb Danilo Krummrich:
>>>>>>>> On 8/16/23 16:05, Christian König wrote:
>>>>>>>>> Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
>>>>>>>>>> Hi Matt,
>>>>>>>>>>
>>>>>>>>>> On 8/11/23 04:31, Matthew Brost wrote:
>>>>>>>>>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
>>>>>>>>>>> mapping between a drm_gpu_scheduler and
>>>>>>>>>>> drm_sched_entity. At first this
>>>>>>>>>>> seems a bit odd but let us explain the reasoning below.
>>>>>>>>>>>
>>>>>>>>>>> 1. In XE the submission order from multiple drm_sched_entity is not
>>>>>>>>>>> guaranteed to be the same completion even if
>>>>>>>>>>> targeting the same hardware
>>>>>>>>>>> engine. This is because in XE we have a firmware scheduler, the GuC,
>>>>>>>>>>> which allowed to reorder, timeslice, and preempt
>>>>>>>>>>> submissions. If a using
>>>>>>>>>>> shared drm_gpu_scheduler across multiple
>>>>>>>>>>> drm_sched_entity, the TDR falls
>>>>>>>>>>> apart as the TDR expects submission order ==
>>>>>>>>>>> completion order. Using a
>>>>>>>>>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
>>>>>>>>>>>
>>>>>>>>>>> 2. In XE submissions are done via programming a
>>>>>>>>>>> ring buffer (circular
>>>>>>>>>>> buffer), a drm_gpu_scheduler provides a limit on
>>>>>>>>>>> number of jobs, if the
>>>>>>>>>>> limit of number jobs is set to RING_SIZE /
>>>>>>>>>>> MAX_SIZE_PER_JOB we get flow
>>>>>>>>>>> control on the ring for free.
>>>>>>>>>> In XE, where does the limitation of MAX_SIZE_PER_JOB come from?
>>>>>>>>>>
>>>>>>>>>> In Nouveau we currently do have such a limitation as
>>>>>>>>>> well, but it is derived from the RING_SIZE, hence
>>>>>>>>>> RING_SIZE / MAX_SIZE_PER_JOB would always be 1.
>>>>>>>>>> However, I think most jobs won't actually utilize
>>>>>>>>>> the whole ring.
>>>>>>>>> Well that should probably rather be RING_SIZE /
>>>>>>>>> MAX_SIZE_PER_JOB = hw_submission_limit (or even
>>>>>>>>> hw_submission_limit - 1 when the hw can't distinct full
>>>>>>>>> vs empty ring buffer).
>>>>>>>> Not sure if I get you right, let me try to clarify what I
>>>>>>>> was trying to say: I wanted to say that in Nouveau
>>>>>>>> MAX_SIZE_PER_JOB isn't really limited by anything other than
>>>>>>>> the RING_SIZE and hence we'd never allow more than 1 active
>>>>>>>> job.
>>>>>>> But that lets the hw run dry between submissions. That is
>>>>>>> usually a pretty horrible idea for performance.
>>>>>> Correct, that's the reason why I said it seems to be more efficient
>>>>>> to base ring flow control on the actual size of each incoming job
>>>>>> rather than the maximum size of a job.
>>>>>>
>>>>>>>> However, it seems to be more efficient to base ring flow
>>>>>>>> control on the actual size of each incoming job rather than
>>>>>>>> the worst case, namely the maximum size of a job.
>>>>>>> That doesn't sounds like a good idea to me. See we don't limit
>>>>>>> the number of submitted jobs based on the ring size, but rather
>>>>>>> we calculate the ring size based on the number of submitted
>>>>>>> jobs.
>>>>>>>
>>>>>> My point isn't really about whether we derive the ring size from the
>>>>>> job limit or the other way around. It's more about the job size (or
>>>>>> its maximum size) being arbitrary.
>>>>>>
>>>>>> As mentioned in my reply to Matt:
>>>>>>
>>>>>> "In Nouveau, userspace can submit an arbitrary amount of addresses
>>>>>> of indirect bufferes containing the ring instructions. The ring on
>>>>>> the kernel side takes the addresses of the indirect buffers rather
>>>>>> than the instructions themself. Hence, technically there isn't
>>>>>> really a limit on the amount of IBs submitted by a job except for
>>>>>> the ring size."
>>>>>>
>>>>>> So, my point is that I don't really want to limit the job size
>>>>>> artificially just to be able to fit multiple jobs into the ring even
>>>>>> if they're submitted at their "artificial" maximum size, but rather
>>>>>> track how much of the ring the submitted job actually occupies.
>>>>>>
>>>>>>> In other words the hw_submission_limit defines the ring size,
>>>>>>> not the other way around. And you usually want the
>>>>>>> hw_submission_limit as low as possible for good scheduler
>>>>>>> granularity and to avoid extra overhead.
>>>>>> I don't think you really mean "as low as possible", do you?
>>>>> No, I do mean as low as possible or in other words as few as possible.
>>>>>
>>>>> Ideally the scheduler would submit only the minimum amount of work to
>>>>> the hardware to keep the hardware busy. >
>>>>> The hardware seems to work mostly the same for all vendors, but you
>>>>> somehow seem to think that filling the ring is somehow beneficial which
>>>>> is really not the case as far as I can see.
>>>> I think that's a misunderstanding. I'm not trying to say that it is *always*
>>>> beneficial to fill up the ring as much as possible. But I think it is under
>>>> certain circumstances, exactly those circumstances I described for Nouveau.
>>>>
>>>> As mentioned, in Nouveau the size of a job is only really limited by the
>>>> ring size, which means that one job can (but does not necessarily) fill up
>>>> the whole ring. We both agree that this is inefficient, because it
>>>> potentially results into the HW run dry due to hw_submission_limit == 1.
>>>>
>>>> I recognize you said that one should define hw_submission_limit and adjust
>>>> the other parts of the equation accordingly, the options I see are:
>>>>
>>>> (1) Increase the ring size while keeping the maximum job size.
>>>> (2) Decrease the maximum job size while keeping the ring size.
>>>> (3) Let the scheduler track the actual job size rather than the maximum job
>>>> size.
>>>>
>>>> (1) results into potentially wasted ring memory, because we're not always
>>>> reaching the maximum job size, but the scheduler assumes so.
>>>>
>>>> (2) results into more IOCTLs from userspace for the same amount of IBs and
>>>> more jobs result into more memory allocations and more work being submitted
>>>> to the workqueue (with Matt's patches).
>>>>
>>>> (3) doesn't seem to have any of those draw backs.
>>>>
>>>> What would be your take on that?
>>>>
>>>> Actually, if none of the other drivers is interested into a more precise way
>>>> of keeping track of the ring utilization, I'd be totally fine to do it in a
>>>> driver specific way. However, unfortunately I don't see how this would be
>>>> possible.
>>>>
>>>> My proposal would be to just keep the hw_submission_limit (maybe rename it
>>>> to submission_unit_limit) and add a submission_units field to struct
>>>> drm_sched_job. By default a jobs submission_units field would be 0 and the
>>>> scheduler would behave the exact same way as it does now.
>>>>
>>>> Accordingly, jobs with submission_units > 1 would contribute more than one
>>>> unit to the submission_unit_limit.
>>>>
>>>> What do you think about that?
>>>>
>>> This seems reasonible to me and a very minimal change to the scheduler.
>>
>> If you have a good use case for this then the approach sounds sane to me as
>> well.
>>
> 
> Xe does not have a use case as the difference between the minimum size
> of a job and the maximum is not all that large (maybe 100-192 bytes is
> the range) so the accounting of a unit of 1 per job is just fine for now
> even though it may waste space.
> 
> In Nouveau it seems like the min / max for size of job can vary wildly
> so it needs finer grained units to mke effective use of the ring space.
> Updating the scheduler to support this is rather trivial, hence no real
> opposition from me. Also I do see this valid use case that other driver
> or even perhaps Xe may use someday.

Yes, exactly that.

> 
>> My question is rather how exactly does Nouveau comes to have this use case?
>> Allowing the full ring size in the UAPI sounds a bit questionable.
>>
> 
> I agree allowing the user completely fill the ring is a bit
> questionable, surely there has to be some upper limit. But lets say it
> allows 1-64 IB, that still IMO could be used to justify finer grained
> accouting in the DRM scheduler as stated above this make the difference
> between the min / max job quite large.

Yes, I agree that limiting the job size to at least ring size half makes 
sense to guarantee a contiguous flow.

> 
> Matt
> 
>> Christian.
>>
>>>
>>> Matt
>>>
>>>> Besides all that, you said that filling up the ring just enough to not let
>>>> the HW run dry rather than filling it up entirely is desirable. Why do you
>>>> think so? I tend to think that in most cases it shouldn't make difference.
>>>>
>>>> - Danilo
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Because one really is the minimum if you want to do work at all, but
>>>>>> as you mentioned above a job limit of one can let the ring run dry.
>>>>>>
>>>>>> In the end my proposal comes down to tracking the actual size of a
>>>>>> job rather than just assuming a pre-defined maximum job size, and
>>>>>> hence a dynamic job limit.
>>>>>>
>>>>>> I don't think this would hurt the scheduler granularity. In fact, it
>>>>>> should even contribute to the desire of not letting the ring run dry
>>>>>> even better. Especially for sequences of small jobs, where the
>>>>>> current implementation might wrongly assume the ring is already full
>>>>>> although actually there would still be enough space left.
>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>>> Otherwise your scheduler might just overwrite the ring
>>>>>>>>> buffer by pushing things to fast.
>>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>> Given that, it seems like it would be better to let
>>>>>>>>>> the scheduler keep track of empty ring "slots"
>>>>>>>>>> instead, such that the scheduler can deceide whether
>>>>>>>>>> a subsequent job will still fit on the ring and if
>>>>>>>>>> not re-evaluate once a previous job finished. Of
>>>>>>>>>> course each submitted job would be required to carry
>>>>>>>>>> the number of slots it requires on the ring.
>>>>>>>>>>
>>>>>>>>>> What to you think of implementing this as
>>>>>>>>>> alternative flow control mechanism? Implementation
>>>>>>>>>> wise this could be a union with the existing
>>>>>>>>>> hw_submission_limit.
>>>>>>>>>>
>>>>>>>>>> - Danilo
>>>>>>>>>>
>>>>>>>>>>> A problem with this design is currently a drm_gpu_scheduler uses a
>>>>>>>>>>> kthread for submission / job cleanup. This doesn't scale if a large
>>>>>>>>>>> number of drm_gpu_scheduler are used. To work
>>>>>>>>>>> around the scaling issue,
>>>>>>>>>>> use a worker rather than kthread for submission / job cleanup.
>>>>>>>>>>>
>>>>>>>>>>> v2:
>>>>>>>>>>>      - (Rob Clark) Fix msm build
>>>>>>>>>>>      - Pass in run work queue
>>>>>>>>>>> v3:
>>>>>>>>>>>      - (Boris) don't have loop in worker
>>>>>>>>>>> v4:
>>>>>>>>>>>      - (Tvrtko) break out submit ready, stop,
>>>>>>>>>>> start helpers into own patch
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-18  5:40                   ` Christian König
@ 2023-08-18 12:49                     ` Matthew Brost
  2023-08-18 12:06                       ` Danilo Krummrich
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Brost @ 2023-08-18 12:49 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, Danilo Krummrich,
	donald.robson, boris.brezillon, intel-xe, faith.ekstrand

On Fri, Aug 18, 2023 at 07:40:41AM +0200, Christian König wrote:
> Am 18.08.23 um 05:08 schrieb Matthew Brost:
> > On Thu, Aug 17, 2023 at 01:13:31PM +0200, Danilo Krummrich wrote:
> > > On 8/17/23 07:33, Christian König wrote:
> > > > Am 16.08.23 um 18:33 schrieb Danilo Krummrich:
> > > > > On 8/16/23 16:59, Christian König wrote:
> > > > > > Am 16.08.23 um 14:30 schrieb Danilo Krummrich:
> > > > > > > On 8/16/23 16:05, Christian König wrote:
> > > > > > > > Am 16.08.23 um 13:30 schrieb Danilo Krummrich:
> > > > > > > > > Hi Matt,
> > > > > > > > > 
> > > > > > > > > On 8/11/23 04:31, Matthew Brost wrote:
> > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > > > > > mapping between a drm_gpu_scheduler and
> > > > > > > > > > drm_sched_entity. At first this
> > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > 
> > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > > > > > guaranteed to be the same completion even if
> > > > > > > > > > targeting the same hardware
> > > > > > > > > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > > > > > > > > which allowed to reorder, timeslice, and preempt
> > > > > > > > > > submissions. If a using
> > > > > > > > > > shared drm_gpu_scheduler across multiple
> > > > > > > > > > drm_sched_entity, the TDR falls
> > > > > > > > > > apart as the TDR expects submission order ==
> > > > > > > > > > completion order. Using a
> > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
> > > > > > > > > > 
> > > > > > > > > > 2. In XE submissions are done via programming a
> > > > > > > > > > ring buffer (circular
> > > > > > > > > > buffer), a drm_gpu_scheduler provides a limit on
> > > > > > > > > > number of jobs, if the
> > > > > > > > > > limit of number jobs is set to RING_SIZE /
> > > > > > > > > > MAX_SIZE_PER_JOB we get flow
> > > > > > > > > > control on the ring for free.
> > > > > > > > > In XE, where does the limitation of MAX_SIZE_PER_JOB come from?
> > > > > > > > > 
> > > > > > > > > In Nouveau we currently do have such a limitation as
> > > > > > > > > well, but it is derived from the RING_SIZE, hence
> > > > > > > > > RING_SIZE / MAX_SIZE_PER_JOB would always be 1.
> > > > > > > > > However, I think most jobs won't actually utilize
> > > > > > > > > the whole ring.
> > > > > > > > Well that should probably rather be RING_SIZE /
> > > > > > > > MAX_SIZE_PER_JOB = hw_submission_limit (or even
> > > > > > > > hw_submission_limit - 1 when the hw can't distinct full
> > > > > > > > vs empty ring buffer).
> > > > > > > Not sure if I get you right, let me try to clarify what I
> > > > > > > was trying to say: I wanted to say that in Nouveau
> > > > > > > MAX_SIZE_PER_JOB isn't really limited by anything other than
> > > > > > > the RING_SIZE and hence we'd never allow more than 1 active
> > > > > > > job.
> > > > > > But that lets the hw run dry between submissions. That is
> > > > > > usually a pretty horrible idea for performance.
> > > > > Correct, that's the reason why I said it seems to be more efficient
> > > > > to base ring flow control on the actual size of each incoming job
> > > > > rather than the maximum size of a job.
> > > > > 
> > > > > > > However, it seems to be more efficient to base ring flow
> > > > > > > control on the actual size of each incoming job rather than
> > > > > > > the worst case, namely the maximum size of a job.
> > > > > > That doesn't sounds like a good idea to me. See we don't limit
> > > > > > the number of submitted jobs based on the ring size, but rather
> > > > > > we calculate the ring size based on the number of submitted
> > > > > > jobs.
> > > > > > 
> > > > > My point isn't really about whether we derive the ring size from the
> > > > > job limit or the other way around. It's more about the job size (or
> > > > > its maximum size) being arbitrary.
> > > > > 
> > > > > As mentioned in my reply to Matt:
> > > > > 
> > > > > "In Nouveau, userspace can submit an arbitrary amount of addresses
> > > > > of indirect bufferes containing the ring instructions. The ring on
> > > > > the kernel side takes the addresses of the indirect buffers rather
> > > > > than the instructions themself. Hence, technically there isn't
> > > > > really a limit on the amount of IBs submitted by a job except for
> > > > > the ring size."
> > > > > 
> > > > > So, my point is that I don't really want to limit the job size
> > > > > artificially just to be able to fit multiple jobs into the ring even
> > > > > if they're submitted at their "artificial" maximum size, but rather
> > > > > track how much of the ring the submitted job actually occupies.
> > > > > 
> > > > > > In other words the hw_submission_limit defines the ring size,
> > > > > > not the other way around. And you usually want the
> > > > > > hw_submission_limit as low as possible for good scheduler
> > > > > > granularity and to avoid extra overhead.
> > > > > I don't think you really mean "as low as possible", do you?
> > > > No, I do mean as low as possible or in other words as few as possible.
> > > > 
> > > > Ideally the scheduler would submit only the minimum amount of work to
> > > > the hardware to keep the hardware busy. >
> > > > The hardware seems to work mostly the same for all vendors, but you
> > > > somehow seem to think that filling the ring is somehow beneficial which
> > > > is really not the case as far as I can see.
> > > I think that's a misunderstanding. I'm not trying to say that it is *always*
> > > beneficial to fill up the ring as much as possible. But I think it is under
> > > certain circumstances, exactly those circumstances I described for Nouveau.
> > > 
> > > As mentioned, in Nouveau the size of a job is only really limited by the
> > > ring size, which means that one job can (but does not necessarily) fill up
> > > the whole ring. We both agree that this is inefficient, because it
> > > potentially results into the HW run dry due to hw_submission_limit == 1.
> > > 
> > > I recognize you said that one should define hw_submission_limit and adjust
> > > the other parts of the equation accordingly, the options I see are:
> > > 
> > > (1) Increase the ring size while keeping the maximum job size.
> > > (2) Decrease the maximum job size while keeping the ring size.
> > > (3) Let the scheduler track the actual job size rather than the maximum job
> > > size.
> > > 
> > > (1) results into potentially wasted ring memory, because we're not always
> > > reaching the maximum job size, but the scheduler assumes so.
> > > 
> > > (2) results into more IOCTLs from userspace for the same amount of IBs and
> > > more jobs result into more memory allocations and more work being submitted
> > > to the workqueue (with Matt's patches).
> > > 
> > > (3) doesn't seem to have any of those draw backs.
> > > 
> > > What would be your take on that?
> > > 
> > > Actually, if none of the other drivers is interested into a more precise way
> > > of keeping track of the ring utilization, I'd be totally fine to do it in a
> > > driver specific way. However, unfortunately I don't see how this would be
> > > possible.
> > > 
> > > My proposal would be to just keep the hw_submission_limit (maybe rename it
> > > to submission_unit_limit) and add a submission_units field to struct
> > > drm_sched_job. By default a jobs submission_units field would be 0 and the
> > > scheduler would behave the exact same way as it does now.
> > > 
> > > Accordingly, jobs with submission_units > 1 would contribute more than one
> > > unit to the submission_unit_limit.
> > > 
> > > What do you think about that?
> > > 
> > This seems reasonible to me and a very minimal change to the scheduler.
> 
> If you have a good use case for this then the approach sounds sane to me as
> well.
> 

Xe does not have a use case as the difference between the minimum size
of a job and the maximum is not all that large (maybe 100-192 bytes is
the range) so the accounting of a unit of 1 per job is just fine for now
even though it may waste space.

In Nouveau it seems like the min / max for size of job can vary wildly
so it needs finer grained units to mke effective use of the ring space.
Updating the scheduler to support this is rather trivial, hence no real
opposition from me. Also I do see this valid use case that other driver
or even perhaps Xe may use someday.

> My question is rather how exactly does Nouveau comes to have this use case?
> Allowing the full ring size in the UAPI sounds a bit questionable.
>

I agree allowing the user completely fill the ring is a bit
questionable, surely there has to be some upper limit. But lets say it
allows 1-64 IB, that still IMO could be used to justify finer grained
accouting in the DRM scheduler as stated above this make the difference
between the min / max job quite large.

Matt

> Christian.
> 
> > 
> > Matt
> > 
> > > Besides all that, you said that filling up the ring just enough to not let
> > > the HW run dry rather than filling it up entirely is desirable. Why do you
> > > think so? I tend to think that in most cases it shouldn't make difference.
> > > 
> > > - Danilo
> > > 
> > > > Regards,
> > > > Christian.
> > > > 
> > > > > Because one really is the minimum if you want to do work at all, but
> > > > > as you mentioned above a job limit of one can let the ring run dry.
> > > > > 
> > > > > In the end my proposal comes down to tracking the actual size of a
> > > > > job rather than just assuming a pre-defined maximum job size, and
> > > > > hence a dynamic job limit.
> > > > > 
> > > > > I don't think this would hurt the scheduler granularity. In fact, it
> > > > > should even contribute to the desire of not letting the ring run dry
> > > > > even better. Especially for sequences of small jobs, where the
> > > > > current implementation might wrongly assume the ring is already full
> > > > > although actually there would still be enough space left.
> > > > > 
> > > > > > Christian.
> > > > > > 
> > > > > > > > Otherwise your scheduler might just overwrite the ring
> > > > > > > > buffer by pushing things to fast.
> > > > > > > > 
> > > > > > > > Christian.
> > > > > > > > 
> > > > > > > > > Given that, it seems like it would be better to let
> > > > > > > > > the scheduler keep track of empty ring "slots"
> > > > > > > > > instead, such that the scheduler can deceide whether
> > > > > > > > > a subsequent job will still fit on the ring and if
> > > > > > > > > not re-evaluate once a previous job finished. Of
> > > > > > > > > course each submitted job would be required to carry
> > > > > > > > > the number of slots it requires on the ring.
> > > > > > > > > 
> > > > > > > > > What to you think of implementing this as
> > > > > > > > > alternative flow control mechanism? Implementation
> > > > > > > > > wise this could be a union with the existing
> > > > > > > > > hw_submission_limit.
> > > > > > > > > 
> > > > > > > > > - Danilo
> > > > > > > > > 
> > > > > > > > > > A problem with this design is currently a drm_gpu_scheduler uses a
> > > > > > > > > > kthread for submission / job cleanup. This doesn't scale if a large
> > > > > > > > > > number of drm_gpu_scheduler are used. To work
> > > > > > > > > > around the scaling issue,
> > > > > > > > > > use a worker rather than kthread for submission / job cleanup.
> > > > > > > > > > 
> > > > > > > > > > v2:
> > > > > > > > > >     - (Rob Clark) Fix msm build
> > > > > > > > > >     - Pass in run work queue
> > > > > > > > > > v3:
> > > > > > > > > >     - (Boris) don't have loop in worker
> > > > > > > > > > v4:
> > > > > > > > > >     - (Tvrtko) break out submit ready, stop,
> > > > > > > > > > start helpers into own patch
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-18  5:27       ` Christian König
@ 2023-08-18 13:13         ` Matthew Brost
  2023-08-21 13:17           ` Christian König
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Brost @ 2023-08-18 13:13 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen,
	Liviu.Dudau, dri-devel, luben.tuikov, lina, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

On Fri, Aug 18, 2023 at 07:27:33AM +0200, Christian König wrote:
> Am 17.08.23 um 19:54 schrieb Matthew Brost:
> > On Thu, Aug 17, 2023 at 03:39:40PM +0200, Christian König wrote:
> > > Am 11.08.23 um 04:31 schrieb Matthew Brost:
> > > > Rather than call free_job and run_job in same work item have a dedicated
> > > > work item for each. This aligns with the design and intended use of work
> > > > queues.
> > > I would rather say we should get completely rid of the free_job callback.
> > > 
> > Would we still have work item? e.g. Would we still want to call
> > drm_sched_get_cleanup_job which removes the job from the pending list
> > and adjusts the TDR? Trying to figure out out what this looks like. We
> > probably can't do all of this from an IRQ context.
> > 
> > > Essentially the job is just the container which carries the information
> > > which are necessary before you push it to the hw. The real representation of
> > > the submission is actually the scheduler fence.
> > > 
> > Most of the free_jobs call plus drm_sched_job_cleanup + a put on job. In
> > Xe this cannot be called from an IRQ context either.
> > 
> > I'm just confused what exactly you are suggesting here.
> 
> To summarize on one sentence: Instead of the job we keep the scheduler and
> hardware fences around after pushing the job to the hw.
> 
> The free_job callback would then be replaced by dropping the reference on
> the scheduler and hw fence.
> 
> Would that work for you?
> 

I don't think so for a few reasons.

The job and hw fence are different structures (also different allocs too)
for a reason. The job referenced until it is complete (hw fence is
signaled) and the free_job is called. This reference is needed for the
TDR to work properly and also some reset flows too. Also in Xe some of
things done in free_job cannot be from an IRQ context, hence calling
this from the scheduler worker is rather helpful.

The HW fence can live for longer as it can be installed in dma-resv
slots, syncobjs, etc... If the job and hw fence are combined now we
holding on the memory for the longer and perhaps at the mercy of the
user. We also run the risk of the final put being done from an IRQ
context which again wont work in Xe as it is currently coded. Lastly 2
jobs from the same scheduler could do the final put in parallel, so
rather than having free_job serialized by the worker now multiple jobs
are freeing themselves at the same time. This might not be an issue but
adds another level of raceyness that needs to be accounted for. None of
this sounds desirable to me.

FWIW what you suggesting sounds like how the i915 did things
(i915_request and hw fence in 1 memory alloc) and that turned out to be
a huge mess. As rule of thumb I generally do the opposite of whatever
the i915 did.

Matt

> Christian.
> 
> > 
> > Matt
> > 
> > > All the lifetime issues we had came from ignoring this fact and I think we
> > > should push for fixing this design up again.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
> > > >    include/drm/gpu_scheduler.h            |   8 +-
> > > >    2 files changed, 106 insertions(+), 39 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > index cede47afc800..b67469eac179 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > > >     * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
> > > >     *
> > > >     * @rq: scheduler run queue to check.
> > > > + * @dequeue: dequeue selected entity
> > > >     *
> > > >     * Try to find a ready entity, returns NULL if none found.
> > > >     */
> > > >    static struct drm_sched_entity *
> > > > -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
> > > >    {
> > > >    	struct drm_sched_entity *entity;
> > > > @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > >    	if (entity) {
> > > >    		list_for_each_entry_continue(entity, &rq->entities, list) {
> > > >    			if (drm_sched_entity_is_ready(entity)) {
> > > > -				rq->current_entity = entity;
> > > > -				reinit_completion(&entity->entity_idle);
> > > > +				if (dequeue) {
> > > > +					rq->current_entity = entity;
> > > > +					reinit_completion(&entity->entity_idle);
> > > > +				}
> > > >    				spin_unlock(&rq->lock);
> > > >    				return entity;
> > > >    			}
> > > > @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > >    	list_for_each_entry(entity, &rq->entities, list) {
> > > >    		if (drm_sched_entity_is_ready(entity)) {
> > > > -			rq->current_entity = entity;
> > > > -			reinit_completion(&entity->entity_idle);
> > > > +			if (dequeue) {
> > > > +				rq->current_entity = entity;
> > > > +				reinit_completion(&entity->entity_idle);
> > > > +			}
> > > >    			spin_unlock(&rq->lock);
> > > >    			return entity;
> > > >    		}
> > > > @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > >     * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
> > > >     *
> > > >     * @rq: scheduler run queue to check.
> > > > + * @dequeue: dequeue selected entity
> > > >     *
> > > >     * Find oldest waiting ready entity, returns NULL if none found.
> > > >     */
> > > >    static struct drm_sched_entity *
> > > > -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
> > > >    {
> > > >    	struct rb_node *rb;
> > > > @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > >    		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> > > >    		if (drm_sched_entity_is_ready(entity)) {
> > > > -			rq->current_entity = entity;
> > > > -			reinit_completion(&entity->entity_idle);
> > > > +			if (dequeue) {
> > > > +				rq->current_entity = entity;
> > > > +				reinit_completion(&entity->entity_idle);
> > > > +			}
> > > >    			break;
> > > >    		}
> > > >    	}
> > > > @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > >    }
> > > >    /**
> > > > - * drm_sched_submit_queue - scheduler queue submission
> > > > + * drm_sched_run_job_queue - queue job submission
> > > >     * @sched: scheduler instance
> > > >     */
> > > > -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> > > > +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> > > >    {
> > > >    	if (!READ_ONCE(sched->pause_submit))
> > > > -		queue_work(sched->submit_wq, &sched->work_submit);
> > > > +		queue_work(sched->submit_wq, &sched->work_run_job);
> > > > +}
> > > > +
> > > > +static struct drm_sched_entity *
> > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> > > > +
> > > > +/**
> > > > + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> > > > + * @sched: scheduler instance
> > > > + */
> > > > +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > +{
> > > > +	if (drm_sched_select_entity(sched, false))
> > > > +		drm_sched_run_job_queue(sched);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_sched_free_job_queue - queue free job
> > > > + *
> > > > + * @sched: scheduler instance to queue free job
> > > > + */
> > > > +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> > > > +{
> > > > +	if (!READ_ONCE(sched->pause_submit))
> > > > +		queue_work(sched->submit_wq, &sched->work_free_job);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_sched_free_job_queue_if_ready - queue free job if ready
> > > > + *
> > > > + * @sched: scheduler instance to queue free job
> > > > + */
> > > > +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > +{
> > > > +	struct drm_sched_job *job;
> > > > +
> > > > +	spin_lock(&sched->job_list_lock);
> > > > +	job = list_first_entry_or_null(&sched->pending_list,
> > > > +				       struct drm_sched_job, list);
> > > > +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
> > > > +		drm_sched_free_job_queue(sched);
> > > > +	spin_unlock(&sched->job_list_lock);
> > > >    }
> > > >    /**
> > > > @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
> > > >    	dma_fence_get(&s_fence->finished);
> > > >    	drm_sched_fence_finished(s_fence, result);
> > > >    	dma_fence_put(&s_fence->finished);
> > > > -	drm_sched_submit_queue(sched);
> > > > +	drm_sched_free_job_queue(sched);
> > > >    }
> > > >    /**
> > > > @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> > > >    void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> > > >    {
> > > >    	if (drm_sched_can_queue(sched))
> > > > -		drm_sched_submit_queue(sched);
> > > > +		drm_sched_run_job_queue(sched);
> > > >    }
> > > >    /**
> > > >     * drm_sched_select_entity - Select next entity to process
> > > >     *
> > > >     * @sched: scheduler instance
> > > > + * @dequeue: dequeue selected entity
> > > >     *
> > > >     * Returns the entity to process or NULL if none are found.
> > > >     */
> > > >    static struct drm_sched_entity *
> > > > -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
> > > >    {
> > > >    	struct drm_sched_entity *entity;
> > > >    	int i;
> > > > @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > >    	/* Kernel run queue has higher priority than normal run queue*/
> > > >    	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> > > >    		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> > > > -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> > > > -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> > > > +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> > > > +							dequeue) :
> > > > +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> > > > +						      dequeue);
> > > >    		if (entity)
> > > >    			break;
> > > >    	}
> > > > @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> > > >    EXPORT_SYMBOL(drm_sched_pick_best);
> > > >    /**
> > > > - * drm_sched_main - main scheduler thread
> > > > + * drm_sched_free_job_work - worker to call free_job
> > > >     *
> > > > - * @param: scheduler instance
> > > > + * @w: free job work
> > > >     */
> > > > -static void drm_sched_main(struct work_struct *w)
> > > > +static void drm_sched_free_job_work(struct work_struct *w)
> > > >    {
> > > >    	struct drm_gpu_scheduler *sched =
> > > > -		container_of(w, struct drm_gpu_scheduler, work_submit);
> > > > -	struct drm_sched_entity *entity;
> > > > +		container_of(w, struct drm_gpu_scheduler, work_free_job);
> > > >    	struct drm_sched_job *cleanup_job;
> > > > -	int r;
> > > >    	if (READ_ONCE(sched->pause_submit))
> > > >    		return;
> > > >    	cleanup_job = drm_sched_get_cleanup_job(sched);
> > > > -	entity = drm_sched_select_entity(sched);
> > > > +	if (cleanup_job) {
> > > > +		sched->ops->free_job(cleanup_job);
> > > > +
> > > > +		drm_sched_free_job_queue_if_ready(sched);
> > > > +		drm_sched_run_job_queue_if_ready(sched);
> > > > +	}
> > > > +}
> > > > -	if (!entity && !cleanup_job)
> > > > -		return;	/* No more work */
> > > > +/**
> > > > + * drm_sched_run_job_work - worker to call run_job
> > > > + *
> > > > + * @w: run job work
> > > > + */
> > > > +static void drm_sched_run_job_work(struct work_struct *w)
> > > > +{
> > > > +	struct drm_gpu_scheduler *sched =
> > > > +		container_of(w, struct drm_gpu_scheduler, work_run_job);
> > > > +	struct drm_sched_entity *entity;
> > > > +	int r;
> > > > -	if (cleanup_job)
> > > > -		sched->ops->free_job(cleanup_job);
> > > > +	if (READ_ONCE(sched->pause_submit))
> > > > +		return;
> > > > +	entity = drm_sched_select_entity(sched, true);
> > > >    	if (entity) {
> > > >    		struct dma_fence *fence;
> > > >    		struct drm_sched_fence *s_fence;
> > > > @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
> > > >    		sched_job = drm_sched_entity_pop_job(entity);
> > > >    		if (!sched_job) {
> > > >    			complete_all(&entity->entity_idle);
> > > > -			if (!cleanup_job)
> > > > -				return;	/* No more work */
> > > > -			goto again;
> > > > +			return;	/* No more work */
> > > >    		}
> > > >    		s_fence = sched_job->s_fence;
> > > > @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
> > > >    		}
> > > >    		wake_up(&sched->job_scheduled);
> > > > +		drm_sched_run_job_queue_if_ready(sched);
> > > >    	}
> > > > -
> > > > -again:
> > > > -	drm_sched_submit_queue(sched);
> > > >    }
> > > >    /**
> > > > @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> > > >    	spin_lock_init(&sched->job_list_lock);
> > > >    	atomic_set(&sched->hw_rq_count, 0);
> > > >    	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > > > -	INIT_WORK(&sched->work_submit, drm_sched_main);
> > > > +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > > > +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> > > >    	atomic_set(&sched->_score, 0);
> > > >    	atomic64_set(&sched->job_id_count, 0);
> > > >    	sched->pause_submit = false;
> > > > @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
> > > >    void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
> > > >    {
> > > >    	WRITE_ONCE(sched->pause_submit, true);
> > > > -	cancel_work_sync(&sched->work_submit);
> > > > +	cancel_work_sync(&sched->work_run_job);
> > > > +	cancel_work_sync(&sched->work_free_job);
> > > >    }
> > > >    EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
> > > >    void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
> > > >    {
> > > >    	WRITE_ONCE(sched->pause_submit, false);
> > > > -	queue_work(sched->submit_wq, &sched->work_submit);
> > > > +	queue_work(sched->submit_wq, &sched->work_run_job);
> > > > +	queue_work(sched->submit_wq, &sched->work_free_job);
> > > >    }
> > > >    EXPORT_SYMBOL(drm_sched_submit_start);
> > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > index 04eec2d7635f..fbc083a92757 100644
> > > > --- a/include/drm/gpu_scheduler.h
> > > > +++ b/include/drm/gpu_scheduler.h
> > > > @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
> > > >     *                 finished.
> > > >     * @hw_rq_count: the number of jobs currently in the hardware queue.
> > > >     * @job_id_count: used to assign unique id to the each job.
> > > > - * @submit_wq: workqueue used to queue @work_submit
> > > > + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
> > > >     * @timeout_wq: workqueue used to queue @work_tdr
> > > > - * @work_submit: schedules jobs and cleans up entities
> > > > + * @work_run_job: schedules jobs
> > > > + * @work_free_job: cleans up jobs
> > > >     * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> > > >     *            timeout interval is over.
> > > >     * @pending_list: the list of jobs which are currently in the job queue.
> > > > @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
> > > >    	atomic64_t			job_id_count;
> > > >    	struct workqueue_struct		*submit_wq;
> > > >    	struct workqueue_struct		*timeout_wq;
> > > > -	struct work_struct		work_submit;
> > > > +	struct work_struct		work_run_job;
> > > > +	struct work_struct		work_free_job;
> > > >    	struct delayed_work		work_tdr;
> > > >    	struct list_head		pending_list;
> > > >    	spinlock_t			job_list_lock;
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-18 13:13         ` Matthew Brost
@ 2023-08-21 13:17           ` Christian König
  2023-08-23  3:27             ` Matthew Brost
  0 siblings, 1 reply; 80+ messages in thread
From: Christian König @ 2023-08-21 13:17 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen,
	Liviu.Dudau, dri-devel, luben.tuikov, lina, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

Am 18.08.23 um 15:13 schrieb Matthew Brost:
> On Fri, Aug 18, 2023 at 07:27:33AM +0200, Christian König wrote:
>> Am 17.08.23 um 19:54 schrieb Matthew Brost:
>>> On Thu, Aug 17, 2023 at 03:39:40PM +0200, Christian König wrote:
>>>> Am 11.08.23 um 04:31 schrieb Matthew Brost:
>>>>> Rather than call free_job and run_job in same work item have a dedicated
>>>>> work item for each. This aligns with the design and intended use of work
>>>>> queues.
>>>> I would rather say we should get completely rid of the free_job callback.
>>>>
>>> Would we still have work item? e.g. Would we still want to call
>>> drm_sched_get_cleanup_job which removes the job from the pending list
>>> and adjusts the TDR? Trying to figure out out what this looks like. We
>>> probably can't do all of this from an IRQ context.
>>>
>>>> Essentially the job is just the container which carries the information
>>>> which are necessary before you push it to the hw. The real representation of
>>>> the submission is actually the scheduler fence.
>>>>
>>> Most of the free_jobs call plus drm_sched_job_cleanup + a put on job. In
>>> Xe this cannot be called from an IRQ context either.
>>>
>>> I'm just confused what exactly you are suggesting here.
>> To summarize on one sentence: Instead of the job we keep the scheduler and
>> hardware fences around after pushing the job to the hw.
>>
>> The free_job callback would then be replaced by dropping the reference on
>> the scheduler and hw fence.
>>
>> Would that work for you?
>>
> I don't think so for a few reasons.
>
> The job and hw fence are different structures (also different allocs too)
> for a reason. The job referenced until it is complete (hw fence is
> signaled) and the free_job is called. This reference is needed for the
> TDR to work properly and also some reset flows too.

That is exactly what I want to avoid, tying the TDR to the job is what 
some AMD engineers pushed for because it looked like a simple solution 
and made the whole thing similar to what Windows does.

This turned the previous relatively clean scheduler and TDR design into 
a complete nightmare. The job contains quite a bunch of things which are 
not necessarily available after the application which submitted the job 
is torn down.

So what happens is that you either have stale pointers in the TDR which 
can go boom extremely easily or we somehow find a way to keep the 
necessary structures (which include struct thread_info and struct file 
for this driver connection) alive until all submissions are completed.

Delaying application tear down is also not an option because then you 
run into massive trouble with the OOM killer (or more generally OOM 
handling). See what we do in drm_sched_entity_flush() as well.

Since adding the TDR support we completely exercised this through in the 
last two or three years or so. And to sum it up I would really like to 
get away from this mess again.

Compared to that what i915 does is actually rather clean I think.

>   Also in Xe some of
> things done in free_job cannot be from an IRQ context, hence calling
> this from the scheduler worker is rather helpful.

Well putting things for cleanup into a workitem doesn't sounds like 
something hard.

Question is what do you really need for TDR which is not inside the 
hardware fence?

Regards,
Christian.

>
> The HW fence can live for longer as it can be installed in dma-resv
> slots, syncobjs, etc... If the job and hw fence are combined now we
> holding on the memory for the longer and perhaps at the mercy of the
> user. We also run the risk of the final put being done from an IRQ
> context which again wont work in Xe as it is currently coded. Lastly 2
> jobs from the same scheduler could do the final put in parallel, so
> rather than having free_job serialized by the worker now multiple jobs
> are freeing themselves at the same time. This might not be an issue but
> adds another level of raceyness that needs to be accounted for. None of
> this sounds desirable to me.
>
> FWIW what you suggesting sounds like how the i915 did things
> (i915_request and hw fence in 1 memory alloc) and that turned out to be
> a huge mess. As rule of thumb I generally do the opposite of whatever
> the i915 did.
>
> Matt
>
>> Christian.
>>
>>> Matt
>>>
>>>> All the lifetime issues we had came from ignoring this fact and I think we
>>>> should push for fixing this design up again.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
>>>>>     include/drm/gpu_scheduler.h            |   8 +-
>>>>>     2 files changed, 106 insertions(+), 39 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index cede47afc800..b67469eac179 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>>>>      * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>>>>>      *
>>>>>      * @rq: scheduler run queue to check.
>>>>> + * @dequeue: dequeue selected entity
>>>>>      *
>>>>>      * Try to find a ready entity, returns NULL if none found.
>>>>>      */
>>>>>     static struct drm_sched_entity *
>>>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>> +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
>>>>>     {
>>>>>     	struct drm_sched_entity *entity;
>>>>> @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>>     	if (entity) {
>>>>>     		list_for_each_entry_continue(entity, &rq->entities, list) {
>>>>>     			if (drm_sched_entity_is_ready(entity)) {
>>>>> -				rq->current_entity = entity;
>>>>> -				reinit_completion(&entity->entity_idle);
>>>>> +				if (dequeue) {
>>>>> +					rq->current_entity = entity;
>>>>> +					reinit_completion(&entity->entity_idle);
>>>>> +				}
>>>>>     				spin_unlock(&rq->lock);
>>>>>     				return entity;
>>>>>     			}
>>>>> @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>>     	list_for_each_entry(entity, &rq->entities, list) {
>>>>>     		if (drm_sched_entity_is_ready(entity)) {
>>>>> -			rq->current_entity = entity;
>>>>> -			reinit_completion(&entity->entity_idle);
>>>>> +			if (dequeue) {
>>>>> +				rq->current_entity = entity;
>>>>> +				reinit_completion(&entity->entity_idle);
>>>>> +			}
>>>>>     			spin_unlock(&rq->lock);
>>>>>     			return entity;
>>>>>     		}
>>>>> @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>>      * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>>>>>      *
>>>>>      * @rq: scheduler run queue to check.
>>>>> + * @dequeue: dequeue selected entity
>>>>>      *
>>>>>      * Find oldest waiting ready entity, returns NULL if none found.
>>>>>      */
>>>>>     static struct drm_sched_entity *
>>>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>>> +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
>>>>>     {
>>>>>     	struct rb_node *rb;
>>>>> @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>>>     		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>>>>     		if (drm_sched_entity_is_ready(entity)) {
>>>>> -			rq->current_entity = entity;
>>>>> -			reinit_completion(&entity->entity_idle);
>>>>> +			if (dequeue) {
>>>>> +				rq->current_entity = entity;
>>>>> +				reinit_completion(&entity->entity_idle);
>>>>> +			}
>>>>>     			break;
>>>>>     		}
>>>>>     	}
>>>>> @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>>>     }
>>>>>     /**
>>>>> - * drm_sched_submit_queue - scheduler queue submission
>>>>> + * drm_sched_run_job_queue - queue job submission
>>>>>      * @sched: scheduler instance
>>>>>      */
>>>>> -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
>>>>> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>>>>     {
>>>>>     	if (!READ_ONCE(sched->pause_submit))
>>>>> -		queue_work(sched->submit_wq, &sched->work_submit);
>>>>> +		queue_work(sched->submit_wq, &sched->work_run_job);
>>>>> +}
>>>>> +
>>>>> +static struct drm_sched_entity *
>>>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
>>>>> +
>>>>> +/**
>>>>> + * drm_sched_run_job_queue_if_ready - queue job submission if ready
>>>>> + * @sched: scheduler instance
>>>>> + */
>>>>> +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
>>>>> +{
>>>>> +	if (drm_sched_select_entity(sched, false))
>>>>> +		drm_sched_run_job_queue(sched);
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * drm_sched_free_job_queue - queue free job
>>>>> + *
>>>>> + * @sched: scheduler instance to queue free job
>>>>> + */
>>>>> +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
>>>>> +{
>>>>> +	if (!READ_ONCE(sched->pause_submit))
>>>>> +		queue_work(sched->submit_wq, &sched->work_free_job);
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * drm_sched_free_job_queue_if_ready - queue free job if ready
>>>>> + *
>>>>> + * @sched: scheduler instance to queue free job
>>>>> + */
>>>>> +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
>>>>> +{
>>>>> +	struct drm_sched_job *job;
>>>>> +
>>>>> +	spin_lock(&sched->job_list_lock);
>>>>> +	job = list_first_entry_or_null(&sched->pending_list,
>>>>> +				       struct drm_sched_job, list);
>>>>> +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
>>>>> +		drm_sched_free_job_queue(sched);
>>>>> +	spin_unlock(&sched->job_list_lock);
>>>>>     }
>>>>>     /**
>>>>> @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>>>>>     	dma_fence_get(&s_fence->finished);
>>>>>     	drm_sched_fence_finished(s_fence, result);
>>>>>     	dma_fence_put(&s_fence->finished);
>>>>> -	drm_sched_submit_queue(sched);
>>>>> +	drm_sched_free_job_queue(sched);
>>>>>     }
>>>>>     /**
>>>>> @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
>>>>>     void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>>>>>     {
>>>>>     	if (drm_sched_can_queue(sched))
>>>>> -		drm_sched_submit_queue(sched);
>>>>> +		drm_sched_run_job_queue(sched);
>>>>>     }
>>>>>     /**
>>>>>      * drm_sched_select_entity - Select next entity to process
>>>>>      *
>>>>>      * @sched: scheduler instance
>>>>> + * @dequeue: dequeue selected entity
>>>>>      *
>>>>>      * Returns the entity to process or NULL if none are found.
>>>>>      */
>>>>>     static struct drm_sched_entity *
>>>>> -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
>>>>>     {
>>>>>     	struct drm_sched_entity *entity;
>>>>>     	int i;
>>>>> @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>>>>     	/* Kernel run queue has higher priority than normal run queue*/
>>>>>     	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>     		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
>>>>> -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
>>>>> -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
>>>>> +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
>>>>> +							dequeue) :
>>>>> +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
>>>>> +						      dequeue);
>>>>>     		if (entity)
>>>>>     			break;
>>>>>     	}
>>>>> @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>>>>>     EXPORT_SYMBOL(drm_sched_pick_best);
>>>>>     /**
>>>>> - * drm_sched_main - main scheduler thread
>>>>> + * drm_sched_free_job_work - worker to call free_job
>>>>>      *
>>>>> - * @param: scheduler instance
>>>>> + * @w: free job work
>>>>>      */
>>>>> -static void drm_sched_main(struct work_struct *w)
>>>>> +static void drm_sched_free_job_work(struct work_struct *w)
>>>>>     {
>>>>>     	struct drm_gpu_scheduler *sched =
>>>>> -		container_of(w, struct drm_gpu_scheduler, work_submit);
>>>>> -	struct drm_sched_entity *entity;
>>>>> +		container_of(w, struct drm_gpu_scheduler, work_free_job);
>>>>>     	struct drm_sched_job *cleanup_job;
>>>>> -	int r;
>>>>>     	if (READ_ONCE(sched->pause_submit))
>>>>>     		return;
>>>>>     	cleanup_job = drm_sched_get_cleanup_job(sched);
>>>>> -	entity = drm_sched_select_entity(sched);
>>>>> +	if (cleanup_job) {
>>>>> +		sched->ops->free_job(cleanup_job);
>>>>> +
>>>>> +		drm_sched_free_job_queue_if_ready(sched);
>>>>> +		drm_sched_run_job_queue_if_ready(sched);
>>>>> +	}
>>>>> +}
>>>>> -	if (!entity && !cleanup_job)
>>>>> -		return;	/* No more work */
>>>>> +/**
>>>>> + * drm_sched_run_job_work - worker to call run_job
>>>>> + *
>>>>> + * @w: run job work
>>>>> + */
>>>>> +static void drm_sched_run_job_work(struct work_struct *w)
>>>>> +{
>>>>> +	struct drm_gpu_scheduler *sched =
>>>>> +		container_of(w, struct drm_gpu_scheduler, work_run_job);
>>>>> +	struct drm_sched_entity *entity;
>>>>> +	int r;
>>>>> -	if (cleanup_job)
>>>>> -		sched->ops->free_job(cleanup_job);
>>>>> +	if (READ_ONCE(sched->pause_submit))
>>>>> +		return;
>>>>> +	entity = drm_sched_select_entity(sched, true);
>>>>>     	if (entity) {
>>>>>     		struct dma_fence *fence;
>>>>>     		struct drm_sched_fence *s_fence;
>>>>> @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
>>>>>     		sched_job = drm_sched_entity_pop_job(entity);
>>>>>     		if (!sched_job) {
>>>>>     			complete_all(&entity->entity_idle);
>>>>> -			if (!cleanup_job)
>>>>> -				return;	/* No more work */
>>>>> -			goto again;
>>>>> +			return;	/* No more work */
>>>>>     		}
>>>>>     		s_fence = sched_job->s_fence;
>>>>> @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
>>>>>     		}
>>>>>     		wake_up(&sched->job_scheduled);
>>>>> +		drm_sched_run_job_queue_if_ready(sched);
>>>>>     	}
>>>>> -
>>>>> -again:
>>>>> -	drm_sched_submit_queue(sched);
>>>>>     }
>>>>>     /**
>>>>> @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>>>     	spin_lock_init(&sched->job_list_lock);
>>>>>     	atomic_set(&sched->hw_rq_count, 0);
>>>>>     	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>>>> -	INIT_WORK(&sched->work_submit, drm_sched_main);
>>>>> +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
>>>>> +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
>>>>>     	atomic_set(&sched->_score, 0);
>>>>>     	atomic64_set(&sched->job_id_count, 0);
>>>>>     	sched->pause_submit = false;
>>>>> @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
>>>>>     void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
>>>>>     {
>>>>>     	WRITE_ONCE(sched->pause_submit, true);
>>>>> -	cancel_work_sync(&sched->work_submit);
>>>>> +	cancel_work_sync(&sched->work_run_job);
>>>>> +	cancel_work_sync(&sched->work_free_job);
>>>>>     }
>>>>>     EXPORT_SYMBOL(drm_sched_submit_stop);
>>>>> @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
>>>>>     void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
>>>>>     {
>>>>>     	WRITE_ONCE(sched->pause_submit, false);
>>>>> -	queue_work(sched->submit_wq, &sched->work_submit);
>>>>> +	queue_work(sched->submit_wq, &sched->work_run_job);
>>>>> +	queue_work(sched->submit_wq, &sched->work_free_job);
>>>>>     }
>>>>>     EXPORT_SYMBOL(drm_sched_submit_start);
>>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>>> index 04eec2d7635f..fbc083a92757 100644
>>>>> --- a/include/drm/gpu_scheduler.h
>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>> @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
>>>>>      *                 finished.
>>>>>      * @hw_rq_count: the number of jobs currently in the hardware queue.
>>>>>      * @job_id_count: used to assign unique id to the each job.
>>>>> - * @submit_wq: workqueue used to queue @work_submit
>>>>> + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
>>>>>      * @timeout_wq: workqueue used to queue @work_tdr
>>>>> - * @work_submit: schedules jobs and cleans up entities
>>>>> + * @work_run_job: schedules jobs
>>>>> + * @work_free_job: cleans up jobs
>>>>>      * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>>>>>      *            timeout interval is over.
>>>>>      * @pending_list: the list of jobs which are currently in the job queue.
>>>>> @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
>>>>>     	atomic64_t			job_id_count;
>>>>>     	struct workqueue_struct		*submit_wq;
>>>>>     	struct workqueue_struct		*timeout_wq;
>>>>> -	struct work_struct		work_submit;
>>>>> +	struct work_struct		work_run_job;
>>>>> +	struct work_struct		work_free_job;
>>>>>     	struct delayed_work		work_tdr;
>>>>>     	struct list_head		pending_list;
>>>>>     	spinlock_t			job_list_lock;


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-18 11:58                       ` Danilo Krummrich
@ 2023-08-21 14:07                         ` Christian König
  2023-08-21 18:01                           ` Danilo Krummrich
  0 siblings, 1 reply; 80+ messages in thread
From: Christian König @ 2023-08-21 14:07 UTC (permalink / raw)
  To: Danilo Krummrich, Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

Am 18.08.23 um 13:58 schrieb Danilo Krummrich:
> [SNIP]
>> I only see two possible outcomes:
>> 1. You return -EBUSY (or similar) error code indicating the the hw 
>> can't receive more commands.
>> 2. Wait on previously pushed commands to be executed.
>> (3. Your driver crash because you accidentally overwrite stuff in the 
>> ring buffer which is still executed. I just assume that's prevented).
>>
>> Resolution #1 with -EBUSY is actually something the UAPI should not 
>> do, because your UAPI then depends on the specific timing of 
>> submissions which is a really bad idea.
>>
>> Resolution #2 is usually bad because it forces the hw to run dry 
>> between submission and so degrade performance.
>
> I agree, that is a good reason for at least limiting the maximum job 
> size to half of the ring size.
>
> However, there could still be cases where two subsequent jobs are 
> submitted with just a single IB, which as is would still block 
> subsequent jobs to be pushed to the ring although there is still 
> plenty of space. Depending on the (CPU) scheduler latency, such a case 
> can let the HW run dry as well.

Yeah, that was intentionally not done as well. The crux here is that the 
more you push to the hw the worse the scheduling granularity becomes. 
It's just that neither Xe nor Nouveau relies that much on the scheduling 
granularity at all (because of hw queues).

But Xe doesn't seem to need that feature and I would still try to avoid 
it because the more you have pushed to the hw the harder it is to get 
going again after a reset.

>
> Surely, we could just continue decrease the maximum job size even 
> further, but this would result in further overhead on user and kernel 
> for larger IB counts. Tracking the actual job size seems to be the 
> better solution for drivers where the job size can vary over a rather 
> huge range.

I strongly disagree on that. A larger ring buffer is trivial to allocate 
and if userspace submissions are so small that the scheduler can't keep 
up submitting them then your ring buffer size is your smallest problem.

In other words the submission overhead will completely kill your 
performance and you should probably consider stuffing more into a single 
submission.

Regards,
Christian.

>
> - Danilo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-21 14:07                         ` Christian König
@ 2023-08-21 18:01                           ` Danilo Krummrich
  2023-08-21 18:12                             ` Christian König
  0 siblings, 1 reply; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-21 18:01 UTC (permalink / raw)
  To: Christian König, Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

On 8/21/23 16:07, Christian König wrote:
> Am 18.08.23 um 13:58 schrieb Danilo Krummrich:
>> [SNIP]
>>> I only see two possible outcomes:
>>> 1. You return -EBUSY (or similar) error code indicating the the hw 
>>> can't receive more commands.
>>> 2. Wait on previously pushed commands to be executed.
>>> (3. Your driver crash because you accidentally overwrite stuff in the 
>>> ring buffer which is still executed. I just assume that's prevented).
>>>
>>> Resolution #1 with -EBUSY is actually something the UAPI should not 
>>> do, because your UAPI then depends on the specific timing of 
>>> submissions which is a really bad idea.
>>>
>>> Resolution #2 is usually bad because it forces the hw to run dry 
>>> between submission and so degrade performance.
>>
>> I agree, that is a good reason for at least limiting the maximum job 
>> size to half of the ring size.
>>
>> However, there could still be cases where two subsequent jobs are 
>> submitted with just a single IB, which as is would still block 
>> subsequent jobs to be pushed to the ring although there is still 
>> plenty of space. Depending on the (CPU) scheduler latency, such a case 
>> can let the HW run dry as well.
> 
> Yeah, that was intentionally not done as well. The crux here is that the 
> more you push to the hw the worse the scheduling granularity becomes. 
> It's just that neither Xe nor Nouveau relies that much on the scheduling 
> granularity at all (because of hw queues).
> 
> But Xe doesn't seem to need that feature and I would still try to avoid 
> it because the more you have pushed to the hw the harder it is to get 
> going again after a reset.
> 
>>
>> Surely, we could just continue decrease the maximum job size even 
>> further, but this would result in further overhead on user and kernel 
>> for larger IB counts. Tracking the actual job size seems to be the 
>> better solution for drivers where the job size can vary over a rather 
>> huge range.
> 
> I strongly disagree on that. A larger ring buffer is trivial to allocate 

That sounds like a workaround to me. The problem, in the case above, 
isn't that the ring buffer does not have enough space, the problem is 
that we account for the maximum job size although the actual job size is 
much smaller. And enabling the scheduler to track the actual job size is 
trivial as well.

> and if userspace submissions are so small that the scheduler can't keep 
> up submitting them then your ring buffer size is your smallest problem.
> 
> In other words the submission overhead will completely kill your 
> performance and you should probably consider stuffing more into a single 
> submission.

I fully agree and that is also the reason why I want to keep the maximum 
job size as large as possible.

However, afaik with Vulkan it's the applications themselves deciding 
when and with how many command buffers a queue is submitted (@Faith: 
please correct me if I'm wrong). Hence, why not optimize for this case 
as well? It's not that it would make another case worse, right?

- Danilo

> 
> Regards,
> Christian.
> 
>>
>> - Danilo
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-21 18:01                           ` Danilo Krummrich
@ 2023-08-21 18:12                             ` Christian König
  2023-08-21 19:07                               ` Danilo Krummrich
  2023-08-21 19:46                               ` Faith Ekstrand
  0 siblings, 2 replies; 80+ messages in thread
From: Christian König @ 2023-08-21 18:12 UTC (permalink / raw)
  To: Danilo Krummrich, Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

Am 21.08.23 um 20:01 schrieb Danilo Krummrich:
> On 8/21/23 16:07, Christian König wrote:
>> Am 18.08.23 um 13:58 schrieb Danilo Krummrich:
>>> [SNIP]
>>>> I only see two possible outcomes:
>>>> 1. You return -EBUSY (or similar) error code indicating the the hw 
>>>> can't receive more commands.
>>>> 2. Wait on previously pushed commands to be executed.
>>>> (3. Your driver crash because you accidentally overwrite stuff in 
>>>> the ring buffer which is still executed. I just assume that's 
>>>> prevented).
>>>>
>>>> Resolution #1 with -EBUSY is actually something the UAPI should not 
>>>> do, because your UAPI then depends on the specific timing of 
>>>> submissions which is a really bad idea.
>>>>
>>>> Resolution #2 is usually bad because it forces the hw to run dry 
>>>> between submission and so degrade performance.
>>>
>>> I agree, that is a good reason for at least limiting the maximum job 
>>> size to half of the ring size.
>>>
>>> However, there could still be cases where two subsequent jobs are 
>>> submitted with just a single IB, which as is would still block 
>>> subsequent jobs to be pushed to the ring although there is still 
>>> plenty of space. Depending on the (CPU) scheduler latency, such a 
>>> case can let the HW run dry as well.
>>
>> Yeah, that was intentionally not done as well. The crux here is that 
>> the more you push to the hw the worse the scheduling granularity 
>> becomes. It's just that neither Xe nor Nouveau relies that much on 
>> the scheduling granularity at all (because of hw queues).
>>
>> But Xe doesn't seem to need that feature and I would still try to 
>> avoid it because the more you have pushed to the hw the harder it is 
>> to get going again after a reset.
>>
>>>
>>> Surely, we could just continue decrease the maximum job size even 
>>> further, but this would result in further overhead on user and 
>>> kernel for larger IB counts. Tracking the actual job size seems to 
>>> be the better solution for drivers where the job size can vary over 
>>> a rather huge range.
>>
>> I strongly disagree on that. A larger ring buffer is trivial to allocate 
>
> That sounds like a workaround to me. The problem, in the case above, 
> isn't that the ring buffer does not have enough space, the problem is 
> that we account for the maximum job size although the actual job size 
> is much smaller. And enabling the scheduler to track the actual job 
> size is trivial as well.

That's what I agree on, so far I just didn't see the reason for doing it 
but at least a few reason for not doing it.

>
>> and if userspace submissions are so small that the scheduler can't 
>> keep up submitting them then your ring buffer size is your smallest 
>> problem.
>>
>> In other words the submission overhead will completely kill your 
>> performance and you should probably consider stuffing more into a 
>> single submission.
>
> I fully agree and that is also the reason why I want to keep the 
> maximum job size as large as possible.
>
> However, afaik with Vulkan it's the applications themselves deciding 
> when and with how many command buffers a queue is submitted (@Faith: 
> please correct me if I'm wrong). Hence, why not optimize for this case 
> as well? It's not that it would make another case worse, right?

As I said it does make both the scheduling granularity as well as the 
reset behavior worse.

In general I think we should try to push just enough work to the 
hardware to keep it busy and not as much as possible.

So as long as nobody from userspace comes and says we absolutely need to 
optimize this use case I would rather not do it.

Regards,
Christian.

>
> - Danilo
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> - Danilo
>>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-21 18:12                             ` Christian König
@ 2023-08-21 19:07                               ` Danilo Krummrich
  2023-08-22  9:35                                 ` Christian König
  2023-08-21 19:46                               ` Faith Ekstrand
  1 sibling, 1 reply; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-21 19:07 UTC (permalink / raw)
  To: Christian König, Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

On 8/21/23 20:12, Christian König wrote:
> Am 21.08.23 um 20:01 schrieb Danilo Krummrich:
>> On 8/21/23 16:07, Christian König wrote:
>>> Am 18.08.23 um 13:58 schrieb Danilo Krummrich:
>>>> [SNIP]
>>>>> I only see two possible outcomes:
>>>>> 1. You return -EBUSY (or similar) error code indicating the the hw 
>>>>> can't receive more commands.
>>>>> 2. Wait on previously pushed commands to be executed.
>>>>> (3. Your driver crash because you accidentally overwrite stuff in 
>>>>> the ring buffer which is still executed. I just assume that's 
>>>>> prevented).
>>>>>
>>>>> Resolution #1 with -EBUSY is actually something the UAPI should not 
>>>>> do, because your UAPI then depends on the specific timing of 
>>>>> submissions which is a really bad idea.
>>>>>
>>>>> Resolution #2 is usually bad because it forces the hw to run dry 
>>>>> between submission and so degrade performance.
>>>>
>>>> I agree, that is a good reason for at least limiting the maximum job 
>>>> size to half of the ring size.
>>>>
>>>> However, there could still be cases where two subsequent jobs are 
>>>> submitted with just a single IB, which as is would still block 
>>>> subsequent jobs to be pushed to the ring although there is still 
>>>> plenty of space. Depending on the (CPU) scheduler latency, such a 
>>>> case can let the HW run dry as well.
>>>
>>> Yeah, that was intentionally not done as well. The crux here is that 
>>> the more you push to the hw the worse the scheduling granularity 
>>> becomes. It's just that neither Xe nor Nouveau relies that much on 
>>> the scheduling granularity at all (because of hw queues).
>>>
>>> But Xe doesn't seem to need that feature and I would still try to 
>>> avoid it because the more you have pushed to the hw the harder it is 
>>> to get going again after a reset.
>>>
>>>>
>>>> Surely, we could just continue decrease the maximum job size even 
>>>> further, but this would result in further overhead on user and 
>>>> kernel for larger IB counts. Tracking the actual job size seems to 
>>>> be the better solution for drivers where the job size can vary over 
>>>> a rather huge range.
>>>
>>> I strongly disagree on that. A larger ring buffer is trivial to allocate 
>>
>> That sounds like a workaround to me. The problem, in the case above, 
>> isn't that the ring buffer does not have enough space, the problem is 
>> that we account for the maximum job size although the actual job size 
>> is much smaller. And enabling the scheduler to track the actual job 
>> size is trivial as well.
> 
> That's what I agree on, so far I just didn't see the reason for doing it 
> but at least a few reason for not doing it.
> 
>>
>>> and if userspace submissions are so small that the scheduler can't 
>>> keep up submitting them then your ring buffer size is your smallest 
>>> problem.
>>>
>>> In other words the submission overhead will completely kill your 
>>> performance and you should probably consider stuffing more into a 
>>> single submission.
>>
>> I fully agree and that is also the reason why I want to keep the 
>> maximum job size as large as possible.
>>
>> However, afaik with Vulkan it's the applications themselves deciding 
>> when and with how many command buffers a queue is submitted (@Faith: 
>> please correct me if I'm wrong). Hence, why not optimize for this case 
>> as well? It's not that it would make another case worse, right?
> 
> As I said it does make both the scheduling granularity as well as the 
> reset behavior worse.

As you already mentioned Nouveau (and XE) don't really rely much on 
scheduling granularity. For Nouveau, the same is true for the reset 
behavior; if things go south the channel is killed anyway. Userspace 
would just request a new ring in this case.

Hence, I think Nouveau would profit from accounting the actual job size. 
And at the same time, other drivers having a benefit of always 
accounting for the maximum job size would still do so, by default.

Arbitrary ratios of how much the job size contributes to the ring being 
considered as full would also be possible.

- Danilo

> 
> In general I think we should try to push just enough work to the 
> hardware to keep it busy and not as much as possible.
> 
> So as long as nobody from userspace comes and says we absolutely need to 
> optimize this use case I would rather not do it.
> 
> Regards,
> Christian.
> 
>>
>> - Danilo
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> - Danilo
>>>
>>
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-21 18:12                             ` Christian König
  2023-08-21 19:07                               ` Danilo Krummrich
@ 2023-08-21 19:46                               ` Faith Ekstrand
  2023-08-22  9:51                                 ` Christian König
  1 sibling, 1 reply; 80+ messages in thread
From: Faith Ekstrand @ 2023-08-21 19:46 UTC (permalink / raw)
  To: Christian König
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, luben.tuikov,
	Danilo Krummrich, donald.robson, boris.brezillon, intel-xe,
	faith.ekstrand

[-- Attachment #1: Type: text/plain, Size: 6038 bytes --]

On Mon, Aug 21, 2023 at 1:13 PM Christian König <christian.koenig@amd.com>
wrote:

> Am 21.08.23 um 20:01 schrieb Danilo Krummrich:
> > On 8/21/23 16:07, Christian König wrote:
> >> Am 18.08.23 um 13:58 schrieb Danilo Krummrich:
> >>> [SNIP]
> >>>> I only see two possible outcomes:
> >>>> 1. You return -EBUSY (or similar) error code indicating the the hw
> >>>> can't receive more commands.
> >>>> 2. Wait on previously pushed commands to be executed.
> >>>> (3. Your driver crash because you accidentally overwrite stuff in
> >>>> the ring buffer which is still executed. I just assume that's
> >>>> prevented).
> >>>>
> >>>> Resolution #1 with -EBUSY is actually something the UAPI should not
> >>>> do, because your UAPI then depends on the specific timing of
> >>>> submissions which is a really bad idea.
> >>>>
> >>>> Resolution #2 is usually bad because it forces the hw to run dry
> >>>> between submission and so degrade performance.
> >>>
> >>> I agree, that is a good reason for at least limiting the maximum job
> >>> size to half of the ring size.
> >>>
> >>> However, there could still be cases where two subsequent jobs are
> >>> submitted with just a single IB, which as is would still block
> >>> subsequent jobs to be pushed to the ring although there is still
> >>> plenty of space. Depending on the (CPU) scheduler latency, such a
> >>> case can let the HW run dry as well.
> >>
> >> Yeah, that was intentionally not done as well. The crux here is that
> >> the more you push to the hw the worse the scheduling granularity
> >> becomes. It's just that neither Xe nor Nouveau relies that much on
> >> the scheduling granularity at all (because of hw queues).
> >>
> >> But Xe doesn't seem to need that feature and I would still try to
> >> avoid it because the more you have pushed to the hw the harder it is
> >> to get going again after a reset.
> >>
> >>>
> >>> Surely, we could just continue decrease the maximum job size even
> >>> further, but this would result in further overhead on user and
> >>> kernel for larger IB counts. Tracking the actual job size seems to
> >>> be the better solution for drivers where the job size can vary over
> >>> a rather huge range.
> >>
> >> I strongly disagree on that. A larger ring buffer is trivial to
> allocate
> >
> > That sounds like a workaround to me. The problem, in the case above,
> > isn't that the ring buffer does not have enough space, the problem is
> > that we account for the maximum job size although the actual job size
> > is much smaller. And enabling the scheduler to track the actual job
> > size is trivial as well.
>
> That's what I agree on, so far I just didn't see the reason for doing it
> but at least a few reason for not doing it.
>
> >
> >> and if userspace submissions are so small that the scheduler can't
> >> keep up submitting them then your ring buffer size is your smallest
> >> problem.
> >>
> >> In other words the submission overhead will completely kill your
> >> performance and you should probably consider stuffing more into a
> >> single submission.
> >
> > I fully agree and that is also the reason why I want to keep the
> > maximum job size as large as possible.
> >
> > However, afaik with Vulkan it's the applications themselves deciding
> > when and with how many command buffers a queue is submitted (@Faith:
> > please correct me if I'm wrong). Hence, why not optimize for this case
> > as well? It's not that it would make another case worse, right?
>
> As I said it does make both the scheduling granularity as well as the
> reset behavior worse.
>
> In general I think we should try to push just enough work to the
> hardware to keep it busy and not as much as possible.
>
> So as long as nobody from userspace comes and says we absolutely need to
> optimize this use case I would rather not do it.
>

This is a place where nouveau's needs are legitimately different from AMD
or Intel, I think.  NVIDIA's command streamer model is very different from
AMD and Intel.  On AMD and Intel, each EXEC turns into a single small
packet (on the order of 16B) which kicks off a command buffer.  There may
be a bit of cache management or something around it but that's it.  From
there, it's userspace's job to make one command buffer chain to another
until it's finally done and then do a "return", whatever that looks like.

NVIDIA's model is much more static.  Each packet in the HW/FW ring is an
address and a size and that much data is processed and then it grabs the
next packet and processes. The result is that, if we use multiple buffers
of commands, there's no way to chain them together.  We just have to pass
the whole list of buffers to the kernel.  A single EXEC ioctl / job may
have 500 such addr+size packets depending on how big the command buffer
is.  It gets worse on pre-Turing hardware where we have to split the batch
for every single DrawIndirect or DispatchIndirect.

Lest you think NVIDIA is just crazy here, it's a perfectly reasonable model
if you assume that userspace is feeding the firmware.  When that's
happening, you just have a userspace thread that sits there and feeds the
ringbuffer with whatever is next and you can marshal as much data through
as you want. Sure, it'd be nice to have a 2nd level batch thing that gets
launched from the FW ring and has all the individual launch commands but
it's not at all necessary.

What does that mean from a gpu_scheduler PoV? Basically, it means a
variable packet size.

What does this mean for implementation? IDK.  One option would be to teach
the scheduler about actual job sizes. Another would be to virtualize it and
have another layer underneath the scheduler that does the actual feeding of
the ring. Another would be to decrease the job size somewhat and then have
the front-end submit as many jobs as it needs to service userspace and only
put the out-fences on the last job. All the options kinda suck.

~Faith

[-- Attachment #2: Type: text/html, Size: 7220 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-21 19:07                               ` Danilo Krummrich
@ 2023-08-22  9:35                                 ` Christian König
  0 siblings, 0 replies; 80+ messages in thread
From: Christian König @ 2023-08-22  9:35 UTC (permalink / raw)
  To: Danilo Krummrich, Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

Am 21.08.23 um 21:07 schrieb Danilo Krummrich:
> On 8/21/23 20:12, Christian König wrote:
>> Am 21.08.23 um 20:01 schrieb Danilo Krummrich:
>>> On 8/21/23 16:07, Christian König wrote:
>>>> Am 18.08.23 um 13:58 schrieb Danilo Krummrich:
>>>>> [SNIP]
>>>>>> I only see two possible outcomes:
>>>>>> 1. You return -EBUSY (or similar) error code indicating the the 
>>>>>> hw can't receive more commands.
>>>>>> 2. Wait on previously pushed commands to be executed.
>>>>>> (3. Your driver crash because you accidentally overwrite stuff in 
>>>>>> the ring buffer which is still executed. I just assume that's 
>>>>>> prevented).
>>>>>>
>>>>>> Resolution #1 with -EBUSY is actually something the UAPI should 
>>>>>> not do, because your UAPI then depends on the specific timing of 
>>>>>> submissions which is a really bad idea.
>>>>>>
>>>>>> Resolution #2 is usually bad because it forces the hw to run dry 
>>>>>> between submission and so degrade performance.
>>>>>
>>>>> I agree, that is a good reason for at least limiting the maximum 
>>>>> job size to half of the ring size.
>>>>>
>>>>> However, there could still be cases where two subsequent jobs are 
>>>>> submitted with just a single IB, which as is would still block 
>>>>> subsequent jobs to be pushed to the ring although there is still 
>>>>> plenty of space. Depending on the (CPU) scheduler latency, such a 
>>>>> case can let the HW run dry as well.
>>>>
>>>> Yeah, that was intentionally not done as well. The crux here is 
>>>> that the more you push to the hw the worse the scheduling 
>>>> granularity becomes. It's just that neither Xe nor Nouveau relies 
>>>> that much on the scheduling granularity at all (because of hw queues).
>>>>
>>>> But Xe doesn't seem to need that feature and I would still try to 
>>>> avoid it because the more you have pushed to the hw the harder it 
>>>> is to get going again after a reset.
>>>>
>>>>>
>>>>> Surely, we could just continue decrease the maximum job size even 
>>>>> further, but this would result in further overhead on user and 
>>>>> kernel for larger IB counts. Tracking the actual job size seems to 
>>>>> be the better solution for drivers where the job size can vary 
>>>>> over a rather huge range.
>>>>
>>>> I strongly disagree on that. A larger ring buffer is trivial to 
>>>> allocate 
>>>
>>> That sounds like a workaround to me. The problem, in the case above, 
>>> isn't that the ring buffer does not have enough space, the problem 
>>> is that we account for the maximum job size although the actual job 
>>> size is much smaller. And enabling the scheduler to track the actual 
>>> job size is trivial as well.
>>
>> That's what I agree on, so far I just didn't see the reason for doing 
>> it but at least a few reason for not doing it.
>>
>>>
>>>> and if userspace submissions are so small that the scheduler can't 
>>>> keep up submitting them then your ring buffer size is your smallest 
>>>> problem.
>>>>
>>>> In other words the submission overhead will completely kill your 
>>>> performance and you should probably consider stuffing more into a 
>>>> single submission.
>>>
>>> I fully agree and that is also the reason why I want to keep the 
>>> maximum job size as large as possible.
>>>
>>> However, afaik with Vulkan it's the applications themselves deciding 
>>> when and with how many command buffers a queue is submitted (@Faith: 
>>> please correct me if I'm wrong). Hence, why not optimize for this 
>>> case as well? It's not that it would make another case worse, right?
>>
>> As I said it does make both the scheduling granularity as well as the 
>> reset behavior worse.
>
> As you already mentioned Nouveau (and XE) don't really rely much on 
> scheduling granularity. For Nouveau, the same is true for the reset 
> behavior; if things go south the channel is killed anyway. Userspace 
> would just request a new ring in this case.
>
> Hence, I think Nouveau would profit from accounting the actual job 
> size. And at the same time, other drivers having a benefit of always 
> accounting for the maximum job size would still do so, by default.
>
> Arbitrary ratios of how much the job size contributes to the ring 
> being considered as full would also be possible.

That would indeed be rather interesting since for a bunch of drivers the 
limiting part is not the ring buffer size, but rather the utilization of 
engines.

But no idea how to properly design that. You would have multiple values 
to check instead of just one.

Christian.

>
> - Danilo
>
>>
>> In general I think we should try to push just enough work to the 
>> hardware to keep it busy and not as much as possible.
>>
>> So as long as nobody from userspace comes and says we absolutely need 
>> to optimize this use case I would rather not do it.
>>
>> Regards,
>> Christian.
>>
>>>
>>> - Danilo
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> - Danilo
>>>>
>>>
>>
>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-21 19:46                               ` Faith Ekstrand
@ 2023-08-22  9:51                                 ` Christian König
  2023-08-22 16:55                                   ` Faith Ekstrand
  0 siblings, 1 reply; 80+ messages in thread
From: Christian König @ 2023-08-22  9:51 UTC (permalink / raw)
  To: Faith Ekstrand
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, luben.tuikov,
	Danilo Krummrich, donald.robson, boris.brezillon, intel-xe,
	faith.ekstrand

[-- Attachment #1: Type: text/plain, Size: 3141 bytes --]

Am 21.08.23 um 21:46 schrieb Faith Ekstrand:
> On Mon, Aug 21, 2023 at 1:13 PM Christian König 
> <christian.koenig@amd.com> wrote:
>
>     [SNIP]
>     So as long as nobody from userspace comes and says we absolutely
>     need to
>     optimize this use case I would rather not do it.
>
>
> This is a place where nouveau's needs are legitimately different from 
> AMD or Intel, I think.  NVIDIA's command streamer model is very 
> different from AMD and Intel.  On AMD and Intel, each EXEC turns into 
> a single small packet (on the order of 16B) which kicks off a command 
> buffer.  There may be a bit of cache management or something around it 
> but that's it.  From there, it's userspace's job to make one command 
> buffer chain to another until it's finally done and then do a 
> "return", whatever that looks like.
>
> NVIDIA's model is much more static.  Each packet in the HW/FW ring is 
> an address and a size and that much data is processed and then it 
> grabs the next packet and processes. The result is that, if we use 
> multiple buffers of commands, there's no way to chain them together.  
> We just have to pass the whole list of buffers to the kernel.

So far that is actually completely identical to what AMD has.

> A single EXEC ioctl / job may have 500 such addr+size packets 
> depending on how big the command buffer is.

And that is what I don't understand. Why would you need 100dreds of such 
addr+size packets?

This is basically identical to what AMD has (well on newer hw there is 
an extension in the CP packets to JUMP/CALL subsequent IBs, but this 
isn't widely used as far as I know).

Previously the limit was something like 4 which we extended to because 
Bas came up with similar requirements for the AMD side from RADV.

But essentially those approaches with 100dreds of IBs doesn't sound like 
a good idea to me.

> It gets worse on pre-Turing hardware where we have to split the batch 
> for every single DrawIndirect or DispatchIndirect.
>
> Lest you think NVIDIA is just crazy here, it's a perfectly reasonable 
> model if you assume that userspace is feeding the firmware.  When 
> that's happening, you just have a userspace thread that sits there and 
> feeds the ringbuffer with whatever is next and you can marshal as much 
> data through as you want. Sure, it'd be nice to have a 2nd level batch 
> thing that gets launched from the FW ring and has all the individual 
> launch commands but it's not at all necessary.
>
> What does that mean from a gpu_scheduler PoV? Basically, it means a 
> variable packet size.
>
> What does this mean for implementation? IDK.  One option would be to 
> teach the scheduler about actual job sizes. Another would be to 
> virtualize it and have another layer underneath the scheduler that 
> does the actual feeding of the ring. Another would be to decrease the 
> job size somewhat and then have the front-end submit as many jobs as 
> it needs to service userspace and only put the out-fences on the last 
> job. All the options kinda suck.

Yeah, agree. The job size Danilo suggested is still the least painful.

Christian.

>
> ~Faith

[-- Attachment #2: Type: text/html, Size: 5413 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-22  9:51                                 ` Christian König
@ 2023-08-22 16:55                                   ` Faith Ekstrand
  2023-08-24 11:50                                     ` Bas Nieuwenhuizen
  0 siblings, 1 reply; 80+ messages in thread
From: Faith Ekstrand @ 2023-08-22 16:55 UTC (permalink / raw)
  To: Christian König
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, luben.tuikov,
	Danilo Krummrich, donald.robson, boris.brezillon, intel-xe,
	faith.ekstrand

[-- Attachment #1: Type: text/plain, Size: 4226 bytes --]

On Tue, Aug 22, 2023 at 4:51 AM Christian König <christian.koenig@amd.com>
wrote:

> Am 21.08.23 um 21:46 schrieb Faith Ekstrand:
>
> On Mon, Aug 21, 2023 at 1:13 PM Christian König <christian.koenig@amd.com>
> wrote:
>
>> [SNIP]
>> So as long as nobody from userspace comes and says we absolutely need to
>> optimize this use case I would rather not do it.
>>
>
> This is a place where nouveau's needs are legitimately different from AMD
> or Intel, I think.  NVIDIA's command streamer model is very different from
> AMD and Intel.  On AMD and Intel, each EXEC turns into a single small
> packet (on the order of 16B) which kicks off a command buffer.  There may
> be a bit of cache management or something around it but that's it.  From
> there, it's userspace's job to make one command buffer chain to another
> until it's finally done and then do a "return", whatever that looks like.
>
> NVIDIA's model is much more static.  Each packet in the HW/FW ring is an
> address and a size and that much data is processed and then it grabs the
> next packet and processes. The result is that, if we use multiple buffers
> of commands, there's no way to chain them together.  We just have to pass
> the whole list of buffers to the kernel.
>
>
> So far that is actually completely identical to what AMD has.
>
> A single EXEC ioctl / job may have 500 such addr+size packets depending on
> how big the command buffer is.
>
>
> And that is what I don't understand. Why would you need 100dreds of such
> addr+size packets?
>

Well, we're not really in control of it.  We can control our base pushbuf
size and that's something we can tune but we're still limited by the
client.  We have to submit another pushbuf whenever:

 1. We run out of space (power-of-two growth is also possible but the size
is limited to a maximum of about 4MiB due to hardware limitations.)
 2. The client calls a secondary command buffer.
 3. Any usage of indirect draw or dispatch on pre-Turing hardware.

At some point we need to tune our BO size a bit to avoid (1) while also
avoiding piles of tiny BOs.  However, (2) and (3) are out of our control.

This is basically identical to what AMD has (well on newer hw there is an
> extension in the CP packets to JUMP/CALL subsequent IBs, but this isn't
> widely used as far as I know).
>

According to Bas, RADV chains on recent hardware.


> Previously the limit was something like 4 which we extended to because Bas
> came up with similar requirements for the AMD side from RADV.
>
> But essentially those approaches with 100dreds of IBs doesn't sound like a
> good idea to me.
>

No one's arguing that they like it.  Again, the hardware isn't designed to
have a kernel in the way. It's designed to be fed by userspace. But we're
going to have the kernel in the middle for a while so we need to make it
not suck too bad.

~Faith

It gets worse on pre-Turing hardware where we have to split the batch for
> every single DrawIndirect or DispatchIndirect.
>
> Lest you think NVIDIA is just crazy here, it's a perfectly reasonable
> model if you assume that userspace is feeding the firmware.  When that's
> happening, you just have a userspace thread that sits there and feeds the
> ringbuffer with whatever is next and you can marshal as much data through
> as you want. Sure, it'd be nice to have a 2nd level batch thing that gets
> launched from the FW ring and has all the individual launch commands but
> it's not at all necessary.
>
> What does that mean from a gpu_scheduler PoV? Basically, it means a
> variable packet size.
>
> What does this mean for implementation? IDK.  One option would be to teach
> the scheduler about actual job sizes. Another would be to virtualize it and
> have another layer underneath the scheduler that does the actual feeding of
> the ring. Another would be to decrease the job size somewhat and then have
> the front-end submit as many jobs as it needs to service userspace and only
> put the out-fences on the last job. All the options kinda suck.
>
>
> Yeah, agree. The job size Danilo suggested is still the least painful.
>
> Christian.
>
>
> ~Faith
>
>
>

[-- Attachment #2: Type: text/html, Size: 7028 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-21 13:17           ` Christian König
@ 2023-08-23  3:27             ` Matthew Brost
  2023-08-23  7:10               ` Christian König
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Brost @ 2023-08-23  3:27 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen,
	Liviu.Dudau, dri-devel, luben.tuikov, lina, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

On Mon, Aug 21, 2023 at 03:17:29PM +0200, Christian König wrote:
> Am 18.08.23 um 15:13 schrieb Matthew Brost:
> > On Fri, Aug 18, 2023 at 07:27:33AM +0200, Christian König wrote:
> > > Am 17.08.23 um 19:54 schrieb Matthew Brost:
> > > > On Thu, Aug 17, 2023 at 03:39:40PM +0200, Christian König wrote:
> > > > > Am 11.08.23 um 04:31 schrieb Matthew Brost:
> > > > > > Rather than call free_job and run_job in same work item have a dedicated
> > > > > > work item for each. This aligns with the design and intended use of work
> > > > > > queues.
> > > > > I would rather say we should get completely rid of the free_job callback.
> > > > > 
> > > > Would we still have work item? e.g. Would we still want to call
> > > > drm_sched_get_cleanup_job which removes the job from the pending list
> > > > and adjusts the TDR? Trying to figure out out what this looks like. We
> > > > probably can't do all of this from an IRQ context.
> > > > 
> > > > > Essentially the job is just the container which carries the information
> > > > > which are necessary before you push it to the hw. The real representation of
> > > > > the submission is actually the scheduler fence.
> > > > > 
> > > > Most of the free_jobs call plus drm_sched_job_cleanup + a put on job. In
> > > > Xe this cannot be called from an IRQ context either.
> > > > 
> > > > I'm just confused what exactly you are suggesting here.
> > > To summarize on one sentence: Instead of the job we keep the scheduler and
> > > hardware fences around after pushing the job to the hw.
> > > 
> > > The free_job callback would then be replaced by dropping the reference on
> > > the scheduler and hw fence.
> > > 
> > > Would that work for you?
> > > 
> > I don't think so for a few reasons.
> > 
> > The job and hw fence are different structures (also different allocs too)
> > for a reason. The job referenced until it is complete (hw fence is
> > signaled) and the free_job is called. This reference is needed for the
> > TDR to work properly and also some reset flows too.
> 
> That is exactly what I want to avoid, tying the TDR to the job is what some
> AMD engineers pushed for because it looked like a simple solution and made
> the whole thing similar to what Windows does.
> 
> This turned the previous relatively clean scheduler and TDR design into a
> complete nightmare. The job contains quite a bunch of things which are not
> necessarily available after the application which submitted the job is torn
> down.
>

Agree the TDR shouldn't be accessing anything application specific
rather just internal job state required to tear the job down on the
hardware.
 
> So what happens is that you either have stale pointers in the TDR which can
> go boom extremely easily or we somehow find a way to keep the necessary

I have not experenced the TDR going boom in Xe.

> structures (which include struct thread_info and struct file for this driver
> connection) alive until all submissions are completed.
> 

In Xe we keep everything alive until all submissions are completed. By
everything I mean the drm job, entity, scheduler, and VM via a reference
counting scheme. All of these structures are just kernel state which can
safely be accessed even if the application has been killed.

If we need to teardown on demand we just set the TDR to a minimum value and
it kicks the jobs off the hardware, gracefully cleans everything up and
drops all references. This is a benefit of the 1 to 1 relationship, not
sure if this works with how AMDGPU uses the scheduler.

> Delaying application tear down is also not an option because then you run
> into massive trouble with the OOM killer (or more generally OOM handling).
> See what we do in drm_sched_entity_flush() as well.
> 

Not an issue for Xe, we never call drm_sched_entity_flush as our
referencing counting scheme is all jobs are finished before we attempt
to tear down entity / scheduler.

> Since adding the TDR support we completely exercised this through in the
> last two or three years or so. And to sum it up I would really like to get
> away from this mess again.
> 
> Compared to that what i915 does is actually rather clean I think.
> 

Not even close, resets where a nightmare in the i915 (I spend years
trying to get this right and probably still completely work) and in Xe
basically got it right on the attempt.

> >   Also in Xe some of
> > things done in free_job cannot be from an IRQ context, hence calling
> > this from the scheduler worker is rather helpful.
> 
> Well putting things for cleanup into a workitem doesn't sounds like
> something hard.
>

That is exactly what we doing in the scheduler with the free_job
workitem.

> Question is what do you really need for TDR which is not inside the hardware
> fence?
>

A reference to the entity to be able to kick the job off the hardware.
A reference to the entity, job, and VM for error capture.

We also need a reference to the job for recovery after a GPU reset so
run_job can be called again for innocent jobs.

All of this leads to believe we need to stick with the design.

Matt

> Regards,
> Christian.
> 
> > 
> > The HW fence can live for longer as it can be installed in dma-resv
> > slots, syncobjs, etc... If the job and hw fence are combined now we
> > holding on the memory for the longer and perhaps at the mercy of the
> > user. We also run the risk of the final put being done from an IRQ
> > context which again wont work in Xe as it is currently coded. Lastly 2
> > jobs from the same scheduler could do the final put in parallel, so
> > rather than having free_job serialized by the worker now multiple jobs
> > are freeing themselves at the same time. This might not be an issue but
> > adds another level of raceyness that needs to be accounted for. None of
> > this sounds desirable to me.
> > 
> > FWIW what you suggesting sounds like how the i915 did things
> > (i915_request and hw fence in 1 memory alloc) and that turned out to be
> > a huge mess. As rule of thumb I generally do the opposite of whatever
> > the i915 did.
> > 
> > Matt
> > 
> > > Christian.
> > > 
> > > > Matt
> > > > 
> > > > > All the lifetime issues we had came from ignoring this fact and I think we
> > > > > should push for fixing this design up again.
> > > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > ---
> > > > > >     drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
> > > > > >     include/drm/gpu_scheduler.h            |   8 +-
> > > > > >     2 files changed, 106 insertions(+), 39 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > index cede47afc800..b67469eac179 100644
> > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > > > > >      * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
> > > > > >      *
> > > > > >      * @rq: scheduler run queue to check.
> > > > > > + * @dequeue: dequeue selected entity
> > > > > >      *
> > > > > >      * Try to find a ready entity, returns NULL if none found.
> > > > > >      */
> > > > > >     static struct drm_sched_entity *
> > > > > > -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
> > > > > >     {
> > > > > >     	struct drm_sched_entity *entity;
> > > > > > @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > >     	if (entity) {
> > > > > >     		list_for_each_entry_continue(entity, &rq->entities, list) {
> > > > > >     			if (drm_sched_entity_is_ready(entity)) {
> > > > > > -				rq->current_entity = entity;
> > > > > > -				reinit_completion(&entity->entity_idle);
> > > > > > +				if (dequeue) {
> > > > > > +					rq->current_entity = entity;
> > > > > > +					reinit_completion(&entity->entity_idle);
> > > > > > +				}
> > > > > >     				spin_unlock(&rq->lock);
> > > > > >     				return entity;
> > > > > >     			}
> > > > > > @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > >     	list_for_each_entry(entity, &rq->entities, list) {
> > > > > >     		if (drm_sched_entity_is_ready(entity)) {
> > > > > > -			rq->current_entity = entity;
> > > > > > -			reinit_completion(&entity->entity_idle);
> > > > > > +			if (dequeue) {
> > > > > > +				rq->current_entity = entity;
> > > > > > +				reinit_completion(&entity->entity_idle);
> > > > > > +			}
> > > > > >     			spin_unlock(&rq->lock);
> > > > > >     			return entity;
> > > > > >     		}
> > > > > > @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > >      * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
> > > > > >      *
> > > > > >      * @rq: scheduler run queue to check.
> > > > > > + * @dequeue: dequeue selected entity
> > > > > >      *
> > > > > >      * Find oldest waiting ready entity, returns NULL if none found.
> > > > > >      */
> > > > > >     static struct drm_sched_entity *
> > > > > > -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
> > > > > >     {
> > > > > >     	struct rb_node *rb;
> > > > > > @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > >     		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> > > > > >     		if (drm_sched_entity_is_ready(entity)) {
> > > > > > -			rq->current_entity = entity;
> > > > > > -			reinit_completion(&entity->entity_idle);
> > > > > > +			if (dequeue) {
> > > > > > +				rq->current_entity = entity;
> > > > > > +				reinit_completion(&entity->entity_idle);
> > > > > > +			}
> > > > > >     			break;
> > > > > >     		}
> > > > > >     	}
> > > > > > @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > >     }
> > > > > >     /**
> > > > > > - * drm_sched_submit_queue - scheduler queue submission
> > > > > > + * drm_sched_run_job_queue - queue job submission
> > > > > >      * @sched: scheduler instance
> > > > > >      */
> > > > > > -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> > > > > > +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> > > > > >     {
> > > > > >     	if (!READ_ONCE(sched->pause_submit))
> > > > > > -		queue_work(sched->submit_wq, &sched->work_submit);
> > > > > > +		queue_work(sched->submit_wq, &sched->work_run_job);
> > > > > > +}
> > > > > > +
> > > > > > +static struct drm_sched_entity *
> > > > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> > > > > > + * @sched: scheduler instance
> > > > > > + */
> > > > > > +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > > > +{
> > > > > > +	if (drm_sched_select_entity(sched, false))
> > > > > > +		drm_sched_run_job_queue(sched);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_sched_free_job_queue - queue free job
> > > > > > + *
> > > > > > + * @sched: scheduler instance to queue free job
> > > > > > + */
> > > > > > +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> > > > > > +{
> > > > > > +	if (!READ_ONCE(sched->pause_submit))
> > > > > > +		queue_work(sched->submit_wq, &sched->work_free_job);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * drm_sched_free_job_queue_if_ready - queue free job if ready
> > > > > > + *
> > > > > > + * @sched: scheduler instance to queue free job
> > > > > > + */
> > > > > > +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > > > +{
> > > > > > +	struct drm_sched_job *job;
> > > > > > +
> > > > > > +	spin_lock(&sched->job_list_lock);
> > > > > > +	job = list_first_entry_or_null(&sched->pending_list,
> > > > > > +				       struct drm_sched_job, list);
> > > > > > +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
> > > > > > +		drm_sched_free_job_queue(sched);
> > > > > > +	spin_unlock(&sched->job_list_lock);
> > > > > >     }
> > > > > >     /**
> > > > > > @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
> > > > > >     	dma_fence_get(&s_fence->finished);
> > > > > >     	drm_sched_fence_finished(s_fence, result);
> > > > > >     	dma_fence_put(&s_fence->finished);
> > > > > > -	drm_sched_submit_queue(sched);
> > > > > > +	drm_sched_free_job_queue(sched);
> > > > > >     }
> > > > > >     /**
> > > > > > @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> > > > > >     void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> > > > > >     {
> > > > > >     	if (drm_sched_can_queue(sched))
> > > > > > -		drm_sched_submit_queue(sched);
> > > > > > +		drm_sched_run_job_queue(sched);
> > > > > >     }
> > > > > >     /**
> > > > > >      * drm_sched_select_entity - Select next entity to process
> > > > > >      *
> > > > > >      * @sched: scheduler instance
> > > > > > + * @dequeue: dequeue selected entity
> > > > > >      *
> > > > > >      * Returns the entity to process or NULL if none are found.
> > > > > >      */
> > > > > >     static struct drm_sched_entity *
> > > > > > -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
> > > > > >     {
> > > > > >     	struct drm_sched_entity *entity;
> > > > > >     	int i;
> > > > > > @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > > >     	/* Kernel run queue has higher priority than normal run queue*/
> > > > > >     	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> > > > > >     		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> > > > > > -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> > > > > > -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> > > > > > +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> > > > > > +							dequeue) :
> > > > > > +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> > > > > > +						      dequeue);
> > > > > >     		if (entity)
> > > > > >     			break;
> > > > > >     	}
> > > > > > @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> > > > > >     EXPORT_SYMBOL(drm_sched_pick_best);
> > > > > >     /**
> > > > > > - * drm_sched_main - main scheduler thread
> > > > > > + * drm_sched_free_job_work - worker to call free_job
> > > > > >      *
> > > > > > - * @param: scheduler instance
> > > > > > + * @w: free job work
> > > > > >      */
> > > > > > -static void drm_sched_main(struct work_struct *w)
> > > > > > +static void drm_sched_free_job_work(struct work_struct *w)
> > > > > >     {
> > > > > >     	struct drm_gpu_scheduler *sched =
> > > > > > -		container_of(w, struct drm_gpu_scheduler, work_submit);
> > > > > > -	struct drm_sched_entity *entity;
> > > > > > +		container_of(w, struct drm_gpu_scheduler, work_free_job);
> > > > > >     	struct drm_sched_job *cleanup_job;
> > > > > > -	int r;
> > > > > >     	if (READ_ONCE(sched->pause_submit))
> > > > > >     		return;
> > > > > >     	cleanup_job = drm_sched_get_cleanup_job(sched);
> > > > > > -	entity = drm_sched_select_entity(sched);
> > > > > > +	if (cleanup_job) {
> > > > > > +		sched->ops->free_job(cleanup_job);
> > > > > > +
> > > > > > +		drm_sched_free_job_queue_if_ready(sched);
> > > > > > +		drm_sched_run_job_queue_if_ready(sched);
> > > > > > +	}
> > > > > > +}
> > > > > > -	if (!entity && !cleanup_job)
> > > > > > -		return;	/* No more work */
> > > > > > +/**
> > > > > > + * drm_sched_run_job_work - worker to call run_job
> > > > > > + *
> > > > > > + * @w: run job work
> > > > > > + */
> > > > > > +static void drm_sched_run_job_work(struct work_struct *w)
> > > > > > +{
> > > > > > +	struct drm_gpu_scheduler *sched =
> > > > > > +		container_of(w, struct drm_gpu_scheduler, work_run_job);
> > > > > > +	struct drm_sched_entity *entity;
> > > > > > +	int r;
> > > > > > -	if (cleanup_job)
> > > > > > -		sched->ops->free_job(cleanup_job);
> > > > > > +	if (READ_ONCE(sched->pause_submit))
> > > > > > +		return;
> > > > > > +	entity = drm_sched_select_entity(sched, true);
> > > > > >     	if (entity) {
> > > > > >     		struct dma_fence *fence;
> > > > > >     		struct drm_sched_fence *s_fence;
> > > > > > @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
> > > > > >     		sched_job = drm_sched_entity_pop_job(entity);
> > > > > >     		if (!sched_job) {
> > > > > >     			complete_all(&entity->entity_idle);
> > > > > > -			if (!cleanup_job)
> > > > > > -				return;	/* No more work */
> > > > > > -			goto again;
> > > > > > +			return;	/* No more work */
> > > > > >     		}
> > > > > >     		s_fence = sched_job->s_fence;
> > > > > > @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
> > > > > >     		}
> > > > > >     		wake_up(&sched->job_scheduled);
> > > > > > +		drm_sched_run_job_queue_if_ready(sched);
> > > > > >     	}
> > > > > > -
> > > > > > -again:
> > > > > > -	drm_sched_submit_queue(sched);
> > > > > >     }
> > > > > >     /**
> > > > > > @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> > > > > >     	spin_lock_init(&sched->job_list_lock);
> > > > > >     	atomic_set(&sched->hw_rq_count, 0);
> > > > > >     	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > > > > > -	INIT_WORK(&sched->work_submit, drm_sched_main);
> > > > > > +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > > > > > +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> > > > > >     	atomic_set(&sched->_score, 0);
> > > > > >     	atomic64_set(&sched->job_id_count, 0);
> > > > > >     	sched->pause_submit = false;
> > > > > > @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
> > > > > >     void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
> > > > > >     {
> > > > > >     	WRITE_ONCE(sched->pause_submit, true);
> > > > > > -	cancel_work_sync(&sched->work_submit);
> > > > > > +	cancel_work_sync(&sched->work_run_job);
> > > > > > +	cancel_work_sync(&sched->work_free_job);
> > > > > >     }
> > > > > >     EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > > > @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > > >     void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
> > > > > >     {
> > > > > >     	WRITE_ONCE(sched->pause_submit, false);
> > > > > > -	queue_work(sched->submit_wq, &sched->work_submit);
> > > > > > +	queue_work(sched->submit_wq, &sched->work_run_job);
> > > > > > +	queue_work(sched->submit_wq, &sched->work_free_job);
> > > > > >     }
> > > > > >     EXPORT_SYMBOL(drm_sched_submit_start);
> > > > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > > > index 04eec2d7635f..fbc083a92757 100644
> > > > > > --- a/include/drm/gpu_scheduler.h
> > > > > > +++ b/include/drm/gpu_scheduler.h
> > > > > > @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
> > > > > >      *                 finished.
> > > > > >      * @hw_rq_count: the number of jobs currently in the hardware queue.
> > > > > >      * @job_id_count: used to assign unique id to the each job.
> > > > > > - * @submit_wq: workqueue used to queue @work_submit
> > > > > > + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
> > > > > >      * @timeout_wq: workqueue used to queue @work_tdr
> > > > > > - * @work_submit: schedules jobs and cleans up entities
> > > > > > + * @work_run_job: schedules jobs
> > > > > > + * @work_free_job: cleans up jobs
> > > > > >      * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> > > > > >      *            timeout interval is over.
> > > > > >      * @pending_list: the list of jobs which are currently in the job queue.
> > > > > > @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
> > > > > >     	atomic64_t			job_id_count;
> > > > > >     	struct workqueue_struct		*submit_wq;
> > > > > >     	struct workqueue_struct		*timeout_wq;
> > > > > > -	struct work_struct		work_submit;
> > > > > > +	struct work_struct		work_run_job;
> > > > > > +	struct work_struct		work_free_job;
> > > > > >     	struct delayed_work		work_tdr;
> > > > > >     	struct list_head		pending_list;
> > > > > >     	spinlock_t			job_list_lock;
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-23  3:27             ` Matthew Brost
@ 2023-08-23  7:10               ` Christian König
  2023-08-23 15:24                 ` Matthew Brost
  0 siblings, 1 reply; 80+ messages in thread
From: Christian König @ 2023-08-23  7:10 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen,
	Liviu.Dudau, dri-devel, luben.tuikov, lina, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

Am 23.08.23 um 05:27 schrieb Matthew Brost:
> [SNIP]
>> That is exactly what I want to avoid, tying the TDR to the job is what some
>> AMD engineers pushed for because it looked like a simple solution and made
>> the whole thing similar to what Windows does.
>>
>> This turned the previous relatively clean scheduler and TDR design into a
>> complete nightmare. The job contains quite a bunch of things which are not
>> necessarily available after the application which submitted the job is torn
>> down.
>>
> Agree the TDR shouldn't be accessing anything application specific
> rather just internal job state required to tear the job down on the
> hardware.
>   
>> So what happens is that you either have stale pointers in the TDR which can
>> go boom extremely easily or we somehow find a way to keep the necessary
> I have not experenced the TDR going boom in Xe.
>
>> structures (which include struct thread_info and struct file for this driver
>> connection) alive until all submissions are completed.
>>
> In Xe we keep everything alive until all submissions are completed. By
> everything I mean the drm job, entity, scheduler, and VM via a reference
> counting scheme. All of these structures are just kernel state which can
> safely be accessed even if the application has been killed.

Yeah, but that might just not be such a good idea from memory management 
point of view.

When you (for example) kill a process all resource from that progress 
should at least be queued to be freed more or less immediately.

What Linux is doing for other I/O operations is to keep the relevant 
pages alive until the I/O operation is completed, but for GPUs that 
usually means keeping most of the memory of the process alive and that 
in turn is really not something you can do.

You can of course do this if your driver has a reliable way of killing 
your submissions and freeing resources in a reasonable amount of time. 
This should then be done in the flush callback.

> If we need to teardown on demand we just set the TDR to a minimum value and
> it kicks the jobs off the hardware, gracefully cleans everything up and
> drops all references. This is a benefit of the 1 to 1 relationship, not
> sure if this works with how AMDGPU uses the scheduler.
>
>> Delaying application tear down is also not an option because then you run
>> into massive trouble with the OOM killer (or more generally OOM handling).
>> See what we do in drm_sched_entity_flush() as well.
>>
> Not an issue for Xe, we never call drm_sched_entity_flush as our
> referencing counting scheme is all jobs are finished before we attempt
> to tear down entity / scheduler.

I don't think you can do that upstream. Calling drm_sched_entity_flush() 
is a must have from your flush callback for the file descriptor.

Unless you have some other method for killing your submissions this 
would give a path for a deny of service attack vector when the Xe driver 
is in use.

>> Since adding the TDR support we completely exercised this through in the
>> last two or three years or so. And to sum it up I would really like to get
>> away from this mess again.
>>
>> Compared to that what i915 does is actually rather clean I think.
>>
> Not even close, resets where a nightmare in the i915 (I spend years
> trying to get this right and probably still completely work) and in Xe
> basically got it right on the attempt.
>
>>>    Also in Xe some of
>>> things done in free_job cannot be from an IRQ context, hence calling
>>> this from the scheduler worker is rather helpful.
>> Well putting things for cleanup into a workitem doesn't sounds like
>> something hard.
>>
> That is exactly what we doing in the scheduler with the free_job
> workitem.

Yeah, but I think that we do it in the scheduler and not the driver is 
problematic.

For the scheduler it shouldn't care about the job any more as soon as 
the driver takes over.

>
>> Question is what do you really need for TDR which is not inside the hardware
>> fence?
>>
> A reference to the entity to be able to kick the job off the hardware.
> A reference to the entity, job, and VM for error capture.
>
> We also need a reference to the job for recovery after a GPU reset so
> run_job can be called again for innocent jobs.

Well exactly that's what I'm massively pushing back. Letting the 
scheduler call run_job() for the same job again is *NOT* something you 
can actually do.

This pretty clearly violates some of the dma_fence constrains and has 
cause massively headaches for me already.

What you can do is to do this inside your driver, e.g. take the jobs and 
push them again to the hw ring or just tell the hw to start executing 
again from a previous position.

BTW that re-submitting of jobs seems to be a no-go from userspace 
perspective as well. Take a look at the Vulkan spec for that, at least 
Marek pretty much pointed out that we should absolutely not do this 
inside the kernel.

The generally right approach seems to be to cleanly signal to userspace 
that something bad happened and that userspace then needs to submit 
things again even for innocent jobs.

Regards,
Christian.

>
> All of this leads to believe we need to stick with the design.
>
> Matt
>
>> Regards,
>> Christian.
>>
>>> The HW fence can live for longer as it can be installed in dma-resv
>>> slots, syncobjs, etc... If the job and hw fence are combined now we
>>> holding on the memory for the longer and perhaps at the mercy of the
>>> user. We also run the risk of the final put being done from an IRQ
>>> context which again wont work in Xe as it is currently coded. Lastly 2
>>> jobs from the same scheduler could do the final put in parallel, so
>>> rather than having free_job serialized by the worker now multiple jobs
>>> are freeing themselves at the same time. This might not be an issue but
>>> adds another level of raceyness that needs to be accounted for. None of
>>> this sounds desirable to me.
>>>
>>> FWIW what you suggesting sounds like how the i915 did things
>>> (i915_request and hw fence in 1 memory alloc) and that turned out to be
>>> a huge mess. As rule of thumb I generally do the opposite of whatever
>>> the i915 did.
>>>
>>> Matt
>>>
>>>> Christian.
>>>>
>>>>> Matt
>>>>>
>>>>>> All the lifetime issues we had came from ignoring this fact and I think we
>>>>>> should push for fixing this design up again.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>> ---
>>>>>>>      drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
>>>>>>>      include/drm/gpu_scheduler.h            |   8 +-
>>>>>>>      2 files changed, 106 insertions(+), 39 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> index cede47afc800..b67469eac179 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>>>>>>       * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>>>>>>>       *
>>>>>>>       * @rq: scheduler run queue to check.
>>>>>>> + * @dequeue: dequeue selected entity
>>>>>>>       *
>>>>>>>       * Try to find a ready entity, returns NULL if none found.
>>>>>>>       */
>>>>>>>      static struct drm_sched_entity *
>>>>>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>>>> +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
>>>>>>>      {
>>>>>>>      	struct drm_sched_entity *entity;
>>>>>>> @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>>>>      	if (entity) {
>>>>>>>      		list_for_each_entry_continue(entity, &rq->entities, list) {
>>>>>>>      			if (drm_sched_entity_is_ready(entity)) {
>>>>>>> -				rq->current_entity = entity;
>>>>>>> -				reinit_completion(&entity->entity_idle);
>>>>>>> +				if (dequeue) {
>>>>>>> +					rq->current_entity = entity;
>>>>>>> +					reinit_completion(&entity->entity_idle);
>>>>>>> +				}
>>>>>>>      				spin_unlock(&rq->lock);
>>>>>>>      				return entity;
>>>>>>>      			}
>>>>>>> @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>>>>      	list_for_each_entry(entity, &rq->entities, list) {
>>>>>>>      		if (drm_sched_entity_is_ready(entity)) {
>>>>>>> -			rq->current_entity = entity;
>>>>>>> -			reinit_completion(&entity->entity_idle);
>>>>>>> +			if (dequeue) {
>>>>>>> +				rq->current_entity = entity;
>>>>>>> +				reinit_completion(&entity->entity_idle);
>>>>>>> +			}
>>>>>>>      			spin_unlock(&rq->lock);
>>>>>>>      			return entity;
>>>>>>>      		}
>>>>>>> @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>>>>       * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>>>>>>>       *
>>>>>>>       * @rq: scheduler run queue to check.
>>>>>>> + * @dequeue: dequeue selected entity
>>>>>>>       *
>>>>>>>       * Find oldest waiting ready entity, returns NULL if none found.
>>>>>>>       */
>>>>>>>      static struct drm_sched_entity *
>>>>>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>>>>> +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
>>>>>>>      {
>>>>>>>      	struct rb_node *rb;
>>>>>>> @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>>>>>      		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>>>>>>      		if (drm_sched_entity_is_ready(entity)) {
>>>>>>> -			rq->current_entity = entity;
>>>>>>> -			reinit_completion(&entity->entity_idle);
>>>>>>> +			if (dequeue) {
>>>>>>> +				rq->current_entity = entity;
>>>>>>> +				reinit_completion(&entity->entity_idle);
>>>>>>> +			}
>>>>>>>      			break;
>>>>>>>      		}
>>>>>>>      	}
>>>>>>> @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>>>>>      }
>>>>>>>      /**
>>>>>>> - * drm_sched_submit_queue - scheduler queue submission
>>>>>>> + * drm_sched_run_job_queue - queue job submission
>>>>>>>       * @sched: scheduler instance
>>>>>>>       */
>>>>>>> -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
>>>>>>> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>>>>>>      {
>>>>>>>      	if (!READ_ONCE(sched->pause_submit))
>>>>>>> -		queue_work(sched->submit_wq, &sched->work_submit);
>>>>>>> +		queue_work(sched->submit_wq, &sched->work_run_job);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static struct drm_sched_entity *
>>>>>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * drm_sched_run_job_queue_if_ready - queue job submission if ready
>>>>>>> + * @sched: scheduler instance
>>>>>>> + */
>>>>>>> +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
>>>>>>> +{
>>>>>>> +	if (drm_sched_select_entity(sched, false))
>>>>>>> +		drm_sched_run_job_queue(sched);
>>>>>>> +}
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * drm_sched_free_job_queue - queue free job
>>>>>>> + *
>>>>>>> + * @sched: scheduler instance to queue free job
>>>>>>> + */
>>>>>>> +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
>>>>>>> +{
>>>>>>> +	if (!READ_ONCE(sched->pause_submit))
>>>>>>> +		queue_work(sched->submit_wq, &sched->work_free_job);
>>>>>>> +}
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * drm_sched_free_job_queue_if_ready - queue free job if ready
>>>>>>> + *
>>>>>>> + * @sched: scheduler instance to queue free job
>>>>>>> + */
>>>>>>> +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
>>>>>>> +{
>>>>>>> +	struct drm_sched_job *job;
>>>>>>> +
>>>>>>> +	spin_lock(&sched->job_list_lock);
>>>>>>> +	job = list_first_entry_or_null(&sched->pending_list,
>>>>>>> +				       struct drm_sched_job, list);
>>>>>>> +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
>>>>>>> +		drm_sched_free_job_queue(sched);
>>>>>>> +	spin_unlock(&sched->job_list_lock);
>>>>>>>      }
>>>>>>>      /**
>>>>>>> @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>>>>>>>      	dma_fence_get(&s_fence->finished);
>>>>>>>      	drm_sched_fence_finished(s_fence, result);
>>>>>>>      	dma_fence_put(&s_fence->finished);
>>>>>>> -	drm_sched_submit_queue(sched);
>>>>>>> +	drm_sched_free_job_queue(sched);
>>>>>>>      }
>>>>>>>      /**
>>>>>>> @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
>>>>>>>      void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>>>>>>>      {
>>>>>>>      	if (drm_sched_can_queue(sched))
>>>>>>> -		drm_sched_submit_queue(sched);
>>>>>>> +		drm_sched_run_job_queue(sched);
>>>>>>>      }
>>>>>>>      /**
>>>>>>>       * drm_sched_select_entity - Select next entity to process
>>>>>>>       *
>>>>>>>       * @sched: scheduler instance
>>>>>>> + * @dequeue: dequeue selected entity
>>>>>>>       *
>>>>>>>       * Returns the entity to process or NULL if none are found.
>>>>>>>       */
>>>>>>>      static struct drm_sched_entity *
>>>>>>> -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>>>>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
>>>>>>>      {
>>>>>>>      	struct drm_sched_entity *entity;
>>>>>>>      	int i;
>>>>>>> @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>>>>>>      	/* Kernel run queue has higher priority than normal run queue*/
>>>>>>>      	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>      		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
>>>>>>> -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
>>>>>>> -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
>>>>>>> +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
>>>>>>> +							dequeue) :
>>>>>>> +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
>>>>>>> +						      dequeue);
>>>>>>>      		if (entity)
>>>>>>>      			break;
>>>>>>>      	}
>>>>>>> @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>>>>>>>      EXPORT_SYMBOL(drm_sched_pick_best);
>>>>>>>      /**
>>>>>>> - * drm_sched_main - main scheduler thread
>>>>>>> + * drm_sched_free_job_work - worker to call free_job
>>>>>>>       *
>>>>>>> - * @param: scheduler instance
>>>>>>> + * @w: free job work
>>>>>>>       */
>>>>>>> -static void drm_sched_main(struct work_struct *w)
>>>>>>> +static void drm_sched_free_job_work(struct work_struct *w)
>>>>>>>      {
>>>>>>>      	struct drm_gpu_scheduler *sched =
>>>>>>> -		container_of(w, struct drm_gpu_scheduler, work_submit);
>>>>>>> -	struct drm_sched_entity *entity;
>>>>>>> +		container_of(w, struct drm_gpu_scheduler, work_free_job);
>>>>>>>      	struct drm_sched_job *cleanup_job;
>>>>>>> -	int r;
>>>>>>>      	if (READ_ONCE(sched->pause_submit))
>>>>>>>      		return;
>>>>>>>      	cleanup_job = drm_sched_get_cleanup_job(sched);
>>>>>>> -	entity = drm_sched_select_entity(sched);
>>>>>>> +	if (cleanup_job) {
>>>>>>> +		sched->ops->free_job(cleanup_job);
>>>>>>> +
>>>>>>> +		drm_sched_free_job_queue_if_ready(sched);
>>>>>>> +		drm_sched_run_job_queue_if_ready(sched);
>>>>>>> +	}
>>>>>>> +}
>>>>>>> -	if (!entity && !cleanup_job)
>>>>>>> -		return;	/* No more work */
>>>>>>> +/**
>>>>>>> + * drm_sched_run_job_work - worker to call run_job
>>>>>>> + *
>>>>>>> + * @w: run job work
>>>>>>> + */
>>>>>>> +static void drm_sched_run_job_work(struct work_struct *w)
>>>>>>> +{
>>>>>>> +	struct drm_gpu_scheduler *sched =
>>>>>>> +		container_of(w, struct drm_gpu_scheduler, work_run_job);
>>>>>>> +	struct drm_sched_entity *entity;
>>>>>>> +	int r;
>>>>>>> -	if (cleanup_job)
>>>>>>> -		sched->ops->free_job(cleanup_job);
>>>>>>> +	if (READ_ONCE(sched->pause_submit))
>>>>>>> +		return;
>>>>>>> +	entity = drm_sched_select_entity(sched, true);
>>>>>>>      	if (entity) {
>>>>>>>      		struct dma_fence *fence;
>>>>>>>      		struct drm_sched_fence *s_fence;
>>>>>>> @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
>>>>>>>      		sched_job = drm_sched_entity_pop_job(entity);
>>>>>>>      		if (!sched_job) {
>>>>>>>      			complete_all(&entity->entity_idle);
>>>>>>> -			if (!cleanup_job)
>>>>>>> -				return;	/* No more work */
>>>>>>> -			goto again;
>>>>>>> +			return;	/* No more work */
>>>>>>>      		}
>>>>>>>      		s_fence = sched_job->s_fence;
>>>>>>> @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
>>>>>>>      		}
>>>>>>>      		wake_up(&sched->job_scheduled);
>>>>>>> +		drm_sched_run_job_queue_if_ready(sched);
>>>>>>>      	}
>>>>>>> -
>>>>>>> -again:
>>>>>>> -	drm_sched_submit_queue(sched);
>>>>>>>      }
>>>>>>>      /**
>>>>>>> @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>>>>>      	spin_lock_init(&sched->job_list_lock);
>>>>>>>      	atomic_set(&sched->hw_rq_count, 0);
>>>>>>>      	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>>>>>> -	INIT_WORK(&sched->work_submit, drm_sched_main);
>>>>>>> +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
>>>>>>> +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
>>>>>>>      	atomic_set(&sched->_score, 0);
>>>>>>>      	atomic64_set(&sched->job_id_count, 0);
>>>>>>>      	sched->pause_submit = false;
>>>>>>> @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
>>>>>>>      void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
>>>>>>>      {
>>>>>>>      	WRITE_ONCE(sched->pause_submit, true);
>>>>>>> -	cancel_work_sync(&sched->work_submit);
>>>>>>> +	cancel_work_sync(&sched->work_run_job);
>>>>>>> +	cancel_work_sync(&sched->work_free_job);
>>>>>>>      }
>>>>>>>      EXPORT_SYMBOL(drm_sched_submit_stop);
>>>>>>> @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
>>>>>>>      void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
>>>>>>>      {
>>>>>>>      	WRITE_ONCE(sched->pause_submit, false);
>>>>>>> -	queue_work(sched->submit_wq, &sched->work_submit);
>>>>>>> +	queue_work(sched->submit_wq, &sched->work_run_job);
>>>>>>> +	queue_work(sched->submit_wq, &sched->work_free_job);
>>>>>>>      }
>>>>>>>      EXPORT_SYMBOL(drm_sched_submit_start);
>>>>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>>>>> index 04eec2d7635f..fbc083a92757 100644
>>>>>>> --- a/include/drm/gpu_scheduler.h
>>>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>>>> @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
>>>>>>>       *                 finished.
>>>>>>>       * @hw_rq_count: the number of jobs currently in the hardware queue.
>>>>>>>       * @job_id_count: used to assign unique id to the each job.
>>>>>>> - * @submit_wq: workqueue used to queue @work_submit
>>>>>>> + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
>>>>>>>       * @timeout_wq: workqueue used to queue @work_tdr
>>>>>>> - * @work_submit: schedules jobs and cleans up entities
>>>>>>> + * @work_run_job: schedules jobs
>>>>>>> + * @work_free_job: cleans up jobs
>>>>>>>       * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>>>>>>>       *            timeout interval is over.
>>>>>>>       * @pending_list: the list of jobs which are currently in the job queue.
>>>>>>> @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
>>>>>>>      	atomic64_t			job_id_count;
>>>>>>>      	struct workqueue_struct		*submit_wq;
>>>>>>>      	struct workqueue_struct		*timeout_wq;
>>>>>>> -	struct work_struct		work_submit;
>>>>>>> +	struct work_struct		work_run_job;
>>>>>>> +	struct work_struct		work_free_job;
>>>>>>>      	struct delayed_work		work_tdr;
>>>>>>>      	struct list_head		pending_list;
>>>>>>>      	spinlock_t			job_list_lock;


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-23  7:10               ` Christian König
@ 2023-08-23 15:24                 ` Matthew Brost
  2023-08-23 15:41                   ` Alex Deucher
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Brost @ 2023-08-23 15:24 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen,
	Liviu.Dudau, dri-devel, luben.tuikov, lina, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

On Wed, Aug 23, 2023 at 09:10:51AM +0200, Christian König wrote:
> Am 23.08.23 um 05:27 schrieb Matthew Brost:
> > [SNIP]
> > > That is exactly what I want to avoid, tying the TDR to the job is what some
> > > AMD engineers pushed for because it looked like a simple solution and made
> > > the whole thing similar to what Windows does.
> > > 
> > > This turned the previous relatively clean scheduler and TDR design into a
> > > complete nightmare. The job contains quite a bunch of things which are not
> > > necessarily available after the application which submitted the job is torn
> > > down.
> > > 
> > Agree the TDR shouldn't be accessing anything application specific
> > rather just internal job state required to tear the job down on the
> > hardware.
> > > So what happens is that you either have stale pointers in the TDR which can
> > > go boom extremely easily or we somehow find a way to keep the necessary
> > I have not experenced the TDR going boom in Xe.
> > 
> > > structures (which include struct thread_info and struct file for this driver
> > > connection) alive until all submissions are completed.
> > > 
> > In Xe we keep everything alive until all submissions are completed. By
> > everything I mean the drm job, entity, scheduler, and VM via a reference
> > counting scheme. All of these structures are just kernel state which can
> > safely be accessed even if the application has been killed.
> 
> Yeah, but that might just not be such a good idea from memory management
> point of view.
> 
> When you (for example) kill a process all resource from that progress should
> at least be queued to be freed more or less immediately.
> 

We do this, the TDR kicks jobs off the hardware as fast as the hw
interface allows and signals all pending hw fences immediately after.
Free jobs then is immediately called and the reference count goes to
zero. I think max time for all of this to occur is a handful of ms.

> What Linux is doing for other I/O operations is to keep the relevant pages
> alive until the I/O operation is completed, but for GPUs that usually means
> keeping most of the memory of the process alive and that in turn is really
> not something you can do.
> 
> You can of course do this if your driver has a reliable way of killing your
> submissions and freeing resources in a reasonable amount of time. This
> should then be done in the flush callback.
> 

'flush callback' - Do you mean drm_sched_entity_flush? I looked at that
and think that function doesn't even work for what I tell. It flushes
the spsc queue but what about jobs on the hardware, how do those get
killed?

As stated we do via the TDR which is rather clean design and fits with
our reference couting scheme.

> > If we need to teardown on demand we just set the TDR to a minimum value and
> > it kicks the jobs off the hardware, gracefully cleans everything up and
> > drops all references. This is a benefit of the 1 to 1 relationship, not
> > sure if this works with how AMDGPU uses the scheduler.
> > 
> > > Delaying application tear down is also not an option because then you run
> > > into massive trouble with the OOM killer (or more generally OOM handling).
> > > See what we do in drm_sched_entity_flush() as well.
> > > 
> > Not an issue for Xe, we never call drm_sched_entity_flush as our
> > referencing counting scheme is all jobs are finished before we attempt
> > to tear down entity / scheduler.
> 
> I don't think you can do that upstream. Calling drm_sched_entity_flush() is
> a must have from your flush callback for the file descriptor.
> 

Again 'flush callback'? What are you refering too.

And why does drm_sched_entity_flush need to be called, doesn't seem to
do anything useful.

> Unless you have some other method for killing your submissions this would
> give a path for a deny of service attack vector when the Xe driver is in
> use.
> 

Yes, once th TDR fires is disallows all new submissions at the exec
IOCTL plus flushes any pending submissions as fast as possible.

> > > Since adding the TDR support we completely exercised this through in the
> > > last two or three years or so. And to sum it up I would really like to get
> > > away from this mess again.
> > > 
> > > Compared to that what i915 does is actually rather clean I think.
> > > 
> > Not even close, resets where a nightmare in the i915 (I spend years
> > trying to get this right and probably still completely work) and in Xe
> > basically got it right on the attempt.
> > 
> > > >    Also in Xe some of
> > > > things done in free_job cannot be from an IRQ context, hence calling
> > > > this from the scheduler worker is rather helpful.
> > > Well putting things for cleanup into a workitem doesn't sounds like
> > > something hard.
> > > 
> > That is exactly what we doing in the scheduler with the free_job
> > workitem.
> 
> Yeah, but I think that we do it in the scheduler and not the driver is
> problematic.
>

Disagree, a common clean callback from a non-irq contexts IMO is a good
design rather than each driver possibly having its own worker for
cleanup.

> For the scheduler it shouldn't care about the job any more as soon as the
> driver takes over.
> 

This a massive rewrite for all users of the DRM scheduler, I'm saying
for Xe what you are suggesting makes little to no sense. 

I'd like other users of the DRM scheduler to chime in on what you
purposing. The scope of this change affects 8ish drivers that would
require buy in each of the stakeholders. I certainly can't change of
these drivers as I don't feel comfortable in all of those code bases nor
do I have hardware to test all of these drivers.

> > 
> > > Question is what do you really need for TDR which is not inside the hardware
> > > fence?
> > > 
> > A reference to the entity to be able to kick the job off the hardware.
> > A reference to the entity, job, and VM for error capture.
> > 
> > We also need a reference to the job for recovery after a GPU reset so
> > run_job can be called again for innocent jobs.
> 
> Well exactly that's what I'm massively pushing back. Letting the scheduler
> call run_job() for the same job again is *NOT* something you can actually
> do.
> 

But lots of drivers do this already and the DRM scheduler documents
this.

> This pretty clearly violates some of the dma_fence constrains and has cause
> massively headaches for me already.
> 

Seems to work fine in Xe.

> What you can do is to do this inside your driver, e.g. take the jobs and
> push them again to the hw ring or just tell the hw to start executing again
> from a previous position.
> 

Again this now is massive rewrite of many drivers.

> BTW that re-submitting of jobs seems to be a no-go from userspace
> perspective as well. Take a look at the Vulkan spec for that, at least Marek
> pretty much pointed out that we should absolutely not do this inside the
> kernel.
> 

Yes if the job causes the hang, we ban the queue. Typcially only per
entity (queue) resets are done in Xe but occasionally device level
resets are done (issues with hardware) and innocent jobs / entities call
run_job again.

> The generally right approach seems to be to cleanly signal to userspace that
> something bad happened and that userspace then needs to submit things again
> even for innocent jobs.
> 

I disagree that innocent jobs should be banned. What you are suggesting
is if a device reset needs to be done we kill / ban every user space queue.
Thats seems like overkill. Not seeing where that is stated in this doc
[1], it seems to imply that only jobs that are stuck results in bans.

Matt

[1] https://patchwork.freedesktop.org/patch/553465/?series=119883&rev=3

> Regards,
> Christian.
> 
> > 
> > All of this leads to believe we need to stick with the design.
> > 
> > Matt
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > The HW fence can live for longer as it can be installed in dma-resv
> > > > slots, syncobjs, etc... If the job and hw fence are combined now we
> > > > holding on the memory for the longer and perhaps at the mercy of the
> > > > user. We also run the risk of the final put being done from an IRQ
> > > > context which again wont work in Xe as it is currently coded. Lastly 2
> > > > jobs from the same scheduler could do the final put in parallel, so
> > > > rather than having free_job serialized by the worker now multiple jobs
> > > > are freeing themselves at the same time. This might not be an issue but
> > > > adds another level of raceyness that needs to be accounted for. None of
> > > > this sounds desirable to me.
> > > > 
> > > > FWIW what you suggesting sounds like how the i915 did things
> > > > (i915_request and hw fence in 1 memory alloc) and that turned out to be
> > > > a huge mess. As rule of thumb I generally do the opposite of whatever
> > > > the i915 did.
> > > > 
> > > > Matt
> > > > 
> > > > > Christian.
> > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > > All the lifetime issues we had came from ignoring this fact and I think we
> > > > > > > should push for fixing this design up again.
> > > > > > > 
> > > > > > > Regards,
> > > > > > > Christian.
> > > > > > > 
> > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > ---
> > > > > > > >      drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
> > > > > > > >      include/drm/gpu_scheduler.h            |   8 +-
> > > > > > > >      2 files changed, 106 insertions(+), 39 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > index cede47afc800..b67469eac179 100644
> > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > > > > > > >       * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
> > > > > > > >       *
> > > > > > > >       * @rq: scheduler run queue to check.
> > > > > > > > + * @dequeue: dequeue selected entity
> > > > > > > >       *
> > > > > > > >       * Try to find a ready entity, returns NULL if none found.
> > > > > > > >       */
> > > > > > > >      static struct drm_sched_entity *
> > > > > > > > -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
> > > > > > > >      {
> > > > > > > >      	struct drm_sched_entity *entity;
> > > > > > > > @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > >      	if (entity) {
> > > > > > > >      		list_for_each_entry_continue(entity, &rq->entities, list) {
> > > > > > > >      			if (drm_sched_entity_is_ready(entity)) {
> > > > > > > > -				rq->current_entity = entity;
> > > > > > > > -				reinit_completion(&entity->entity_idle);
> > > > > > > > +				if (dequeue) {
> > > > > > > > +					rq->current_entity = entity;
> > > > > > > > +					reinit_completion(&entity->entity_idle);
> > > > > > > > +				}
> > > > > > > >      				spin_unlock(&rq->lock);
> > > > > > > >      				return entity;
> > > > > > > >      			}
> > > > > > > > @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > >      	list_for_each_entry(entity, &rq->entities, list) {
> > > > > > > >      		if (drm_sched_entity_is_ready(entity)) {
> > > > > > > > -			rq->current_entity = entity;
> > > > > > > > -			reinit_completion(&entity->entity_idle);
> > > > > > > > +			if (dequeue) {
> > > > > > > > +				rq->current_entity = entity;
> > > > > > > > +				reinit_completion(&entity->entity_idle);
> > > > > > > > +			}
> > > > > > > >      			spin_unlock(&rq->lock);
> > > > > > > >      			return entity;
> > > > > > > >      		}
> > > > > > > > @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > >       * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
> > > > > > > >       *
> > > > > > > >       * @rq: scheduler run queue to check.
> > > > > > > > + * @dequeue: dequeue selected entity
> > > > > > > >       *
> > > > > > > >       * Find oldest waiting ready entity, returns NULL if none found.
> > > > > > > >       */
> > > > > > > >      static struct drm_sched_entity *
> > > > > > > > -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > > > +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
> > > > > > > >      {
> > > > > > > >      	struct rb_node *rb;
> > > > > > > > @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > > >      		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> > > > > > > >      		if (drm_sched_entity_is_ready(entity)) {
> > > > > > > > -			rq->current_entity = entity;
> > > > > > > > -			reinit_completion(&entity->entity_idle);
> > > > > > > > +			if (dequeue) {
> > > > > > > > +				rq->current_entity = entity;
> > > > > > > > +				reinit_completion(&entity->entity_idle);
> > > > > > > > +			}
> > > > > > > >      			break;
> > > > > > > >      		}
> > > > > > > >      	}
> > > > > > > > @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > > >      }
> > > > > > > >      /**
> > > > > > > > - * drm_sched_submit_queue - scheduler queue submission
> > > > > > > > + * drm_sched_run_job_queue - queue job submission
> > > > > > > >       * @sched: scheduler instance
> > > > > > > >       */
> > > > > > > > -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> > > > > > > >      {
> > > > > > > >      	if (!READ_ONCE(sched->pause_submit))
> > > > > > > > -		queue_work(sched->submit_wq, &sched->work_submit);
> > > > > > > > +		queue_work(sched->submit_wq, &sched->work_run_job);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static struct drm_sched_entity *
> > > > > > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> > > > > > > > +
> > > > > > > > +/**
> > > > > > > > + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> > > > > > > > + * @sched: scheduler instance
> > > > > > > > + */
> > > > > > > > +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > > > > > +{
> > > > > > > > +	if (drm_sched_select_entity(sched, false))
> > > > > > > > +		drm_sched_run_job_queue(sched);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +/**
> > > > > > > > + * drm_sched_free_job_queue - queue free job
> > > > > > > > + *
> > > > > > > > + * @sched: scheduler instance to queue free job
> > > > > > > > + */
> > > > > > > > +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > +{
> > > > > > > > +	if (!READ_ONCE(sched->pause_submit))
> > > > > > > > +		queue_work(sched->submit_wq, &sched->work_free_job);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +/**
> > > > > > > > + * drm_sched_free_job_queue_if_ready - queue free job if ready
> > > > > > > > + *
> > > > > > > > + * @sched: scheduler instance to queue free job
> > > > > > > > + */
> > > > > > > > +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > > > > > +{
> > > > > > > > +	struct drm_sched_job *job;
> > > > > > > > +
> > > > > > > > +	spin_lock(&sched->job_list_lock);
> > > > > > > > +	job = list_first_entry_or_null(&sched->pending_list,
> > > > > > > > +				       struct drm_sched_job, list);
> > > > > > > > +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
> > > > > > > > +		drm_sched_free_job_queue(sched);
> > > > > > > > +	spin_unlock(&sched->job_list_lock);
> > > > > > > >      }
> > > > > > > >      /**
> > > > > > > > @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
> > > > > > > >      	dma_fence_get(&s_fence->finished);
> > > > > > > >      	drm_sched_fence_finished(s_fence, result);
> > > > > > > >      	dma_fence_put(&s_fence->finished);
> > > > > > > > -	drm_sched_submit_queue(sched);
> > > > > > > > +	drm_sched_free_job_queue(sched);
> > > > > > > >      }
> > > > > > > >      /**
> > > > > > > > @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> > > > > > > >      void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> > > > > > > >      {
> > > > > > > >      	if (drm_sched_can_queue(sched))
> > > > > > > > -		drm_sched_submit_queue(sched);
> > > > > > > > +		drm_sched_run_job_queue(sched);
> > > > > > > >      }
> > > > > > > >      /**
> > > > > > > >       * drm_sched_select_entity - Select next entity to process
> > > > > > > >       *
> > > > > > > >       * @sched: scheduler instance
> > > > > > > > + * @dequeue: dequeue selected entity
> > > > > > > >       *
> > > > > > > >       * Returns the entity to process or NULL if none are found.
> > > > > > > >       */
> > > > > > > >      static struct drm_sched_entity *
> > > > > > > > -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > > > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
> > > > > > > >      {
> > > > > > > >      	struct drm_sched_entity *entity;
> > > > > > > >      	int i;
> > > > > > > > @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > > > > >      	/* Kernel run queue has higher priority than normal run queue*/
> > > > > > > >      	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> > > > > > > >      		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> > > > > > > > -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> > > > > > > > -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> > > > > > > > +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> > > > > > > > +							dequeue) :
> > > > > > > > +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> > > > > > > > +						      dequeue);
> > > > > > > >      		if (entity)
> > > > > > > >      			break;
> > > > > > > >      	}
> > > > > > > > @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> > > > > > > >      EXPORT_SYMBOL(drm_sched_pick_best);
> > > > > > > >      /**
> > > > > > > > - * drm_sched_main - main scheduler thread
> > > > > > > > + * drm_sched_free_job_work - worker to call free_job
> > > > > > > >       *
> > > > > > > > - * @param: scheduler instance
> > > > > > > > + * @w: free job work
> > > > > > > >       */
> > > > > > > > -static void drm_sched_main(struct work_struct *w)
> > > > > > > > +static void drm_sched_free_job_work(struct work_struct *w)
> > > > > > > >      {
> > > > > > > >      	struct drm_gpu_scheduler *sched =
> > > > > > > > -		container_of(w, struct drm_gpu_scheduler, work_submit);
> > > > > > > > -	struct drm_sched_entity *entity;
> > > > > > > > +		container_of(w, struct drm_gpu_scheduler, work_free_job);
> > > > > > > >      	struct drm_sched_job *cleanup_job;
> > > > > > > > -	int r;
> > > > > > > >      	if (READ_ONCE(sched->pause_submit))
> > > > > > > >      		return;
> > > > > > > >      	cleanup_job = drm_sched_get_cleanup_job(sched);
> > > > > > > > -	entity = drm_sched_select_entity(sched);
> > > > > > > > +	if (cleanup_job) {
> > > > > > > > +		sched->ops->free_job(cleanup_job);
> > > > > > > > +
> > > > > > > > +		drm_sched_free_job_queue_if_ready(sched);
> > > > > > > > +		drm_sched_run_job_queue_if_ready(sched);
> > > > > > > > +	}
> > > > > > > > +}
> > > > > > > > -	if (!entity && !cleanup_job)
> > > > > > > > -		return;	/* No more work */
> > > > > > > > +/**
> > > > > > > > + * drm_sched_run_job_work - worker to call run_job
> > > > > > > > + *
> > > > > > > > + * @w: run job work
> > > > > > > > + */
> > > > > > > > +static void drm_sched_run_job_work(struct work_struct *w)
> > > > > > > > +{
> > > > > > > > +	struct drm_gpu_scheduler *sched =
> > > > > > > > +		container_of(w, struct drm_gpu_scheduler, work_run_job);
> > > > > > > > +	struct drm_sched_entity *entity;
> > > > > > > > +	int r;
> > > > > > > > -	if (cleanup_job)
> > > > > > > > -		sched->ops->free_job(cleanup_job);
> > > > > > > > +	if (READ_ONCE(sched->pause_submit))
> > > > > > > > +		return;
> > > > > > > > +	entity = drm_sched_select_entity(sched, true);
> > > > > > > >      	if (entity) {
> > > > > > > >      		struct dma_fence *fence;
> > > > > > > >      		struct drm_sched_fence *s_fence;
> > > > > > > > @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > > >      		sched_job = drm_sched_entity_pop_job(entity);
> > > > > > > >      		if (!sched_job) {
> > > > > > > >      			complete_all(&entity->entity_idle);
> > > > > > > > -			if (!cleanup_job)
> > > > > > > > -				return;	/* No more work */
> > > > > > > > -			goto again;
> > > > > > > > +			return;	/* No more work */
> > > > > > > >      		}
> > > > > > > >      		s_fence = sched_job->s_fence;
> > > > > > > > @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > > >      		}
> > > > > > > >      		wake_up(&sched->job_scheduled);
> > > > > > > > +		drm_sched_run_job_queue_if_ready(sched);
> > > > > > > >      	}
> > > > > > > > -
> > > > > > > > -again:
> > > > > > > > -	drm_sched_submit_queue(sched);
> > > > > > > >      }
> > > > > > > >      /**
> > > > > > > > @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> > > > > > > >      	spin_lock_init(&sched->job_list_lock);
> > > > > > > >      	atomic_set(&sched->hw_rq_count, 0);
> > > > > > > >      	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > > > > > > > -	INIT_WORK(&sched->work_submit, drm_sched_main);
> > > > > > > > +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > > > > > > > +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> > > > > > > >      	atomic_set(&sched->_score, 0);
> > > > > > > >      	atomic64_set(&sched->job_id_count, 0);
> > > > > > > >      	sched->pause_submit = false;
> > > > > > > > @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
> > > > > > > >      void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
> > > > > > > >      {
> > > > > > > >      	WRITE_ONCE(sched->pause_submit, true);
> > > > > > > > -	cancel_work_sync(&sched->work_submit);
> > > > > > > > +	cancel_work_sync(&sched->work_run_job);
> > > > > > > > +	cancel_work_sync(&sched->work_free_job);
> > > > > > > >      }
> > > > > > > >      EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > > > > > @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > > > > >      void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
> > > > > > > >      {
> > > > > > > >      	WRITE_ONCE(sched->pause_submit, false);
> > > > > > > > -	queue_work(sched->submit_wq, &sched->work_submit);
> > > > > > > > +	queue_work(sched->submit_wq, &sched->work_run_job);
> > > > > > > > +	queue_work(sched->submit_wq, &sched->work_free_job);
> > > > > > > >      }
> > > > > > > >      EXPORT_SYMBOL(drm_sched_submit_start);
> > > > > > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > > > > > index 04eec2d7635f..fbc083a92757 100644
> > > > > > > > --- a/include/drm/gpu_scheduler.h
> > > > > > > > +++ b/include/drm/gpu_scheduler.h
> > > > > > > > @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
> > > > > > > >       *                 finished.
> > > > > > > >       * @hw_rq_count: the number of jobs currently in the hardware queue.
> > > > > > > >       * @job_id_count: used to assign unique id to the each job.
> > > > > > > > - * @submit_wq: workqueue used to queue @work_submit
> > > > > > > > + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
> > > > > > > >       * @timeout_wq: workqueue used to queue @work_tdr
> > > > > > > > - * @work_submit: schedules jobs and cleans up entities
> > > > > > > > + * @work_run_job: schedules jobs
> > > > > > > > + * @work_free_job: cleans up jobs
> > > > > > > >       * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> > > > > > > >       *            timeout interval is over.
> > > > > > > >       * @pending_list: the list of jobs which are currently in the job queue.
> > > > > > > > @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
> > > > > > > >      	atomic64_t			job_id_count;
> > > > > > > >      	struct workqueue_struct		*submit_wq;
> > > > > > > >      	struct workqueue_struct		*timeout_wq;
> > > > > > > > -	struct work_struct		work_submit;
> > > > > > > > +	struct work_struct		work_run_job;
> > > > > > > > +	struct work_struct		work_free_job;
> > > > > > > >      	struct delayed_work		work_tdr;
> > > > > > > >      	struct list_head		pending_list;
> > > > > > > >      	spinlock_t			job_list_lock;
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-23 15:24                 ` Matthew Brost
@ 2023-08-23 15:41                   ` Alex Deucher
  2023-08-23 17:26                     ` [Intel-xe] " Rodrigo Vivi
  0 siblings, 1 reply; 80+ messages in thread
From: Alex Deucher @ 2023-08-23 15:41 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, intel-xe, luben.tuikov, donald.robson,
	boris.brezillon, Christian König, faith.ekstrand

On Wed, Aug 23, 2023 at 11:26 AM Matthew Brost <matthew.brost@intel.com> wrote:
>
> On Wed, Aug 23, 2023 at 09:10:51AM +0200, Christian König wrote:
> > Am 23.08.23 um 05:27 schrieb Matthew Brost:
> > > [SNIP]
> > > > That is exactly what I want to avoid, tying the TDR to the job is what some
> > > > AMD engineers pushed for because it looked like a simple solution and made
> > > > the whole thing similar to what Windows does.
> > > >
> > > > This turned the previous relatively clean scheduler and TDR design into a
> > > > complete nightmare. The job contains quite a bunch of things which are not
> > > > necessarily available after the application which submitted the job is torn
> > > > down.
> > > >
> > > Agree the TDR shouldn't be accessing anything application specific
> > > rather just internal job state required to tear the job down on the
> > > hardware.
> > > > So what happens is that you either have stale pointers in the TDR which can
> > > > go boom extremely easily or we somehow find a way to keep the necessary
> > > I have not experenced the TDR going boom in Xe.
> > >
> > > > structures (which include struct thread_info and struct file for this driver
> > > > connection) alive until all submissions are completed.
> > > >
> > > In Xe we keep everything alive until all submissions are completed. By
> > > everything I mean the drm job, entity, scheduler, and VM via a reference
> > > counting scheme. All of these structures are just kernel state which can
> > > safely be accessed even if the application has been killed.
> >
> > Yeah, but that might just not be such a good idea from memory management
> > point of view.
> >
> > When you (for example) kill a process all resource from that progress should
> > at least be queued to be freed more or less immediately.
> >
>
> We do this, the TDR kicks jobs off the hardware as fast as the hw
> interface allows and signals all pending hw fences immediately after.
> Free jobs then is immediately called and the reference count goes to
> zero. I think max time for all of this to occur is a handful of ms.
>
> > What Linux is doing for other I/O operations is to keep the relevant pages
> > alive until the I/O operation is completed, but for GPUs that usually means
> > keeping most of the memory of the process alive and that in turn is really
> > not something you can do.
> >
> > You can of course do this if your driver has a reliable way of killing your
> > submissions and freeing resources in a reasonable amount of time. This
> > should then be done in the flush callback.
> >
>
> 'flush callback' - Do you mean drm_sched_entity_flush? I looked at that
> and think that function doesn't even work for what I tell. It flushes
> the spsc queue but what about jobs on the hardware, how do those get
> killed?
>
> As stated we do via the TDR which is rather clean design and fits with
> our reference couting scheme.
>
> > > If we need to teardown on demand we just set the TDR to a minimum value and
> > > it kicks the jobs off the hardware, gracefully cleans everything up and
> > > drops all references. This is a benefit of the 1 to 1 relationship, not
> > > sure if this works with how AMDGPU uses the scheduler.
> > >
> > > > Delaying application tear down is also not an option because then you run
> > > > into massive trouble with the OOM killer (or more generally OOM handling).
> > > > See what we do in drm_sched_entity_flush() as well.
> > > >
> > > Not an issue for Xe, we never call drm_sched_entity_flush as our
> > > referencing counting scheme is all jobs are finished before we attempt
> > > to tear down entity / scheduler.
> >
> > I don't think you can do that upstream. Calling drm_sched_entity_flush() is
> > a must have from your flush callback for the file descriptor.
> >
>
> Again 'flush callback'? What are you refering too.
>
> And why does drm_sched_entity_flush need to be called, doesn't seem to
> do anything useful.
>
> > Unless you have some other method for killing your submissions this would
> > give a path for a deny of service attack vector when the Xe driver is in
> > use.
> >
>
> Yes, once th TDR fires is disallows all new submissions at the exec
> IOCTL plus flushes any pending submissions as fast as possible.
>
> > > > Since adding the TDR support we completely exercised this through in the
> > > > last two or three years or so. And to sum it up I would really like to get
> > > > away from this mess again.
> > > >
> > > > Compared to that what i915 does is actually rather clean I think.
> > > >
> > > Not even close, resets where a nightmare in the i915 (I spend years
> > > trying to get this right and probably still completely work) and in Xe
> > > basically got it right on the attempt.
> > >
> > > > >    Also in Xe some of
> > > > > things done in free_job cannot be from an IRQ context, hence calling
> > > > > this from the scheduler worker is rather helpful.
> > > > Well putting things for cleanup into a workitem doesn't sounds like
> > > > something hard.
> > > >
> > > That is exactly what we doing in the scheduler with the free_job
> > > workitem.
> >
> > Yeah, but I think that we do it in the scheduler and not the driver is
> > problematic.
> >
>
> Disagree, a common clean callback from a non-irq contexts IMO is a good
> design rather than each driver possibly having its own worker for
> cleanup.
>
> > For the scheduler it shouldn't care about the job any more as soon as the
> > driver takes over.
> >
>
> This a massive rewrite for all users of the DRM scheduler, I'm saying
> for Xe what you are suggesting makes little to no sense.
>
> I'd like other users of the DRM scheduler to chime in on what you
> purposing. The scope of this change affects 8ish drivers that would
> require buy in each of the stakeholders. I certainly can't change of
> these drivers as I don't feel comfortable in all of those code bases nor
> do I have hardware to test all of these drivers.
>
> > >
> > > > Question is what do you really need for TDR which is not inside the hardware
> > > > fence?
> > > >
> > > A reference to the entity to be able to kick the job off the hardware.
> > > A reference to the entity, job, and VM for error capture.
> > >
> > > We also need a reference to the job for recovery after a GPU reset so
> > > run_job can be called again for innocent jobs.
> >
> > Well exactly that's what I'm massively pushing back. Letting the scheduler
> > call run_job() for the same job again is *NOT* something you can actually
> > do.
> >
>
> But lots of drivers do this already and the DRM scheduler documents
> this.
>
> > This pretty clearly violates some of the dma_fence constrains and has cause
> > massively headaches for me already.
> >
>
> Seems to work fine in Xe.
>
> > What you can do is to do this inside your driver, e.g. take the jobs and
> > push them again to the hw ring or just tell the hw to start executing again
> > from a previous position.
> >
>
> Again this now is massive rewrite of many drivers.
>
> > BTW that re-submitting of jobs seems to be a no-go from userspace
> > perspective as well. Take a look at the Vulkan spec for that, at least Marek
> > pretty much pointed out that we should absolutely not do this inside the
> > kernel.
> >
>
> Yes if the job causes the hang, we ban the queue. Typcially only per
> entity (queue) resets are done in Xe but occasionally device level
> resets are done (issues with hardware) and innocent jobs / entities call
> run_job again.

If the engine is reset and the job was already executing, how can you
determine that it's in a good state to resubmit?  What if some
internal fence or semaphore in memory used by the logic in the command
buffer has been signaled already and then you resubmit the job and it
now starts executing with different input state?

Alex

>
> > The generally right approach seems to be to cleanly signal to userspace that
> > something bad happened and that userspace then needs to submit things again
> > even for innocent jobs.
> >
>
> I disagree that innocent jobs should be banned. What you are suggesting
> is if a device reset needs to be done we kill / ban every user space queue.
> Thats seems like overkill. Not seeing where that is stated in this doc
> [1], it seems to imply that only jobs that are stuck results in bans.
>
> Matt
>
> [1] https://patchwork.freedesktop.org/patch/553465/?series=119883&rev=3
>
> > Regards,
> > Christian.
> >
> > >
> > > All of this leads to believe we need to stick with the design.
> > >
> > > Matt
> > >
> > > > Regards,
> > > > Christian.
> > > >
> > > > > The HW fence can live for longer as it can be installed in dma-resv
> > > > > slots, syncobjs, etc... If the job and hw fence are combined now we
> > > > > holding on the memory for the longer and perhaps at the mercy of the
> > > > > user. We also run the risk of the final put being done from an IRQ
> > > > > context which again wont work in Xe as it is currently coded. Lastly 2
> > > > > jobs from the same scheduler could do the final put in parallel, so
> > > > > rather than having free_job serialized by the worker now multiple jobs
> > > > > are freeing themselves at the same time. This might not be an issue but
> > > > > adds another level of raceyness that needs to be accounted for. None of
> > > > > this sounds desirable to me.
> > > > >
> > > > > FWIW what you suggesting sounds like how the i915 did things
> > > > > (i915_request and hw fence in 1 memory alloc) and that turned out to be
> > > > > a huge mess. As rule of thumb I generally do the opposite of whatever
> > > > > the i915 did.
> > > > >
> > > > > Matt
> > > > >
> > > > > > Christian.
> > > > > >
> > > > > > > Matt
> > > > > > >
> > > > > > > > All the lifetime issues we had came from ignoring this fact and I think we
> > > > > > > > should push for fixing this design up again.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Christian.
> > > > > > > >
> > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > ---
> > > > > > > > >      drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
> > > > > > > > >      include/drm/gpu_scheduler.h            |   8 +-
> > > > > > > > >      2 files changed, 106 insertions(+), 39 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > index cede47afc800..b67469eac179 100644
> > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > > > > > > > >       * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
> > > > > > > > >       *
> > > > > > > > >       * @rq: scheduler run queue to check.
> > > > > > > > > + * @dequeue: dequeue selected entity
> > > > > > > > >       *
> > > > > > > > >       * Try to find a ready entity, returns NULL if none found.
> > > > > > > > >       */
> > > > > > > > >      static struct drm_sched_entity *
> > > > > > > > > -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > > +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
> > > > > > > > >      {
> > > > > > > > >         struct drm_sched_entity *entity;
> > > > > > > > > @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > >         if (entity) {
> > > > > > > > >                 list_for_each_entry_continue(entity, &rq->entities, list) {
> > > > > > > > >                         if (drm_sched_entity_is_ready(entity)) {
> > > > > > > > > -                               rq->current_entity = entity;
> > > > > > > > > -                               reinit_completion(&entity->entity_idle);
> > > > > > > > > +                               if (dequeue) {
> > > > > > > > > +                                       rq->current_entity = entity;
> > > > > > > > > +                                       reinit_completion(&entity->entity_idle);
> > > > > > > > > +                               }
> > > > > > > > >                                 spin_unlock(&rq->lock);
> > > > > > > > >                                 return entity;
> > > > > > > > >                         }
> > > > > > > > > @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > >         list_for_each_entry(entity, &rq->entities, list) {
> > > > > > > > >                 if (drm_sched_entity_is_ready(entity)) {
> > > > > > > > > -                       rq->current_entity = entity;
> > > > > > > > > -                       reinit_completion(&entity->entity_idle);
> > > > > > > > > +                       if (dequeue) {
> > > > > > > > > +                               rq->current_entity = entity;
> > > > > > > > > +                               reinit_completion(&entity->entity_idle);
> > > > > > > > > +                       }
> > > > > > > > >                         spin_unlock(&rq->lock);
> > > > > > > > >                         return entity;
> > > > > > > > >                 }
> > > > > > > > > @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > >       * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
> > > > > > > > >       *
> > > > > > > > >       * @rq: scheduler run queue to check.
> > > > > > > > > + * @dequeue: dequeue selected entity
> > > > > > > > >       *
> > > > > > > > >       * Find oldest waiting ready entity, returns NULL if none found.
> > > > > > > > >       */
> > > > > > > > >      static struct drm_sched_entity *
> > > > > > > > > -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > > > > +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
> > > > > > > > >      {
> > > > > > > > >         struct rb_node *rb;
> > > > > > > > > @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > > > >                 entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> > > > > > > > >                 if (drm_sched_entity_is_ready(entity)) {
> > > > > > > > > -                       rq->current_entity = entity;
> > > > > > > > > -                       reinit_completion(&entity->entity_idle);
> > > > > > > > > +                       if (dequeue) {
> > > > > > > > > +                               rq->current_entity = entity;
> > > > > > > > > +                               reinit_completion(&entity->entity_idle);
> > > > > > > > > +                       }
> > > > > > > > >                         break;
> > > > > > > > >                 }
> > > > > > > > >         }
> > > > > > > > > @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > > > >      }
> > > > > > > > >      /**
> > > > > > > > > - * drm_sched_submit_queue - scheduler queue submission
> > > > > > > > > + * drm_sched_run_job_queue - queue job submission
> > > > > > > > >       * @sched: scheduler instance
> > > > > > > > >       */
> > > > > > > > > -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > >      {
> > > > > > > > >         if (!READ_ONCE(sched->pause_submit))
> > > > > > > > > -               queue_work(sched->submit_wq, &sched->work_submit);
> > > > > > > > > +               queue_work(sched->submit_wq, &sched->work_run_job);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static struct drm_sched_entity *
> > > > > > > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> > > > > > > > > +
> > > > > > > > > +/**
> > > > > > > > > + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> > > > > > > > > + * @sched: scheduler instance
> > > > > > > > > + */
> > > > > > > > > +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > > > > > > +{
> > > > > > > > > +       if (drm_sched_select_entity(sched, false))
> > > > > > > > > +               drm_sched_run_job_queue(sched);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +/**
> > > > > > > > > + * drm_sched_free_job_queue - queue free job
> > > > > > > > > + *
> > > > > > > > > + * @sched: scheduler instance to queue free job
> > > > > > > > > + */
> > > > > > > > > +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > +{
> > > > > > > > > +       if (!READ_ONCE(sched->pause_submit))
> > > > > > > > > +               queue_work(sched->submit_wq, &sched->work_free_job);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +/**
> > > > > > > > > + * drm_sched_free_job_queue_if_ready - queue free job if ready
> > > > > > > > > + *
> > > > > > > > > + * @sched: scheduler instance to queue free job
> > > > > > > > > + */
> > > > > > > > > +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > > > > > > +{
> > > > > > > > > +       struct drm_sched_job *job;
> > > > > > > > > +
> > > > > > > > > +       spin_lock(&sched->job_list_lock);
> > > > > > > > > +       job = list_first_entry_or_null(&sched->pending_list,
> > > > > > > > > +                                      struct drm_sched_job, list);
> > > > > > > > > +       if (job && dma_fence_is_signaled(&job->s_fence->finished))
> > > > > > > > > +               drm_sched_free_job_queue(sched);
> > > > > > > > > +       spin_unlock(&sched->job_list_lock);
> > > > > > > > >      }
> > > > > > > > >      /**
> > > > > > > > > @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
> > > > > > > > >         dma_fence_get(&s_fence->finished);
> > > > > > > > >         drm_sched_fence_finished(s_fence, result);
> > > > > > > > >         dma_fence_put(&s_fence->finished);
> > > > > > > > > -       drm_sched_submit_queue(sched);
> > > > > > > > > +       drm_sched_free_job_queue(sched);
> > > > > > > > >      }
> > > > > > > > >      /**
> > > > > > > > > @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > >      void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > >      {
> > > > > > > > >         if (drm_sched_can_queue(sched))
> > > > > > > > > -               drm_sched_submit_queue(sched);
> > > > > > > > > +               drm_sched_run_job_queue(sched);
> > > > > > > > >      }
> > > > > > > > >      /**
> > > > > > > > >       * drm_sched_select_entity - Select next entity to process
> > > > > > > > >       *
> > > > > > > > >       * @sched: scheduler instance
> > > > > > > > > + * @dequeue: dequeue selected entity
> > > > > > > > >       *
> > > > > > > > >       * Returns the entity to process or NULL if none are found.
> > > > > > > > >       */
> > > > > > > > >      static struct drm_sched_entity *
> > > > > > > > > -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > > > > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
> > > > > > > > >      {
> > > > > > > > >         struct drm_sched_entity *entity;
> > > > > > > > >         int i;
> > > > > > > > > @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > > > > > >         /* Kernel run queue has higher priority than normal run queue*/
> > > > > > > > >         for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> > > > > > > > >                 entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> > > > > > > > > -                       drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> > > > > > > > > -                       drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> > > > > > > > > +                       drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> > > > > > > > > +                                                       dequeue) :
> > > > > > > > > +                       drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> > > > > > > > > +                                                     dequeue);
> > > > > > > > >                 if (entity)
> > > > > > > > >                         break;
> > > > > > > > >         }
> > > > > > > > > @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> > > > > > > > >      EXPORT_SYMBOL(drm_sched_pick_best);
> > > > > > > > >      /**
> > > > > > > > > - * drm_sched_main - main scheduler thread
> > > > > > > > > + * drm_sched_free_job_work - worker to call free_job
> > > > > > > > >       *
> > > > > > > > > - * @param: scheduler instance
> > > > > > > > > + * @w: free job work
> > > > > > > > >       */
> > > > > > > > > -static void drm_sched_main(struct work_struct *w)
> > > > > > > > > +static void drm_sched_free_job_work(struct work_struct *w)
> > > > > > > > >      {
> > > > > > > > >         struct drm_gpu_scheduler *sched =
> > > > > > > > > -               container_of(w, struct drm_gpu_scheduler, work_submit);
> > > > > > > > > -       struct drm_sched_entity *entity;
> > > > > > > > > +               container_of(w, struct drm_gpu_scheduler, work_free_job);
> > > > > > > > >         struct drm_sched_job *cleanup_job;
> > > > > > > > > -       int r;
> > > > > > > > >         if (READ_ONCE(sched->pause_submit))
> > > > > > > > >                 return;
> > > > > > > > >         cleanup_job = drm_sched_get_cleanup_job(sched);
> > > > > > > > > -       entity = drm_sched_select_entity(sched);
> > > > > > > > > +       if (cleanup_job) {
> > > > > > > > > +               sched->ops->free_job(cleanup_job);
> > > > > > > > > +
> > > > > > > > > +               drm_sched_free_job_queue_if_ready(sched);
> > > > > > > > > +               drm_sched_run_job_queue_if_ready(sched);
> > > > > > > > > +       }
> > > > > > > > > +}
> > > > > > > > > -       if (!entity && !cleanup_job)
> > > > > > > > > -               return; /* No more work */
> > > > > > > > > +/**
> > > > > > > > > + * drm_sched_run_job_work - worker to call run_job
> > > > > > > > > + *
> > > > > > > > > + * @w: run job work
> > > > > > > > > + */
> > > > > > > > > +static void drm_sched_run_job_work(struct work_struct *w)
> > > > > > > > > +{
> > > > > > > > > +       struct drm_gpu_scheduler *sched =
> > > > > > > > > +               container_of(w, struct drm_gpu_scheduler, work_run_job);
> > > > > > > > > +       struct drm_sched_entity *entity;
> > > > > > > > > +       int r;
> > > > > > > > > -       if (cleanup_job)
> > > > > > > > > -               sched->ops->free_job(cleanup_job);
> > > > > > > > > +       if (READ_ONCE(sched->pause_submit))
> > > > > > > > > +               return;
> > > > > > > > > +       entity = drm_sched_select_entity(sched, true);
> > > > > > > > >         if (entity) {
> > > > > > > > >                 struct dma_fence *fence;
> > > > > > > > >                 struct drm_sched_fence *s_fence;
> > > > > > > > > @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > > > >                 sched_job = drm_sched_entity_pop_job(entity);
> > > > > > > > >                 if (!sched_job) {
> > > > > > > > >                         complete_all(&entity->entity_idle);
> > > > > > > > > -                       if (!cleanup_job)
> > > > > > > > > -                               return; /* No more work */
> > > > > > > > > -                       goto again;
> > > > > > > > > +                       return; /* No more work */
> > > > > > > > >                 }
> > > > > > > > >                 s_fence = sched_job->s_fence;
> > > > > > > > > @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > > > >                 }
> > > > > > > > >                 wake_up(&sched->job_scheduled);
> > > > > > > > > +               drm_sched_run_job_queue_if_ready(sched);
> > > > > > > > >         }
> > > > > > > > > -
> > > > > > > > > -again:
> > > > > > > > > -       drm_sched_submit_queue(sched);
> > > > > > > > >      }
> > > > > > > > >      /**
> > > > > > > > > @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> > > > > > > > >         spin_lock_init(&sched->job_list_lock);
> > > > > > > > >         atomic_set(&sched->hw_rq_count, 0);
> > > > > > > > >         INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > > > > > > > > -       INIT_WORK(&sched->work_submit, drm_sched_main);
> > > > > > > > > +       INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > > > > > > > > +       INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> > > > > > > > >         atomic_set(&sched->_score, 0);
> > > > > > > > >         atomic64_set(&sched->job_id_count, 0);
> > > > > > > > >         sched->pause_submit = false;
> > > > > > > > > @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
> > > > > > > > >      void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
> > > > > > > > >      {
> > > > > > > > >         WRITE_ONCE(sched->pause_submit, true);
> > > > > > > > > -       cancel_work_sync(&sched->work_submit);
> > > > > > > > > +       cancel_work_sync(&sched->work_run_job);
> > > > > > > > > +       cancel_work_sync(&sched->work_free_job);
> > > > > > > > >      }
> > > > > > > > >      EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > > > > > > @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > > > > > >      void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
> > > > > > > > >      {
> > > > > > > > >         WRITE_ONCE(sched->pause_submit, false);
> > > > > > > > > -       queue_work(sched->submit_wq, &sched->work_submit);
> > > > > > > > > +       queue_work(sched->submit_wq, &sched->work_run_job);
> > > > > > > > > +       queue_work(sched->submit_wq, &sched->work_free_job);
> > > > > > > > >      }
> > > > > > > > >      EXPORT_SYMBOL(drm_sched_submit_start);
> > > > > > > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > > > > > > index 04eec2d7635f..fbc083a92757 100644
> > > > > > > > > --- a/include/drm/gpu_scheduler.h
> > > > > > > > > +++ b/include/drm/gpu_scheduler.h
> > > > > > > > > @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
> > > > > > > > >       *                 finished.
> > > > > > > > >       * @hw_rq_count: the number of jobs currently in the hardware queue.
> > > > > > > > >       * @job_id_count: used to assign unique id to the each job.
> > > > > > > > > - * @submit_wq: workqueue used to queue @work_submit
> > > > > > > > > + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
> > > > > > > > >       * @timeout_wq: workqueue used to queue @work_tdr
> > > > > > > > > - * @work_submit: schedules jobs and cleans up entities
> > > > > > > > > + * @work_run_job: schedules jobs
> > > > > > > > > + * @work_free_job: cleans up jobs
> > > > > > > > >       * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> > > > > > > > >       *            timeout interval is over.
> > > > > > > > >       * @pending_list: the list of jobs which are currently in the job queue.
> > > > > > > > > @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
> > > > > > > > >         atomic64_t                      job_id_count;
> > > > > > > > >         struct workqueue_struct         *submit_wq;
> > > > > > > > >         struct workqueue_struct         *timeout_wq;
> > > > > > > > > -       struct work_struct              work_submit;
> > > > > > > > > +       struct work_struct              work_run_job;
> > > > > > > > > +       struct work_struct              work_free_job;
> > > > > > > > >         struct delayed_work             work_tdr;
> > > > > > > > >         struct list_head                pending_list;
> > > > > > > > >         spinlock_t                      job_list_lock;
> >

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Intel-xe] [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-23 15:41                   ` Alex Deucher
@ 2023-08-23 17:26                     ` Rodrigo Vivi
  2023-08-23 23:12                       ` Matthew Brost
  0 siblings, 1 reply; 80+ messages in thread
From: Rodrigo Vivi @ 2023-08-23 17:26 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Matthew Brost, robdclark, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, Christian König, luben.tuikov,
	donald.robson, boris.brezillon, intel-xe, faith.ekstrand

On Wed, Aug 23, 2023 at 11:41:19AM -0400, Alex Deucher wrote:
> On Wed, Aug 23, 2023 at 11:26 AM Matthew Brost <matthew.brost@intel.com> wrote:
> >
> > On Wed, Aug 23, 2023 at 09:10:51AM +0200, Christian König wrote:
> > > Am 23.08.23 um 05:27 schrieb Matthew Brost:
> > > > [SNIP]
> > > > > That is exactly what I want to avoid, tying the TDR to the job is what some
> > > > > AMD engineers pushed for because it looked like a simple solution and made
> > > > > the whole thing similar to what Windows does.
> > > > >
> > > > > This turned the previous relatively clean scheduler and TDR design into a
> > > > > complete nightmare. The job contains quite a bunch of things which are not
> > > > > necessarily available after the application which submitted the job is torn
> > > > > down.
> > > > >
> > > > Agree the TDR shouldn't be accessing anything application specific
> > > > rather just internal job state required to tear the job down on the
> > > > hardware.
> > > > > So what happens is that you either have stale pointers in the TDR which can
> > > > > go boom extremely easily or we somehow find a way to keep the necessary
> > > > I have not experenced the TDR going boom in Xe.
> > > >
> > > > > structures (which include struct thread_info and struct file for this driver
> > > > > connection) alive until all submissions are completed.
> > > > >
> > > > In Xe we keep everything alive until all submissions are completed. By
> > > > everything I mean the drm job, entity, scheduler, and VM via a reference
> > > > counting scheme. All of these structures are just kernel state which can
> > > > safely be accessed even if the application has been killed.
> > >
> > > Yeah, but that might just not be such a good idea from memory management
> > > point of view.
> > >
> > > When you (for example) kill a process all resource from that progress should
> > > at least be queued to be freed more or less immediately.
> > >
> >
> > We do this, the TDR kicks jobs off the hardware as fast as the hw
> > interface allows and signals all pending hw fences immediately after.
> > Free jobs then is immediately called and the reference count goes to
> > zero. I think max time for all of this to occur is a handful of ms.
> >
> > > What Linux is doing for other I/O operations is to keep the relevant pages
> > > alive until the I/O operation is completed, but for GPUs that usually means
> > > keeping most of the memory of the process alive and that in turn is really
> > > not something you can do.
> > >
> > > You can of course do this if your driver has a reliable way of killing your
> > > submissions and freeing resources in a reasonable amount of time. This
> > > should then be done in the flush callback.
> > >
> >
> > 'flush callback' - Do you mean drm_sched_entity_flush? I looked at that
> > and think that function doesn't even work for what I tell. It flushes
> > the spsc queue but what about jobs on the hardware, how do those get
> > killed?
> >
> > As stated we do via the TDR which is rather clean design and fits with
> > our reference couting scheme.
> >
> > > > If we need to teardown on demand we just set the TDR to a minimum value and
> > > > it kicks the jobs off the hardware, gracefully cleans everything up and
> > > > drops all references. This is a benefit of the 1 to 1 relationship, not
> > > > sure if this works with how AMDGPU uses the scheduler.
> > > >
> > > > > Delaying application tear down is also not an option because then you run
> > > > > into massive trouble with the OOM killer (or more generally OOM handling).
> > > > > See what we do in drm_sched_entity_flush() as well.
> > > > >
> > > > Not an issue for Xe, we never call drm_sched_entity_flush as our
> > > > referencing counting scheme is all jobs are finished before we attempt
> > > > to tear down entity / scheduler.
> > >
> > > I don't think you can do that upstream. Calling drm_sched_entity_flush() is
> > > a must have from your flush callback for the file descriptor.
> > >
> >
> > Again 'flush callback'? What are you refering too.
> >
> > And why does drm_sched_entity_flush need to be called, doesn't seem to
> > do anything useful.
> >
> > > Unless you have some other method for killing your submissions this would
> > > give a path for a deny of service attack vector when the Xe driver is in
> > > use.
> > >
> >
> > Yes, once th TDR fires is disallows all new submissions at the exec
> > IOCTL plus flushes any pending submissions as fast as possible.
> >
> > > > > Since adding the TDR support we completely exercised this through in the
> > > > > last two or three years or so. And to sum it up I would really like to get
> > > > > away from this mess again.
> > > > >
> > > > > Compared to that what i915 does is actually rather clean I think.
> > > > >
> > > > Not even close, resets where a nightmare in the i915 (I spend years
> > > > trying to get this right and probably still completely work) and in Xe
> > > > basically got it right on the attempt.
> > > >
> > > > > >    Also in Xe some of
> > > > > > things done in free_job cannot be from an IRQ context, hence calling
> > > > > > this from the scheduler worker is rather helpful.
> > > > > Well putting things for cleanup into a workitem doesn't sounds like
> > > > > something hard.
> > > > >
> > > > That is exactly what we doing in the scheduler with the free_job
> > > > workitem.
> > >
> > > Yeah, but I think that we do it in the scheduler and not the driver is
> > > problematic.

Christian, I do see your point on simply get rid of free job callbacks here
then use fence with own-driver workqueue and house cleaning. But I wonder if
starting with this patch as a clear separation of that is not a step forward
and that could be cleaned up on a follow up!?

Matt, why exactly do we need the separation in this patch? Commit message tells
what it is doing and that it is aligned with design, but is not clear on why
exactly we need this right now. Specially if in the end what we want is exactly
keeping the submit_wq to ensure the serialization of the operations you mentioned.
I mean, could we simply drop this patch and then work on a follow-up later and
investigate the Christian suggestion when we are in-tree?

> > >
> >
> > Disagree, a common clean callback from a non-irq contexts IMO is a good
> > design rather than each driver possibly having its own worker for
> > cleanup.
> >
> > > For the scheduler it shouldn't care about the job any more as soon as the
> > > driver takes over.
> > >
> >
> > This a massive rewrite for all users of the DRM scheduler, I'm saying
> > for Xe what you are suggesting makes little to no sense.
> >
> > I'd like other users of the DRM scheduler to chime in on what you
> > purposing. The scope of this change affects 8ish drivers that would
> > require buy in each of the stakeholders. I certainly can't change of
> > these drivers as I don't feel comfortable in all of those code bases nor
> > do I have hardware to test all of these drivers.
> >
> > > >
> > > > > Question is what do you really need for TDR which is not inside the hardware
> > > > > fence?
> > > > >
> > > > A reference to the entity to be able to kick the job off the hardware.
> > > > A reference to the entity, job, and VM for error capture.
> > > >
> > > > We also need a reference to the job for recovery after a GPU reset so
> > > > run_job can be called again for innocent jobs.
> > >
> > > Well exactly that's what I'm massively pushing back. Letting the scheduler
> > > call run_job() for the same job again is *NOT* something you can actually
> > > do.
> > >
> >
> > But lots of drivers do this already and the DRM scheduler documents
> > this.
> >
> > > This pretty clearly violates some of the dma_fence constrains and has cause
> > > massively headaches for me already.
> > >
> >
> > Seems to work fine in Xe.
> >
> > > What you can do is to do this inside your driver, e.g. take the jobs and
> > > push them again to the hw ring or just tell the hw to start executing again
> > > from a previous position.
> > >
> >
> > Again this now is massive rewrite of many drivers.
> >
> > > BTW that re-submitting of jobs seems to be a no-go from userspace
> > > perspective as well. Take a look at the Vulkan spec for that, at least Marek
> > > pretty much pointed out that we should absolutely not do this inside the
> > > kernel.
> > >
> >
> > Yes if the job causes the hang, we ban the queue. Typcially only per
> > entity (queue) resets are done in Xe but occasionally device level
> > resets are done (issues with hardware) and innocent jobs / entities call
> > run_job again.
> 
> If the engine is reset and the job was already executing, how can you
> determine that it's in a good state to resubmit?  What if some
> internal fence or semaphore in memory used by the logic in the command
> buffer has been signaled already and then you resubmit the job and it
> now starts executing with different input state?

I believe we could set some more rules in the new robustness documentation:
https://lore.kernel.org/all/20230818200642.276735-1-andrealmeid@igalia.com/

For this robustness implementation i915 pin point the exact context that
was in execution when the gpu hang and only blame that, although the
ressubmission is up to the user space. While on Xe we are blaming every
single context that was in the queue. So I'm actually confused on what
are the innocent jobs and who are calling for resubmission, if all of
them got banned and blamed.

> 
> Alex
> 
> >
> > > The generally right approach seems to be to cleanly signal to userspace that
> > > something bad happened and that userspace then needs to submit things again
> > > even for innocent jobs.
> > >
> >
> > I disagree that innocent jobs should be banned. What you are suggesting
> > is if a device reset needs to be done we kill / ban every user space queue.
> > Thats seems like overkill. Not seeing where that is stated in this doc
> > [1], it seems to imply that only jobs that are stuck results in bans.
> >
> > Matt
> >
> > [1] https://patchwork.freedesktop.org/patch/553465/?series=119883&rev=3
> >
> > > Regards,
> > > Christian.
> > >
> > > >
> > > > All of this leads to believe we need to stick with the design.
> > > >
> > > > Matt
> > > >
> > > > > Regards,
> > > > > Christian.
> > > > >
> > > > > > The HW fence can live for longer as it can be installed in dma-resv
> > > > > > slots, syncobjs, etc... If the job and hw fence are combined now we
> > > > > > holding on the memory for the longer and perhaps at the mercy of the
> > > > > > user. We also run the risk of the final put being done from an IRQ
> > > > > > context which again wont work in Xe as it is currently coded. Lastly 2
> > > > > > jobs from the same scheduler could do the final put in parallel, so
> > > > > > rather than having free_job serialized by the worker now multiple jobs
> > > > > > are freeing themselves at the same time. This might not be an issue but
> > > > > > adds another level of raceyness that needs to be accounted for. None of
> > > > > > this sounds desirable to me.
> > > > > >
> > > > > > FWIW what you suggesting sounds like how the i915 did things
> > > > > > (i915_request and hw fence in 1 memory alloc) and that turned out to be
> > > > > > a huge mess. As rule of thumb I generally do the opposite of whatever
> > > > > > the i915 did.
> > > > > >
> > > > > > Matt
> > > > > >
> > > > > > > Christian.
> > > > > > >
> > > > > > > > Matt
> > > > > > > >
> > > > > > > > > All the lifetime issues we had came from ignoring this fact and I think we
> > > > > > > > > should push for fixing this design up again.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Christian.
> > > > > > > > >
> > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > ---
> > > > > > > > > >      drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
> > > > > > > > > >      include/drm/gpu_scheduler.h            |   8 +-
> > > > > > > > > >      2 files changed, 106 insertions(+), 39 deletions(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > index cede47afc800..b67469eac179 100644
> > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > > > > > > > > >       * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
> > > > > > > > > >       *
> > > > > > > > > >       * @rq: scheduler run queue to check.
> > > > > > > > > > + * @dequeue: dequeue selected entity
> > > > > > > > > >       *
> > > > > > > > > >       * Try to find a ready entity, returns NULL if none found.
> > > > > > > > > >       */
> > > > > > > > > >      static struct drm_sched_entity *
> > > > > > > > > > -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > > > +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
> > > > > > > > > >      {
> > > > > > > > > >         struct drm_sched_entity *entity;
> > > > > > > > > > @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > > >         if (entity) {
> > > > > > > > > >                 list_for_each_entry_continue(entity, &rq->entities, list) {
> > > > > > > > > >                         if (drm_sched_entity_is_ready(entity)) {
> > > > > > > > > > -                               rq->current_entity = entity;
> > > > > > > > > > -                               reinit_completion(&entity->entity_idle);
> > > > > > > > > > +                               if (dequeue) {
> > > > > > > > > > +                                       rq->current_entity = entity;
> > > > > > > > > > +                                       reinit_completion(&entity->entity_idle);
> > > > > > > > > > +                               }
> > > > > > > > > >                                 spin_unlock(&rq->lock);
> > > > > > > > > >                                 return entity;
> > > > > > > > > >                         }
> > > > > > > > > > @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > > >         list_for_each_entry(entity, &rq->entities, list) {
> > > > > > > > > >                 if (drm_sched_entity_is_ready(entity)) {
> > > > > > > > > > -                       rq->current_entity = entity;
> > > > > > > > > > -                       reinit_completion(&entity->entity_idle);
> > > > > > > > > > +                       if (dequeue) {
> > > > > > > > > > +                               rq->current_entity = entity;
> > > > > > > > > > +                               reinit_completion(&entity->entity_idle);
> > > > > > > > > > +                       }
> > > > > > > > > >                         spin_unlock(&rq->lock);
> > > > > > > > > >                         return entity;
> > > > > > > > > >                 }
> > > > > > > > > > @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > > >       * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
> > > > > > > > > >       *
> > > > > > > > > >       * @rq: scheduler run queue to check.
> > > > > > > > > > + * @dequeue: dequeue selected entity
> > > > > > > > > >       *
> > > > > > > > > >       * Find oldest waiting ready entity, returns NULL if none found.
> > > > > > > > > >       */
> > > > > > > > > >      static struct drm_sched_entity *
> > > > > > > > > > -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > > > > > +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
> > > > > > > > > >      {
> > > > > > > > > >         struct rb_node *rb;
> > > > > > > > > > @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > > > > >                 entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> > > > > > > > > >                 if (drm_sched_entity_is_ready(entity)) {
> > > > > > > > > > -                       rq->current_entity = entity;
> > > > > > > > > > -                       reinit_completion(&entity->entity_idle);
> > > > > > > > > > +                       if (dequeue) {
> > > > > > > > > > +                               rq->current_entity = entity;
> > > > > > > > > > +                               reinit_completion(&entity->entity_idle);
> > > > > > > > > > +                       }
> > > > > > > > > >                         break;
> > > > > > > > > >                 }
> > > > > > > > > >         }
> > > > > > > > > > @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > > > > >      }
> > > > > > > > > >      /**
> > > > > > > > > > - * drm_sched_submit_queue - scheduler queue submission
> > > > > > > > > > + * drm_sched_run_job_queue - queue job submission
> > > > > > > > > >       * @sched: scheduler instance
> > > > > > > > > >       */
> > > > > > > > > > -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > > +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > >      {
> > > > > > > > > >         if (!READ_ONCE(sched->pause_submit))
> > > > > > > > > > -               queue_work(sched->submit_wq, &sched->work_submit);
> > > > > > > > > > +               queue_work(sched->submit_wq, &sched->work_run_job);
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +static struct drm_sched_entity *
> > > > > > > > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> > > > > > > > > > +
> > > > > > > > > > +/**
> > > > > > > > > > + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> > > > > > > > > > + * @sched: scheduler instance
> > > > > > > > > > + */
> > > > > > > > > > +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > > > > > > > +{
> > > > > > > > > > +       if (drm_sched_select_entity(sched, false))
> > > > > > > > > > +               drm_sched_run_job_queue(sched);
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +/**
> > > > > > > > > > + * drm_sched_free_job_queue - queue free job
> > > > > > > > > > + *
> > > > > > > > > > + * @sched: scheduler instance to queue free job
> > > > > > > > > > + */
> > > > > > > > > > +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > > +{
> > > > > > > > > > +       if (!READ_ONCE(sched->pause_submit))
> > > > > > > > > > +               queue_work(sched->submit_wq, &sched->work_free_job);
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +/**
> > > > > > > > > > + * drm_sched_free_job_queue_if_ready - queue free job if ready
> > > > > > > > > > + *
> > > > > > > > > > + * @sched: scheduler instance to queue free job
> > > > > > > > > > + */
> > > > > > > > > > +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > > > > > > > +{
> > > > > > > > > > +       struct drm_sched_job *job;
> > > > > > > > > > +
> > > > > > > > > > +       spin_lock(&sched->job_list_lock);
> > > > > > > > > > +       job = list_first_entry_or_null(&sched->pending_list,
> > > > > > > > > > +                                      struct drm_sched_job, list);
> > > > > > > > > > +       if (job && dma_fence_is_signaled(&job->s_fence->finished))
> > > > > > > > > > +               drm_sched_free_job_queue(sched);
> > > > > > > > > > +       spin_unlock(&sched->job_list_lock);
> > > > > > > > > >      }
> > > > > > > > > >      /**
> > > > > > > > > > @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
> > > > > > > > > >         dma_fence_get(&s_fence->finished);
> > > > > > > > > >         drm_sched_fence_finished(s_fence, result);
> > > > > > > > > >         dma_fence_put(&s_fence->finished);
> > > > > > > > > > -       drm_sched_submit_queue(sched);
> > > > > > > > > > +       drm_sched_free_job_queue(sched);
> > > > > > > > > >      }
> > > > > > > > > >      /**
> > > > > > > > > > @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > >      void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > >      {
> > > > > > > > > >         if (drm_sched_can_queue(sched))
> > > > > > > > > > -               drm_sched_submit_queue(sched);
> > > > > > > > > > +               drm_sched_run_job_queue(sched);
> > > > > > > > > >      }
> > > > > > > > > >      /**
> > > > > > > > > >       * drm_sched_select_entity - Select next entity to process
> > > > > > > > > >       *
> > > > > > > > > >       * @sched: scheduler instance
> > > > > > > > > > + * @dequeue: dequeue selected entity
> > > > > > > > > >       *
> > > > > > > > > >       * Returns the entity to process or NULL if none are found.
> > > > > > > > > >       */
> > > > > > > > > >      static struct drm_sched_entity *
> > > > > > > > > > -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > > > > > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
> > > > > > > > > >      {
> > > > > > > > > >         struct drm_sched_entity *entity;
> > > > > > > > > >         int i;
> > > > > > > > > > @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > > > > > > >         /* Kernel run queue has higher priority than normal run queue*/
> > > > > > > > > >         for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> > > > > > > > > >                 entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> > > > > > > > > > -                       drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> > > > > > > > > > -                       drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> > > > > > > > > > +                       drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> > > > > > > > > > +                                                       dequeue) :
> > > > > > > > > > +                       drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> > > > > > > > > > +                                                     dequeue);
> > > > > > > > > >                 if (entity)
> > > > > > > > > >                         break;
> > > > > > > > > >         }
> > > > > > > > > > @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> > > > > > > > > >      EXPORT_SYMBOL(drm_sched_pick_best);
> > > > > > > > > >      /**
> > > > > > > > > > - * drm_sched_main - main scheduler thread
> > > > > > > > > > + * drm_sched_free_job_work - worker to call free_job
> > > > > > > > > >       *
> > > > > > > > > > - * @param: scheduler instance
> > > > > > > > > > + * @w: free job work
> > > > > > > > > >       */
> > > > > > > > > > -static void drm_sched_main(struct work_struct *w)
> > > > > > > > > > +static void drm_sched_free_job_work(struct work_struct *w)
> > > > > > > > > >      {
> > > > > > > > > >         struct drm_gpu_scheduler *sched =
> > > > > > > > > > -               container_of(w, struct drm_gpu_scheduler, work_submit);
> > > > > > > > > > -       struct drm_sched_entity *entity;
> > > > > > > > > > +               container_of(w, struct drm_gpu_scheduler, work_free_job);
> > > > > > > > > >         struct drm_sched_job *cleanup_job;
> > > > > > > > > > -       int r;
> > > > > > > > > >         if (READ_ONCE(sched->pause_submit))
> > > > > > > > > >                 return;
> > > > > > > > > >         cleanup_job = drm_sched_get_cleanup_job(sched);
> > > > > > > > > > -       entity = drm_sched_select_entity(sched);
> > > > > > > > > > +       if (cleanup_job) {
> > > > > > > > > > +               sched->ops->free_job(cleanup_job);
> > > > > > > > > > +
> > > > > > > > > > +               drm_sched_free_job_queue_if_ready(sched);
> > > > > > > > > > +               drm_sched_run_job_queue_if_ready(sched);
> > > > > > > > > > +       }
> > > > > > > > > > +}
> > > > > > > > > > -       if (!entity && !cleanup_job)
> > > > > > > > > > -               return; /* No more work */
> > > > > > > > > > +/**
> > > > > > > > > > + * drm_sched_run_job_work - worker to call run_job
> > > > > > > > > > + *
> > > > > > > > > > + * @w: run job work
> > > > > > > > > > + */
> > > > > > > > > > +static void drm_sched_run_job_work(struct work_struct *w)
> > > > > > > > > > +{
> > > > > > > > > > +       struct drm_gpu_scheduler *sched =
> > > > > > > > > > +               container_of(w, struct drm_gpu_scheduler, work_run_job);
> > > > > > > > > > +       struct drm_sched_entity *entity;
> > > > > > > > > > +       int r;
> > > > > > > > > > -       if (cleanup_job)
> > > > > > > > > > -               sched->ops->free_job(cleanup_job);
> > > > > > > > > > +       if (READ_ONCE(sched->pause_submit))
> > > > > > > > > > +               return;
> > > > > > > > > > +       entity = drm_sched_select_entity(sched, true);
> > > > > > > > > >         if (entity) {
> > > > > > > > > >                 struct dma_fence *fence;
> > > > > > > > > >                 struct drm_sched_fence *s_fence;
> > > > > > > > > > @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > > > > >                 sched_job = drm_sched_entity_pop_job(entity);
> > > > > > > > > >                 if (!sched_job) {
> > > > > > > > > >                         complete_all(&entity->entity_idle);
> > > > > > > > > > -                       if (!cleanup_job)
> > > > > > > > > > -                               return; /* No more work */
> > > > > > > > > > -                       goto again;
> > > > > > > > > > +                       return; /* No more work */
> > > > > > > > > >                 }
> > > > > > > > > >                 s_fence = sched_job->s_fence;
> > > > > > > > > > @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > > > > >                 }
> > > > > > > > > >                 wake_up(&sched->job_scheduled);
> > > > > > > > > > +               drm_sched_run_job_queue_if_ready(sched);
> > > > > > > > > >         }
> > > > > > > > > > -
> > > > > > > > > > -again:
> > > > > > > > > > -       drm_sched_submit_queue(sched);
> > > > > > > > > >      }
> > > > > > > > > >      /**
> > > > > > > > > > @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> > > > > > > > > >         spin_lock_init(&sched->job_list_lock);
> > > > > > > > > >         atomic_set(&sched->hw_rq_count, 0);
> > > > > > > > > >         INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > > > > > > > > > -       INIT_WORK(&sched->work_submit, drm_sched_main);
> > > > > > > > > > +       INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > > > > > > > > > +       INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> > > > > > > > > >         atomic_set(&sched->_score, 0);
> > > > > > > > > >         atomic64_set(&sched->job_id_count, 0);
> > > > > > > > > >         sched->pause_submit = false;
> > > > > > > > > > @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
> > > > > > > > > >      void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
> > > > > > > > > >      {
> > > > > > > > > >         WRITE_ONCE(sched->pause_submit, true);
> > > > > > > > > > -       cancel_work_sync(&sched->work_submit);
> > > > > > > > > > +       cancel_work_sync(&sched->work_run_job);
> > > > > > > > > > +       cancel_work_sync(&sched->work_free_job);
> > > > > > > > > >      }
> > > > > > > > > >      EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > > > > > > > @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > > > > > > >      void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
> > > > > > > > > >      {
> > > > > > > > > >         WRITE_ONCE(sched->pause_submit, false);
> > > > > > > > > > -       queue_work(sched->submit_wq, &sched->work_submit);
> > > > > > > > > > +       queue_work(sched->submit_wq, &sched->work_run_job);
> > > > > > > > > > +       queue_work(sched->submit_wq, &sched->work_free_job);
> > > > > > > > > >      }
> > > > > > > > > >      EXPORT_SYMBOL(drm_sched_submit_start);
> > > > > > > > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > > > > > > > index 04eec2d7635f..fbc083a92757 100644
> > > > > > > > > > --- a/include/drm/gpu_scheduler.h
> > > > > > > > > > +++ b/include/drm/gpu_scheduler.h
> > > > > > > > > > @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
> > > > > > > > > >       *                 finished.
> > > > > > > > > >       * @hw_rq_count: the number of jobs currently in the hardware queue.
> > > > > > > > > >       * @job_id_count: used to assign unique id to the each job.
> > > > > > > > > > - * @submit_wq: workqueue used to queue @work_submit
> > > > > > > > > > + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
> > > > > > > > > >       * @timeout_wq: workqueue used to queue @work_tdr
> > > > > > > > > > - * @work_submit: schedules jobs and cleans up entities
> > > > > > > > > > + * @work_run_job: schedules jobs
> > > > > > > > > > + * @work_free_job: cleans up jobs
> > > > > > > > > >       * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> > > > > > > > > >       *            timeout interval is over.
> > > > > > > > > >       * @pending_list: the list of jobs which are currently in the job queue.
> > > > > > > > > > @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
> > > > > > > > > >         atomic64_t                      job_id_count;
> > > > > > > > > >         struct workqueue_struct         *submit_wq;
> > > > > > > > > >         struct workqueue_struct         *timeout_wq;
> > > > > > > > > > -       struct work_struct              work_submit;
> > > > > > > > > > +       struct work_struct              work_run_job;
> > > > > > > > > > +       struct work_struct              work_free_job;
> > > > > > > > > >         struct delayed_work             work_tdr;
> > > > > > > > > >         struct list_head                pending_list;
> > > > > > > > > >         spinlock_t                      job_list_lock;
> > >

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Intel-xe] [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-23 17:26                     ` [Intel-xe] " Rodrigo Vivi
@ 2023-08-23 23:12                       ` Matthew Brost
  2023-08-24 11:44                         ` Christian König
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Brost @ 2023-08-23 23:12 UTC (permalink / raw)
  To: Rodrigo Vivi
  Cc: robdclark, sarah.walker, ketil.johnsen, lina, Liviu.Dudau,
	dri-devel, Christian König, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

On Wed, Aug 23, 2023 at 01:26:09PM -0400, Rodrigo Vivi wrote:
> On Wed, Aug 23, 2023 at 11:41:19AM -0400, Alex Deucher wrote:
> > On Wed, Aug 23, 2023 at 11:26 AM Matthew Brost <matthew.brost@intel.com> wrote:
> > >
> > > On Wed, Aug 23, 2023 at 09:10:51AM +0200, Christian König wrote:
> > > > Am 23.08.23 um 05:27 schrieb Matthew Brost:
> > > > > [SNIP]
> > > > > > That is exactly what I want to avoid, tying the TDR to the job is what some
> > > > > > AMD engineers pushed for because it looked like a simple solution and made
> > > > > > the whole thing similar to what Windows does.
> > > > > >
> > > > > > This turned the previous relatively clean scheduler and TDR design into a
> > > > > > complete nightmare. The job contains quite a bunch of things which are not
> > > > > > necessarily available after the application which submitted the job is torn
> > > > > > down.
> > > > > >
> > > > > Agree the TDR shouldn't be accessing anything application specific
> > > > > rather just internal job state required to tear the job down on the
> > > > > hardware.
> > > > > > So what happens is that you either have stale pointers in the TDR which can
> > > > > > go boom extremely easily or we somehow find a way to keep the necessary
> > > > > I have not experenced the TDR going boom in Xe.
> > > > >
> > > > > > structures (which include struct thread_info and struct file for this driver
> > > > > > connection) alive until all submissions are completed.
> > > > > >
> > > > > In Xe we keep everything alive until all submissions are completed. By
> > > > > everything I mean the drm job, entity, scheduler, and VM via a reference
> > > > > counting scheme. All of these structures are just kernel state which can
> > > > > safely be accessed even if the application has been killed.
> > > >
> > > > Yeah, but that might just not be such a good idea from memory management
> > > > point of view.
> > > >
> > > > When you (for example) kill a process all resource from that progress should
> > > > at least be queued to be freed more or less immediately.
> > > >
> > >
> > > We do this, the TDR kicks jobs off the hardware as fast as the hw
> > > interface allows and signals all pending hw fences immediately after.
> > > Free jobs then is immediately called and the reference count goes to
> > > zero. I think max time for all of this to occur is a handful of ms.
> > >
> > > > What Linux is doing for other I/O operations is to keep the relevant pages
> > > > alive until the I/O operation is completed, but for GPUs that usually means
> > > > keeping most of the memory of the process alive and that in turn is really
> > > > not something you can do.
> > > >
> > > > You can of course do this if your driver has a reliable way of killing your
> > > > submissions and freeing resources in a reasonable amount of time. This
> > > > should then be done in the flush callback.
> > > >
> > >
> > > 'flush callback' - Do you mean drm_sched_entity_flush? I looked at that
> > > and think that function doesn't even work for what I tell. It flushes
> > > the spsc queue but what about jobs on the hardware, how do those get
> > > killed?
> > >
> > > As stated we do via the TDR which is rather clean design and fits with
> > > our reference couting scheme.
> > >
> > > > > If we need to teardown on demand we just set the TDR to a minimum value and
> > > > > it kicks the jobs off the hardware, gracefully cleans everything up and
> > > > > drops all references. This is a benefit of the 1 to 1 relationship, not
> > > > > sure if this works with how AMDGPU uses the scheduler.
> > > > >
> > > > > > Delaying application tear down is also not an option because then you run
> > > > > > into massive trouble with the OOM killer (or more generally OOM handling).
> > > > > > See what we do in drm_sched_entity_flush() as well.
> > > > > >
> > > > > Not an issue for Xe, we never call drm_sched_entity_flush as our
> > > > > referencing counting scheme is all jobs are finished before we attempt
> > > > > to tear down entity / scheduler.
> > > >
> > > > I don't think you can do that upstream. Calling drm_sched_entity_flush() is
> > > > a must have from your flush callback for the file descriptor.
> > > >
> > >
> > > Again 'flush callback'? What are you refering too.
> > >
> > > And why does drm_sched_entity_flush need to be called, doesn't seem to
> > > do anything useful.
> > >
> > > > Unless you have some other method for killing your submissions this would
> > > > give a path for a deny of service attack vector when the Xe driver is in
> > > > use.
> > > >
> > >
> > > Yes, once th TDR fires is disallows all new submissions at the exec
> > > IOCTL plus flushes any pending submissions as fast as possible.
> > >
> > > > > > Since adding the TDR support we completely exercised this through in the
> > > > > > last two or three years or so. And to sum it up I would really like to get
> > > > > > away from this mess again.
> > > > > >
> > > > > > Compared to that what i915 does is actually rather clean I think.
> > > > > >
> > > > > Not even close, resets where a nightmare in the i915 (I spend years
> > > > > trying to get this right and probably still completely work) and in Xe
> > > > > basically got it right on the attempt.
> > > > >
> > > > > > >    Also in Xe some of
> > > > > > > things done in free_job cannot be from an IRQ context, hence calling
> > > > > > > this from the scheduler worker is rather helpful.
> > > > > > Well putting things for cleanup into a workitem doesn't sounds like
> > > > > > something hard.
> > > > > >
> > > > > That is exactly what we doing in the scheduler with the free_job
> > > > > workitem.
> > > >
> > > > Yeah, but I think that we do it in the scheduler and not the driver is
> > > > problematic.
> 
> Christian, I do see your point on simply get rid of free job callbacks here
> then use fence with own-driver workqueue and house cleaning. But I wonder if
> starting with this patch as a clear separation of that is not a step forward
> and that could be cleaned up on a follow up!?
> 
> Matt, why exactly do we need the separation in this patch? Commit message tells
> what it is doing and that it is aligned with design, but is not clear on why
> exactly we need this right now. Specially if in the end what we want is exactly
> keeping the submit_wq to ensure the serialization of the operations you mentioned.
> I mean, could we simply drop this patch and then work on a follow-up later and
> investigate the Christian suggestion when we are in-tree?
> 

I believe Christian suggested this change in a previous rev (free_job,
proccess_msg) in there own workitem [1].

Dropping free_job / calling run_job again is really a completely
different topic than this patch.

[1] https://patchwork.freedesktop.org/patch/550722/?series=121745&rev=1

> > > >
> > >
> > > Disagree, a common clean callback from a non-irq contexts IMO is a good
> > > design rather than each driver possibly having its own worker for
> > > cleanup.
> > >
> > > > For the scheduler it shouldn't care about the job any more as soon as the
> > > > driver takes over.
> > > >
> > >
> > > This a massive rewrite for all users of the DRM scheduler, I'm saying
> > > for Xe what you are suggesting makes little to no sense.
> > >
> > > I'd like other users of the DRM scheduler to chime in on what you
> > > purposing. The scope of this change affects 8ish drivers that would
> > > require buy in each of the stakeholders. I certainly can't change of
> > > these drivers as I don't feel comfortable in all of those code bases nor
> > > do I have hardware to test all of these drivers.
> > >
> > > > >
> > > > > > Question is what do you really need for TDR which is not inside the hardware
> > > > > > fence?
> > > > > >
> > > > > A reference to the entity to be able to kick the job off the hardware.
> > > > > A reference to the entity, job, and VM for error capture.
> > > > >
> > > > > We also need a reference to the job for recovery after a GPU reset so
> > > > > run_job can be called again for innocent jobs.
> > > >
> > > > Well exactly that's what I'm massively pushing back. Letting the scheduler
> > > > call run_job() for the same job again is *NOT* something you can actually
> > > > do.
> > > >
> > >
> > > But lots of drivers do this already and the DRM scheduler documents
> > > this.
> > >
> > > > This pretty clearly violates some of the dma_fence constrains and has cause
> > > > massively headaches for me already.
> > > >
> > >
> > > Seems to work fine in Xe.
> > >
> > > > What you can do is to do this inside your driver, e.g. take the jobs and
> > > > push them again to the hw ring or just tell the hw to start executing again
> > > > from a previous position.
> > > >
> > >
> > > Again this now is massive rewrite of many drivers.
> > >
> > > > BTW that re-submitting of jobs seems to be a no-go from userspace
> > > > perspective as well. Take a look at the Vulkan spec for that, at least Marek
> > > > pretty much pointed out that we should absolutely not do this inside the
> > > > kernel.
> > > >
> > >
> > > Yes if the job causes the hang, we ban the queue. Typcially only per
> > > entity (queue) resets are done in Xe but occasionally device level
> > > resets are done (issues with hardware) and innocent jobs / entities call
> > > run_job again.
> > 
> > If the engine is reset and the job was already executing, how can you
> > determine that it's in a good state to resubmit?  What if some

If a job has started but not completed we ban the queue during device
reset. If a queue have jobs submitted but not started we resubmit all
jobs on the queue during device reset.

The started / completed state can be determined by looking at a seqno in
memory.

> > internal fence or semaphore in memory used by the logic in the command
> > buffer has been signaled already and then you resubmit the job and it
> > now starts executing with different input state?
> 
> I believe we could set some more rules in the new robustness documentation:
> https://lore.kernel.org/all/20230818200642.276735-1-andrealmeid@igalia.com/
> 
> For this robustness implementation i915 pin point the exact context that
> was in execution when the gpu hang and only blame that, although the
> ressubmission is up to the user space. While on Xe we are blaming every
> single context that was in the queue. So I'm actually confused on what
> are the innocent jobs and who are calling for resubmission, if all of
> them got banned and blamed.

See above, innocent job == submited job but not started (i.e. a job
stuck in the FW queue not yet been put on the hardware). Because we have
a FW scheduler we could have 1000s of innocent jobs that don't need to
get banned. This is very different from drivers without FW schedulers as
typically when run_job is called the job hits the hardware immediately.

Matt

> 
> > 
> > Alex
> > 
> > >
> > > > The generally right approach seems to be to cleanly signal to userspace that
> > > > something bad happened and that userspace then needs to submit things again
> > > > even for innocent jobs.
> > > >
> > >
> > > I disagree that innocent jobs should be banned. What you are suggesting
> > > is if a device reset needs to be done we kill / ban every user space queue.
> > > Thats seems like overkill. Not seeing where that is stated in this doc
> > > [1], it seems to imply that only jobs that are stuck results in bans.
> > >
> > > Matt
> > >
> > > [1] https://patchwork.freedesktop.org/patch/553465/?series=119883&rev=3
> > >
> > > > Regards,
> > > > Christian.
> > > >
> > > > >
> > > > > All of this leads to believe we need to stick with the design.
> > > > >
> > > > > Matt
> > > > >
> > > > > > Regards,
> > > > > > Christian.
> > > > > >
> > > > > > > The HW fence can live for longer as it can be installed in dma-resv
> > > > > > > slots, syncobjs, etc... If the job and hw fence are combined now we
> > > > > > > holding on the memory for the longer and perhaps at the mercy of the
> > > > > > > user. We also run the risk of the final put being done from an IRQ
> > > > > > > context which again wont work in Xe as it is currently coded. Lastly 2
> > > > > > > jobs from the same scheduler could do the final put in parallel, so
> > > > > > > rather than having free_job serialized by the worker now multiple jobs
> > > > > > > are freeing themselves at the same time. This might not be an issue but
> > > > > > > adds another level of raceyness that needs to be accounted for. None of
> > > > > > > this sounds desirable to me.
> > > > > > >
> > > > > > > FWIW what you suggesting sounds like how the i915 did things
> > > > > > > (i915_request and hw fence in 1 memory alloc) and that turned out to be
> > > > > > > a huge mess. As rule of thumb I generally do the opposite of whatever
> > > > > > > the i915 did.
> > > > > > >
> > > > > > > Matt
> > > > > > >
> > > > > > > > Christian.
> > > > > > > >
> > > > > > > > > Matt
> > > > > > > > >
> > > > > > > > > > All the lifetime issues we had came from ignoring this fact and I think we
> > > > > > > > > > should push for fixing this design up again.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Christian.
> > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > > ---
> > > > > > > > > > >      drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
> > > > > > > > > > >      include/drm/gpu_scheduler.h            |   8 +-
> > > > > > > > > > >      2 files changed, 106 insertions(+), 39 deletions(-)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > index cede47afc800..b67469eac179 100644
> > > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > > > > > > > > > >       * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
> > > > > > > > > > >       *
> > > > > > > > > > >       * @rq: scheduler run queue to check.
> > > > > > > > > > > + * @dequeue: dequeue selected entity
> > > > > > > > > > >       *
> > > > > > > > > > >       * Try to find a ready entity, returns NULL if none found.
> > > > > > > > > > >       */
> > > > > > > > > > >      static struct drm_sched_entity *
> > > > > > > > > > > -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > > > > +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
> > > > > > > > > > >      {
> > > > > > > > > > >         struct drm_sched_entity *entity;
> > > > > > > > > > > @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > > > >         if (entity) {
> > > > > > > > > > >                 list_for_each_entry_continue(entity, &rq->entities, list) {
> > > > > > > > > > >                         if (drm_sched_entity_is_ready(entity)) {
> > > > > > > > > > > -                               rq->current_entity = entity;
> > > > > > > > > > > -                               reinit_completion(&entity->entity_idle);
> > > > > > > > > > > +                               if (dequeue) {
> > > > > > > > > > > +                                       rq->current_entity = entity;
> > > > > > > > > > > +                                       reinit_completion(&entity->entity_idle);
> > > > > > > > > > > +                               }
> > > > > > > > > > >                                 spin_unlock(&rq->lock);
> > > > > > > > > > >                                 return entity;
> > > > > > > > > > >                         }
> > > > > > > > > > > @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > > > >         list_for_each_entry(entity, &rq->entities, list) {
> > > > > > > > > > >                 if (drm_sched_entity_is_ready(entity)) {
> > > > > > > > > > > -                       rq->current_entity = entity;
> > > > > > > > > > > -                       reinit_completion(&entity->entity_idle);
> > > > > > > > > > > +                       if (dequeue) {
> > > > > > > > > > > +                               rq->current_entity = entity;
> > > > > > > > > > > +                               reinit_completion(&entity->entity_idle);
> > > > > > > > > > > +                       }
> > > > > > > > > > >                         spin_unlock(&rq->lock);
> > > > > > > > > > >                         return entity;
> > > > > > > > > > >                 }
> > > > > > > > > > > @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > > > >       * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
> > > > > > > > > > >       *
> > > > > > > > > > >       * @rq: scheduler run queue to check.
> > > > > > > > > > > + * @dequeue: dequeue selected entity
> > > > > > > > > > >       *
> > > > > > > > > > >       * Find oldest waiting ready entity, returns NULL if none found.
> > > > > > > > > > >       */
> > > > > > > > > > >      static struct drm_sched_entity *
> > > > > > > > > > > -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > > > > > > +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
> > > > > > > > > > >      {
> > > > > > > > > > >         struct rb_node *rb;
> > > > > > > > > > > @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > > > > > >                 entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> > > > > > > > > > >                 if (drm_sched_entity_is_ready(entity)) {
> > > > > > > > > > > -                       rq->current_entity = entity;
> > > > > > > > > > > -                       reinit_completion(&entity->entity_idle);
> > > > > > > > > > > +                       if (dequeue) {
> > > > > > > > > > > +                               rq->current_entity = entity;
> > > > > > > > > > > +                               reinit_completion(&entity->entity_idle);
> > > > > > > > > > > +                       }
> > > > > > > > > > >                         break;
> > > > > > > > > > >                 }
> > > > > > > > > > >         }
> > > > > > > > > > > @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > > > > > >      }
> > > > > > > > > > >      /**
> > > > > > > > > > > - * drm_sched_submit_queue - scheduler queue submission
> > > > > > > > > > > + * drm_sched_run_job_queue - queue job submission
> > > > > > > > > > >       * @sched: scheduler instance
> > > > > > > > > > >       */
> > > > > > > > > > > -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > > >      {
> > > > > > > > > > >         if (!READ_ONCE(sched->pause_submit))
> > > > > > > > > > > -               queue_work(sched->submit_wq, &sched->work_submit);
> > > > > > > > > > > +               queue_work(sched->submit_wq, &sched->work_run_job);
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +static struct drm_sched_entity *
> > > > > > > > > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> > > > > > > > > > > +
> > > > > > > > > > > +/**
> > > > > > > > > > > + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> > > > > > > > > > > + * @sched: scheduler instance
> > > > > > > > > > > + */
> > > > > > > > > > > +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > +{
> > > > > > > > > > > +       if (drm_sched_select_entity(sched, false))
> > > > > > > > > > > +               drm_sched_run_job_queue(sched);
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +/**
> > > > > > > > > > > + * drm_sched_free_job_queue - queue free job
> > > > > > > > > > > + *
> > > > > > > > > > > + * @sched: scheduler instance to queue free job
> > > > > > > > > > > + */
> > > > > > > > > > > +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > +{
> > > > > > > > > > > +       if (!READ_ONCE(sched->pause_submit))
> > > > > > > > > > > +               queue_work(sched->submit_wq, &sched->work_free_job);
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +/**
> > > > > > > > > > > + * drm_sched_free_job_queue_if_ready - queue free job if ready
> > > > > > > > > > > + *
> > > > > > > > > > > + * @sched: scheduler instance to queue free job
> > > > > > > > > > > + */
> > > > > > > > > > > +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > +{
> > > > > > > > > > > +       struct drm_sched_job *job;
> > > > > > > > > > > +
> > > > > > > > > > > +       spin_lock(&sched->job_list_lock);
> > > > > > > > > > > +       job = list_first_entry_or_null(&sched->pending_list,
> > > > > > > > > > > +                                      struct drm_sched_job, list);
> > > > > > > > > > > +       if (job && dma_fence_is_signaled(&job->s_fence->finished))
> > > > > > > > > > > +               drm_sched_free_job_queue(sched);
> > > > > > > > > > > +       spin_unlock(&sched->job_list_lock);
> > > > > > > > > > >      }
> > > > > > > > > > >      /**
> > > > > > > > > > > @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
> > > > > > > > > > >         dma_fence_get(&s_fence->finished);
> > > > > > > > > > >         drm_sched_fence_finished(s_fence, result);
> > > > > > > > > > >         dma_fence_put(&s_fence->finished);
> > > > > > > > > > > -       drm_sched_submit_queue(sched);
> > > > > > > > > > > +       drm_sched_free_job_queue(sched);
> > > > > > > > > > >      }
> > > > > > > > > > >      /**
> > > > > > > > > > > @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > > >      void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > > >      {
> > > > > > > > > > >         if (drm_sched_can_queue(sched))
> > > > > > > > > > > -               drm_sched_submit_queue(sched);
> > > > > > > > > > > +               drm_sched_run_job_queue(sched);
> > > > > > > > > > >      }
> > > > > > > > > > >      /**
> > > > > > > > > > >       * drm_sched_select_entity - Select next entity to process
> > > > > > > > > > >       *
> > > > > > > > > > >       * @sched: scheduler instance
> > > > > > > > > > > + * @dequeue: dequeue selected entity
> > > > > > > > > > >       *
> > > > > > > > > > >       * Returns the entity to process or NULL if none are found.
> > > > > > > > > > >       */
> > > > > > > > > > >      static struct drm_sched_entity *
> > > > > > > > > > > -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
> > > > > > > > > > >      {
> > > > > > > > > > >         struct drm_sched_entity *entity;
> > > > > > > > > > >         int i;
> > > > > > > > > > > @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > > > > > > > >         /* Kernel run queue has higher priority than normal run queue*/
> > > > > > > > > > >         for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> > > > > > > > > > >                 entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> > > > > > > > > > > -                       drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> > > > > > > > > > > -                       drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> > > > > > > > > > > +                       drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> > > > > > > > > > > +                                                       dequeue) :
> > > > > > > > > > > +                       drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> > > > > > > > > > > +                                                     dequeue);
> > > > > > > > > > >                 if (entity)
> > > > > > > > > > >                         break;
> > > > > > > > > > >         }
> > > > > > > > > > > @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> > > > > > > > > > >      EXPORT_SYMBOL(drm_sched_pick_best);
> > > > > > > > > > >      /**
> > > > > > > > > > > - * drm_sched_main - main scheduler thread
> > > > > > > > > > > + * drm_sched_free_job_work - worker to call free_job
> > > > > > > > > > >       *
> > > > > > > > > > > - * @param: scheduler instance
> > > > > > > > > > > + * @w: free job work
> > > > > > > > > > >       */
> > > > > > > > > > > -static void drm_sched_main(struct work_struct *w)
> > > > > > > > > > > +static void drm_sched_free_job_work(struct work_struct *w)
> > > > > > > > > > >      {
> > > > > > > > > > >         struct drm_gpu_scheduler *sched =
> > > > > > > > > > > -               container_of(w, struct drm_gpu_scheduler, work_submit);
> > > > > > > > > > > -       struct drm_sched_entity *entity;
> > > > > > > > > > > +               container_of(w, struct drm_gpu_scheduler, work_free_job);
> > > > > > > > > > >         struct drm_sched_job *cleanup_job;
> > > > > > > > > > > -       int r;
> > > > > > > > > > >         if (READ_ONCE(sched->pause_submit))
> > > > > > > > > > >                 return;
> > > > > > > > > > >         cleanup_job = drm_sched_get_cleanup_job(sched);
> > > > > > > > > > > -       entity = drm_sched_select_entity(sched);
> > > > > > > > > > > +       if (cleanup_job) {
> > > > > > > > > > > +               sched->ops->free_job(cleanup_job);
> > > > > > > > > > > +
> > > > > > > > > > > +               drm_sched_free_job_queue_if_ready(sched);
> > > > > > > > > > > +               drm_sched_run_job_queue_if_ready(sched);
> > > > > > > > > > > +       }
> > > > > > > > > > > +}
> > > > > > > > > > > -       if (!entity && !cleanup_job)
> > > > > > > > > > > -               return; /* No more work */
> > > > > > > > > > > +/**
> > > > > > > > > > > + * drm_sched_run_job_work - worker to call run_job
> > > > > > > > > > > + *
> > > > > > > > > > > + * @w: run job work
> > > > > > > > > > > + */
> > > > > > > > > > > +static void drm_sched_run_job_work(struct work_struct *w)
> > > > > > > > > > > +{
> > > > > > > > > > > +       struct drm_gpu_scheduler *sched =
> > > > > > > > > > > +               container_of(w, struct drm_gpu_scheduler, work_run_job);
> > > > > > > > > > > +       struct drm_sched_entity *entity;
> > > > > > > > > > > +       int r;
> > > > > > > > > > > -       if (cleanup_job)
> > > > > > > > > > > -               sched->ops->free_job(cleanup_job);
> > > > > > > > > > > +       if (READ_ONCE(sched->pause_submit))
> > > > > > > > > > > +               return;
> > > > > > > > > > > +       entity = drm_sched_select_entity(sched, true);
> > > > > > > > > > >         if (entity) {
> > > > > > > > > > >                 struct dma_fence *fence;
> > > > > > > > > > >                 struct drm_sched_fence *s_fence;
> > > > > > > > > > > @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > > > > > >                 sched_job = drm_sched_entity_pop_job(entity);
> > > > > > > > > > >                 if (!sched_job) {
> > > > > > > > > > >                         complete_all(&entity->entity_idle);
> > > > > > > > > > > -                       if (!cleanup_job)
> > > > > > > > > > > -                               return; /* No more work */
> > > > > > > > > > > -                       goto again;
> > > > > > > > > > > +                       return; /* No more work */
> > > > > > > > > > >                 }
> > > > > > > > > > >                 s_fence = sched_job->s_fence;
> > > > > > > > > > > @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > > > > > >                 }
> > > > > > > > > > >                 wake_up(&sched->job_scheduled);
> > > > > > > > > > > +               drm_sched_run_job_queue_if_ready(sched);
> > > > > > > > > > >         }
> > > > > > > > > > > -
> > > > > > > > > > > -again:
> > > > > > > > > > > -       drm_sched_submit_queue(sched);
> > > > > > > > > > >      }
> > > > > > > > > > >      /**
> > > > > > > > > > > @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> > > > > > > > > > >         spin_lock_init(&sched->job_list_lock);
> > > > > > > > > > >         atomic_set(&sched->hw_rq_count, 0);
> > > > > > > > > > >         INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > > > > > > > > > > -       INIT_WORK(&sched->work_submit, drm_sched_main);
> > > > > > > > > > > +       INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > > > > > > > > > > +       INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> > > > > > > > > > >         atomic_set(&sched->_score, 0);
> > > > > > > > > > >         atomic64_set(&sched->job_id_count, 0);
> > > > > > > > > > >         sched->pause_submit = false;
> > > > > > > > > > > @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
> > > > > > > > > > >      void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
> > > > > > > > > > >      {
> > > > > > > > > > >         WRITE_ONCE(sched->pause_submit, true);
> > > > > > > > > > > -       cancel_work_sync(&sched->work_submit);
> > > > > > > > > > > +       cancel_work_sync(&sched->work_run_job);
> > > > > > > > > > > +       cancel_work_sync(&sched->work_free_job);
> > > > > > > > > > >      }
> > > > > > > > > > >      EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > > > > > > > > @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > > > > > > > >      void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
> > > > > > > > > > >      {
> > > > > > > > > > >         WRITE_ONCE(sched->pause_submit, false);
> > > > > > > > > > > -       queue_work(sched->submit_wq, &sched->work_submit);
> > > > > > > > > > > +       queue_work(sched->submit_wq, &sched->work_run_job);
> > > > > > > > > > > +       queue_work(sched->submit_wq, &sched->work_free_job);
> > > > > > > > > > >      }
> > > > > > > > > > >      EXPORT_SYMBOL(drm_sched_submit_start);
> > > > > > > > > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > > > > > > > > index 04eec2d7635f..fbc083a92757 100644
> > > > > > > > > > > --- a/include/drm/gpu_scheduler.h
> > > > > > > > > > > +++ b/include/drm/gpu_scheduler.h
> > > > > > > > > > > @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
> > > > > > > > > > >       *                 finished.
> > > > > > > > > > >       * @hw_rq_count: the number of jobs currently in the hardware queue.
> > > > > > > > > > >       * @job_id_count: used to assign unique id to the each job.
> > > > > > > > > > > - * @submit_wq: workqueue used to queue @work_submit
> > > > > > > > > > > + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
> > > > > > > > > > >       * @timeout_wq: workqueue used to queue @work_tdr
> > > > > > > > > > > - * @work_submit: schedules jobs and cleans up entities
> > > > > > > > > > > + * @work_run_job: schedules jobs
> > > > > > > > > > > + * @work_free_job: cleans up jobs
> > > > > > > > > > >       * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> > > > > > > > > > >       *            timeout interval is over.
> > > > > > > > > > >       * @pending_list: the list of jobs which are currently in the job queue.
> > > > > > > > > > > @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
> > > > > > > > > > >         atomic64_t                      job_id_count;
> > > > > > > > > > >         struct workqueue_struct         *submit_wq;
> > > > > > > > > > >         struct workqueue_struct         *timeout_wq;
> > > > > > > > > > > -       struct work_struct              work_submit;
> > > > > > > > > > > +       struct work_struct              work_run_job;
> > > > > > > > > > > +       struct work_struct              work_free_job;
> > > > > > > > > > >         struct delayed_work             work_tdr;
> > > > > > > > > > >         struct list_head                pending_list;
> > > > > > > > > > >         spinlock_t                      job_list_lock;
> > > >

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/9] DRM scheduler changes for Xe
  2023-08-11  2:31 [PATCH v2 0/9] DRM scheduler changes for Xe Matthew Brost
                   ` (8 preceding siblings ...)
  2023-08-11  2:31 ` [PATCH v2 9/9] drm/sched: Add helper to set TDR timeout Matthew Brost
@ 2023-08-24  0:08 ` Danilo Krummrich
  2023-08-24  3:23   ` Matthew Brost
  9 siblings, 1 reply; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-24  0:08 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, intel-xe, luben.tuikov, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

Hi Matt,

On 8/11/23 04:31, Matthew Brost wrote:
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first.
> 
> This a continuation of a RFC [3] with all comments addressed, ready for
> a full review, and hopefully in state which can merged in the near
> future. More details of this series can found in the cover letter of the
> RFC [3].
> 
> These changes have been tested with the Xe driver.

Do you keep a branch with these patches somewhere?

- Danilo

> 
> v2:
>   - Break run job, free job, and process message in own work items
>   - This might break other drivers as run job and free job now can run in
>     parallel, can fix up if needed
> 
> Matt
> 
> Matthew Brost (9):
>    drm/sched: Convert drm scheduler to use a work queue  rather than
>      kthread
>    drm/sched: Move schedule policy to scheduler / entity
>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>    drm/sched: Split free_job into own work item
>    drm/sched: Add generic scheduler message interface
>    drm/sched: Add drm_sched_start_timeout_unlocked helper
>    drm/sched: Start run wq before TDR in drm_sched_start
>    drm/sched: Submit job before starting TDR
>    drm/sched: Add helper to set TDR timeout
> 
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   3 +-
>   drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   5 +-
>   drivers/gpu/drm/lima/lima_sched.c          |   5 +-
>   drivers/gpu/drm/msm/msm_ringbuffer.c       |   5 +-
>   drivers/gpu/drm/nouveau/nouveau_sched.c    |   5 +-
>   drivers/gpu/drm/panfrost/panfrost_job.c    |   5 +-
>   drivers/gpu/drm/scheduler/sched_entity.c   |  85 ++++-
>   drivers/gpu/drm/scheduler/sched_fence.c    |   2 +-
>   drivers/gpu/drm/scheduler/sched_main.c     | 408 ++++++++++++++++-----
>   drivers/gpu/drm/v3d/v3d_sched.c            |  25 +-
>   include/drm/gpu_scheduler.h                |  75 +++-
>   11 files changed, 487 insertions(+), 136 deletions(-)
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/9] DRM scheduler changes for Xe
  2023-08-24  0:08 ` [PATCH v2 0/9] DRM scheduler changes for Xe Danilo Krummrich
@ 2023-08-24  3:23   ` Matthew Brost
  2023-08-24 14:51     ` Danilo Krummrich
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Brost @ 2023-08-24  3:23 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, intel-xe, luben.tuikov, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

On Thu, Aug 24, 2023 at 02:08:59AM +0200, Danilo Krummrich wrote:
> Hi Matt,
> 
> On 8/11/23 04:31, Matthew Brost wrote:
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first.
> > 
> > This a continuation of a RFC [3] with all comments addressed, ready for
> > a full review, and hopefully in state which can merged in the near
> > future. More details of this series can found in the cover letter of the
> > RFC [3].
> > 
> > These changes have been tested with the Xe driver.
> 
> Do you keep a branch with these patches somewhere?
> 

Pushed a branch for you:
https://gitlab.freedesktop.org/mbrost/nouveau-drm-scheduler/-/tree/xe-sched-changes?ref_type=heads

Matt

> - Danilo
> 
> > 
> > v2:
> >   - Break run job, free job, and process message in own work items
> >   - This might break other drivers as run job and free job now can run in
> >     parallel, can fix up if needed
> > 
> > Matt
> > 
> > Matthew Brost (9):
> >    drm/sched: Convert drm scheduler to use a work queue  rather than
> >      kthread
> >    drm/sched: Move schedule policy to scheduler / entity
> >    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >    drm/sched: Split free_job into own work item
> >    drm/sched: Add generic scheduler message interface
> >    drm/sched: Add drm_sched_start_timeout_unlocked helper
> >    drm/sched: Start run wq before TDR in drm_sched_start
> >    drm/sched: Submit job before starting TDR
> >    drm/sched: Add helper to set TDR timeout
> > 
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   3 +-
> >   drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   5 +-
> >   drivers/gpu/drm/lima/lima_sched.c          |   5 +-
> >   drivers/gpu/drm/msm/msm_ringbuffer.c       |   5 +-
> >   drivers/gpu/drm/nouveau/nouveau_sched.c    |   5 +-
> >   drivers/gpu/drm/panfrost/panfrost_job.c    |   5 +-
> >   drivers/gpu/drm/scheduler/sched_entity.c   |  85 ++++-
> >   drivers/gpu/drm/scheduler/sched_fence.c    |   2 +-
> >   drivers/gpu/drm/scheduler/sched_main.c     | 408 ++++++++++++++++-----
> >   drivers/gpu/drm/v3d/v3d_sched.c            |  25 +-
> >   include/drm/gpu_scheduler.h                |  75 +++-
> >   11 files changed, 487 insertions(+), 136 deletions(-)
> > 
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Intel-xe] [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-23 23:12                       ` Matthew Brost
@ 2023-08-24 11:44                         ` Christian König
  2023-08-24 14:30                           ` Matthew Brost
  0 siblings, 1 reply; 80+ messages in thread
From: Christian König @ 2023-08-24 11:44 UTC (permalink / raw)
  To: Matthew Brost, Rodrigo Vivi
  Cc: robdclark, sarah.walker, ketil.johnsen, lina, Liviu.Dudau,
	dri-devel, luben.tuikov, donald.robson, boris.brezillon, intel-xe,
	faith.ekstrand

Am 24.08.23 um 01:12 schrieb Matthew Brost:
> On Wed, Aug 23, 2023 at 01:26:09PM -0400, Rodrigo Vivi wrote:
>> On Wed, Aug 23, 2023 at 11:41:19AM -0400, Alex Deucher wrote:
>>> On Wed, Aug 23, 2023 at 11:26 AM Matthew Brost <matthew.brost@intel.com> wrote:
>>>> On Wed, Aug 23, 2023 at 09:10:51AM +0200, Christian König wrote:
>>>>> Am 23.08.23 um 05:27 schrieb Matthew Brost:
>>>>>> [SNIP]
>>>>>>> That is exactly what I want to avoid, tying the TDR to the job is what some
>>>>>>> AMD engineers pushed for because it looked like a simple solution and made
>>>>>>> the whole thing similar to what Windows does.
>>>>>>>
>>>>>>> This turned the previous relatively clean scheduler and TDR design into a
>>>>>>> complete nightmare. The job contains quite a bunch of things which are not
>>>>>>> necessarily available after the application which submitted the job is torn
>>>>>>> down.
>>>>>>>
>>>>>> Agree the TDR shouldn't be accessing anything application specific
>>>>>> rather just internal job state required to tear the job down on the
>>>>>> hardware.
>>>>>>> So what happens is that you either have stale pointers in the TDR which can
>>>>>>> go boom extremely easily or we somehow find a way to keep the necessary
>>>>>> I have not experenced the TDR going boom in Xe.
>>>>>>
>>>>>>> structures (which include struct thread_info and struct file for this driver
>>>>>>> connection) alive until all submissions are completed.
>>>>>>>
>>>>>> In Xe we keep everything alive until all submissions are completed. By
>>>>>> everything I mean the drm job, entity, scheduler, and VM via a reference
>>>>>> counting scheme. All of these structures are just kernel state which can
>>>>>> safely be accessed even if the application has been killed.
>>>>> Yeah, but that might just not be such a good idea from memory management
>>>>> point of view.
>>>>>
>>>>> When you (for example) kill a process all resource from that progress should
>>>>> at least be queued to be freed more or less immediately.
>>>>>
>>>> We do this, the TDR kicks jobs off the hardware as fast as the hw
>>>> interface allows and signals all pending hw fences immediately after.
>>>> Free jobs then is immediately called and the reference count goes to
>>>> zero. I think max time for all of this to occur is a handful of ms.
>>>>
>>>>> What Linux is doing for other I/O operations is to keep the relevant pages
>>>>> alive until the I/O operation is completed, but for GPUs that usually means
>>>>> keeping most of the memory of the process alive and that in turn is really
>>>>> not something you can do.
>>>>>
>>>>> You can of course do this if your driver has a reliable way of killing your
>>>>> submissions and freeing resources in a reasonable amount of time. This
>>>>> should then be done in the flush callback.
>>>>>
>>>> 'flush callback' - Do you mean drm_sched_entity_flush? I looked at that
>>>> and think that function doesn't even work for what I tell. It flushes
>>>> the spsc queue but what about jobs on the hardware, how do those get
>>>> killed?
>>>>
>>>> As stated we do via the TDR which is rather clean design and fits with
>>>> our reference couting scheme.
>>>>
>>>>>> If we need to teardown on demand we just set the TDR to a minimum value and
>>>>>> it kicks the jobs off the hardware, gracefully cleans everything up and
>>>>>> drops all references. This is a benefit of the 1 to 1 relationship, not
>>>>>> sure if this works with how AMDGPU uses the scheduler.
>>>>>>
>>>>>>> Delaying application tear down is also not an option because then you run
>>>>>>> into massive trouble with the OOM killer (or more generally OOM handling).
>>>>>>> See what we do in drm_sched_entity_flush() as well.
>>>>>>>
>>>>>> Not an issue for Xe, we never call drm_sched_entity_flush as our
>>>>>> referencing counting scheme is all jobs are finished before we attempt
>>>>>> to tear down entity / scheduler.
>>>>> I don't think you can do that upstream. Calling drm_sched_entity_flush() is
>>>>> a must have from your flush callback for the file descriptor.
>>>>>
>>>> Again 'flush callback'? What are you refering too.
>>>>
>>>> And why does drm_sched_entity_flush need to be called, doesn't seem to
>>>> do anything useful.
>>>>
>>>>> Unless you have some other method for killing your submissions this would
>>>>> give a path for a deny of service attack vector when the Xe driver is in
>>>>> use.
>>>>>
>>>> Yes, once th TDR fires is disallows all new submissions at the exec
>>>> IOCTL plus flushes any pending submissions as fast as possible.
>>>>
>>>>>>> Since adding the TDR support we completely exercised this through in the
>>>>>>> last two or three years or so. And to sum it up I would really like to get
>>>>>>> away from this mess again.
>>>>>>>
>>>>>>> Compared to that what i915 does is actually rather clean I think.
>>>>>>>
>>>>>> Not even close, resets where a nightmare in the i915 (I spend years
>>>>>> trying to get this right and probably still completely work) and in Xe
>>>>>> basically got it right on the attempt.
>>>>>>
>>>>>>>>     Also in Xe some of
>>>>>>>> things done in free_job cannot be from an IRQ context, hence calling
>>>>>>>> this from the scheduler worker is rather helpful.
>>>>>>> Well putting things for cleanup into a workitem doesn't sounds like
>>>>>>> something hard.
>>>>>>>
>>>>>> That is exactly what we doing in the scheduler with the free_job
>>>>>> workitem.
>>>>> Yeah, but I think that we do it in the scheduler and not the driver is
>>>>> problematic.
>> Christian, I do see your point on simply get rid of free job callbacks here
>> then use fence with own-driver workqueue and house cleaning. But I wonder if
>> starting with this patch as a clear separation of that is not a step forward
>> and that could be cleaned up on a follow up!?
>>
>> Matt, why exactly do we need the separation in this patch? Commit message tells
>> what it is doing and that it is aligned with design, but is not clear on why
>> exactly we need this right now. Specially if in the end what we want is exactly
>> keeping the submit_wq to ensure the serialization of the operations you mentioned.
>> I mean, could we simply drop this patch and then work on a follow-up later and
>> investigate the Christian suggestion when we are in-tree?
>>
> I believe Christian suggested this change in a previous rev (free_job,
> proccess_msg) in there own workitem [1].
>
> Dropping free_job / calling run_job again is really a completely
> different topic than this patch.

Yeah, agree. I just wanted to bring this up before we put even more 
effort in the free_job based approach.

Rodrigos point is a really good one, no matter if the driver or the 
scheduler frees the job. Doing that in a separate work item sounds like 
the right thing to do.

Regards,
Christian.

>
> [1] https://patchwork.freedesktop.org/patch/550722/?series=121745&rev=1
>
>>>> Disagree, a common clean callback from a non-irq contexts IMO is a good
>>>> design rather than each driver possibly having its own worker for
>>>> cleanup.
>>>>
>>>>> For the scheduler it shouldn't care about the job any more as soon as the
>>>>> driver takes over.
>>>>>
>>>> This a massive rewrite for all users of the DRM scheduler, I'm saying
>>>> for Xe what you are suggesting makes little to no sense.
>>>>
>>>> I'd like other users of the DRM scheduler to chime in on what you
>>>> purposing. The scope of this change affects 8ish drivers that would
>>>> require buy in each of the stakeholders. I certainly can't change of
>>>> these drivers as I don't feel comfortable in all of those code bases nor
>>>> do I have hardware to test all of these drivers.
>>>>
>>>>>>> Question is what do you really need for TDR which is not inside the hardware
>>>>>>> fence?
>>>>>>>
>>>>>> A reference to the entity to be able to kick the job off the hardware.
>>>>>> A reference to the entity, job, and VM for error capture.
>>>>>>
>>>>>> We also need a reference to the job for recovery after a GPU reset so
>>>>>> run_job can be called again for innocent jobs.
>>>>> Well exactly that's what I'm massively pushing back. Letting the scheduler
>>>>> call run_job() for the same job again is *NOT* something you can actually
>>>>> do.
>>>>>
>>>> But lots of drivers do this already and the DRM scheduler documents
>>>> this.
>>>>
>>>>> This pretty clearly violates some of the dma_fence constrains and has cause
>>>>> massively headaches for me already.
>>>>>
>>>> Seems to work fine in Xe.
>>>>
>>>>> What you can do is to do this inside your driver, e.g. take the jobs and
>>>>> push them again to the hw ring or just tell the hw to start executing again
>>>>> from a previous position.
>>>>>
>>>> Again this now is massive rewrite of many drivers.
>>>>
>>>>> BTW that re-submitting of jobs seems to be a no-go from userspace
>>>>> perspective as well. Take a look at the Vulkan spec for that, at least Marek
>>>>> pretty much pointed out that we should absolutely not do this inside the
>>>>> kernel.
>>>>>
>>>> Yes if the job causes the hang, we ban the queue. Typcially only per
>>>> entity (queue) resets are done in Xe but occasionally device level
>>>> resets are done (issues with hardware) and innocent jobs / entities call
>>>> run_job again.
>>> If the engine is reset and the job was already executing, how can you
>>> determine that it's in a good state to resubmit?  What if some
> If a job has started but not completed we ban the queue during device
> reset. If a queue have jobs submitted but not started we resubmit all
> jobs on the queue during device reset.
>
> The started / completed state can be determined by looking at a seqno in
> memory.
>
>>> internal fence or semaphore in memory used by the logic in the command
>>> buffer has been signaled already and then you resubmit the job and it
>>> now starts executing with different input state?
>> I believe we could set some more rules in the new robustness documentation:
>> https://lore.kernel.org/all/20230818200642.276735-1-andrealmeid@igalia.com/
>>
>> For this robustness implementation i915 pin point the exact context that
>> was in execution when the gpu hang and only blame that, although the
>> ressubmission is up to the user space. While on Xe we are blaming every
>> single context that was in the queue. So I'm actually confused on what
>> are the innocent jobs and who are calling for resubmission, if all of
>> them got banned and blamed.
> See above, innocent job == submited job but not started (i.e. a job
> stuck in the FW queue not yet been put on the hardware). Because we have
> a FW scheduler we could have 1000s of innocent jobs that don't need to
> get banned. This is very different from drivers without FW schedulers as
> typically when run_job is called the job hits the hardware immediately.
>
> Matt
>
>>> Alex
>>>
>>>>> The generally right approach seems to be to cleanly signal to userspace that
>>>>> something bad happened and that userspace then needs to submit things again
>>>>> even for innocent jobs.
>>>>>
>>>> I disagree that innocent jobs should be banned. What you are suggesting
>>>> is if a device reset needs to be done we kill / ban every user space queue.
>>>> Thats seems like overkill. Not seeing where that is stated in this doc
>>>> [1], it seems to imply that only jobs that are stuck results in bans.
>>>>
>>>> Matt
>>>>
>>>> [1] https://patchwork.freedesktop.org/patch/553465/?series=119883&rev=3
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> All of this leads to believe we need to stick with the design.
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> The HW fence can live for longer as it can be installed in dma-resv
>>>>>>>> slots, syncobjs, etc... If the job and hw fence are combined now we
>>>>>>>> holding on the memory for the longer and perhaps at the mercy of the
>>>>>>>> user. We also run the risk of the final put being done from an IRQ
>>>>>>>> context which again wont work in Xe as it is currently coded. Lastly 2
>>>>>>>> jobs from the same scheduler could do the final put in parallel, so
>>>>>>>> rather than having free_job serialized by the worker now multiple jobs
>>>>>>>> are freeing themselves at the same time. This might not be an issue but
>>>>>>>> adds another level of raceyness that needs to be accounted for. None of
>>>>>>>> this sounds desirable to me.
>>>>>>>>
>>>>>>>> FWIW what you suggesting sounds like how the i915 did things
>>>>>>>> (i915_request and hw fence in 1 memory alloc) and that turned out to be
>>>>>>>> a huge mess. As rule of thumb I generally do the opposite of whatever
>>>>>>>> the i915 did.
>>>>>>>>
>>>>>>>> Matt
>>>>>>>>
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>> Matt
>>>>>>>>>>
>>>>>>>>>>> All the lifetime issues we had came from ignoring this fact and I think we
>>>>>>>>>>> should push for fixing this design up again.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Christian.
>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>       drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
>>>>>>>>>>>>       include/drm/gpu_scheduler.h            |   8 +-
>>>>>>>>>>>>       2 files changed, 106 insertions(+), 39 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>> index cede47afc800..b67469eac179 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>>> @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>>>>>>>>>>>        * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>>>>>>>>>>>>        *
>>>>>>>>>>>>        * @rq: scheduler run queue to check.
>>>>>>>>>>>> + * @dequeue: dequeue selected entity
>>>>>>>>>>>>        *
>>>>>>>>>>>>        * Try to find a ready entity, returns NULL if none found.
>>>>>>>>>>>>        */
>>>>>>>>>>>>       static struct drm_sched_entity *
>>>>>>>>>>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>>>>>>>>> +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
>>>>>>>>>>>>       {
>>>>>>>>>>>>          struct drm_sched_entity *entity;
>>>>>>>>>>>> @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>>>>>>>>>          if (entity) {
>>>>>>>>>>>>                  list_for_each_entry_continue(entity, &rq->entities, list) {
>>>>>>>>>>>>                          if (drm_sched_entity_is_ready(entity)) {
>>>>>>>>>>>> -                               rq->current_entity = entity;
>>>>>>>>>>>> -                               reinit_completion(&entity->entity_idle);
>>>>>>>>>>>> +                               if (dequeue) {
>>>>>>>>>>>> +                                       rq->current_entity = entity;
>>>>>>>>>>>> +                                       reinit_completion(&entity->entity_idle);
>>>>>>>>>>>> +                               }
>>>>>>>>>>>>                                  spin_unlock(&rq->lock);
>>>>>>>>>>>>                                  return entity;
>>>>>>>>>>>>                          }
>>>>>>>>>>>> @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>>>>>>>>>          list_for_each_entry(entity, &rq->entities, list) {
>>>>>>>>>>>>                  if (drm_sched_entity_is_ready(entity)) {
>>>>>>>>>>>> -                       rq->current_entity = entity;
>>>>>>>>>>>> -                       reinit_completion(&entity->entity_idle);
>>>>>>>>>>>> +                       if (dequeue) {
>>>>>>>>>>>> +                               rq->current_entity = entity;
>>>>>>>>>>>> +                               reinit_completion(&entity->entity_idle);
>>>>>>>>>>>> +                       }
>>>>>>>>>>>>                          spin_unlock(&rq->lock);
>>>>>>>>>>>>                          return entity;
>>>>>>>>>>>>                  }
>>>>>>>>>>>> @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>>>>>>>>>        * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>>>>>>>>>>>>        *
>>>>>>>>>>>>        * @rq: scheduler run queue to check.
>>>>>>>>>>>> + * @dequeue: dequeue selected entity
>>>>>>>>>>>>        *
>>>>>>>>>>>>        * Find oldest waiting ready entity, returns NULL if none found.
>>>>>>>>>>>>        */
>>>>>>>>>>>>       static struct drm_sched_entity *
>>>>>>>>>>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>>>>>>>>>> +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
>>>>>>>>>>>>       {
>>>>>>>>>>>>          struct rb_node *rb;
>>>>>>>>>>>> @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>>>>>>>>>>                  entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>>>>>>>>>>>                  if (drm_sched_entity_is_ready(entity)) {
>>>>>>>>>>>> -                       rq->current_entity = entity;
>>>>>>>>>>>> -                       reinit_completion(&entity->entity_idle);
>>>>>>>>>>>> +                       if (dequeue) {
>>>>>>>>>>>> +                               rq->current_entity = entity;
>>>>>>>>>>>> +                               reinit_completion(&entity->entity_idle);
>>>>>>>>>>>> +                       }
>>>>>>>>>>>>                          break;
>>>>>>>>>>>>                  }
>>>>>>>>>>>>          }
>>>>>>>>>>>> @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>>>>>>>>>>       }
>>>>>>>>>>>>       /**
>>>>>>>>>>>> - * drm_sched_submit_queue - scheduler queue submission
>>>>>>>>>>>> + * drm_sched_run_job_queue - queue job submission
>>>>>>>>>>>>        * @sched: scheduler instance
>>>>>>>>>>>>        */
>>>>>>>>>>>> -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
>>>>>>>>>>>> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>>>>>>>>>>>       {
>>>>>>>>>>>>          if (!READ_ONCE(sched->pause_submit))
>>>>>>>>>>>> -               queue_work(sched->submit_wq, &sched->work_submit);
>>>>>>>>>>>> +               queue_work(sched->submit_wq, &sched->work_run_job);
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>> +static struct drm_sched_entity *
>>>>>>>>>>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
>>>>>>>>>>>> +
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * drm_sched_run_job_queue_if_ready - queue job submission if ready
>>>>>>>>>>>> + * @sched: scheduler instance
>>>>>>>>>>>> + */
>>>>>>>>>>>> +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +       if (drm_sched_select_entity(sched, false))
>>>>>>>>>>>> +               drm_sched_run_job_queue(sched);
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * drm_sched_free_job_queue - queue free job
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * @sched: scheduler instance to queue free job
>>>>>>>>>>>> + */
>>>>>>>>>>>> +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +       if (!READ_ONCE(sched->pause_submit))
>>>>>>>>>>>> +               queue_work(sched->submit_wq, &sched->work_free_job);
>>>>>>>>>>>> +}
>>>>>>>>>>>> +
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * drm_sched_free_job_queue_if_ready - queue free job if ready
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * @sched: scheduler instance to queue free job
>>>>>>>>>>>> + */
>>>>>>>>>>>> +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +       struct drm_sched_job *job;
>>>>>>>>>>>> +
>>>>>>>>>>>> +       spin_lock(&sched->job_list_lock);
>>>>>>>>>>>> +       job = list_first_entry_or_null(&sched->pending_list,
>>>>>>>>>>>> +                                      struct drm_sched_job, list);
>>>>>>>>>>>> +       if (job && dma_fence_is_signaled(&job->s_fence->finished))
>>>>>>>>>>>> +               drm_sched_free_job_queue(sched);
>>>>>>>>>>>> +       spin_unlock(&sched->job_list_lock);
>>>>>>>>>>>>       }
>>>>>>>>>>>>       /**
>>>>>>>>>>>> @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>>>>>>>>>>>>          dma_fence_get(&s_fence->finished);
>>>>>>>>>>>>          drm_sched_fence_finished(s_fence, result);
>>>>>>>>>>>>          dma_fence_put(&s_fence->finished);
>>>>>>>>>>>> -       drm_sched_submit_queue(sched);
>>>>>>>>>>>> +       drm_sched_free_job_queue(sched);
>>>>>>>>>>>>       }
>>>>>>>>>>>>       /**
>>>>>>>>>>>> @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
>>>>>>>>>>>>       void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>>>>>>>>>>>>       {
>>>>>>>>>>>>          if (drm_sched_can_queue(sched))
>>>>>>>>>>>> -               drm_sched_submit_queue(sched);
>>>>>>>>>>>> +               drm_sched_run_job_queue(sched);
>>>>>>>>>>>>       }
>>>>>>>>>>>>       /**
>>>>>>>>>>>>        * drm_sched_select_entity - Select next entity to process
>>>>>>>>>>>>        *
>>>>>>>>>>>>        * @sched: scheduler instance
>>>>>>>>>>>> + * @dequeue: dequeue selected entity
>>>>>>>>>>>>        *
>>>>>>>>>>>>        * Returns the entity to process or NULL if none are found.
>>>>>>>>>>>>        */
>>>>>>>>>>>>       static struct drm_sched_entity *
>>>>>>>>>>>> -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>>>>>>>>>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
>>>>>>>>>>>>       {
>>>>>>>>>>>>          struct drm_sched_entity *entity;
>>>>>>>>>>>>          int i;
>>>>>>>>>>>> @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>>>>>>>>>>>          /* Kernel run queue has higher priority than normal run queue*/
>>>>>>>>>>>>          for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>>>>>>>>                  entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
>>>>>>>>>>>> -                       drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
>>>>>>>>>>>> -                       drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
>>>>>>>>>>>> +                       drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
>>>>>>>>>>>> +                                                       dequeue) :
>>>>>>>>>>>> +                       drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
>>>>>>>>>>>> +                                                     dequeue);
>>>>>>>>>>>>                  if (entity)
>>>>>>>>>>>>                          break;
>>>>>>>>>>>>          }
>>>>>>>>>>>> @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>>>>>>>>>>>>       EXPORT_SYMBOL(drm_sched_pick_best);
>>>>>>>>>>>>       /**
>>>>>>>>>>>> - * drm_sched_main - main scheduler thread
>>>>>>>>>>>> + * drm_sched_free_job_work - worker to call free_job
>>>>>>>>>>>>        *
>>>>>>>>>>>> - * @param: scheduler instance
>>>>>>>>>>>> + * @w: free job work
>>>>>>>>>>>>        */
>>>>>>>>>>>> -static void drm_sched_main(struct work_struct *w)
>>>>>>>>>>>> +static void drm_sched_free_job_work(struct work_struct *w)
>>>>>>>>>>>>       {
>>>>>>>>>>>>          struct drm_gpu_scheduler *sched =
>>>>>>>>>>>> -               container_of(w, struct drm_gpu_scheduler, work_submit);
>>>>>>>>>>>> -       struct drm_sched_entity *entity;
>>>>>>>>>>>> +               container_of(w, struct drm_gpu_scheduler, work_free_job);
>>>>>>>>>>>>          struct drm_sched_job *cleanup_job;
>>>>>>>>>>>> -       int r;
>>>>>>>>>>>>          if (READ_ONCE(sched->pause_submit))
>>>>>>>>>>>>                  return;
>>>>>>>>>>>>          cleanup_job = drm_sched_get_cleanup_job(sched);
>>>>>>>>>>>> -       entity = drm_sched_select_entity(sched);
>>>>>>>>>>>> +       if (cleanup_job) {
>>>>>>>>>>>> +               sched->ops->free_job(cleanup_job);
>>>>>>>>>>>> +
>>>>>>>>>>>> +               drm_sched_free_job_queue_if_ready(sched);
>>>>>>>>>>>> +               drm_sched_run_job_queue_if_ready(sched);
>>>>>>>>>>>> +       }
>>>>>>>>>>>> +}
>>>>>>>>>>>> -       if (!entity && !cleanup_job)
>>>>>>>>>>>> -               return; /* No more work */
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * drm_sched_run_job_work - worker to call run_job
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * @w: run job work
>>>>>>>>>>>> + */
>>>>>>>>>>>> +static void drm_sched_run_job_work(struct work_struct *w)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +       struct drm_gpu_scheduler *sched =
>>>>>>>>>>>> +               container_of(w, struct drm_gpu_scheduler, work_run_job);
>>>>>>>>>>>> +       struct drm_sched_entity *entity;
>>>>>>>>>>>> +       int r;
>>>>>>>>>>>> -       if (cleanup_job)
>>>>>>>>>>>> -               sched->ops->free_job(cleanup_job);
>>>>>>>>>>>> +       if (READ_ONCE(sched->pause_submit))
>>>>>>>>>>>> +               return;
>>>>>>>>>>>> +       entity = drm_sched_select_entity(sched, true);
>>>>>>>>>>>>          if (entity) {
>>>>>>>>>>>>                  struct dma_fence *fence;
>>>>>>>>>>>>                  struct drm_sched_fence *s_fence;
>>>>>>>>>>>> @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
>>>>>>>>>>>>                  sched_job = drm_sched_entity_pop_job(entity);
>>>>>>>>>>>>                  if (!sched_job) {
>>>>>>>>>>>>                          complete_all(&entity->entity_idle);
>>>>>>>>>>>> -                       if (!cleanup_job)
>>>>>>>>>>>> -                               return; /* No more work */
>>>>>>>>>>>> -                       goto again;
>>>>>>>>>>>> +                       return; /* No more work */
>>>>>>>>>>>>                  }
>>>>>>>>>>>>                  s_fence = sched_job->s_fence;
>>>>>>>>>>>> @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
>>>>>>>>>>>>                  }
>>>>>>>>>>>>                  wake_up(&sched->job_scheduled);
>>>>>>>>>>>> +               drm_sched_run_job_queue_if_ready(sched);
>>>>>>>>>>>>          }
>>>>>>>>>>>> -
>>>>>>>>>>>> -again:
>>>>>>>>>>>> -       drm_sched_submit_queue(sched);
>>>>>>>>>>>>       }
>>>>>>>>>>>>       /**
>>>>>>>>>>>> @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>>>>>>>>>>          spin_lock_init(&sched->job_list_lock);
>>>>>>>>>>>>          atomic_set(&sched->hw_rq_count, 0);
>>>>>>>>>>>>          INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>>>>>>>>>>> -       INIT_WORK(&sched->work_submit, drm_sched_main);
>>>>>>>>>>>> +       INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
>>>>>>>>>>>> +       INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
>>>>>>>>>>>>          atomic_set(&sched->_score, 0);
>>>>>>>>>>>>          atomic64_set(&sched->job_id_count, 0);
>>>>>>>>>>>>          sched->pause_submit = false;
>>>>>>>>>>>> @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
>>>>>>>>>>>>       void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
>>>>>>>>>>>>       {
>>>>>>>>>>>>          WRITE_ONCE(sched->pause_submit, true);
>>>>>>>>>>>> -       cancel_work_sync(&sched->work_submit);
>>>>>>>>>>>> +       cancel_work_sync(&sched->work_run_job);
>>>>>>>>>>>> +       cancel_work_sync(&sched->work_free_job);
>>>>>>>>>>>>       }
>>>>>>>>>>>>       EXPORT_SYMBOL(drm_sched_submit_stop);
>>>>>>>>>>>> @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
>>>>>>>>>>>>       void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
>>>>>>>>>>>>       {
>>>>>>>>>>>>          WRITE_ONCE(sched->pause_submit, false);
>>>>>>>>>>>> -       queue_work(sched->submit_wq, &sched->work_submit);
>>>>>>>>>>>> +       queue_work(sched->submit_wq, &sched->work_run_job);
>>>>>>>>>>>> +       queue_work(sched->submit_wq, &sched->work_free_job);
>>>>>>>>>>>>       }
>>>>>>>>>>>>       EXPORT_SYMBOL(drm_sched_submit_start);
>>>>>>>>>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>>>>>>>>>> index 04eec2d7635f..fbc083a92757 100644
>>>>>>>>>>>> --- a/include/drm/gpu_scheduler.h
>>>>>>>>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>>>>>>>>> @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
>>>>>>>>>>>>        *                 finished.
>>>>>>>>>>>>        * @hw_rq_count: the number of jobs currently in the hardware queue.
>>>>>>>>>>>>        * @job_id_count: used to assign unique id to the each job.
>>>>>>>>>>>> - * @submit_wq: workqueue used to queue @work_submit
>>>>>>>>>>>> + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
>>>>>>>>>>>>        * @timeout_wq: workqueue used to queue @work_tdr
>>>>>>>>>>>> - * @work_submit: schedules jobs and cleans up entities
>>>>>>>>>>>> + * @work_run_job: schedules jobs
>>>>>>>>>>>> + * @work_free_job: cleans up jobs
>>>>>>>>>>>>        * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>>>>>>>>>>>>        *            timeout interval is over.
>>>>>>>>>>>>        * @pending_list: the list of jobs which are currently in the job queue.
>>>>>>>>>>>> @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
>>>>>>>>>>>>          atomic64_t                      job_id_count;
>>>>>>>>>>>>          struct workqueue_struct         *submit_wq;
>>>>>>>>>>>>          struct workqueue_struct         *timeout_wq;
>>>>>>>>>>>> -       struct work_struct              work_submit;
>>>>>>>>>>>> +       struct work_struct              work_run_job;
>>>>>>>>>>>> +       struct work_struct              work_free_job;
>>>>>>>>>>>>          struct delayed_work             work_tdr;
>>>>>>>>>>>>          struct list_head                pending_list;
>>>>>>>>>>>>          spinlock_t                      job_list_lock;


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-22 16:55                                   ` Faith Ekstrand
@ 2023-08-24 11:50                                     ` Bas Nieuwenhuizen
  0 siblings, 0 replies; 80+ messages in thread
From: Bas Nieuwenhuizen @ 2023-08-24 11:50 UTC (permalink / raw)
  To: Faith Ekstrand
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, intel-xe,
	luben.tuikov, Danilo Krummrich, donald.robson, boris.brezillon,
	Christian König, faith.ekstrand

[-- Attachment #1: Type: text/plain, Size: 4735 bytes --]

On Tue, Aug 22, 2023 at 6:55 PM Faith Ekstrand <faith@gfxstrand.net> wrote:

> On Tue, Aug 22, 2023 at 4:51 AM Christian König <christian.koenig@amd.com>
> wrote:
>
>> Am 21.08.23 um 21:46 schrieb Faith Ekstrand:
>>
>> On Mon, Aug 21, 2023 at 1:13 PM Christian König <christian.koenig@amd.com>
>> wrote:
>>
>>> [SNIP]
>>> So as long as nobody from userspace comes and says we absolutely need to
>>> optimize this use case I would rather not do it.
>>>
>>
>> This is a place where nouveau's needs are legitimately different from AMD
>> or Intel, I think.  NVIDIA's command streamer model is very different from
>> AMD and Intel.  On AMD and Intel, each EXEC turns into a single small
>> packet (on the order of 16B) which kicks off a command buffer.  There may
>> be a bit of cache management or something around it but that's it.  From
>> there, it's userspace's job to make one command buffer chain to another
>> until it's finally done and then do a "return", whatever that looks like.
>>
>> NVIDIA's model is much more static.  Each packet in the HW/FW ring is an
>> address and a size and that much data is processed and then it grabs the
>> next packet and processes. The result is that, if we use multiple buffers
>> of commands, there's no way to chain them together.  We just have to pass
>> the whole list of buffers to the kernel.
>>
>>
>> So far that is actually completely identical to what AMD has.
>>
>> A single EXEC ioctl / job may have 500 such addr+size packets depending
>> on how big the command buffer is.
>>
>>
>> And that is what I don't understand. Why would you need 100dreds of such
>> addr+size packets?
>>
>
> Well, we're not really in control of it.  We can control our base pushbuf
> size and that's something we can tune but we're still limited by the
> client.  We have to submit another pushbuf whenever:
>
>  1. We run out of space (power-of-two growth is also possible but the size
> is limited to a maximum of about 4MiB due to hardware limitations.)
>  2. The client calls a secondary command buffer.
>  3. Any usage of indirect draw or dispatch on pre-Turing hardware.
>
> At some point we need to tune our BO size a bit to avoid (1) while also
> avoiding piles of tiny BOs.  However, (2) and (3) are out of our control.
>
> This is basically identical to what AMD has (well on newer hw there is an
>> extension in the CP packets to JUMP/CALL subsequent IBs, but this isn't
>> widely used as far as I know).
>>
>
> According to Bas, RADV chains on recent hardware.
>

well:

1) on GFX6 and older we can't chain at all
2) on Compute/DMA we can't chain at all
3) with VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT we can't chain between
cmdbuffers
4) for some secondary use cases we can't chain.

so we have to do the "submit multiple" dance in many cases.

>
>
>> Previously the limit was something like 4 which we extended to because
>> Bas came up with similar requirements for the AMD side from RADV.
>>
>> But essentially those approaches with 100dreds of IBs doesn't sound like
>> a good idea to me.
>>
>
> No one's arguing that they like it.  Again, the hardware isn't designed to
> have a kernel in the way. It's designed to be fed by userspace. But we're
> going to have the kernel in the middle for a while so we need to make it
> not suck too bad.
>
> ~Faith
>
> It gets worse on pre-Turing hardware where we have to split the batch for
>> every single DrawIndirect or DispatchIndirect.
>>
>> Lest you think NVIDIA is just crazy here, it's a perfectly reasonable
>> model if you assume that userspace is feeding the firmware.  When that's
>> happening, you just have a userspace thread that sits there and feeds the
>> ringbuffer with whatever is next and you can marshal as much data through
>> as you want. Sure, it'd be nice to have a 2nd level batch thing that gets
>> launched from the FW ring and has all the individual launch commands but
>> it's not at all necessary.
>>
>> What does that mean from a gpu_scheduler PoV? Basically, it means a
>> variable packet size.
>>
>> What does this mean for implementation? IDK.  One option would be to
>> teach the scheduler about actual job sizes. Another would be to virtualize
>> it and have another layer underneath the scheduler that does the actual
>> feeding of the ring. Another would be to decrease the job size somewhat and
>> then have the front-end submit as many jobs as it needs to service
>> userspace and only put the out-fences on the last job. All the options
>> kinda suck.
>>
>>
>> Yeah, agree. The job size Danilo suggested is still the least painful.
>>
>> Christian.
>>
>>
>> ~Faith
>>
>>
>>

[-- Attachment #2: Type: text/html, Size: 8069 bytes --]

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [Intel-xe] [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-24 11:44                         ` Christian König
@ 2023-08-24 14:30                           ` Matthew Brost
  0 siblings, 0 replies; 80+ messages in thread
From: Matthew Brost @ 2023-08-24 14:30 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, sarah.walker, ketil.johnsen, lina, Liviu.Dudau,
	dri-devel, luben.tuikov, boris.brezillon, donald.robson,
	Rodrigo Vivi, intel-xe, faith.ekstrand

On Thu, Aug 24, 2023 at 01:44:41PM +0200, Christian König wrote:
> Am 24.08.23 um 01:12 schrieb Matthew Brost:
> > On Wed, Aug 23, 2023 at 01:26:09PM -0400, Rodrigo Vivi wrote:
> > > On Wed, Aug 23, 2023 at 11:41:19AM -0400, Alex Deucher wrote:
> > > > On Wed, Aug 23, 2023 at 11:26 AM Matthew Brost <matthew.brost@intel.com> wrote:
> > > > > On Wed, Aug 23, 2023 at 09:10:51AM +0200, Christian König wrote:
> > > > > > Am 23.08.23 um 05:27 schrieb Matthew Brost:
> > > > > > > [SNIP]
> > > > > > > > That is exactly what I want to avoid, tying the TDR to the job is what some
> > > > > > > > AMD engineers pushed for because it looked like a simple solution and made
> > > > > > > > the whole thing similar to what Windows does.
> > > > > > > > 
> > > > > > > > This turned the previous relatively clean scheduler and TDR design into a
> > > > > > > > complete nightmare. The job contains quite a bunch of things which are not
> > > > > > > > necessarily available after the application which submitted the job is torn
> > > > > > > > down.
> > > > > > > > 
> > > > > > > Agree the TDR shouldn't be accessing anything application specific
> > > > > > > rather just internal job state required to tear the job down on the
> > > > > > > hardware.
> > > > > > > > So what happens is that you either have stale pointers in the TDR which can
> > > > > > > > go boom extremely easily or we somehow find a way to keep the necessary
> > > > > > > I have not experenced the TDR going boom in Xe.
> > > > > > > 
> > > > > > > > structures (which include struct thread_info and struct file for this driver
> > > > > > > > connection) alive until all submissions are completed.
> > > > > > > > 
> > > > > > > In Xe we keep everything alive until all submissions are completed. By
> > > > > > > everything I mean the drm job, entity, scheduler, and VM via a reference
> > > > > > > counting scheme. All of these structures are just kernel state which can
> > > > > > > safely be accessed even if the application has been killed.
> > > > > > Yeah, but that might just not be such a good idea from memory management
> > > > > > point of view.
> > > > > > 
> > > > > > When you (for example) kill a process all resource from that progress should
> > > > > > at least be queued to be freed more or less immediately.
> > > > > > 
> > > > > We do this, the TDR kicks jobs off the hardware as fast as the hw
> > > > > interface allows and signals all pending hw fences immediately after.
> > > > > Free jobs then is immediately called and the reference count goes to
> > > > > zero. I think max time for all of this to occur is a handful of ms.
> > > > > 
> > > > > > What Linux is doing for other I/O operations is to keep the relevant pages
> > > > > > alive until the I/O operation is completed, but for GPUs that usually means
> > > > > > keeping most of the memory of the process alive and that in turn is really
> > > > > > not something you can do.
> > > > > > 
> > > > > > You can of course do this if your driver has a reliable way of killing your
> > > > > > submissions and freeing resources in a reasonable amount of time. This
> > > > > > should then be done in the flush callback.
> > > > > > 
> > > > > 'flush callback' - Do you mean drm_sched_entity_flush? I looked at that
> > > > > and think that function doesn't even work for what I tell. It flushes
> > > > > the spsc queue but what about jobs on the hardware, how do those get
> > > > > killed?
> > > > > 
> > > > > As stated we do via the TDR which is rather clean design and fits with
> > > > > our reference couting scheme.
> > > > > 
> > > > > > > If we need to teardown on demand we just set the TDR to a minimum value and
> > > > > > > it kicks the jobs off the hardware, gracefully cleans everything up and
> > > > > > > drops all references. This is a benefit of the 1 to 1 relationship, not
> > > > > > > sure if this works with how AMDGPU uses the scheduler.
> > > > > > > 
> > > > > > > > Delaying application tear down is also not an option because then you run
> > > > > > > > into massive trouble with the OOM killer (or more generally OOM handling).
> > > > > > > > See what we do in drm_sched_entity_flush() as well.
> > > > > > > > 
> > > > > > > Not an issue for Xe, we never call drm_sched_entity_flush as our
> > > > > > > referencing counting scheme is all jobs are finished before we attempt
> > > > > > > to tear down entity / scheduler.
> > > > > > I don't think you can do that upstream. Calling drm_sched_entity_flush() is
> > > > > > a must have from your flush callback for the file descriptor.
> > > > > > 
> > > > > Again 'flush callback'? What are you refering too.
> > > > > 
> > > > > And why does drm_sched_entity_flush need to be called, doesn't seem to
> > > > > do anything useful.
> > > > > 
> > > > > > Unless you have some other method for killing your submissions this would
> > > > > > give a path for a deny of service attack vector when the Xe driver is in
> > > > > > use.
> > > > > > 
> > > > > Yes, once th TDR fires is disallows all new submissions at the exec
> > > > > IOCTL plus flushes any pending submissions as fast as possible.
> > > > > 
> > > > > > > > Since adding the TDR support we completely exercised this through in the
> > > > > > > > last two or three years or so. And to sum it up I would really like to get
> > > > > > > > away from this mess again.
> > > > > > > > 
> > > > > > > > Compared to that what i915 does is actually rather clean I think.
> > > > > > > > 
> > > > > > > Not even close, resets where a nightmare in the i915 (I spend years
> > > > > > > trying to get this right and probably still completely work) and in Xe
> > > > > > > basically got it right on the attempt.
> > > > > > > 
> > > > > > > > >     Also in Xe some of
> > > > > > > > > things done in free_job cannot be from an IRQ context, hence calling
> > > > > > > > > this from the scheduler worker is rather helpful.
> > > > > > > > Well putting things for cleanup into a workitem doesn't sounds like
> > > > > > > > something hard.
> > > > > > > > 
> > > > > > > That is exactly what we doing in the scheduler with the free_job
> > > > > > > workitem.
> > > > > > Yeah, but I think that we do it in the scheduler and not the driver is
> > > > > > problematic.
> > > Christian, I do see your point on simply get rid of free job callbacks here
> > > then use fence with own-driver workqueue and house cleaning. But I wonder if
> > > starting with this patch as a clear separation of that is not a step forward
> > > and that could be cleaned up on a follow up!?
> > > 
> > > Matt, why exactly do we need the separation in this patch? Commit message tells
> > > what it is doing and that it is aligned with design, but is not clear on why
> > > exactly we need this right now. Specially if in the end what we want is exactly
> > > keeping the submit_wq to ensure the serialization of the operations you mentioned.
> > > I mean, could we simply drop this patch and then work on a follow-up later and
> > > investigate the Christian suggestion when we are in-tree?
> > > 
> > I believe Christian suggested this change in a previous rev (free_job,
> > proccess_msg) in there own workitem [1].
> > 
> > Dropping free_job / calling run_job again is really a completely
> > different topic than this patch.
> 
> Yeah, agree. I just wanted to bring this up before we put even more effort
> in the free_job based approach.
> 
> Rodrigos point is a really good one, no matter if the driver or the
> scheduler frees the job. Doing that in a separate work item sounds like the
> right thing to do.
> 

Ok, so this patch for now is ok but as a follow up we should explore
dropping free_job / scheduler refs to jobs with a wider audience as this
change affects all drivers.

Matt

> Regards,
> Christian.
> 
> > 
> > [1] https://patchwork.freedesktop.org/patch/550722/?series=121745&rev=1
> > 
> > > > > Disagree, a common clean callback from a non-irq contexts IMO is a good
> > > > > design rather than each driver possibly having its own worker for
> > > > > cleanup.
> > > > > 
> > > > > > For the scheduler it shouldn't care about the job any more as soon as the
> > > > > > driver takes over.
> > > > > > 
> > > > > This a massive rewrite for all users of the DRM scheduler, I'm saying
> > > > > for Xe what you are suggesting makes little to no sense.
> > > > > 
> > > > > I'd like other users of the DRM scheduler to chime in on what you
> > > > > purposing. The scope of this change affects 8ish drivers that would
> > > > > require buy in each of the stakeholders. I certainly can't change of
> > > > > these drivers as I don't feel comfortable in all of those code bases nor
> > > > > do I have hardware to test all of these drivers.
> > > > > 
> > > > > > > > Question is what do you really need for TDR which is not inside the hardware
> > > > > > > > fence?
> > > > > > > > 
> > > > > > > A reference to the entity to be able to kick the job off the hardware.
> > > > > > > A reference to the entity, job, and VM for error capture.
> > > > > > > 
> > > > > > > We also need a reference to the job for recovery after a GPU reset so
> > > > > > > run_job can be called again for innocent jobs.
> > > > > > Well exactly that's what I'm massively pushing back. Letting the scheduler
> > > > > > call run_job() for the same job again is *NOT* something you can actually
> > > > > > do.
> > > > > > 
> > > > > But lots of drivers do this already and the DRM scheduler documents
> > > > > this.
> > > > > 
> > > > > > This pretty clearly violates some of the dma_fence constrains and has cause
> > > > > > massively headaches for me already.
> > > > > > 
> > > > > Seems to work fine in Xe.
> > > > > 
> > > > > > What you can do is to do this inside your driver, e.g. take the jobs and
> > > > > > push them again to the hw ring or just tell the hw to start executing again
> > > > > > from a previous position.
> > > > > > 
> > > > > Again this now is massive rewrite of many drivers.
> > > > > 
> > > > > > BTW that re-submitting of jobs seems to be a no-go from userspace
> > > > > > perspective as well. Take a look at the Vulkan spec for that, at least Marek
> > > > > > pretty much pointed out that we should absolutely not do this inside the
> > > > > > kernel.
> > > > > > 
> > > > > Yes if the job causes the hang, we ban the queue. Typcially only per
> > > > > entity (queue) resets are done in Xe but occasionally device level
> > > > > resets are done (issues with hardware) and innocent jobs / entities call
> > > > > run_job again.
> > > > If the engine is reset and the job was already executing, how can you
> > > > determine that it's in a good state to resubmit?  What if some
> > If a job has started but not completed we ban the queue during device
> > reset. If a queue have jobs submitted but not started we resubmit all
> > jobs on the queue during device reset.
> > 
> > The started / completed state can be determined by looking at a seqno in
> > memory.
> > 
> > > > internal fence or semaphore in memory used by the logic in the command
> > > > buffer has been signaled already and then you resubmit the job and it
> > > > now starts executing with different input state?
> > > I believe we could set some more rules in the new robustness documentation:
> > > https://lore.kernel.org/all/20230818200642.276735-1-andrealmeid@igalia.com/
> > > 
> > > For this robustness implementation i915 pin point the exact context that
> > > was in execution when the gpu hang and only blame that, although the
> > > ressubmission is up to the user space. While on Xe we are blaming every
> > > single context that was in the queue. So I'm actually confused on what
> > > are the innocent jobs and who are calling for resubmission, if all of
> > > them got banned and blamed.
> > See above, innocent job == submited job but not started (i.e. a job
> > stuck in the FW queue not yet been put on the hardware). Because we have
> > a FW scheduler we could have 1000s of innocent jobs that don't need to
> > get banned. This is very different from drivers without FW schedulers as
> > typically when run_job is called the job hits the hardware immediately.
> > 
> > Matt
> > 
> > > > Alex
> > > > 
> > > > > > The generally right approach seems to be to cleanly signal to userspace that
> > > > > > something bad happened and that userspace then needs to submit things again
> > > > > > even for innocent jobs.
> > > > > > 
> > > > > I disagree that innocent jobs should be banned. What you are suggesting
> > > > > is if a device reset needs to be done we kill / ban every user space queue.
> > > > > Thats seems like overkill. Not seeing where that is stated in this doc
> > > > > [1], it seems to imply that only jobs that are stuck results in bans.
> > > > > 
> > > > > Matt
> > > > > 
> > > > > [1] https://patchwork.freedesktop.org/patch/553465/?series=119883&rev=3
> > > > > 
> > > > > > Regards,
> > > > > > Christian.
> > > > > > 
> > > > > > > All of this leads to believe we need to stick with the design.
> > > > > > > 
> > > > > > > Matt
> > > > > > > 
> > > > > > > > Regards,
> > > > > > > > Christian.
> > > > > > > > 
> > > > > > > > > The HW fence can live for longer as it can be installed in dma-resv
> > > > > > > > > slots, syncobjs, etc... If the job and hw fence are combined now we
> > > > > > > > > holding on the memory for the longer and perhaps at the mercy of the
> > > > > > > > > user. We also run the risk of the final put being done from an IRQ
> > > > > > > > > context which again wont work in Xe as it is currently coded. Lastly 2
> > > > > > > > > jobs from the same scheduler could do the final put in parallel, so
> > > > > > > > > rather than having free_job serialized by the worker now multiple jobs
> > > > > > > > > are freeing themselves at the same time. This might not be an issue but
> > > > > > > > > adds another level of raceyness that needs to be accounted for. None of
> > > > > > > > > this sounds desirable to me.
> > > > > > > > > 
> > > > > > > > > FWIW what you suggesting sounds like how the i915 did things
> > > > > > > > > (i915_request and hw fence in 1 memory alloc) and that turned out to be
> > > > > > > > > a huge mess. As rule of thumb I generally do the opposite of whatever
> > > > > > > > > the i915 did.
> > > > > > > > > 
> > > > > > > > > Matt
> > > > > > > > > 
> > > > > > > > > > Christian.
> > > > > > > > > > 
> > > > > > > > > > > Matt
> > > > > > > > > > > 
> > > > > > > > > > > > All the lifetime issues we had came from ignoring this fact and I think we
> > > > > > > > > > > > should push for fixing this design up again.
> > > > > > > > > > > > 
> > > > > > > > > > > > Regards,
> > > > > > > > > > > > Christian.
> > > > > > > > > > > > 
> > > > > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >       drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
> > > > > > > > > > > > >       include/drm/gpu_scheduler.h            |   8 +-
> > > > > > > > > > > > >       2 files changed, 106 insertions(+), 39 deletions(-)
> > > > > > > > > > > > > 
> > > > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > > index cede47afc800..b67469eac179 100644
> > > > > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > > @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > > > > > > > > > > > >        * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
> > > > > > > > > > > > >        *
> > > > > > > > > > > > >        * @rq: scheduler run queue to check.
> > > > > > > > > > > > > + * @dequeue: dequeue selected entity
> > > > > > > > > > > > >        *
> > > > > > > > > > > > >        * Try to find a ready entity, returns NULL if none found.
> > > > > > > > > > > > >        */
> > > > > > > > > > > > >       static struct drm_sched_entity *
> > > > > > > > > > > > > -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > > > > > > +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
> > > > > > > > > > > > >       {
> > > > > > > > > > > > >          struct drm_sched_entity *entity;
> > > > > > > > > > > > > @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > > > > > >          if (entity) {
> > > > > > > > > > > > >                  list_for_each_entry_continue(entity, &rq->entities, list) {
> > > > > > > > > > > > >                          if (drm_sched_entity_is_ready(entity)) {
> > > > > > > > > > > > > -                               rq->current_entity = entity;
> > > > > > > > > > > > > -                               reinit_completion(&entity->entity_idle);
> > > > > > > > > > > > > +                               if (dequeue) {
> > > > > > > > > > > > > +                                       rq->current_entity = entity;
> > > > > > > > > > > > > +                                       reinit_completion(&entity->entity_idle);
> > > > > > > > > > > > > +                               }
> > > > > > > > > > > > >                                  spin_unlock(&rq->lock);
> > > > > > > > > > > > >                                  return entity;
> > > > > > > > > > > > >                          }
> > > > > > > > > > > > > @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > > > > > >          list_for_each_entry(entity, &rq->entities, list) {
> > > > > > > > > > > > >                  if (drm_sched_entity_is_ready(entity)) {
> > > > > > > > > > > > > -                       rq->current_entity = entity;
> > > > > > > > > > > > > -                       reinit_completion(&entity->entity_idle);
> > > > > > > > > > > > > +                       if (dequeue) {
> > > > > > > > > > > > > +                               rq->current_entity = entity;
> > > > > > > > > > > > > +                               reinit_completion(&entity->entity_idle);
> > > > > > > > > > > > > +                       }
> > > > > > > > > > > > >                          spin_unlock(&rq->lock);
> > > > > > > > > > > > >                          return entity;
> > > > > > > > > > > > >                  }
> > > > > > > > > > > > > @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > > > > > > > > > >        * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
> > > > > > > > > > > > >        *
> > > > > > > > > > > > >        * @rq: scheduler run queue to check.
> > > > > > > > > > > > > + * @dequeue: dequeue selected entity
> > > > > > > > > > > > >        *
> > > > > > > > > > > > >        * Find oldest waiting ready entity, returns NULL if none found.
> > > > > > > > > > > > >        */
> > > > > > > > > > > > >       static struct drm_sched_entity *
> > > > > > > > > > > > > -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > > > > > > > > +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
> > > > > > > > > > > > >       {
> > > > > > > > > > > > >          struct rb_node *rb;
> > > > > > > > > > > > > @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > > > > > > > >                  entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> > > > > > > > > > > > >                  if (drm_sched_entity_is_ready(entity)) {
> > > > > > > > > > > > > -                       rq->current_entity = entity;
> > > > > > > > > > > > > -                       reinit_completion(&entity->entity_idle);
> > > > > > > > > > > > > +                       if (dequeue) {
> > > > > > > > > > > > > +                               rq->current_entity = entity;
> > > > > > > > > > > > > +                               reinit_completion(&entity->entity_idle);
> > > > > > > > > > > > > +                       }
> > > > > > > > > > > > >                          break;
> > > > > > > > > > > > >                  }
> > > > > > > > > > > > >          }
> > > > > > > > > > > > > @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > > > > > > > > > >       }
> > > > > > > > > > > > >       /**
> > > > > > > > > > > > > - * drm_sched_submit_queue - scheduler queue submission
> > > > > > > > > > > > > + * drm_sched_run_job_queue - queue job submission
> > > > > > > > > > > > >        * @sched: scheduler instance
> > > > > > > > > > > > >        */
> > > > > > > > > > > > > -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > > > +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > > >       {
> > > > > > > > > > > > >          if (!READ_ONCE(sched->pause_submit))
> > > > > > > > > > > > > -               queue_work(sched->submit_wq, &sched->work_submit);
> > > > > > > > > > > > > +               queue_work(sched->submit_wq, &sched->work_run_job);
> > > > > > > > > > > > > +}
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +static struct drm_sched_entity *
> > > > > > > > > > > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +/**
> > > > > > > > > > > > > + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> > > > > > > > > > > > > + * @sched: scheduler instance
> > > > > > > > > > > > > + */
> > > > > > > > > > > > > +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > > > +{
> > > > > > > > > > > > > +       if (drm_sched_select_entity(sched, false))
> > > > > > > > > > > > > +               drm_sched_run_job_queue(sched);
> > > > > > > > > > > > > +}
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +/**
> > > > > > > > > > > > > + * drm_sched_free_job_queue - queue free job
> > > > > > > > > > > > > + *
> > > > > > > > > > > > > + * @sched: scheduler instance to queue free job
> > > > > > > > > > > > > + */
> > > > > > > > > > > > > +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > > > +{
> > > > > > > > > > > > > +       if (!READ_ONCE(sched->pause_submit))
> > > > > > > > > > > > > +               queue_work(sched->submit_wq, &sched->work_free_job);
> > > > > > > > > > > > > +}
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +/**
> > > > > > > > > > > > > + * drm_sched_free_job_queue_if_ready - queue free job if ready
> > > > > > > > > > > > > + *
> > > > > > > > > > > > > + * @sched: scheduler instance to queue free job
> > > > > > > > > > > > > + */
> > > > > > > > > > > > > +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > > > +{
> > > > > > > > > > > > > +       struct drm_sched_job *job;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +       spin_lock(&sched->job_list_lock);
> > > > > > > > > > > > > +       job = list_first_entry_or_null(&sched->pending_list,
> > > > > > > > > > > > > +                                      struct drm_sched_job, list);
> > > > > > > > > > > > > +       if (job && dma_fence_is_signaled(&job->s_fence->finished))
> > > > > > > > > > > > > +               drm_sched_free_job_queue(sched);
> > > > > > > > > > > > > +       spin_unlock(&sched->job_list_lock);
> > > > > > > > > > > > >       }
> > > > > > > > > > > > >       /**
> > > > > > > > > > > > > @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
> > > > > > > > > > > > >          dma_fence_get(&s_fence->finished);
> > > > > > > > > > > > >          drm_sched_fence_finished(s_fence, result);
> > > > > > > > > > > > >          dma_fence_put(&s_fence->finished);
> > > > > > > > > > > > > -       drm_sched_submit_queue(sched);
> > > > > > > > > > > > > +       drm_sched_free_job_queue(sched);
> > > > > > > > > > > > >       }
> > > > > > > > > > > > >       /**
> > > > > > > > > > > > > @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > > >       void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > > >       {
> > > > > > > > > > > > >          if (drm_sched_can_queue(sched))
> > > > > > > > > > > > > -               drm_sched_submit_queue(sched);
> > > > > > > > > > > > > +               drm_sched_run_job_queue(sched);
> > > > > > > > > > > > >       }
> > > > > > > > > > > > >       /**
> > > > > > > > > > > > >        * drm_sched_select_entity - Select next entity to process
> > > > > > > > > > > > >        *
> > > > > > > > > > > > >        * @sched: scheduler instance
> > > > > > > > > > > > > + * @dequeue: dequeue selected entity
> > > > > > > > > > > > >        *
> > > > > > > > > > > > >        * Returns the entity to process or NULL if none are found.
> > > > > > > > > > > > >        */
> > > > > > > > > > > > >       static struct drm_sched_entity *
> > > > > > > > > > > > > -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
> > > > > > > > > > > > >       {
> > > > > > > > > > > > >          struct drm_sched_entity *entity;
> > > > > > > > > > > > >          int i;
> > > > > > > > > > > > > @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > > >          /* Kernel run queue has higher priority than normal run queue*/
> > > > > > > > > > > > >          for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> > > > > > > > > > > > >                  entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> > > > > > > > > > > > > -                       drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> > > > > > > > > > > > > -                       drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> > > > > > > > > > > > > +                       drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> > > > > > > > > > > > > +                                                       dequeue) :
> > > > > > > > > > > > > +                       drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> > > > > > > > > > > > > +                                                     dequeue);
> > > > > > > > > > > > >                  if (entity)
> > > > > > > > > > > > >                          break;
> > > > > > > > > > > > >          }
> > > > > > > > > > > > > @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> > > > > > > > > > > > >       EXPORT_SYMBOL(drm_sched_pick_best);
> > > > > > > > > > > > >       /**
> > > > > > > > > > > > > - * drm_sched_main - main scheduler thread
> > > > > > > > > > > > > + * drm_sched_free_job_work - worker to call free_job
> > > > > > > > > > > > >        *
> > > > > > > > > > > > > - * @param: scheduler instance
> > > > > > > > > > > > > + * @w: free job work
> > > > > > > > > > > > >        */
> > > > > > > > > > > > > -static void drm_sched_main(struct work_struct *w)
> > > > > > > > > > > > > +static void drm_sched_free_job_work(struct work_struct *w)
> > > > > > > > > > > > >       {
> > > > > > > > > > > > >          struct drm_gpu_scheduler *sched =
> > > > > > > > > > > > > -               container_of(w, struct drm_gpu_scheduler, work_submit);
> > > > > > > > > > > > > -       struct drm_sched_entity *entity;
> > > > > > > > > > > > > +               container_of(w, struct drm_gpu_scheduler, work_free_job);
> > > > > > > > > > > > >          struct drm_sched_job *cleanup_job;
> > > > > > > > > > > > > -       int r;
> > > > > > > > > > > > >          if (READ_ONCE(sched->pause_submit))
> > > > > > > > > > > > >                  return;
> > > > > > > > > > > > >          cleanup_job = drm_sched_get_cleanup_job(sched);
> > > > > > > > > > > > > -       entity = drm_sched_select_entity(sched);
> > > > > > > > > > > > > +       if (cleanup_job) {
> > > > > > > > > > > > > +               sched->ops->free_job(cleanup_job);
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +               drm_sched_free_job_queue_if_ready(sched);
> > > > > > > > > > > > > +               drm_sched_run_job_queue_if_ready(sched);
> > > > > > > > > > > > > +       }
> > > > > > > > > > > > > +}
> > > > > > > > > > > > > -       if (!entity && !cleanup_job)
> > > > > > > > > > > > > -               return; /* No more work */
> > > > > > > > > > > > > +/**
> > > > > > > > > > > > > + * drm_sched_run_job_work - worker to call run_job
> > > > > > > > > > > > > + *
> > > > > > > > > > > > > + * @w: run job work
> > > > > > > > > > > > > + */
> > > > > > > > > > > > > +static void drm_sched_run_job_work(struct work_struct *w)
> > > > > > > > > > > > > +{
> > > > > > > > > > > > > +       struct drm_gpu_scheduler *sched =
> > > > > > > > > > > > > +               container_of(w, struct drm_gpu_scheduler, work_run_job);
> > > > > > > > > > > > > +       struct drm_sched_entity *entity;
> > > > > > > > > > > > > +       int r;
> > > > > > > > > > > > > -       if (cleanup_job)
> > > > > > > > > > > > > -               sched->ops->free_job(cleanup_job);
> > > > > > > > > > > > > +       if (READ_ONCE(sched->pause_submit))
> > > > > > > > > > > > > +               return;
> > > > > > > > > > > > > +       entity = drm_sched_select_entity(sched, true);
> > > > > > > > > > > > >          if (entity) {
> > > > > > > > > > > > >                  struct dma_fence *fence;
> > > > > > > > > > > > >                  struct drm_sched_fence *s_fence;
> > > > > > > > > > > > > @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > > > > > > > >                  sched_job = drm_sched_entity_pop_job(entity);
> > > > > > > > > > > > >                  if (!sched_job) {
> > > > > > > > > > > > >                          complete_all(&entity->entity_idle);
> > > > > > > > > > > > > -                       if (!cleanup_job)
> > > > > > > > > > > > > -                               return; /* No more work */
> > > > > > > > > > > > > -                       goto again;
> > > > > > > > > > > > > +                       return; /* No more work */
> > > > > > > > > > > > >                  }
> > > > > > > > > > > > >                  s_fence = sched_job->s_fence;
> > > > > > > > > > > > > @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > > > > > > > >                  }
> > > > > > > > > > > > >                  wake_up(&sched->job_scheduled);
> > > > > > > > > > > > > +               drm_sched_run_job_queue_if_ready(sched);
> > > > > > > > > > > > >          }
> > > > > > > > > > > > > -
> > > > > > > > > > > > > -again:
> > > > > > > > > > > > > -       drm_sched_submit_queue(sched);
> > > > > > > > > > > > >       }
> > > > > > > > > > > > >       /**
> > > > > > > > > > > > > @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> > > > > > > > > > > > >          spin_lock_init(&sched->job_list_lock);
> > > > > > > > > > > > >          atomic_set(&sched->hw_rq_count, 0);
> > > > > > > > > > > > >          INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > > > > > > > > > > > > -       INIT_WORK(&sched->work_submit, drm_sched_main);
> > > > > > > > > > > > > +       INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > > > > > > > > > > > > +       INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> > > > > > > > > > > > >          atomic_set(&sched->_score, 0);
> > > > > > > > > > > > >          atomic64_set(&sched->job_id_count, 0);
> > > > > > > > > > > > >          sched->pause_submit = false;
> > > > > > > > > > > > > @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
> > > > > > > > > > > > >       void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > > >       {
> > > > > > > > > > > > >          WRITE_ONCE(sched->pause_submit, true);
> > > > > > > > > > > > > -       cancel_work_sync(&sched->work_submit);
> > > > > > > > > > > > > +       cancel_work_sync(&sched->work_run_job);
> > > > > > > > > > > > > +       cancel_work_sync(&sched->work_free_job);
> > > > > > > > > > > > >       }
> > > > > > > > > > > > >       EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > > > > > > > > > > @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > > > > > > > > > >       void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
> > > > > > > > > > > > >       {
> > > > > > > > > > > > >          WRITE_ONCE(sched->pause_submit, false);
> > > > > > > > > > > > > -       queue_work(sched->submit_wq, &sched->work_submit);
> > > > > > > > > > > > > +       queue_work(sched->submit_wq, &sched->work_run_job);
> > > > > > > > > > > > > +       queue_work(sched->submit_wq, &sched->work_free_job);
> > > > > > > > > > > > >       }
> > > > > > > > > > > > >       EXPORT_SYMBOL(drm_sched_submit_start);
> > > > > > > > > > > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > > > > > > > > > > index 04eec2d7635f..fbc083a92757 100644
> > > > > > > > > > > > > --- a/include/drm/gpu_scheduler.h
> > > > > > > > > > > > > +++ b/include/drm/gpu_scheduler.h
> > > > > > > > > > > > > @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
> > > > > > > > > > > > >        *                 finished.
> > > > > > > > > > > > >        * @hw_rq_count: the number of jobs currently in the hardware queue.
> > > > > > > > > > > > >        * @job_id_count: used to assign unique id to the each job.
> > > > > > > > > > > > > - * @submit_wq: workqueue used to queue @work_submit
> > > > > > > > > > > > > + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
> > > > > > > > > > > > >        * @timeout_wq: workqueue used to queue @work_tdr
> > > > > > > > > > > > > - * @work_submit: schedules jobs and cleans up entities
> > > > > > > > > > > > > + * @work_run_job: schedules jobs
> > > > > > > > > > > > > + * @work_free_job: cleans up jobs
> > > > > > > > > > > > >        * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> > > > > > > > > > > > >        *            timeout interval is over.
> > > > > > > > > > > > >        * @pending_list: the list of jobs which are currently in the job queue.
> > > > > > > > > > > > > @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
> > > > > > > > > > > > >          atomic64_t                      job_id_count;
> > > > > > > > > > > > >          struct workqueue_struct         *submit_wq;
> > > > > > > > > > > > >          struct workqueue_struct         *timeout_wq;
> > > > > > > > > > > > > -       struct work_struct              work_submit;
> > > > > > > > > > > > > +       struct work_struct              work_run_job;
> > > > > > > > > > > > > +       struct work_struct              work_free_job;
> > > > > > > > > > > > >          struct delayed_work             work_tdr;
> > > > > > > > > > > > >          struct list_head                pending_list;
> > > > > > > > > > > > >          spinlock_t                      job_list_lock;
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 0/9] DRM scheduler changes for Xe
  2023-08-24  3:23   ` Matthew Brost
@ 2023-08-24 14:51     ` Danilo Krummrich
  0 siblings, 0 replies; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-24 14:51 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, intel-xe, luben.tuikov, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

On 8/24/23 05:23, Matthew Brost wrote:
> On Thu, Aug 24, 2023 at 02:08:59AM +0200, Danilo Krummrich wrote:
>> Hi Matt,
>>
>> On 8/11/23 04:31, Matthew Brost wrote:
>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>> have been asked to merge our common DRM scheduler patches first.
>>>
>>> This a continuation of a RFC [3] with all comments addressed, ready for
>>> a full review, and hopefully in state which can merged in the near
>>> future. More details of this series can found in the cover letter of the
>>> RFC [3].
>>>
>>> These changes have been tested with the Xe driver.
>>
>> Do you keep a branch with these patches somewhere?
>>
> 
> Pushed a branch for you:
> https://gitlab.freedesktop.org/mbrost/nouveau-drm-scheduler/-/tree/xe-sched-changes?ref_type=heads

Great - gonna pick this up to work on making use of DRM_SCHED_POLICY_SINGLE_ENTITY in Nouveau.

- Danilo

> 
> Matt
> 
>> - Danilo
>>
>>>
>>> v2:
>>>    - Break run job, free job, and process message in own work items
>>>    - This might break other drivers as run job and free job now can run in
>>>      parallel, can fix up if needed
>>>
>>> Matt
>>>
>>> Matthew Brost (9):
>>>     drm/sched: Convert drm scheduler to use a work queue  rather than
>>>       kthread
>>>     drm/sched: Move schedule policy to scheduler / entity
>>>     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>>>     drm/sched: Split free_job into own work item
>>>     drm/sched: Add generic scheduler message interface
>>>     drm/sched: Add drm_sched_start_timeout_unlocked helper
>>>     drm/sched: Start run wq before TDR in drm_sched_start
>>>     drm/sched: Submit job before starting TDR
>>>     drm/sched: Add helper to set TDR timeout
>>>
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   3 +-
>>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   5 +-
>>>    drivers/gpu/drm/lima/lima_sched.c          |   5 +-
>>>    drivers/gpu/drm/msm/msm_ringbuffer.c       |   5 +-
>>>    drivers/gpu/drm/nouveau/nouveau_sched.c    |   5 +-
>>>    drivers/gpu/drm/panfrost/panfrost_job.c    |   5 +-
>>>    drivers/gpu/drm/scheduler/sched_entity.c   |  85 ++++-
>>>    drivers/gpu/drm/scheduler/sched_fence.c    |   2 +-
>>>    drivers/gpu/drm/scheduler/sched_main.c     | 408 ++++++++++++++++-----
>>>    drivers/gpu/drm/v3d/v3d_sched.c            |  25 +-
>>>    include/drm/gpu_scheduler.h                |  75 +++-
>>>    11 files changed, 487 insertions(+), 136 deletions(-)
>>>
>>
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-11  2:31 ` [PATCH v2 4/9] drm/sched: Split free_job into own work item Matthew Brost
  2023-08-17 13:39   ` Christian König
@ 2023-08-24 23:04   ` Danilo Krummrich
  2023-08-25  2:58     ` Matthew Brost
  2023-08-28 18:04   ` Danilo Krummrich
  2 siblings, 1 reply; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-24 23:04 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, christian.koenig, luben.tuikov,
	donald.robson, boris.brezillon, intel-xe, faith.ekstrand

On Thu, Aug 10, 2023 at 07:31:32PM -0700, Matthew Brost wrote:
> Rather than call free_job and run_job in same work item have a dedicated
> work item for each. This aligns with the design and intended use of work
> queues.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
>  include/drm/gpu_scheduler.h            |   8 +-
>  2 files changed, 106 insertions(+), 39 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index cede47afc800..b67469eac179 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>   * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>   *
>   * @rq: scheduler run queue to check.
> + * @dequeue: dequeue selected entity
>   *
>   * Try to find a ready entity, returns NULL if none found.
>   */
>  static struct drm_sched_entity *
> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
>  {
>  	struct drm_sched_entity *entity;
>  
> @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>  	if (entity) {
>  		list_for_each_entry_continue(entity, &rq->entities, list) {
>  			if (drm_sched_entity_is_ready(entity)) {
> -				rq->current_entity = entity;
> -				reinit_completion(&entity->entity_idle);
> +				if (dequeue) {
> +					rq->current_entity = entity;
> +					reinit_completion(&entity->entity_idle);
> +				}
>  				spin_unlock(&rq->lock);
>  				return entity;
>  			}
> @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>  	list_for_each_entry(entity, &rq->entities, list) {
>  
>  		if (drm_sched_entity_is_ready(entity)) {
> -			rq->current_entity = entity;
> -			reinit_completion(&entity->entity_idle);
> +			if (dequeue) {
> +				rq->current_entity = entity;
> +				reinit_completion(&entity->entity_idle);
> +			}
>  			spin_unlock(&rq->lock);
>  			return entity;
>  		}
> @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>   * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>   *
>   * @rq: scheduler run queue to check.
> + * @dequeue: dequeue selected entity
>   *
>   * Find oldest waiting ready entity, returns NULL if none found.
>   */
>  static struct drm_sched_entity *
> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
>  {
>  	struct rb_node *rb;
>  
> @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>  
>  		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>  		if (drm_sched_entity_is_ready(entity)) {
> -			rq->current_entity = entity;
> -			reinit_completion(&entity->entity_idle);
> +			if (dequeue) {
> +				rq->current_entity = entity;
> +				reinit_completion(&entity->entity_idle);
> +			}
>  			break;
>  		}
>  	}
> @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>  }
>  
>  /**
> - * drm_sched_submit_queue - scheduler queue submission
> + * drm_sched_run_job_queue - queue job submission
>   * @sched: scheduler instance
>   */
> -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>  {
>  	if (!READ_ONCE(sched->pause_submit))
> -		queue_work(sched->submit_wq, &sched->work_submit);
> +		queue_work(sched->submit_wq, &sched->work_run_job);
> +}
> +
> +static struct drm_sched_entity *
> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> +
> +/**
> + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> + * @sched: scheduler instance
> + */
> +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> +{
> +	if (drm_sched_select_entity(sched, false))
> +		drm_sched_run_job_queue(sched);
> +}
> +
> +/**
> + * drm_sched_free_job_queue - queue free job
> + *
> + * @sched: scheduler instance to queue free job
> + */
> +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> +{
> +	if (!READ_ONCE(sched->pause_submit))
> +		queue_work(sched->submit_wq, &sched->work_free_job);
> +}
> +
> +/**
> + * drm_sched_free_job_queue_if_ready - queue free job if ready
> + *
> + * @sched: scheduler instance to queue free job
> + */
> +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> +{
> +	struct drm_sched_job *job;
> +
> +	spin_lock(&sched->job_list_lock);
> +	job = list_first_entry_or_null(&sched->pending_list,
> +				       struct drm_sched_job, list);
> +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
> +		drm_sched_free_job_queue(sched);
> +	spin_unlock(&sched->job_list_lock);
>  }
>  
>  /**
> @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>  	dma_fence_get(&s_fence->finished);
>  	drm_sched_fence_finished(s_fence, result);
>  	dma_fence_put(&s_fence->finished);
> -	drm_sched_submit_queue(sched);
> +	drm_sched_free_job_queue(sched);
>  }
>  
>  /**
> @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
>  void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>  {
>  	if (drm_sched_can_queue(sched))
> -		drm_sched_submit_queue(sched);
> +		drm_sched_run_job_queue(sched);
>  }
>  
>  /**
>   * drm_sched_select_entity - Select next entity to process
>   *
>   * @sched: scheduler instance
> + * @dequeue: dequeue selected entity
>   *
>   * Returns the entity to process or NULL if none are found.
>   */
>  static struct drm_sched_entity *
> -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
>  {
>  	struct drm_sched_entity *entity;
>  	int i;
> @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>  	/* Kernel run queue has higher priority than normal run queue*/
>  	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>  		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> +							dequeue) :
> +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> +						      dequeue);
>  		if (entity)
>  			break;
>  	}
> @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>  EXPORT_SYMBOL(drm_sched_pick_best);
>  
>  /**
> - * drm_sched_main - main scheduler thread
> + * drm_sched_free_job_work - worker to call free_job
>   *
> - * @param: scheduler instance
> + * @w: free job work
>   */
> -static void drm_sched_main(struct work_struct *w)
> +static void drm_sched_free_job_work(struct work_struct *w)
>  {
>  	struct drm_gpu_scheduler *sched =
> -		container_of(w, struct drm_gpu_scheduler, work_submit);
> -	struct drm_sched_entity *entity;
> +		container_of(w, struct drm_gpu_scheduler, work_free_job);
>  	struct drm_sched_job *cleanup_job;
> -	int r;
>  
>  	if (READ_ONCE(sched->pause_submit))
>  		return;
>  
>  	cleanup_job = drm_sched_get_cleanup_job(sched);

I tried this patch with Nouveau and found a race condition:

In drm_sched_run_job_work() the job is added to the pending_list via
drm_sched_job_begin(), then the run_job() callback is called and the scheduled
fence is signaled.

However, in parallel drm_sched_get_cleanup_job() might be called from
drm_sched_free_job_work(), which picks the first job from the pending_list and
for the next job on the pending_list sets the scheduled fence' timestamp field.

The job can be on the pending_list, but the scheduled fence might not yet be
signaled. The call to actually signal the fence will subsequently fault because
it will try to dereference the timestamp.

I'm not sure what's the best way to fix this, maybe it's enough to re-order
signalling the scheduled fence and adding the job to the pending_list. Not sure
if this has other implications though.

- Danilo

> -	entity = drm_sched_select_entity(sched);
> +	if (cleanup_job) {
> +		sched->ops->free_job(cleanup_job);
> +
> +		drm_sched_free_job_queue_if_ready(sched);
> +		drm_sched_run_job_queue_if_ready(sched);
> +	}
> +}
>  
> -	if (!entity && !cleanup_job)
> -		return;	/* No more work */
> +/**
> + * drm_sched_run_job_work - worker to call run_job
> + *
> + * @w: run job work
> + */
> +static void drm_sched_run_job_work(struct work_struct *w)
> +{
> +	struct drm_gpu_scheduler *sched =
> +		container_of(w, struct drm_gpu_scheduler, work_run_job);
> +	struct drm_sched_entity *entity;
> +	int r;
>  
> -	if (cleanup_job)
> -		sched->ops->free_job(cleanup_job);
> +	if (READ_ONCE(sched->pause_submit))
> +		return;
>  
> +	entity = drm_sched_select_entity(sched, true);
>  	if (entity) {
>  		struct dma_fence *fence;
>  		struct drm_sched_fence *s_fence;
> @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
>  		sched_job = drm_sched_entity_pop_job(entity);
>  		if (!sched_job) {
>  			complete_all(&entity->entity_idle);
> -			if (!cleanup_job)
> -				return;	/* No more work */
> -			goto again;
> +			return;	/* No more work */
>  		}
>  
>  		s_fence = sched_job->s_fence;
> @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
>  		}
>  
>  		wake_up(&sched->job_scheduled);
> +		drm_sched_run_job_queue_if_ready(sched);
>  	}
> -
> -again:
> -	drm_sched_submit_queue(sched);
>  }
>  
>  /**
> @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>  	spin_lock_init(&sched->job_list_lock);
>  	atomic_set(&sched->hw_rq_count, 0);
>  	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> -	INIT_WORK(&sched->work_submit, drm_sched_main);
> +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
>  	atomic_set(&sched->_score, 0);
>  	atomic64_set(&sched->job_id_count, 0);
>  	sched->pause_submit = false;
> @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
>  void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
>  {
>  	WRITE_ONCE(sched->pause_submit, true);
> -	cancel_work_sync(&sched->work_submit);
> +	cancel_work_sync(&sched->work_run_job);
> +	cancel_work_sync(&sched->work_free_job);
>  }
>  EXPORT_SYMBOL(drm_sched_submit_stop);
>  
> @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
>  void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
>  {
>  	WRITE_ONCE(sched->pause_submit, false);
> -	queue_work(sched->submit_wq, &sched->work_submit);
> +	queue_work(sched->submit_wq, &sched->work_run_job);
> +	queue_work(sched->submit_wq, &sched->work_free_job);
>  }
>  EXPORT_SYMBOL(drm_sched_submit_start);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 04eec2d7635f..fbc083a92757 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
>   *                 finished.
>   * @hw_rq_count: the number of jobs currently in the hardware queue.
>   * @job_id_count: used to assign unique id to the each job.
> - * @submit_wq: workqueue used to queue @work_submit
> + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
>   * @timeout_wq: workqueue used to queue @work_tdr
> - * @work_submit: schedules jobs and cleans up entities
> + * @work_run_job: schedules jobs
> + * @work_free_job: cleans up jobs
>   * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>   *            timeout interval is over.
>   * @pending_list: the list of jobs which are currently in the job queue.
> @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
>  	atomic64_t			job_id_count;
>  	struct workqueue_struct		*submit_wq;
>  	struct workqueue_struct		*timeout_wq;
> -	struct work_struct		work_submit;
> +	struct work_struct		work_run_job;
> +	struct work_struct		work_free_job;
>  	struct delayed_work		work_tdr;
>  	struct list_head		pending_list;
>  	spinlock_t			job_list_lock;
> -- 
> 2.34.1
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-24 23:04   ` Danilo Krummrich
@ 2023-08-25  2:58     ` Matthew Brost
  2023-08-25  8:02       ` Christian König
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Brost @ 2023-08-25  2:58 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, christian.koenig, luben.tuikov,
	donald.robson, boris.brezillon, intel-xe, faith.ekstrand

On Fri, Aug 25, 2023 at 01:04:10AM +0200, Danilo Krummrich wrote:
> On Thu, Aug 10, 2023 at 07:31:32PM -0700, Matthew Brost wrote:
> > Rather than call free_job and run_job in same work item have a dedicated
> > work item for each. This aligns with the design and intended use of work
> > queues.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
> >  include/drm/gpu_scheduler.h            |   8 +-
> >  2 files changed, 106 insertions(+), 39 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index cede47afc800..b67469eac179 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> >   * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
> >   *
> >   * @rq: scheduler run queue to check.
> > + * @dequeue: dequeue selected entity
> >   *
> >   * Try to find a ready entity, returns NULL if none found.
> >   */
> >  static struct drm_sched_entity *
> > -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
> >  {
> >  	struct drm_sched_entity *entity;
> >  
> > @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >  	if (entity) {
> >  		list_for_each_entry_continue(entity, &rq->entities, list) {
> >  			if (drm_sched_entity_is_ready(entity)) {
> > -				rq->current_entity = entity;
> > -				reinit_completion(&entity->entity_idle);
> > +				if (dequeue) {
> > +					rq->current_entity = entity;
> > +					reinit_completion(&entity->entity_idle);
> > +				}
> >  				spin_unlock(&rq->lock);
> >  				return entity;
> >  			}
> > @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >  	list_for_each_entry(entity, &rq->entities, list) {
> >  
> >  		if (drm_sched_entity_is_ready(entity)) {
> > -			rq->current_entity = entity;
> > -			reinit_completion(&entity->entity_idle);
> > +			if (dequeue) {
> > +				rq->current_entity = entity;
> > +				reinit_completion(&entity->entity_idle);
> > +			}
> >  			spin_unlock(&rq->lock);
> >  			return entity;
> >  		}
> > @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >   * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
> >   *
> >   * @rq: scheduler run queue to check.
> > + * @dequeue: dequeue selected entity
> >   *
> >   * Find oldest waiting ready entity, returns NULL if none found.
> >   */
> >  static struct drm_sched_entity *
> > -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
> >  {
> >  	struct rb_node *rb;
> >  
> > @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> >  
> >  		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> >  		if (drm_sched_entity_is_ready(entity)) {
> > -			rq->current_entity = entity;
> > -			reinit_completion(&entity->entity_idle);
> > +			if (dequeue) {
> > +				rq->current_entity = entity;
> > +				reinit_completion(&entity->entity_idle);
> > +			}
> >  			break;
> >  		}
> >  	}
> > @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> >  }
> >  
> >  /**
> > - * drm_sched_submit_queue - scheduler queue submission
> > + * drm_sched_run_job_queue - queue job submission
> >   * @sched: scheduler instance
> >   */
> > -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> > +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> >  {
> >  	if (!READ_ONCE(sched->pause_submit))
> > -		queue_work(sched->submit_wq, &sched->work_submit);
> > +		queue_work(sched->submit_wq, &sched->work_run_job);
> > +}
> > +
> > +static struct drm_sched_entity *
> > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> > +
> > +/**
> > + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> > + * @sched: scheduler instance
> > + */
> > +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > +{
> > +	if (drm_sched_select_entity(sched, false))
> > +		drm_sched_run_job_queue(sched);
> > +}
> > +
> > +/**
> > + * drm_sched_free_job_queue - queue free job
> > + *
> > + * @sched: scheduler instance to queue free job
> > + */
> > +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> > +{
> > +	if (!READ_ONCE(sched->pause_submit))
> > +		queue_work(sched->submit_wq, &sched->work_free_job);
> > +}
> > +
> > +/**
> > + * drm_sched_free_job_queue_if_ready - queue free job if ready
> > + *
> > + * @sched: scheduler instance to queue free job
> > + */
> > +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > +{
> > +	struct drm_sched_job *job;
> > +
> > +	spin_lock(&sched->job_list_lock);
> > +	job = list_first_entry_or_null(&sched->pending_list,
> > +				       struct drm_sched_job, list);
> > +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
> > +		drm_sched_free_job_queue(sched);
> > +	spin_unlock(&sched->job_list_lock);
> >  }
> >  
> >  /**
> > @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
> >  	dma_fence_get(&s_fence->finished);
> >  	drm_sched_fence_finished(s_fence, result);
> >  	dma_fence_put(&s_fence->finished);
> > -	drm_sched_submit_queue(sched);
> > +	drm_sched_free_job_queue(sched);
> >  }
> >  
> >  /**
> > @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> >  void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> >  {
> >  	if (drm_sched_can_queue(sched))
> > -		drm_sched_submit_queue(sched);
> > +		drm_sched_run_job_queue(sched);
> >  }
> >  
> >  /**
> >   * drm_sched_select_entity - Select next entity to process
> >   *
> >   * @sched: scheduler instance
> > + * @dequeue: dequeue selected entity
> >   *
> >   * Returns the entity to process or NULL if none are found.
> >   */
> >  static struct drm_sched_entity *
> > -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
> >  {
> >  	struct drm_sched_entity *entity;
> >  	int i;
> > @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> >  	/* Kernel run queue has higher priority than normal run queue*/
> >  	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> >  		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> > -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> > -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> > +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> > +							dequeue) :
> > +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> > +						      dequeue);
> >  		if (entity)
> >  			break;
> >  	}
> > @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> >  EXPORT_SYMBOL(drm_sched_pick_best);
> >  
> >  /**
> > - * drm_sched_main - main scheduler thread
> > + * drm_sched_free_job_work - worker to call free_job
> >   *
> > - * @param: scheduler instance
> > + * @w: free job work
> >   */
> > -static void drm_sched_main(struct work_struct *w)
> > +static void drm_sched_free_job_work(struct work_struct *w)
> >  {
> >  	struct drm_gpu_scheduler *sched =
> > -		container_of(w, struct drm_gpu_scheduler, work_submit);
> > -	struct drm_sched_entity *entity;
> > +		container_of(w, struct drm_gpu_scheduler, work_free_job);
> >  	struct drm_sched_job *cleanup_job;
> > -	int r;
> >  
> >  	if (READ_ONCE(sched->pause_submit))
> >  		return;
> >  
> >  	cleanup_job = drm_sched_get_cleanup_job(sched);
> 
> I tried this patch with Nouveau and found a race condition:
> 
> In drm_sched_run_job_work() the job is added to the pending_list via
> drm_sched_job_begin(), then the run_job() callback is called and the scheduled
> fence is signaled.
> 
> However, in parallel drm_sched_get_cleanup_job() might be called from
> drm_sched_free_job_work(), which picks the first job from the pending_list and
> for the next job on the pending_list sets the scheduled fence' timestamp field.
> 
> The job can be on the pending_list, but the scheduled fence might not yet be
> signaled. The call to actually signal the fence will subsequently fault because
> it will try to dereference the timestamp.
> 
> I'm not sure what's the best way to fix this, maybe it's enough to re-order
> signalling the scheduled fence and adding the job to the pending_list. Not sure
> if this has other implications though.
> 

We really want the job on the pending list before calling run_job.

I'm thinking we just delete the updating of the timestamp, not sure why
this is useful.

Or we could do something like this where we try to update the timestamp,
if we can't update the timestamp run_job worker will do it anyways.

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 67e0fb6e7d18..54bd3e88f139 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1074,8 +1074,10 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
                                                typeof(*next), list);

                if (next) {
-                       next->s_fence->scheduled.timestamp =
-                               job->s_fence->finished.timestamp;
+                       if (test_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
+                                    &next->s_fence->scheduled.flags))
+                               next->s_fence->scheduled.timestamp =
+                                       job->s_fence->finished.timestamp;
                        /* start TO timer for next job */
                        drm_sched_start_timeout(sched);
                }

I guess I'm leaning towards the latter option.

Matt

> - Danilo
> 
> > -	entity = drm_sched_select_entity(sched);
> > +	if (cleanup_job) {
> > +		sched->ops->free_job(cleanup_job);
> > +
> > +		drm_sched_free_job_queue_if_ready(sched);
> > +		drm_sched_run_job_queue_if_ready(sched);
> > +	}
> > +}
> >  
> > -	if (!entity && !cleanup_job)
> > -		return;	/* No more work */
> > +/**
> > + * drm_sched_run_job_work - worker to call run_job
> > + *
> > + * @w: run job work
> > + */
> > +static void drm_sched_run_job_work(struct work_struct *w)
> > +{
> > +	struct drm_gpu_scheduler *sched =
> > +		container_of(w, struct drm_gpu_scheduler, work_run_job);
> > +	struct drm_sched_entity *entity;
> > +	int r;
> >  
> > -	if (cleanup_job)
> > -		sched->ops->free_job(cleanup_job);
> > +	if (READ_ONCE(sched->pause_submit))
> > +		return;
> >  
> > +	entity = drm_sched_select_entity(sched, true);
> >  	if (entity) {
> >  		struct dma_fence *fence;
> >  		struct drm_sched_fence *s_fence;
> > @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
> >  		sched_job = drm_sched_entity_pop_job(entity);
> >  		if (!sched_job) {
> >  			complete_all(&entity->entity_idle);
> > -			if (!cleanup_job)
> > -				return;	/* No more work */
> > -			goto again;
> > +			return;	/* No more work */
> >  		}
> >  
> >  		s_fence = sched_job->s_fence;
> > @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
> >  		}
> >  
> >  		wake_up(&sched->job_scheduled);
> > +		drm_sched_run_job_queue_if_ready(sched);
> >  	}
> > -
> > -again:
> > -	drm_sched_submit_queue(sched);
> >  }
> >  
> >  /**
> > @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> >  	spin_lock_init(&sched->job_list_lock);
> >  	atomic_set(&sched->hw_rq_count, 0);
> >  	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > -	INIT_WORK(&sched->work_submit, drm_sched_main);
> > +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> >  	atomic_set(&sched->_score, 0);
> >  	atomic64_set(&sched->job_id_count, 0);
> >  	sched->pause_submit = false;
> > @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
> >  void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
> >  {
> >  	WRITE_ONCE(sched->pause_submit, true);
> > -	cancel_work_sync(&sched->work_submit);
> > +	cancel_work_sync(&sched->work_run_job);
> > +	cancel_work_sync(&sched->work_free_job);
> >  }
> >  EXPORT_SYMBOL(drm_sched_submit_stop);
> >  
> > @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
> >  void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
> >  {
> >  	WRITE_ONCE(sched->pause_submit, false);
> > -	queue_work(sched->submit_wq, &sched->work_submit);
> > +	queue_work(sched->submit_wq, &sched->work_run_job);
> > +	queue_work(sched->submit_wq, &sched->work_free_job);
> >  }
> >  EXPORT_SYMBOL(drm_sched_submit_start);
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 04eec2d7635f..fbc083a92757 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
> >   *                 finished.
> >   * @hw_rq_count: the number of jobs currently in the hardware queue.
> >   * @job_id_count: used to assign unique id to the each job.
> > - * @submit_wq: workqueue used to queue @work_submit
> > + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
> >   * @timeout_wq: workqueue used to queue @work_tdr
> > - * @work_submit: schedules jobs and cleans up entities
> > + * @work_run_job: schedules jobs
> > + * @work_free_job: cleans up jobs
> >   * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> >   *            timeout interval is over.
> >   * @pending_list: the list of jobs which are currently in the job queue.
> > @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
> >  	atomic64_t			job_id_count;
> >  	struct workqueue_struct		*submit_wq;
> >  	struct workqueue_struct		*timeout_wq;
> > -	struct work_struct		work_submit;
> > +	struct work_struct		work_run_job;
> > +	struct work_struct		work_free_job;
> >  	struct delayed_work		work_tdr;
> >  	struct list_head		pending_list;
> >  	spinlock_t			job_list_lock;
> > -- 
> > 2.34.1
> > 
> 

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-25  2:58     ` Matthew Brost
@ 2023-08-25  8:02       ` Christian König
  2023-08-25 13:36         ` Matthew Brost
  0 siblings, 1 reply; 80+ messages in thread
From: Christian König @ 2023-08-25  8:02 UTC (permalink / raw)
  To: Matthew Brost, Danilo Krummrich
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, donald.robson,
	boris.brezillon, intel-xe, faith.ekstrand

Am 25.08.23 um 04:58 schrieb Matthew Brost:
> On Fri, Aug 25, 2023 at 01:04:10AM +0200, Danilo Krummrich wrote:
>> On Thu, Aug 10, 2023 at 07:31:32PM -0700, Matthew Brost wrote:
>>> Rather than call free_job and run_job in same work item have a dedicated
>>> work item for each. This aligns with the design and intended use of work
>>> queues.
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
>>>   include/drm/gpu_scheduler.h            |   8 +-
>>>   2 files changed, 106 insertions(+), 39 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index cede47afc800..b67469eac179 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>>    * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>>>    *
>>>    * @rq: scheduler run queue to check.
>>> + * @dequeue: dequeue selected entity
>>>    *
>>>    * Try to find a ready entity, returns NULL if none found.
>>>    */
>>>   static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
>>>   {
>>>   	struct drm_sched_entity *entity;
>>>   
>>> @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>   	if (entity) {
>>>   		list_for_each_entry_continue(entity, &rq->entities, list) {
>>>   			if (drm_sched_entity_is_ready(entity)) {
>>> -				rq->current_entity = entity;
>>> -				reinit_completion(&entity->entity_idle);
>>> +				if (dequeue) {
>>> +					rq->current_entity = entity;
>>> +					reinit_completion(&entity->entity_idle);
>>> +				}
>>>   				spin_unlock(&rq->lock);
>>>   				return entity;
>>>   			}
>>> @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>   	list_for_each_entry(entity, &rq->entities, list) {
>>>   
>>>   		if (drm_sched_entity_is_ready(entity)) {
>>> -			rq->current_entity = entity;
>>> -			reinit_completion(&entity->entity_idle);
>>> +			if (dequeue) {
>>> +				rq->current_entity = entity;
>>> +				reinit_completion(&entity->entity_idle);
>>> +			}
>>>   			spin_unlock(&rq->lock);
>>>   			return entity;
>>>   		}
>>> @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>    * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>>>    *
>>>    * @rq: scheduler run queue to check.
>>> + * @dequeue: dequeue selected entity
>>>    *
>>>    * Find oldest waiting ready entity, returns NULL if none found.
>>>    */
>>>   static struct drm_sched_entity *
>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>> +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
>>>   {
>>>   	struct rb_node *rb;
>>>   
>>> @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>   
>>>   		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>>   		if (drm_sched_entity_is_ready(entity)) {
>>> -			rq->current_entity = entity;
>>> -			reinit_completion(&entity->entity_idle);
>>> +			if (dequeue) {
>>> +				rq->current_entity = entity;
>>> +				reinit_completion(&entity->entity_idle);
>>> +			}
>>>   			break;
>>>   		}
>>>   	}
>>> @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>   }
>>>   
>>>   /**
>>> - * drm_sched_submit_queue - scheduler queue submission
>>> + * drm_sched_run_job_queue - queue job submission
>>>    * @sched: scheduler instance
>>>    */
>>> -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
>>> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>>   {
>>>   	if (!READ_ONCE(sched->pause_submit))
>>> -		queue_work(sched->submit_wq, &sched->work_submit);
>>> +		queue_work(sched->submit_wq, &sched->work_run_job);
>>> +}
>>> +
>>> +static struct drm_sched_entity *
>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
>>> +
>>> +/**
>>> + * drm_sched_run_job_queue_if_ready - queue job submission if ready
>>> + * @sched: scheduler instance
>>> + */
>>> +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
>>> +{
>>> +	if (drm_sched_select_entity(sched, false))
>>> +		drm_sched_run_job_queue(sched);
>>> +}
>>> +
>>> +/**
>>> + * drm_sched_free_job_queue - queue free job
>>> + *
>>> + * @sched: scheduler instance to queue free job
>>> + */
>>> +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
>>> +{
>>> +	if (!READ_ONCE(sched->pause_submit))
>>> +		queue_work(sched->submit_wq, &sched->work_free_job);
>>> +}
>>> +
>>> +/**
>>> + * drm_sched_free_job_queue_if_ready - queue free job if ready
>>> + *
>>> + * @sched: scheduler instance to queue free job
>>> + */
>>> +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
>>> +{
>>> +	struct drm_sched_job *job;
>>> +
>>> +	spin_lock(&sched->job_list_lock);
>>> +	job = list_first_entry_or_null(&sched->pending_list,
>>> +				       struct drm_sched_job, list);
>>> +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
>>> +		drm_sched_free_job_queue(sched);
>>> +	spin_unlock(&sched->job_list_lock);
>>>   }
>>>   
>>>   /**
>>> @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>>>   	dma_fence_get(&s_fence->finished);
>>>   	drm_sched_fence_finished(s_fence, result);
>>>   	dma_fence_put(&s_fence->finished);
>>> -	drm_sched_submit_queue(sched);
>>> +	drm_sched_free_job_queue(sched);
>>>   }
>>>   
>>>   /**
>>> @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
>>>   void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>>>   {
>>>   	if (drm_sched_can_queue(sched))
>>> -		drm_sched_submit_queue(sched);
>>> +		drm_sched_run_job_queue(sched);
>>>   }
>>>   
>>>   /**
>>>    * drm_sched_select_entity - Select next entity to process
>>>    *
>>>    * @sched: scheduler instance
>>> + * @dequeue: dequeue selected entity
>>>    *
>>>    * Returns the entity to process or NULL if none are found.
>>>    */
>>>   static struct drm_sched_entity *
>>> -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
>>>   {
>>>   	struct drm_sched_entity *entity;
>>>   	int i;
>>> @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>>   	/* Kernel run queue has higher priority than normal run queue*/
>>>   	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>   		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
>>> -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
>>> -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
>>> +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
>>> +							dequeue) :
>>> +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
>>> +						      dequeue);
>>>   		if (entity)
>>>   			break;
>>>   	}
>>> @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>>>   EXPORT_SYMBOL(drm_sched_pick_best);
>>>   
>>>   /**
>>> - * drm_sched_main - main scheduler thread
>>> + * drm_sched_free_job_work - worker to call free_job
>>>    *
>>> - * @param: scheduler instance
>>> + * @w: free job work
>>>    */
>>> -static void drm_sched_main(struct work_struct *w)
>>> +static void drm_sched_free_job_work(struct work_struct *w)
>>>   {
>>>   	struct drm_gpu_scheduler *sched =
>>> -		container_of(w, struct drm_gpu_scheduler, work_submit);
>>> -	struct drm_sched_entity *entity;
>>> +		container_of(w, struct drm_gpu_scheduler, work_free_job);
>>>   	struct drm_sched_job *cleanup_job;
>>> -	int r;
>>>   
>>>   	if (READ_ONCE(sched->pause_submit))
>>>   		return;
>>>   
>>>   	cleanup_job = drm_sched_get_cleanup_job(sched);
>> I tried this patch with Nouveau and found a race condition:
>>
>> In drm_sched_run_job_work() the job is added to the pending_list via
>> drm_sched_job_begin(), then the run_job() callback is called and the scheduled
>> fence is signaled.
>>
>> However, in parallel drm_sched_get_cleanup_job() might be called from
>> drm_sched_free_job_work(), which picks the first job from the pending_list and
>> for the next job on the pending_list sets the scheduled fence' timestamp field.

Well why can this happen in parallel? Either the work items are 
scheduled to a single threaded work queue or you have protected the 
pending list with some locks.

Just moving the free_job into a separate work item without such 
precautions won't work because of quite a bunch of other reasons as well.

>>
>> The job can be on the pending_list, but the scheduled fence might not yet be
>> signaled. The call to actually signal the fence will subsequently fault because
>> it will try to dereference the timestamp.
>>
>> I'm not sure what's the best way to fix this, maybe it's enough to re-order
>> signalling the scheduled fence and adding the job to the pending_list. Not sure
>> if this has other implications though.
>>
> We really want the job on the pending list before calling run_job.
>
> I'm thinking we just delete the updating of the timestamp, not sure why
> this is useful.

This is used for calculating how long each job has spend on the hw, so 
big NAK to deleting this.

Regards,
Christian.

>
> Or we could do something like this where we try to update the timestamp,
> if we can't update the timestamp run_job worker will do it anyways.
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 67e0fb6e7d18..54bd3e88f139 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1074,8 +1074,10 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>                                                  typeof(*next), list);
>
>                  if (next) {
> -                       next->s_fence->scheduled.timestamp =
> -                               job->s_fence->finished.timestamp;
> +                       if (test_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
> +                                    &next->s_fence->scheduled.flags))
> +                               next->s_fence->scheduled.timestamp =
> +                                       job->s_fence->finished.timestamp;
>                          /* start TO timer for next job */
>                          drm_sched_start_timeout(sched);
>                  }
>
> I guess I'm leaning towards the latter option.
>
> Matt
>
>> - Danilo
>>
>>> -	entity = drm_sched_select_entity(sched);
>>> +	if (cleanup_job) {
>>> +		sched->ops->free_job(cleanup_job);
>>> +
>>> +		drm_sched_free_job_queue_if_ready(sched);
>>> +		drm_sched_run_job_queue_if_ready(sched);
>>> +	}
>>> +}
>>>   
>>> -	if (!entity && !cleanup_job)
>>> -		return;	/* No more work */
>>> +/**
>>> + * drm_sched_run_job_work - worker to call run_job
>>> + *
>>> + * @w: run job work
>>> + */
>>> +static void drm_sched_run_job_work(struct work_struct *w)
>>> +{
>>> +	struct drm_gpu_scheduler *sched =
>>> +		container_of(w, struct drm_gpu_scheduler, work_run_job);
>>> +	struct drm_sched_entity *entity;
>>> +	int r;
>>>   
>>> -	if (cleanup_job)
>>> -		sched->ops->free_job(cleanup_job);
>>> +	if (READ_ONCE(sched->pause_submit))
>>> +		return;
>>>   
>>> +	entity = drm_sched_select_entity(sched, true);
>>>   	if (entity) {
>>>   		struct dma_fence *fence;
>>>   		struct drm_sched_fence *s_fence;
>>> @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
>>>   		sched_job = drm_sched_entity_pop_job(entity);
>>>   		if (!sched_job) {
>>>   			complete_all(&entity->entity_idle);
>>> -			if (!cleanup_job)
>>> -				return;	/* No more work */
>>> -			goto again;
>>> +			return;	/* No more work */
>>>   		}
>>>   
>>>   		s_fence = sched_job->s_fence;
>>> @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
>>>   		}
>>>   
>>>   		wake_up(&sched->job_scheduled);
>>> +		drm_sched_run_job_queue_if_ready(sched);
>>>   	}
>>> -
>>> -again:
>>> -	drm_sched_submit_queue(sched);
>>>   }
>>>   
>>>   /**
>>> @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>   	spin_lock_init(&sched->job_list_lock);
>>>   	atomic_set(&sched->hw_rq_count, 0);
>>>   	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>> -	INIT_WORK(&sched->work_submit, drm_sched_main);
>>> +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
>>> +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
>>>   	atomic_set(&sched->_score, 0);
>>>   	atomic64_set(&sched->job_id_count, 0);
>>>   	sched->pause_submit = false;
>>> @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
>>>   void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
>>>   {
>>>   	WRITE_ONCE(sched->pause_submit, true);
>>> -	cancel_work_sync(&sched->work_submit);
>>> +	cancel_work_sync(&sched->work_run_job);
>>> +	cancel_work_sync(&sched->work_free_job);
>>>   }
>>>   EXPORT_SYMBOL(drm_sched_submit_stop);
>>>   
>>> @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
>>>   void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
>>>   {
>>>   	WRITE_ONCE(sched->pause_submit, false);
>>> -	queue_work(sched->submit_wq, &sched->work_submit);
>>> +	queue_work(sched->submit_wq, &sched->work_run_job);
>>> +	queue_work(sched->submit_wq, &sched->work_free_job);
>>>   }
>>>   EXPORT_SYMBOL(drm_sched_submit_start);
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index 04eec2d7635f..fbc083a92757 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
>>>    *                 finished.
>>>    * @hw_rq_count: the number of jobs currently in the hardware queue.
>>>    * @job_id_count: used to assign unique id to the each job.
>>> - * @submit_wq: workqueue used to queue @work_submit
>>> + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
>>>    * @timeout_wq: workqueue used to queue @work_tdr
>>> - * @work_submit: schedules jobs and cleans up entities
>>> + * @work_run_job: schedules jobs
>>> + * @work_free_job: cleans up jobs
>>>    * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>>>    *            timeout interval is over.
>>>    * @pending_list: the list of jobs which are currently in the job queue.
>>> @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
>>>   	atomic64_t			job_id_count;
>>>   	struct workqueue_struct		*submit_wq;
>>>   	struct workqueue_struct		*timeout_wq;
>>> -	struct work_struct		work_submit;
>>> +	struct work_struct		work_run_job;
>>> +	struct work_struct		work_free_job;
>>>   	struct delayed_work		work_tdr;
>>>   	struct list_head		pending_list;
>>>   	spinlock_t			job_list_lock;
>>> -- 
>>> 2.34.1
>>>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-25  8:02       ` Christian König
@ 2023-08-25 13:36         ` Matthew Brost
  2023-08-25 13:45           ` Christian König
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Brost @ 2023-08-25 13:36 UTC (permalink / raw)
  To: Christian König
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, Danilo Krummrich,
	donald.robson, boris.brezillon, intel-xe, faith.ekstrand

On Fri, Aug 25, 2023 at 10:02:32AM +0200, Christian König wrote:
> Am 25.08.23 um 04:58 schrieb Matthew Brost:
> > On Fri, Aug 25, 2023 at 01:04:10AM +0200, Danilo Krummrich wrote:
> > > On Thu, Aug 10, 2023 at 07:31:32PM -0700, Matthew Brost wrote:
> > > > Rather than call free_job and run_job in same work item have a dedicated
> > > > work item for each. This aligns with the design and intended use of work
> > > > queues.
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >   drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
> > > >   include/drm/gpu_scheduler.h            |   8 +-
> > > >   2 files changed, 106 insertions(+), 39 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > index cede47afc800..b67469eac179 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > > >    * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
> > > >    *
> > > >    * @rq: scheduler run queue to check.
> > > > + * @dequeue: dequeue selected entity
> > > >    *
> > > >    * Try to find a ready entity, returns NULL if none found.
> > > >    */
> > > >   static struct drm_sched_entity *
> > > > -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > > +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
> > > >   {
> > > >   	struct drm_sched_entity *entity;
> > > > @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > >   	if (entity) {
> > > >   		list_for_each_entry_continue(entity, &rq->entities, list) {
> > > >   			if (drm_sched_entity_is_ready(entity)) {
> > > > -				rq->current_entity = entity;
> > > > -				reinit_completion(&entity->entity_idle);
> > > > +				if (dequeue) {
> > > > +					rq->current_entity = entity;
> > > > +					reinit_completion(&entity->entity_idle);
> > > > +				}
> > > >   				spin_unlock(&rq->lock);
> > > >   				return entity;
> > > >   			}
> > > > @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > >   	list_for_each_entry(entity, &rq->entities, list) {
> > > >   		if (drm_sched_entity_is_ready(entity)) {
> > > > -			rq->current_entity = entity;
> > > > -			reinit_completion(&entity->entity_idle);
> > > > +			if (dequeue) {
> > > > +				rq->current_entity = entity;
> > > > +				reinit_completion(&entity->entity_idle);
> > > > +			}
> > > >   			spin_unlock(&rq->lock);
> > > >   			return entity;
> > > >   		}
> > > > @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > > >    * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
> > > >    *
> > > >    * @rq: scheduler run queue to check.
> > > > + * @dequeue: dequeue selected entity
> > > >    *
> > > >    * Find oldest waiting ready entity, returns NULL if none found.
> > > >    */
> > > >   static struct drm_sched_entity *
> > > > -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > > +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
> > > >   {
> > > >   	struct rb_node *rb;
> > > > @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > >   		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> > > >   		if (drm_sched_entity_is_ready(entity)) {
> > > > -			rq->current_entity = entity;
> > > > -			reinit_completion(&entity->entity_idle);
> > > > +			if (dequeue) {
> > > > +				rq->current_entity = entity;
> > > > +				reinit_completion(&entity->entity_idle);
> > > > +			}
> > > >   			break;
> > > >   		}
> > > >   	}
> > > > @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > > >   }
> > > >   /**
> > > > - * drm_sched_submit_queue - scheduler queue submission
> > > > + * drm_sched_run_job_queue - queue job submission
> > > >    * @sched: scheduler instance
> > > >    */
> > > > -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> > > > +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> > > >   {
> > > >   	if (!READ_ONCE(sched->pause_submit))
> > > > -		queue_work(sched->submit_wq, &sched->work_submit);
> > > > +		queue_work(sched->submit_wq, &sched->work_run_job);
> > > > +}
> > > > +
> > > > +static struct drm_sched_entity *
> > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> > > > +
> > > > +/**
> > > > + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> > > > + * @sched: scheduler instance
> > > > + */
> > > > +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > +{
> > > > +	if (drm_sched_select_entity(sched, false))
> > > > +		drm_sched_run_job_queue(sched);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_sched_free_job_queue - queue free job
> > > > + *
> > > > + * @sched: scheduler instance to queue free job
> > > > + */
> > > > +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> > > > +{
> > > > +	if (!READ_ONCE(sched->pause_submit))
> > > > +		queue_work(sched->submit_wq, &sched->work_free_job);
> > > > +}
> > > > +
> > > > +/**
> > > > + * drm_sched_free_job_queue_if_ready - queue free job if ready
> > > > + *
> > > > + * @sched: scheduler instance to queue free job
> > > > + */
> > > > +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > > > +{
> > > > +	struct drm_sched_job *job;
> > > > +
> > > > +	spin_lock(&sched->job_list_lock);
> > > > +	job = list_first_entry_or_null(&sched->pending_list,
> > > > +				       struct drm_sched_job, list);
> > > > +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
> > > > +		drm_sched_free_job_queue(sched);
> > > > +	spin_unlock(&sched->job_list_lock);
> > > >   }
> > > >   /**
> > > > @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
> > > >   	dma_fence_get(&s_fence->finished);
> > > >   	drm_sched_fence_finished(s_fence, result);
> > > >   	dma_fence_put(&s_fence->finished);
> > > > -	drm_sched_submit_queue(sched);
> > > > +	drm_sched_free_job_queue(sched);
> > > >   }
> > > >   /**
> > > > @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> > > >   void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> > > >   {
> > > >   	if (drm_sched_can_queue(sched))
> > > > -		drm_sched_submit_queue(sched);
> > > > +		drm_sched_run_job_queue(sched);
> > > >   }
> > > >   /**
> > > >    * drm_sched_select_entity - Select next entity to process
> > > >    *
> > > >    * @sched: scheduler instance
> > > > + * @dequeue: dequeue selected entity
> > > >    *
> > > >    * Returns the entity to process or NULL if none are found.
> > > >    */
> > > >   static struct drm_sched_entity *
> > > > -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
> > > >   {
> > > >   	struct drm_sched_entity *entity;
> > > >   	int i;
> > > > @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > > >   	/* Kernel run queue has higher priority than normal run queue*/
> > > >   	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> > > >   		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> > > > -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> > > > -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> > > > +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> > > > +							dequeue) :
> > > > +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> > > > +						      dequeue);
> > > >   		if (entity)
> > > >   			break;
> > > >   	}
> > > > @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> > > >   EXPORT_SYMBOL(drm_sched_pick_best);
> > > >   /**
> > > > - * drm_sched_main - main scheduler thread
> > > > + * drm_sched_free_job_work - worker to call free_job
> > > >    *
> > > > - * @param: scheduler instance
> > > > + * @w: free job work
> > > >    */
> > > > -static void drm_sched_main(struct work_struct *w)
> > > > +static void drm_sched_free_job_work(struct work_struct *w)
> > > >   {
> > > >   	struct drm_gpu_scheduler *sched =
> > > > -		container_of(w, struct drm_gpu_scheduler, work_submit);
> > > > -	struct drm_sched_entity *entity;
> > > > +		container_of(w, struct drm_gpu_scheduler, work_free_job);
> > > >   	struct drm_sched_job *cleanup_job;
> > > > -	int r;
> > > >   	if (READ_ONCE(sched->pause_submit))
> > > >   		return;
> > > >   	cleanup_job = drm_sched_get_cleanup_job(sched);
> > > I tried this patch with Nouveau and found a race condition:
> > > 
> > > In drm_sched_run_job_work() the job is added to the pending_list via
> > > drm_sched_job_begin(), then the run_job() callback is called and the scheduled
> > > fence is signaled.
> > > 
> > > However, in parallel drm_sched_get_cleanup_job() might be called from
> > > drm_sched_free_job_work(), which picks the first job from the pending_list and
> > > for the next job on the pending_list sets the scheduled fence' timestamp field.
> 
> Well why can this happen in parallel? Either the work items are scheduled to
> a single threaded work queue or you have protected the pending list with
> some locks.
> 

Xe uses a single-threaded work queue, Nouveau does not (desired
behavior).

The list of pending jobs is protected by a lock (safe), the race is:

add job to pending list
run_job
signal scheduled fence

dequeue from pending list
free_job
update timestamp

Once a job is on the pending list its timestamp can be accessed which
can blow up if scheduled fence isn't signaled or more specifically unless
DMA_FENCE_FLAG_TIMESTAMP_BIT is set. Logically it makes sense for the
job to be in the pending list before run_job and signal the scheduled
fence after run_job so I think we need to live with this race.

> Just moving the free_job into a separate work item without such precautions
> won't work because of quite a bunch of other reasons as well.
>

Yes, free_job might not be safe to run in parallel with run_job
depending on the driver vfuncs. Mention this in the cover letter.

Certainly this should be safe in the scheduler code though and I think
it will be after fixing this.

Matt

> > > 
> > > The job can be on the pending_list, but the scheduled fence might not yet be
> > > signaled. The call to actually signal the fence will subsequently fault because
> > > it will try to dereference the timestamp.
> > > 
> > > I'm not sure what's the best way to fix this, maybe it's enough to re-order
> > > signalling the scheduled fence and adding the job to the pending_list. Not sure
> > > if this has other implications though.
> > > 
> > We really want the job on the pending list before calling run_job.
> > 
> > I'm thinking we just delete the updating of the timestamp, not sure why
> > this is useful.
> 
> This is used for calculating how long each job has spend on the hw, so big
> NAK to deleting this.
>

Ah, I see that AMDGPU uses this. Previously just checked the scheduler
code.

The below patch should work just fine then.

Matt

> Regards,
> Christian.
> 
> > 
> > Or we could do something like this where we try to update the timestamp,
> > if we can't update the timestamp run_job worker will do it anyways.
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 67e0fb6e7d18..54bd3e88f139 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1074,8 +1074,10 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
> >                                                  typeof(*next), list);
> > 
> >                  if (next) {
> > -                       next->s_fence->scheduled.timestamp =
> > -                               job->s_fence->finished.timestamp;
> > +                       if (test_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
> > +                                    &next->s_fence->scheduled.flags))
> > +                               next->s_fence->scheduled.timestamp =
> > +                                       job->s_fence->finished.timestamp;
> >                          /* start TO timer for next job */
> >                          drm_sched_start_timeout(sched);
> >                  }
> > 
> > I guess I'm leaning towards the latter option.
> > 
> > Matt
> > 
> > > - Danilo
> > > 
> > > > -	entity = drm_sched_select_entity(sched);
> > > > +	if (cleanup_job) {
> > > > +		sched->ops->free_job(cleanup_job);
> > > > +
> > > > +		drm_sched_free_job_queue_if_ready(sched);
> > > > +		drm_sched_run_job_queue_if_ready(sched);
> > > > +	}
> > > > +}
> > > > -	if (!entity && !cleanup_job)
> > > > -		return;	/* No more work */
> > > > +/**
> > > > + * drm_sched_run_job_work - worker to call run_job
> > > > + *
> > > > + * @w: run job work
> > > > + */
> > > > +static void drm_sched_run_job_work(struct work_struct *w)
> > > > +{
> > > > +	struct drm_gpu_scheduler *sched =
> > > > +		container_of(w, struct drm_gpu_scheduler, work_run_job);
> > > > +	struct drm_sched_entity *entity;
> > > > +	int r;
> > > > -	if (cleanup_job)
> > > > -		sched->ops->free_job(cleanup_job);
> > > > +	if (READ_ONCE(sched->pause_submit))
> > > > +		return;
> > > > +	entity = drm_sched_select_entity(sched, true);
> > > >   	if (entity) {
> > > >   		struct dma_fence *fence;
> > > >   		struct drm_sched_fence *s_fence;
> > > > @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
> > > >   		sched_job = drm_sched_entity_pop_job(entity);
> > > >   		if (!sched_job) {
> > > >   			complete_all(&entity->entity_idle);
> > > > -			if (!cleanup_job)
> > > > -				return;	/* No more work */
> > > > -			goto again;
> > > > +			return;	/* No more work */
> > > >   		}
> > > >   		s_fence = sched_job->s_fence;
> > > > @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
> > > >   		}
> > > >   		wake_up(&sched->job_scheduled);
> > > > +		drm_sched_run_job_queue_if_ready(sched);
> > > >   	}
> > > > -
> > > > -again:
> > > > -	drm_sched_submit_queue(sched);
> > > >   }
> > > >   /**
> > > > @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> > > >   	spin_lock_init(&sched->job_list_lock);
> > > >   	atomic_set(&sched->hw_rq_count, 0);
> > > >   	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > > > -	INIT_WORK(&sched->work_submit, drm_sched_main);
> > > > +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > > > +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> > > >   	atomic_set(&sched->_score, 0);
> > > >   	atomic64_set(&sched->job_id_count, 0);
> > > >   	sched->pause_submit = false;
> > > > @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
> > > >   void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
> > > >   {
> > > >   	WRITE_ONCE(sched->pause_submit, true);
> > > > -	cancel_work_sync(&sched->work_submit);
> > > > +	cancel_work_sync(&sched->work_run_job);
> > > > +	cancel_work_sync(&sched->work_free_job);
> > > >   }
> > > >   EXPORT_SYMBOL(drm_sched_submit_stop);
> > > > @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
> > > >   void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
> > > >   {
> > > >   	WRITE_ONCE(sched->pause_submit, false);
> > > > -	queue_work(sched->submit_wq, &sched->work_submit);
> > > > +	queue_work(sched->submit_wq, &sched->work_run_job);
> > > > +	queue_work(sched->submit_wq, &sched->work_free_job);
> > > >   }
> > > >   EXPORT_SYMBOL(drm_sched_submit_start);
> > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > index 04eec2d7635f..fbc083a92757 100644
> > > > --- a/include/drm/gpu_scheduler.h
> > > > +++ b/include/drm/gpu_scheduler.h
> > > > @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
> > > >    *                 finished.
> > > >    * @hw_rq_count: the number of jobs currently in the hardware queue.
> > > >    * @job_id_count: used to assign unique id to the each job.
> > > > - * @submit_wq: workqueue used to queue @work_submit
> > > > + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
> > > >    * @timeout_wq: workqueue used to queue @work_tdr
> > > > - * @work_submit: schedules jobs and cleans up entities
> > > > + * @work_run_job: schedules jobs
> > > > + * @work_free_job: cleans up jobs
> > > >    * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> > > >    *            timeout interval is over.
> > > >    * @pending_list: the list of jobs which are currently in the job queue.
> > > > @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
> > > >   	atomic64_t			job_id_count;
> > > >   	struct workqueue_struct		*submit_wq;
> > > >   	struct workqueue_struct		*timeout_wq;
> > > > -	struct work_struct		work_submit;
> > > > +	struct work_struct		work_run_job;
> > > > +	struct work_struct		work_free_job;
> > > >   	struct delayed_work		work_tdr;
> > > >   	struct list_head		pending_list;
> > > >   	spinlock_t			job_list_lock;
> > > > -- 
> > > > 2.34.1
> > > > 
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-25 13:36         ` Matthew Brost
@ 2023-08-25 13:45           ` Christian König
  2023-09-12 10:13             ` Boris Brezillon
  2023-09-12 13:27             ` Boris Brezillon
  0 siblings, 2 replies; 80+ messages in thread
From: Christian König @ 2023-08-25 13:45 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, luben.tuikov, Danilo Krummrich,
	donald.robson, boris.brezillon, intel-xe, faith.ekstrand

Am 25.08.23 um 15:36 schrieb Matthew Brost:
> On Fri, Aug 25, 2023 at 10:02:32AM +0200, Christian König wrote:
>> Am 25.08.23 um 04:58 schrieb Matthew Brost:
>>> On Fri, Aug 25, 2023 at 01:04:10AM +0200, Danilo Krummrich wrote:
>>>> On Thu, Aug 10, 2023 at 07:31:32PM -0700, Matthew Brost wrote:
>>>>> Rather than call free_job and run_job in same work item have a dedicated
>>>>> work item for each. This aligns with the design and intended use of work
>>>>> queues.
>>>>>
>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>> ---
>>>>>    drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
>>>>>    include/drm/gpu_scheduler.h            |   8 +-
>>>>>    2 files changed, 106 insertions(+), 39 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index cede47afc800..b67469eac179 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>>>>     * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>>>>>     *
>>>>>     * @rq: scheduler run queue to check.
>>>>> + * @dequeue: dequeue selected entity
>>>>>     *
>>>>>     * Try to find a ready entity, returns NULL if none found.
>>>>>     */
>>>>>    static struct drm_sched_entity *
>>>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>> +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
>>>>>    {
>>>>>    	struct drm_sched_entity *entity;
>>>>> @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>>    	if (entity) {
>>>>>    		list_for_each_entry_continue(entity, &rq->entities, list) {
>>>>>    			if (drm_sched_entity_is_ready(entity)) {
>>>>> -				rq->current_entity = entity;
>>>>> -				reinit_completion(&entity->entity_idle);
>>>>> +				if (dequeue) {
>>>>> +					rq->current_entity = entity;
>>>>> +					reinit_completion(&entity->entity_idle);
>>>>> +				}
>>>>>    				spin_unlock(&rq->lock);
>>>>>    				return entity;
>>>>>    			}
>>>>> @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>>    	list_for_each_entry(entity, &rq->entities, list) {
>>>>>    		if (drm_sched_entity_is_ready(entity)) {
>>>>> -			rq->current_entity = entity;
>>>>> -			reinit_completion(&entity->entity_idle);
>>>>> +			if (dequeue) {
>>>>> +				rq->current_entity = entity;
>>>>> +				reinit_completion(&entity->entity_idle);
>>>>> +			}
>>>>>    			spin_unlock(&rq->lock);
>>>>>    			return entity;
>>>>>    		}
>>>>> @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>>>>>     * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>>>>>     *
>>>>>     * @rq: scheduler run queue to check.
>>>>> + * @dequeue: dequeue selected entity
>>>>>     *
>>>>>     * Find oldest waiting ready entity, returns NULL if none found.
>>>>>     */
>>>>>    static struct drm_sched_entity *
>>>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>>> +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
>>>>>    {
>>>>>    	struct rb_node *rb;
>>>>> @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>>>    		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>>>>>    		if (drm_sched_entity_is_ready(entity)) {
>>>>> -			rq->current_entity = entity;
>>>>> -			reinit_completion(&entity->entity_idle);
>>>>> +			if (dequeue) {
>>>>> +				rq->current_entity = entity;
>>>>> +				reinit_completion(&entity->entity_idle);
>>>>> +			}
>>>>>    			break;
>>>>>    		}
>>>>>    	}
>>>>> @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>>>>>    }
>>>>>    /**
>>>>> - * drm_sched_submit_queue - scheduler queue submission
>>>>> + * drm_sched_run_job_queue - queue job submission
>>>>>     * @sched: scheduler instance
>>>>>     */
>>>>> -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
>>>>> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>>>>>    {
>>>>>    	if (!READ_ONCE(sched->pause_submit))
>>>>> -		queue_work(sched->submit_wq, &sched->work_submit);
>>>>> +		queue_work(sched->submit_wq, &sched->work_run_job);
>>>>> +}
>>>>> +
>>>>> +static struct drm_sched_entity *
>>>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
>>>>> +
>>>>> +/**
>>>>> + * drm_sched_run_job_queue_if_ready - queue job submission if ready
>>>>> + * @sched: scheduler instance
>>>>> + */
>>>>> +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
>>>>> +{
>>>>> +	if (drm_sched_select_entity(sched, false))
>>>>> +		drm_sched_run_job_queue(sched);
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * drm_sched_free_job_queue - queue free job
>>>>> + *
>>>>> + * @sched: scheduler instance to queue free job
>>>>> + */
>>>>> +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
>>>>> +{
>>>>> +	if (!READ_ONCE(sched->pause_submit))
>>>>> +		queue_work(sched->submit_wq, &sched->work_free_job);
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * drm_sched_free_job_queue_if_ready - queue free job if ready
>>>>> + *
>>>>> + * @sched: scheduler instance to queue free job
>>>>> + */
>>>>> +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
>>>>> +{
>>>>> +	struct drm_sched_job *job;
>>>>> +
>>>>> +	spin_lock(&sched->job_list_lock);
>>>>> +	job = list_first_entry_or_null(&sched->pending_list,
>>>>> +				       struct drm_sched_job, list);
>>>>> +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
>>>>> +		drm_sched_free_job_queue(sched);
>>>>> +	spin_unlock(&sched->job_list_lock);
>>>>>    }
>>>>>    /**
>>>>> @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>>>>>    	dma_fence_get(&s_fence->finished);
>>>>>    	drm_sched_fence_finished(s_fence, result);
>>>>>    	dma_fence_put(&s_fence->finished);
>>>>> -	drm_sched_submit_queue(sched);
>>>>> +	drm_sched_free_job_queue(sched);
>>>>>    }
>>>>>    /**
>>>>> @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
>>>>>    void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>>>>>    {
>>>>>    	if (drm_sched_can_queue(sched))
>>>>> -		drm_sched_submit_queue(sched);
>>>>> +		drm_sched_run_job_queue(sched);
>>>>>    }
>>>>>    /**
>>>>>     * drm_sched_select_entity - Select next entity to process
>>>>>     *
>>>>>     * @sched: scheduler instance
>>>>> + * @dequeue: dequeue selected entity
>>>>>     *
>>>>>     * Returns the entity to process or NULL if none are found.
>>>>>     */
>>>>>    static struct drm_sched_entity *
>>>>> -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
>>>>>    {
>>>>>    	struct drm_sched_entity *entity;
>>>>>    	int i;
>>>>> @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>>>>    	/* Kernel run queue has higher priority than normal run queue*/
>>>>>    	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>>>>    		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
>>>>> -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
>>>>> -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
>>>>> +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
>>>>> +							dequeue) :
>>>>> +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
>>>>> +						      dequeue);
>>>>>    		if (entity)
>>>>>    			break;
>>>>>    	}
>>>>> @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>>>>>    EXPORT_SYMBOL(drm_sched_pick_best);
>>>>>    /**
>>>>> - * drm_sched_main - main scheduler thread
>>>>> + * drm_sched_free_job_work - worker to call free_job
>>>>>     *
>>>>> - * @param: scheduler instance
>>>>> + * @w: free job work
>>>>>     */
>>>>> -static void drm_sched_main(struct work_struct *w)
>>>>> +static void drm_sched_free_job_work(struct work_struct *w)
>>>>>    {
>>>>>    	struct drm_gpu_scheduler *sched =
>>>>> -		container_of(w, struct drm_gpu_scheduler, work_submit);
>>>>> -	struct drm_sched_entity *entity;
>>>>> +		container_of(w, struct drm_gpu_scheduler, work_free_job);
>>>>>    	struct drm_sched_job *cleanup_job;
>>>>> -	int r;
>>>>>    	if (READ_ONCE(sched->pause_submit))
>>>>>    		return;
>>>>>    	cleanup_job = drm_sched_get_cleanup_job(sched);
>>>> I tried this patch with Nouveau and found a race condition:
>>>>
>>>> In drm_sched_run_job_work() the job is added to the pending_list via
>>>> drm_sched_job_begin(), then the run_job() callback is called and the scheduled
>>>> fence is signaled.
>>>>
>>>> However, in parallel drm_sched_get_cleanup_job() might be called from
>>>> drm_sched_free_job_work(), which picks the first job from the pending_list and
>>>> for the next job on the pending_list sets the scheduled fence' timestamp field.
>> Well why can this happen in parallel? Either the work items are scheduled to
>> a single threaded work queue or you have protected the pending list with
>> some locks.
>>
> Xe uses a single-threaded work queue, Nouveau does not (desired
> behavior).
>
> The list of pending jobs is protected by a lock (safe), the race is:
>
> add job to pending list
> run_job
> signal scheduled fence
>
> dequeue from pending list
> free_job
> update timestamp
>
> Once a job is on the pending list its timestamp can be accessed which
> can blow up if scheduled fence isn't signaled or more specifically unless
> DMA_FENCE_FLAG_TIMESTAMP_BIT is set.

Ah, that problem again. No that is actually quite harmless.

You just need to double check if the DMA_FENCE_FLAG_TIMESTAMP_BIT is 
already set and if it's not set don't do anything.

Regards,
Christian.


>   Logically it makes sense for the
> job to be in the pending list before run_job and signal the scheduled
> fence after run_job so I think we need to live with this race.
>
>> Just moving the free_job into a separate work item without such precautions
>> won't work because of quite a bunch of other reasons as well.
>>
> Yes, free_job might not be safe to run in parallel with run_job
> depending on the driver vfuncs. Mention this in the cover letter.
>
> Certainly this should be safe in the scheduler code though and I think
> it will be after fixing this.
>
> Matt
>
>>>> The job can be on the pending_list, but the scheduled fence might not yet be
>>>> signaled. The call to actually signal the fence will subsequently fault because
>>>> it will try to dereference the timestamp.
>>>>
>>>> I'm not sure what's the best way to fix this, maybe it's enough to re-order
>>>> signalling the scheduled fence and adding the job to the pending_list. Not sure
>>>> if this has other implications though.
>>>>
>>> We really want the job on the pending list before calling run_job.
>>>
>>> I'm thinking we just delete the updating of the timestamp, not sure why
>>> this is useful.
>> This is used for calculating how long each job has spend on the hw, so big
>> NAK to deleting this.
>>
> Ah, I see that AMDGPU uses this. Previously just checked the scheduler
> code.
>
> The below patch should work just fine then.
>
> Matt
>
>> Regards,
>> Christian.
>>
>>> Or we could do something like this where we try to update the timestamp,
>>> if we can't update the timestamp run_job worker will do it anyways.
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 67e0fb6e7d18..54bd3e88f139 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1074,8 +1074,10 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>>>                                                   typeof(*next), list);
>>>
>>>                   if (next) {
>>> -                       next->s_fence->scheduled.timestamp =
>>> -                               job->s_fence->finished.timestamp;
>>> +                       if (test_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
>>> +                                    &next->s_fence->scheduled.flags))
>>> +                               next->s_fence->scheduled.timestamp =
>>> +                                       job->s_fence->finished.timestamp;
>>>                           /* start TO timer for next job */
>>>                           drm_sched_start_timeout(sched);
>>>                   }
>>>
>>> I guess I'm leaning towards the latter option.
>>>
>>> Matt
>>>
>>>> - Danilo
>>>>
>>>>> -	entity = drm_sched_select_entity(sched);
>>>>> +	if (cleanup_job) {
>>>>> +		sched->ops->free_job(cleanup_job);
>>>>> +
>>>>> +		drm_sched_free_job_queue_if_ready(sched);
>>>>> +		drm_sched_run_job_queue_if_ready(sched);
>>>>> +	}
>>>>> +}
>>>>> -	if (!entity && !cleanup_job)
>>>>> -		return;	/* No more work */
>>>>> +/**
>>>>> + * drm_sched_run_job_work - worker to call run_job
>>>>> + *
>>>>> + * @w: run job work
>>>>> + */
>>>>> +static void drm_sched_run_job_work(struct work_struct *w)
>>>>> +{
>>>>> +	struct drm_gpu_scheduler *sched =
>>>>> +		container_of(w, struct drm_gpu_scheduler, work_run_job);
>>>>> +	struct drm_sched_entity *entity;
>>>>> +	int r;
>>>>> -	if (cleanup_job)
>>>>> -		sched->ops->free_job(cleanup_job);
>>>>> +	if (READ_ONCE(sched->pause_submit))
>>>>> +		return;
>>>>> +	entity = drm_sched_select_entity(sched, true);
>>>>>    	if (entity) {
>>>>>    		struct dma_fence *fence;
>>>>>    		struct drm_sched_fence *s_fence;
>>>>> @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
>>>>>    		sched_job = drm_sched_entity_pop_job(entity);
>>>>>    		if (!sched_job) {
>>>>>    			complete_all(&entity->entity_idle);
>>>>> -			if (!cleanup_job)
>>>>> -				return;	/* No more work */
>>>>> -			goto again;
>>>>> +			return;	/* No more work */
>>>>>    		}
>>>>>    		s_fence = sched_job->s_fence;
>>>>> @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
>>>>>    		}
>>>>>    		wake_up(&sched->job_scheduled);
>>>>> +		drm_sched_run_job_queue_if_ready(sched);
>>>>>    	}
>>>>> -
>>>>> -again:
>>>>> -	drm_sched_submit_queue(sched);
>>>>>    }
>>>>>    /**
>>>>> @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>>>    	spin_lock_init(&sched->job_list_lock);
>>>>>    	atomic_set(&sched->hw_rq_count, 0);
>>>>>    	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>>>> -	INIT_WORK(&sched->work_submit, drm_sched_main);
>>>>> +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
>>>>> +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
>>>>>    	atomic_set(&sched->_score, 0);
>>>>>    	atomic64_set(&sched->job_id_count, 0);
>>>>>    	sched->pause_submit = false;
>>>>> @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
>>>>>    void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
>>>>>    {
>>>>>    	WRITE_ONCE(sched->pause_submit, true);
>>>>> -	cancel_work_sync(&sched->work_submit);
>>>>> +	cancel_work_sync(&sched->work_run_job);
>>>>> +	cancel_work_sync(&sched->work_free_job);
>>>>>    }
>>>>>    EXPORT_SYMBOL(drm_sched_submit_stop);
>>>>> @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
>>>>>    void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
>>>>>    {
>>>>>    	WRITE_ONCE(sched->pause_submit, false);
>>>>> -	queue_work(sched->submit_wq, &sched->work_submit);
>>>>> +	queue_work(sched->submit_wq, &sched->work_run_job);
>>>>> +	queue_work(sched->submit_wq, &sched->work_free_job);
>>>>>    }
>>>>>    EXPORT_SYMBOL(drm_sched_submit_start);
>>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>>> index 04eec2d7635f..fbc083a92757 100644
>>>>> --- a/include/drm/gpu_scheduler.h
>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>> @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
>>>>>     *                 finished.
>>>>>     * @hw_rq_count: the number of jobs currently in the hardware queue.
>>>>>     * @job_id_count: used to assign unique id to the each job.
>>>>> - * @submit_wq: workqueue used to queue @work_submit
>>>>> + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
>>>>>     * @timeout_wq: workqueue used to queue @work_tdr
>>>>> - * @work_submit: schedules jobs and cleans up entities
>>>>> + * @work_run_job: schedules jobs
>>>>> + * @work_free_job: cleans up jobs
>>>>>     * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>>>>>     *            timeout interval is over.
>>>>>     * @pending_list: the list of jobs which are currently in the job queue.
>>>>> @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
>>>>>    	atomic64_t			job_id_count;
>>>>>    	struct workqueue_struct		*submit_wq;
>>>>>    	struct workqueue_struct		*timeout_wq;
>>>>> -	struct work_struct		work_submit;
>>>>> +	struct work_struct		work_run_job;
>>>>> +	struct work_struct		work_free_job;
>>>>>    	struct delayed_work		work_tdr;
>>>>>    	struct list_head		pending_list;
>>>>>    	spinlock_t			job_list_lock;
>>>>> -- 
>>>>> 2.34.1
>>>>>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-11  2:31 ` [PATCH v2 4/9] drm/sched: Split free_job into own work item Matthew Brost
  2023-08-17 13:39   ` Christian König
  2023-08-24 23:04   ` Danilo Krummrich
@ 2023-08-28 18:04   ` Danilo Krummrich
  2023-08-28 18:41     ` Matthew Brost
  2 siblings, 1 reply; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-28 18:04 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, intel-xe, luben.tuikov, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

On 8/11/23 04:31, Matthew Brost wrote:
> Rather than call free_job and run_job in same work item have a dedicated
> work item for each. This aligns with the design and intended use of work
> queues.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
>   include/drm/gpu_scheduler.h            |   8 +-
>   2 files changed, 106 insertions(+), 39 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index cede47afc800..b67469eac179 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>    * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
>    *
>    * @rq: scheduler run queue to check.
> + * @dequeue: dequeue selected entity
>    *
>    * Try to find a ready entity, returns NULL if none found.
>    */
>   static struct drm_sched_entity *
> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
>   {
>   	struct drm_sched_entity *entity;
>   
> @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>   	if (entity) {
>   		list_for_each_entry_continue(entity, &rq->entities, list) {
>   			if (drm_sched_entity_is_ready(entity)) {
> -				rq->current_entity = entity;
> -				reinit_completion(&entity->entity_idle);
> +				if (dequeue) {
> +					rq->current_entity = entity;
> +					reinit_completion(&entity->entity_idle);
> +				}
>   				spin_unlock(&rq->lock);
>   				return entity;
>   			}
> @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>   	list_for_each_entry(entity, &rq->entities, list) {
>   
>   		if (drm_sched_entity_is_ready(entity)) {
> -			rq->current_entity = entity;
> -			reinit_completion(&entity->entity_idle);
> +			if (dequeue) {
> +				rq->current_entity = entity;
> +				reinit_completion(&entity->entity_idle);
> +			}
>   			spin_unlock(&rq->lock);
>   			return entity;
>   		}
> @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
>    * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
>    *
>    * @rq: scheduler run queue to check.
> + * @dequeue: dequeue selected entity
>    *
>    * Find oldest waiting ready entity, returns NULL if none found.
>    */
>   static struct drm_sched_entity *
> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
>   {
>   	struct rb_node *rb;
>   
> @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>   
>   		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
>   		if (drm_sched_entity_is_ready(entity)) {
> -			rq->current_entity = entity;
> -			reinit_completion(&entity->entity_idle);
> +			if (dequeue) {
> +				rq->current_entity = entity;
> +				reinit_completion(&entity->entity_idle);
> +			}
>   			break;
>   		}
>   	}
> @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
>   }
>   
>   /**
> - * drm_sched_submit_queue - scheduler queue submission
> + * drm_sched_run_job_queue - queue job submission
>    * @sched: scheduler instance
>    */
> -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
>   {
>   	if (!READ_ONCE(sched->pause_submit))
> -		queue_work(sched->submit_wq, &sched->work_submit);
> +		queue_work(sched->submit_wq, &sched->work_run_job);
> +}
> +
> +static struct drm_sched_entity *
> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> +
> +/**
> + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> + * @sched: scheduler instance
> + */
> +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> +{
> +	if (drm_sched_select_entity(sched, false))
> +		drm_sched_run_job_queue(sched);
> +}
> +
> +/**
> + * drm_sched_free_job_queue - queue free job
> + *
> + * @sched: scheduler instance to queue free job
> + */
> +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> +{
> +	if (!READ_ONCE(sched->pause_submit))
> +		queue_work(sched->submit_wq, &sched->work_free_job);
> +}
> +
> +/**
> + * drm_sched_free_job_queue_if_ready - queue free job if ready
> + *
> + * @sched: scheduler instance to queue free job
> + */
> +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> +{
> +	struct drm_sched_job *job;
> +
> +	spin_lock(&sched->job_list_lock);
> +	job = list_first_entry_or_null(&sched->pending_list,
> +				       struct drm_sched_job, list);
> +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
> +		drm_sched_free_job_queue(sched);
> +	spin_unlock(&sched->job_list_lock);
>   }
>   
>   /**
> @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
>   	dma_fence_get(&s_fence->finished);
>   	drm_sched_fence_finished(s_fence, result);
>   	dma_fence_put(&s_fence->finished);
> -	drm_sched_submit_queue(sched);
> +	drm_sched_free_job_queue(sched);
>   }
>   
>   /**
> @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
>   void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
>   {
>   	if (drm_sched_can_queue(sched))
> -		drm_sched_submit_queue(sched);
> +		drm_sched_run_job_queue(sched);
>   }
>   
>   /**
>    * drm_sched_select_entity - Select next entity to process
>    *
>    * @sched: scheduler instance
> + * @dequeue: dequeue selected entity
>    *
>    * Returns the entity to process or NULL if none are found.
>    */
>   static struct drm_sched_entity *
> -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
>   {
>   	struct drm_sched_entity *entity;
>   	int i;
> @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>   	/* Kernel run queue has higher priority than normal run queue*/
>   	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>   		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> +							dequeue) :
> +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> +						      dequeue);
>   		if (entity)
>   			break;
>   	}
> @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>   EXPORT_SYMBOL(drm_sched_pick_best);
>   
>   /**
> - * drm_sched_main - main scheduler thread
> + * drm_sched_free_job_work - worker to call free_job
>    *
> - * @param: scheduler instance
> + * @w: free job work
>    */
> -static void drm_sched_main(struct work_struct *w)
> +static void drm_sched_free_job_work(struct work_struct *w)
>   {
>   	struct drm_gpu_scheduler *sched =
> -		container_of(w, struct drm_gpu_scheduler, work_submit);
> -	struct drm_sched_entity *entity;
> +		container_of(w, struct drm_gpu_scheduler, work_free_job);
>   	struct drm_sched_job *cleanup_job;
> -	int r;
>   
>   	if (READ_ONCE(sched->pause_submit))
>   		return;
>   
>   	cleanup_job = drm_sched_get_cleanup_job(sched);
> -	entity = drm_sched_select_entity(sched);
> +	if (cleanup_job) {
> +		sched->ops->free_job(cleanup_job);
> +
> +		drm_sched_free_job_queue_if_ready(sched);
> +		drm_sched_run_job_queue_if_ready(sched);
> +	}
> +}
>   
> -	if (!entity && !cleanup_job)
> -		return;	/* No more work */
> +/**
> + * drm_sched_run_job_work - worker to call run_job
> + *
> + * @w: run job work
> + */
> +static void drm_sched_run_job_work(struct work_struct *w)
> +{
> +	struct drm_gpu_scheduler *sched =
> +		container_of(w, struct drm_gpu_scheduler, work_run_job);
> +	struct drm_sched_entity *entity;
> +	int r;
>   
> -	if (cleanup_job)
> -		sched->ops->free_job(cleanup_job);
> +	if (READ_ONCE(sched->pause_submit))
> +		return;
>   
> +	entity = drm_sched_select_entity(sched, true);
>   	if (entity) {
>   		struct dma_fence *fence;
>   		struct drm_sched_fence *s_fence;
> @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
>   		sched_job = drm_sched_entity_pop_job(entity);
>   		if (!sched_job) {
>   			complete_all(&entity->entity_idle);
> -			if (!cleanup_job)
> -				return;	/* No more work */
> -			goto again;
> +			return;	/* No more work */
>   		}
>   
>   		s_fence = sched_job->s_fence;
> @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
>   		}
>   
>   		wake_up(&sched->job_scheduled);
> +		drm_sched_run_job_queue_if_ready(sched);
>   	}
> -
> -again:
> -	drm_sched_submit_queue(sched);
>   }
>   
>   /**
> @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>   	spin_lock_init(&sched->job_list_lock);
>   	atomic_set(&sched->hw_rq_count, 0);
>   	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> -	INIT_WORK(&sched->work_submit, drm_sched_main);
> +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
>   	atomic_set(&sched->_score, 0);
>   	atomic64_set(&sched->job_id_count, 0);
>   	sched->pause_submit = false;
> @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
>   void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)

I was wondering what the scheduler teardown sequence looks like for
DRM_SCHED_POLICY_SINGLE_ENTITY and how XE does that.

In Nouveau, userspace can ask the kernel to create a channel (or multiple),
where each channel represents a ring feeding the firmware scheduler. Userspace
can forcefully close channels via either a dedicated IOCTL or by just closing
the FD which subsequently closes all channels opened through this FD.

When this happens the scheduler needs to be teared down. Without keeping track of
things in a driver specific way, the only thing I could really come up with is the
following.

/* Make sure no more jobs are fetched from the entity. */
drm_sched_submit_stop();

/* Wait for the channel to be idle, namely jobs in flight to complete. */
nouveau_channel_idle();

/* Stop the scheduler to free jobs from the pending_list. Ring must be idle at this
  * point, otherwise me might leak jobs. Feels more like a workaround to free
  * finished jobs.
  */
drm_sched_stop();

/* Free jobs from the entity queue. */
drm_sched_entity_fini();

/* Probably not even needed in this case. */
drm_sched_fini();

This doesn't look very straightforward though. I wonder if other drivers feeding
firmware schedulers have similar cases. Maybe something like drm_sched_teardown(),
which would stop job submission, wait for pending jobs to finish and subsequently
free them up would makes sense?

- Danilo

>   {
>   	WRITE_ONCE(sched->pause_submit, true);
> -	cancel_work_sync(&sched->work_submit);
> +	cancel_work_sync(&sched->work_run_job);
> +	cancel_work_sync(&sched->work_free_job);
>   }
>   EXPORT_SYMBOL(drm_sched_submit_stop);
>   
> @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
>   void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
>   {
>   	WRITE_ONCE(sched->pause_submit, false);
> -	queue_work(sched->submit_wq, &sched->work_submit);
> +	queue_work(sched->submit_wq, &sched->work_run_job);
> +	queue_work(sched->submit_wq, &sched->work_free_job);
>   }
>   EXPORT_SYMBOL(drm_sched_submit_start);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 04eec2d7635f..fbc083a92757 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
>    *                 finished.
>    * @hw_rq_count: the number of jobs currently in the hardware queue.
>    * @job_id_count: used to assign unique id to the each job.
> - * @submit_wq: workqueue used to queue @work_submit
> + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
>    * @timeout_wq: workqueue used to queue @work_tdr
> - * @work_submit: schedules jobs and cleans up entities
> + * @work_run_job: schedules jobs
> + * @work_free_job: cleans up jobs
>    * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>    *            timeout interval is over.
>    * @pending_list: the list of jobs which are currently in the job queue.
> @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
>   	atomic64_t			job_id_count;
>   	struct workqueue_struct		*submit_wq;
>   	struct workqueue_struct		*timeout_wq;
> -	struct work_struct		work_submit;
> +	struct work_struct		work_run_job;
> +	struct work_struct		work_free_job;
>   	struct delayed_work		work_tdr;
>   	struct list_head		pending_list;
>   	spinlock_t			job_list_lock;


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-28 18:04   ` Danilo Krummrich
@ 2023-08-28 18:41     ` Matthew Brost
  2023-08-29  1:20       ` Danilo Krummrich
  0 siblings, 1 reply; 80+ messages in thread
From: Matthew Brost @ 2023-08-28 18:41 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, intel-xe, luben.tuikov, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

On Mon, Aug 28, 2023 at 08:04:31PM +0200, Danilo Krummrich wrote:
> On 8/11/23 04:31, Matthew Brost wrote:
> > Rather than call free_job and run_job in same work item have a dedicated
> > work item for each. This aligns with the design and intended use of work
> > queues.
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
> >   include/drm/gpu_scheduler.h            |   8 +-
> >   2 files changed, 106 insertions(+), 39 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index cede47afc800..b67469eac179 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> >    * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
> >    *
> >    * @rq: scheduler run queue to check.
> > + * @dequeue: dequeue selected entity
> >    *
> >    * Try to find a ready entity, returns NULL if none found.
> >    */
> >   static struct drm_sched_entity *
> > -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
> >   {
> >   	struct drm_sched_entity *entity;
> > @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >   	if (entity) {
> >   		list_for_each_entry_continue(entity, &rq->entities, list) {
> >   			if (drm_sched_entity_is_ready(entity)) {
> > -				rq->current_entity = entity;
> > -				reinit_completion(&entity->entity_idle);
> > +				if (dequeue) {
> > +					rq->current_entity = entity;
> > +					reinit_completion(&entity->entity_idle);
> > +				}
> >   				spin_unlock(&rq->lock);
> >   				return entity;
> >   			}
> > @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >   	list_for_each_entry(entity, &rq->entities, list) {
> >   		if (drm_sched_entity_is_ready(entity)) {
> > -			rq->current_entity = entity;
> > -			reinit_completion(&entity->entity_idle);
> > +			if (dequeue) {
> > +				rq->current_entity = entity;
> > +				reinit_completion(&entity->entity_idle);
> > +			}
> >   			spin_unlock(&rq->lock);
> >   			return entity;
> >   		}
> > @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >    * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
> >    *
> >    * @rq: scheduler run queue to check.
> > + * @dequeue: dequeue selected entity
> >    *
> >    * Find oldest waiting ready entity, returns NULL if none found.
> >    */
> >   static struct drm_sched_entity *
> > -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
> >   {
> >   	struct rb_node *rb;
> > @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> >   		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> >   		if (drm_sched_entity_is_ready(entity)) {
> > -			rq->current_entity = entity;
> > -			reinit_completion(&entity->entity_idle);
> > +			if (dequeue) {
> > +				rq->current_entity = entity;
> > +				reinit_completion(&entity->entity_idle);
> > +			}
> >   			break;
> >   		}
> >   	}
> > @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> >   }
> >   /**
> > - * drm_sched_submit_queue - scheduler queue submission
> > + * drm_sched_run_job_queue - queue job submission
> >    * @sched: scheduler instance
> >    */
> > -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> > +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> >   {
> >   	if (!READ_ONCE(sched->pause_submit))
> > -		queue_work(sched->submit_wq, &sched->work_submit);
> > +		queue_work(sched->submit_wq, &sched->work_run_job);
> > +}
> > +
> > +static struct drm_sched_entity *
> > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> > +
> > +/**
> > + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> > + * @sched: scheduler instance
> > + */
> > +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > +{
> > +	if (drm_sched_select_entity(sched, false))
> > +		drm_sched_run_job_queue(sched);
> > +}
> > +
> > +/**
> > + * drm_sched_free_job_queue - queue free job
> > + *
> > + * @sched: scheduler instance to queue free job
> > + */
> > +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> > +{
> > +	if (!READ_ONCE(sched->pause_submit))
> > +		queue_work(sched->submit_wq, &sched->work_free_job);
> > +}
> > +
> > +/**
> > + * drm_sched_free_job_queue_if_ready - queue free job if ready
> > + *
> > + * @sched: scheduler instance to queue free job
> > + */
> > +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > +{
> > +	struct drm_sched_job *job;
> > +
> > +	spin_lock(&sched->job_list_lock);
> > +	job = list_first_entry_or_null(&sched->pending_list,
> > +				       struct drm_sched_job, list);
> > +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
> > +		drm_sched_free_job_queue(sched);
> > +	spin_unlock(&sched->job_list_lock);
> >   }
> >   /**
> > @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
> >   	dma_fence_get(&s_fence->finished);
> >   	drm_sched_fence_finished(s_fence, result);
> >   	dma_fence_put(&s_fence->finished);
> > -	drm_sched_submit_queue(sched);
> > +	drm_sched_free_job_queue(sched);
> >   }
> >   /**
> > @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> >   void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> >   {
> >   	if (drm_sched_can_queue(sched))
> > -		drm_sched_submit_queue(sched);
> > +		drm_sched_run_job_queue(sched);
> >   }
> >   /**
> >    * drm_sched_select_entity - Select next entity to process
> >    *
> >    * @sched: scheduler instance
> > + * @dequeue: dequeue selected entity
> >    *
> >    * Returns the entity to process or NULL if none are found.
> >    */
> >   static struct drm_sched_entity *
> > -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
> >   {
> >   	struct drm_sched_entity *entity;
> >   	int i;
> > @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> >   	/* Kernel run queue has higher priority than normal run queue*/
> >   	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> >   		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> > -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> > -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> > +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> > +							dequeue) :
> > +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> > +						      dequeue);
> >   		if (entity)
> >   			break;
> >   	}
> > @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> >   EXPORT_SYMBOL(drm_sched_pick_best);
> >   /**
> > - * drm_sched_main - main scheduler thread
> > + * drm_sched_free_job_work - worker to call free_job
> >    *
> > - * @param: scheduler instance
> > + * @w: free job work
> >    */
> > -static void drm_sched_main(struct work_struct *w)
> > +static void drm_sched_free_job_work(struct work_struct *w)
> >   {
> >   	struct drm_gpu_scheduler *sched =
> > -		container_of(w, struct drm_gpu_scheduler, work_submit);
> > -	struct drm_sched_entity *entity;
> > +		container_of(w, struct drm_gpu_scheduler, work_free_job);
> >   	struct drm_sched_job *cleanup_job;
> > -	int r;
> >   	if (READ_ONCE(sched->pause_submit))
> >   		return;
> >   	cleanup_job = drm_sched_get_cleanup_job(sched);
> > -	entity = drm_sched_select_entity(sched);
> > +	if (cleanup_job) {
> > +		sched->ops->free_job(cleanup_job);
> > +
> > +		drm_sched_free_job_queue_if_ready(sched);
> > +		drm_sched_run_job_queue_if_ready(sched);
> > +	}
> > +}
> > -	if (!entity && !cleanup_job)
> > -		return;	/* No more work */
> > +/**
> > + * drm_sched_run_job_work - worker to call run_job
> > + *
> > + * @w: run job work
> > + */
> > +static void drm_sched_run_job_work(struct work_struct *w)
> > +{
> > +	struct drm_gpu_scheduler *sched =
> > +		container_of(w, struct drm_gpu_scheduler, work_run_job);
> > +	struct drm_sched_entity *entity;
> > +	int r;
> > -	if (cleanup_job)
> > -		sched->ops->free_job(cleanup_job);
> > +	if (READ_ONCE(sched->pause_submit))
> > +		return;
> > +	entity = drm_sched_select_entity(sched, true);
> >   	if (entity) {
> >   		struct dma_fence *fence;
> >   		struct drm_sched_fence *s_fence;
> > @@ -1056,9 +1122,7 @@ static void drm_sched_main(struct work_struct *w)
> >   		sched_job = drm_sched_entity_pop_job(entity);
> >   		if (!sched_job) {
> >   			complete_all(&entity->entity_idle);
> > -			if (!cleanup_job)
> > -				return;	/* No more work */
> > -			goto again;
> > +			return;	/* No more work */
> >   		}
> >   		s_fence = sched_job->s_fence;
> > @@ -1088,10 +1152,8 @@ static void drm_sched_main(struct work_struct *w)
> >   		}
> >   		wake_up(&sched->job_scheduled);
> > +		drm_sched_run_job_queue_if_ready(sched);
> >   	}
> > -
> > -again:
> > -	drm_sched_submit_queue(sched);
> >   }
> >   /**
> > @@ -1150,7 +1212,8 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> >   	spin_lock_init(&sched->job_list_lock);
> >   	atomic_set(&sched->hw_rq_count, 0);
> >   	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > -	INIT_WORK(&sched->work_submit, drm_sched_main);
> > +	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > +	INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> >   	atomic_set(&sched->_score, 0);
> >   	atomic64_set(&sched->job_id_count, 0);
> >   	sched->pause_submit = false;
> > @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
> >   void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
> 
> I was wondering what the scheduler teardown sequence looks like for
> DRM_SCHED_POLICY_SINGLE_ENTITY and how XE does that.
> 
> In Nouveau, userspace can ask the kernel to create a channel (or multiple),
> where each channel represents a ring feeding the firmware scheduler. Userspace
> can forcefully close channels via either a dedicated IOCTL or by just closing
> the FD which subsequently closes all channels opened through this FD.
> 
> When this happens the scheduler needs to be teared down. Without keeping track of
> things in a driver specific way, the only thing I could really come up with is the
> following.
> 
> /* Make sure no more jobs are fetched from the entity. */
> drm_sched_submit_stop();
> 
> /* Wait for the channel to be idle, namely jobs in flight to complete. */
> nouveau_channel_idle();
> 
> /* Stop the scheduler to free jobs from the pending_list. Ring must be idle at this
>  * point, otherwise me might leak jobs. Feels more like a workaround to free
>  * finished jobs.
>  */
> drm_sched_stop();
> 
> /* Free jobs from the entity queue. */
> drm_sched_entity_fini();
> 
> /* Probably not even needed in this case. */
> drm_sched_fini();
> 
> This doesn't look very straightforward though. I wonder if other drivers feeding
> firmware schedulers have similar cases. Maybe something like drm_sched_teardown(),
> which would stop job submission, wait for pending jobs to finish and subsequently
> free them up would makes sense?
> 

exec queue == gpu scheduler + entity in Xe

We kinda invented our own flow with reference counting + use the TDR for
cleanup.

We have a creation ref for the exec queue plus each job takes a ref to
the exec queue. On exec queue close [1][2] (whether that be IOCTL or FD
close) we drop the creation reference and call a vfunc for killing thr
exec queue. The firmware implementation is here [3].

If you read through it just sets the TDR to the minimum value [4], the
TDR will kick any running jobs the off the hardware, signals the jobs
fences, any jobs waiting on dependencies eventually flush out via
run_job + TDR for cleanup without going on the hardware, the exec queue
reference count goes to zero once all jobs are flushed out, we trigger
the exec queue clean up flow and finally free all memory for the exec
queue.

Using the TDR in this way is how we teardown an exec queue for other
reasons too (user page fault, user job times out, user job hang detected
by firmware, device reset, etc...).

This all works rather nicely and is a single code path for all of these
cases. I'm no sure if this can be made any more generic nor do I really
see the need too (at least I don't see Xe needing a generic solution).

Hope this helps,
Matt

[1] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_exec_queue.c#L911
[2] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_device.c#L77
[3] https://gitlab.freedesktop.org/drm/xe/kernel/-/tree/drm-xe-next/drivers/gpu/drm/xe#L1184
[4] https://gitlab.freedesktop.org/drm/xe/kernel/-/tree/drm-xe-next/drivers/gpu/drm/xe#L789

> - Danilo
> 
> >   {
> >   	WRITE_ONCE(sched->pause_submit, true);
> > -	cancel_work_sync(&sched->work_submit);
> > +	cancel_work_sync(&sched->work_run_job);
> > +	cancel_work_sync(&sched->work_free_job);
> >   }
> >   EXPORT_SYMBOL(drm_sched_submit_stop);
> > @@ -1287,6 +1351,7 @@ EXPORT_SYMBOL(drm_sched_submit_stop);
> >   void drm_sched_submit_start(struct drm_gpu_scheduler *sched)
> >   {
> >   	WRITE_ONCE(sched->pause_submit, false);
> > -	queue_work(sched->submit_wq, &sched->work_submit);
> > +	queue_work(sched->submit_wq, &sched->work_run_job);
> > +	queue_work(sched->submit_wq, &sched->work_free_job);
> >   }
> >   EXPORT_SYMBOL(drm_sched_submit_start);
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index 04eec2d7635f..fbc083a92757 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -487,9 +487,10 @@ struct drm_sched_backend_ops {
> >    *                 finished.
> >    * @hw_rq_count: the number of jobs currently in the hardware queue.
> >    * @job_id_count: used to assign unique id to the each job.
> > - * @submit_wq: workqueue used to queue @work_submit
> > + * @submit_wq: workqueue used to queue @work_run_job and @work_free_job
> >    * @timeout_wq: workqueue used to queue @work_tdr
> > - * @work_submit: schedules jobs and cleans up entities
> > + * @work_run_job: schedules jobs
> > + * @work_free_job: cleans up jobs
> >    * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> >    *            timeout interval is over.
> >    * @pending_list: the list of jobs which are currently in the job queue.
> > @@ -518,7 +519,8 @@ struct drm_gpu_scheduler {
> >   	atomic64_t			job_id_count;
> >   	struct workqueue_struct		*submit_wq;
> >   	struct workqueue_struct		*timeout_wq;
> > -	struct work_struct		work_submit;
> > +	struct work_struct		work_run_job;
> > +	struct work_struct		work_free_job;
> >   	struct delayed_work		work_tdr;
> >   	struct list_head		pending_list;
> >   	spinlock_t			job_list_lock;
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-28 18:41     ` Matthew Brost
@ 2023-08-29  1:20       ` Danilo Krummrich
  0 siblings, 0 replies; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-29  1:20 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, intel-xe, luben.tuikov, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

On 8/28/23 20:41, Matthew Brost wrote:
> On Mon, Aug 28, 2023 at 08:04:31PM +0200, Danilo Krummrich wrote:
>> On 8/11/23 04:31, Matthew Brost wrote:
>>> Rather than call free_job and run_job in same work item have a dedicated
>>> work item for each. This aligns with the design and intended use of work
>>> queues.
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
>>>    include/drm/gpu_scheduler.h            |   8 +-
>>>    2 files changed, 106 insertions(+), 39 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index cede47afc800..b67469eac179 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1275,7 +1338,8 @@ EXPORT_SYMBOL(drm_sched_submit_ready);
>>>    void drm_sched_submit_stop(struct drm_gpu_scheduler *sched)
>>
>> I was wondering what the scheduler teardown sequence looks like for
>> DRM_SCHED_POLICY_SINGLE_ENTITY and how XE does that.
>>
>> In Nouveau, userspace can ask the kernel to create a channel (or multiple),
>> where each channel represents a ring feeding the firmware scheduler. Userspace
>> can forcefully close channels via either a dedicated IOCTL or by just closing
>> the FD which subsequently closes all channels opened through this FD.
>>
>> When this happens the scheduler needs to be teared down. Without keeping track of
>> things in a driver specific way, the only thing I could really come up with is the
>> following.
>>
>> /* Make sure no more jobs are fetched from the entity. */
>> drm_sched_submit_stop();
>>
>> /* Wait for the channel to be idle, namely jobs in flight to complete. */
>> nouveau_channel_idle();
>>
>> /* Stop the scheduler to free jobs from the pending_list. Ring must be idle at this
>>   * point, otherwise me might leak jobs. Feels more like a workaround to free
>>   * finished jobs.
>>   */
>> drm_sched_stop();
>>
>> /* Free jobs from the entity queue. */
>> drm_sched_entity_fini();
>>
>> /* Probably not even needed in this case. */
>> drm_sched_fini();
>>
>> This doesn't look very straightforward though. I wonder if other drivers feeding
>> firmware schedulers have similar cases. Maybe something like drm_sched_teardown(),
>> which would stop job submission, wait for pending jobs to finish and subsequently
>> free them up would makes sense?
>>
> 
> exec queue == gpu scheduler + entity in Xe
> 
> We kinda invented our own flow with reference counting + use the TDR for
> cleanup.

Thanks for the detailed explanation. In case of making it driver specific
I thought about something similar, pretty much the same reference counting,
but instead of the TDR, let jobs from the entity just return -ECANCELED from
job_run() and also signal pending jobs with the same error code.

On the other hand, I don't really want scheduler and job structures to
potentially outlive the channel. Which is where I think I'd be nice to avoid
consuming all the queued up jobs from the entity in the first place, stop the
schdeduler with drm_sched_submit_stop(), signal all pending_jobs with
-ECANCELED and call the free_job() callbacks right away.

The latter I could probably do in Nouveau as well, however, it kinda feels
wrong to do all that within the driver.

Also, I was wondering how existing drivers using the GPU scheduler handle
that. It seems like they just rely on the pending_list of the scheduler being
empty once drm_sched_fini() is called. Admittedly, that's pretty likely (never
to happen) since it's typically called on driver remove, but I don't see how
that's actually ensured. Am I missing something?
> 
> We have a creation ref for the exec queue plus each job takes a ref to
> the exec queue. On exec queue close [1][2] (whether that be IOCTL or FD
> close) we drop the creation reference and call a vfunc for killing thr
> exec queue. The firmware implementation is here [3].
> 
> If you read through it just sets the TDR to the minimum value [4], the
> TDR will kick any running jobs the off the hardware, signals the jobs
> fences, any jobs waiting on dependencies eventually flush out via
> run_job + TDR for cleanup without going on the hardware, the exec queue
> reference count goes to zero once all jobs are flushed out, we trigger
> the exec queue clean up flow and finally free all memory for the exec
> queue.
> 
> Using the TDR in this way is how we teardown an exec queue for other
> reasons too (user page fault, user job times out, user job hang detected
> by firmware, device reset, etc...).
> 
> This all works rather nicely and is a single code path for all of these
> cases. I'm no sure if this can be made any more generic nor do I really
> see the need too (at least I don't see Xe needing a generic solution).
> 
> Hope this helps,
> Matt
> 
> [1] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_exec_queue.c#L911
> [2] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_device.c#L77
> [3] https://gitlab.freedesktop.org/drm/xe/kernel/-/tree/drm-xe-next/drivers/gpu/drm/xe#L1184
> [4] https://gitlab.freedesktop.org/drm/xe/kernel/-/tree/drm-xe-next/drivers/gpu/drm/xe#L789
> 
>> - Danilo
>>


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 3/9] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
  2023-08-11  2:31 ` [PATCH v2 3/9] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy Matthew Brost
@ 2023-08-29 17:37   ` Danilo Krummrich
  2023-09-05 11:10     ` Danilo Krummrich
  0 siblings, 1 reply; 80+ messages in thread
From: Danilo Krummrich @ 2023-08-29 17:37 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, intel-xe, luben.tuikov, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

On 8/11/23 04:31, Matthew Brost wrote:
> DRM_SCHED_POLICY_SINGLE_ENTITY creates a 1 to 1 relationship between
> scheduler and entity. No priorities or run queue used in this mode.
> Intended for devices with firmware schedulers.
> 
> v2:
>    - Drop sched / rq union (Luben)
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/scheduler/sched_entity.c | 69 ++++++++++++++++++------
>   drivers/gpu/drm/scheduler/sched_fence.c  |  2 +-
>   drivers/gpu/drm/scheduler/sched_main.c   | 63 +++++++++++++++++++---
>   include/drm/gpu_scheduler.h              |  8 +++
>   4 files changed, 119 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 65a972b52eda..1dec97caaba3 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -83,6 +83,7 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>   	memset(entity, 0, sizeof(struct drm_sched_entity));
>   	INIT_LIST_HEAD(&entity->list);
>   	entity->rq = NULL;
> +	entity->single_sched = NULL;
>   	entity->guilty = guilty;
>   	entity->num_sched_list = num_sched_list;
>   	entity->priority = priority;
> @@ -90,8 +91,17 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>   	RCU_INIT_POINTER(entity->last_scheduled, NULL);
>   	RB_CLEAR_NODE(&entity->rb_tree_node);
>   
> -	if(num_sched_list)
> -		entity->rq = &sched_list[0]->sched_rq[entity->priority];
> +	if (num_sched_list) {
> +		if (sched_list[0]->sched_policy !=
> +		    DRM_SCHED_POLICY_SINGLE_ENTITY) {
> +			entity->rq = &sched_list[0]->sched_rq[entity->priority];
> +		} else {
> +			if (num_sched_list != 1 || sched_list[0]->single_entity)
> +				return -EINVAL;
> +			sched_list[0]->single_entity = entity;
> +			entity->single_sched = sched_list[0];
> +		}
> +	}
>   
>   	init_completion(&entity->entity_idle);
>   
> @@ -124,7 +134,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>   				    struct drm_gpu_scheduler **sched_list,
>   				    unsigned int num_sched_list)
>   {
> -	WARN_ON(!num_sched_list || !sched_list);
> +	WARN_ON(!num_sched_list || !sched_list ||
> +		!!entity->single_sched);
>   
>   	entity->sched_list = sched_list;
>   	entity->num_sched_list = num_sched_list;
> @@ -231,13 +242,15 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity)
>   {
>   	struct drm_sched_job *job;
>   	struct dma_fence *prev;
> +	bool single_entity = !!entity->single_sched;
>   
> -	if (!entity->rq)
> +	if (!entity->rq && !single_entity)
>   		return;
>   
>   	spin_lock(&entity->rq_lock);
>   	entity->stopped = true;
> -	drm_sched_rq_remove_entity(entity->rq, entity);
> +	if (!single_entity)
> +		drm_sched_rq_remove_entity(entity->rq, entity);

Looks like nothing prevents drm_sched_run_job_work() to fetch more jobs from the entity,
hence if this is called for an entity still having queued up jobs and a still running
scheduler, drm_sched_entity_kill() and drm_sched_run_job_work() would race for jobs, right?

Not sure if this is by intention because we don't expect the driver to drm_sched_entity_fini()
as long as there are still queued up jobs. At least this is inconsistant to what
drm_sched_entity_kill() does without DRM_SCHED_POLICY_SINGLE_ENTITY and should either be fixed
or documented if we agree nothing else makes sense.

I think it also touches my question on how to tear down the scheduler once a ring is removed
or deinitialized.

I know XE is going its own way in this respect, but I also feel like we're leaving drivers
potentially being interested in DRM_SCHED_POLICY_SINGLE_ENTITY a bit alone on that. I think
we should probably give drivers a bit more guidance on how to do that.

Currently, I see two approaches.

(1) Do what XE does, which means letting the scheduler run dry, which includes both the
     entity's job queue and the schedulers pending_list. While jobs from the entity's queue
     aren't pushing any more work to the ring on tear down, but just "flow through" to get
     freed up eventually. (Hopefully I got that right.)

(2) Kill the entity to cleanup jobs from the entity's queue, stop the scheduler and either
     just wait for pending jobs or signal them right away and finally free them up.

Actually there'd also be (3), which could be a mix of both, discard the entity's queued jobs,
but let the pending_list run dry.

I'm not saying we should provide a whole bunch of infrastructure for drivers, e.g. for (1)
as you've mentioned already as well, there is probably not much to generalize anyway. However,
I think we should document the options drivers have to tear things down and do enough to
enable drivers using any option (as long as we agree it is reasonable).

For Nouveau specifically, I'd probably like to go with (3).

- Danilo

>   	spin_unlock(&entity->rq_lock);
>   
>   	/* Make sure this entity is not used by the scheduler at the moment */
> @@ -259,6 +272,20 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity)
>   	dma_fence_put(prev);
>   }
>   
> +/**
> + * drm_sched_entity_to_scheduler - Schedule entity to GPU scheduler
> + * @entity: scheduler entity
> + *
> + * Returns GPU scheduler for the entity
> + */
> +struct drm_gpu_scheduler *
> +drm_sched_entity_to_scheduler(struct drm_sched_entity *entity)
> +{
> +	bool single_entity = !!entity->single_sched;
> +
> +	return single_entity ? entity->single_sched : entity->rq->sched;
> +}
> +
>   /**
>    * drm_sched_entity_flush - Flush a context entity
>    *
> @@ -276,11 +303,12 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
>   	struct drm_gpu_scheduler *sched;
>   	struct task_struct *last_user;
>   	long ret = timeout;
> +	bool single_entity = !!entity->single_sched;
>   
> -	if (!entity->rq)
> +	if (!entity->rq && !single_entity)
>   		return 0;
>   
> -	sched = entity->rq->sched;
> +	sched = drm_sched_entity_to_scheduler(entity);
>   	/**
>   	 * The client will not queue more IBs during this fini, consume existing
>   	 * queued IBs or discard them on SIGKILL
> @@ -373,7 +401,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>   		container_of(cb, struct drm_sched_entity, cb);
>   
>   	drm_sched_entity_clear_dep(f, cb);
> -	drm_sched_wakeup_if_can_queue(entity->rq->sched);
> +	drm_sched_wakeup_if_can_queue(drm_sched_entity_to_scheduler(entity));
>   }
>   
>   /**
> @@ -387,6 +415,8 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>   void drm_sched_entity_set_priority(struct drm_sched_entity *entity,
>   				   enum drm_sched_priority priority)
>   {
> +	WARN_ON(!!entity->single_sched);
> +
>   	spin_lock(&entity->rq_lock);
>   	entity->priority = priority;
>   	spin_unlock(&entity->rq_lock);
> @@ -399,7 +429,7 @@ EXPORT_SYMBOL(drm_sched_entity_set_priority);
>    */
>   static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>   {
> -	struct drm_gpu_scheduler *sched = entity->rq->sched;
> +	struct drm_gpu_scheduler *sched = drm_sched_entity_to_scheduler(entity);
>   	struct dma_fence *fence = entity->dependency;
>   	struct drm_sched_fence *s_fence;
>   
> @@ -501,7 +531,8 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>   	 * Update the entity's location in the min heap according to
>   	 * the timestamp of the next job, if any.
>   	 */
> -	if (entity->rq->sched->sched_policy == DRM_SCHED_POLICY_FIFO) {
> +	if (drm_sched_entity_to_scheduler(entity)->sched_policy ==
> +	    DRM_SCHED_POLICY_FIFO) {
>   		struct drm_sched_job *next;
>   
>   		next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> @@ -524,6 +555,8 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>   	struct drm_gpu_scheduler *sched;
>   	struct drm_sched_rq *rq;
>   
> +	WARN_ON(!!entity->single_sched);
> +
>   	/* single possible engine and already selected */
>   	if (!entity->sched_list)
>   		return;
> @@ -573,12 +606,13 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>   void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>   {
>   	struct drm_sched_entity *entity = sched_job->entity;
> -	bool first, fifo = entity->rq->sched->sched_policy ==
> -		DRM_SCHED_POLICY_FIFO;
> +	bool single_entity = !!entity->single_sched;
> +	bool first;
>   	ktime_t submit_ts;
>   
>   	trace_drm_sched_job(sched_job, entity);
> -	atomic_inc(entity->rq->sched->score);
> +	if (!single_entity)
> +		atomic_inc(entity->rq->sched->score);
>   	WRITE_ONCE(entity->last_user, current->group_leader);
>   
>   	/*
> @@ -591,6 +625,10 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>   
>   	/* first job wakes up scheduler */
>   	if (first) {
> +		struct drm_gpu_scheduler *sched =
> +			drm_sched_entity_to_scheduler(entity);
> +		bool fifo = sched->sched_policy == DRM_SCHED_POLICY_FIFO;
> +
>   		/* Add the entity to the run queue */
>   		spin_lock(&entity->rq_lock);
>   		if (entity->stopped) {
> @@ -600,13 +638,14 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>   			return;
>   		}
>   
> -		drm_sched_rq_add_entity(entity->rq, entity);
> +		if (!single_entity)
> +			drm_sched_rq_add_entity(entity->rq, entity);
>   		spin_unlock(&entity->rq_lock);
>   
>   		if (fifo)
>   			drm_sched_rq_update_fifo(entity, submit_ts);
>   
> -		drm_sched_wakeup_if_can_queue(entity->rq->sched);
> +		drm_sched_wakeup_if_can_queue(sched);
>   	}
>   }
>   EXPORT_SYMBOL(drm_sched_entity_push_job);
> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
> index 06cedfe4b486..f6b926f5e188 100644
> --- a/drivers/gpu/drm/scheduler/sched_fence.c
> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
> @@ -225,7 +225,7 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
>   {
>   	unsigned seq;
>   
> -	fence->sched = entity->rq->sched;
> +	fence->sched = drm_sched_entity_to_scheduler(entity);
>   	seq = atomic_inc_return(&entity->fence_seq);
>   	dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
>   		       &fence->lock, entity->fence_context, seq);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 545d5298c086..cede47afc800 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -32,7 +32,8 @@
>    * backend operations to the scheduler like submitting a job to hardware run queue,
>    * returning the dependencies of a job etc.
>    *
> - * The organisation of the scheduler is the following:
> + * The organisation of the scheduler is the following for scheduling policies
> + * DRM_SCHED_POLICY_RR and DRM_SCHED_POLICY_FIFO:
>    *
>    * 1. Each hw run queue has one scheduler
>    * 2. Each scheduler has multiple run queues with different priorities
> @@ -43,6 +44,23 @@
>    *
>    * The jobs in a entity are always scheduled in the order that they were pushed.
>    *
> + * The organisation of the scheduler is the following for scheduling policy
> + * DRM_SCHED_POLICY_SINGLE_ENTITY:
> + *
> + * 1. One to one relationship between scheduler and entity
> + * 2. No priorities implemented per scheduler (single job queue)
> + * 3. No run queues in scheduler rather jobs are directly dequeued from entity
> + * 4. The entity maintains a queue of jobs that will be scheduled on the
> + * hardware
> + *
> + * The jobs in a entity are always scheduled in the order that they were pushed
> + * regardless of scheduling policy.
> + *
> + * A policy of DRM_SCHED_POLICY_RR or DRM_SCHED_POLICY_FIFO is expected to used
> + * when the KMD is scheduling directly on the hardware while a scheduling policy
> + * of DRM_SCHED_POLICY_SINGLE_ENTITY is expected to be used when there is a
> + * firmware scheduler.
> + *
>    * Note that once a job was taken from the entities queue and pushed to the
>    * hardware, i.e. the pending queue, the entity must not be referenced anymore
>    * through the jobs entity pointer.
> @@ -96,6 +114,8 @@ static inline void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti
>   
>   void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
>   {
> +	WARN_ON(!!entity->single_sched);
> +
>   	/*
>   	 * Both locks need to be grabbed, one to protect from entity->rq change
>   	 * for entity from within concurrent drm_sched_entity_select_rq and the
> @@ -126,6 +146,8 @@ void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
>   static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
>   			      struct drm_sched_rq *rq)
>   {
> +	WARN_ON(sched->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
> +
>   	spin_lock_init(&rq->lock);
>   	INIT_LIST_HEAD(&rq->entities);
>   	rq->rb_tree_root = RB_ROOT_CACHED;
> @@ -144,6 +166,8 @@ static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
>   void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
>   			     struct drm_sched_entity *entity)
>   {
> +	WARN_ON(!!entity->single_sched);
> +
>   	if (!list_empty(&entity->list))
>   		return;
>   
> @@ -166,6 +190,8 @@ void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
>   void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>   				struct drm_sched_entity *entity)
>   {
> +	WARN_ON(!!entity->single_sched);
> +
>   	if (list_empty(&entity->list))
>   		return;
>   
> @@ -641,7 +667,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>   		       struct drm_sched_entity *entity,
>   		       void *owner)
>   {
> -	if (!entity->rq)
> +	if (!entity->rq && !entity->single_sched)
>   		return -ENOENT;
>   
>   	job->entity = entity;
> @@ -674,13 +700,16 @@ void drm_sched_job_arm(struct drm_sched_job *job)
>   {
>   	struct drm_gpu_scheduler *sched;
>   	struct drm_sched_entity *entity = job->entity;
> +	bool single_entity = !!entity->single_sched;
>   
>   	BUG_ON(!entity);
> -	drm_sched_entity_select_rq(entity);
> -	sched = entity->rq->sched;
> +	if (!single_entity)
> +		drm_sched_entity_select_rq(entity);
> +	sched = drm_sched_entity_to_scheduler(entity);
>   
>   	job->sched = sched;
> -	job->s_priority = entity->rq - sched->sched_rq;
> +	if (!single_entity)
> +		job->s_priority = entity->rq - sched->sched_rq;
>   	job->id = atomic64_inc_return(&sched->job_id_count);
>   
>   	drm_sched_fence_init(job->s_fence, job->entity);
> @@ -896,6 +925,13 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>   	if (!drm_sched_can_queue(sched))
>   		return NULL;
>   
> +	if (sched->single_entity) {
> +		if (drm_sched_entity_is_ready(sched->single_entity))
> +			return sched->single_entity;
> +
> +		return NULL;
> +	}
> +
>   	/* Kernel run queue has higher priority than normal run queue*/
>   	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>   		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> @@ -1091,6 +1127,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>   		return -EINVAL;
>   
>   	sched->ops = ops;
> +	sched->single_entity = NULL;
>   	sched->hw_submission_limit = hw_submission;
>   	sched->name = name;
>   	sched->submit_wq = submit_wq ? : system_wq;
> @@ -1103,7 +1140,9 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>   		sched->sched_policy = default_drm_sched_policy;
>   	else
>   		sched->sched_policy = sched_policy;
> -	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
> +	for (i = DRM_SCHED_PRIORITY_MIN; sched_policy !=
> +	     DRM_SCHED_POLICY_SINGLE_ENTITY && i < DRM_SCHED_PRIORITY_COUNT;
> +	     i++)
>   		drm_sched_rq_init(sched, &sched->sched_rq[i]);
>   
>   	init_waitqueue_head(&sched->job_scheduled);
> @@ -1135,7 +1174,15 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>   
>   	drm_sched_submit_stop(sched);
>   
> -	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> +	if (sched->single_entity) {
> +		spin_lock(&sched->single_entity->rq_lock);
> +		sched->single_entity->stopped = true;
> +		spin_unlock(&sched->single_entity->rq_lock);
> +	}
> +
> +	for (i = DRM_SCHED_PRIORITY_COUNT - 1; sched->sched_policy !=
> +	     DRM_SCHED_POLICY_SINGLE_ENTITY && i >= DRM_SCHED_PRIORITY_MIN;
> +	     i--) {
>   		struct drm_sched_rq *rq = &sched->sched_rq[i];
>   
>   		spin_lock(&rq->lock);
> @@ -1176,6 +1223,8 @@ void drm_sched_increase_karma(struct drm_sched_job *bad)
>   	struct drm_sched_entity *entity;
>   	struct drm_gpu_scheduler *sched = bad->sched;
>   
> +	WARN_ON(sched->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
> +
>   	/* don't change @bad's karma if it's from KERNEL RQ,
>   	 * because sometimes GPU hang would cause kernel jobs (like VM updating jobs)
>   	 * corrupt but keep in mind that kernel jobs always considered good.
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 897d52a4ff4f..04eec2d7635f 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -79,6 +79,7 @@ enum drm_sched_policy {
>   	DRM_SCHED_POLICY_DEFAULT,
>   	DRM_SCHED_POLICY_RR,
>   	DRM_SCHED_POLICY_FIFO,
> +	DRM_SCHED_POLICY_SINGLE_ENTITY,
>   	DRM_SCHED_POLICY_COUNT,
>   };
>   
> @@ -112,6 +113,9 @@ struct drm_sched_entity {
>   	 */
>   	struct drm_sched_rq		*rq;
>   
> +	/** @single_sched: Single scheduler */
> +	struct drm_gpu_scheduler	*single_sched;
> +
>   	/**
>   	 * @sched_list:
>   	 *
> @@ -473,6 +477,7 @@ struct drm_sched_backend_ops {
>    * struct drm_gpu_scheduler - scheduler instance-specific data
>    *
>    * @ops: backend operations provided by the driver.
> + * @single_entity: Single entity for the scheduler
>    * @hw_submission_limit: the max size of the hardware queue.
>    * @timeout: the time after which a job is removed from the scheduler.
>    * @name: name of the ring for which this scheduler is being used.
> @@ -503,6 +508,7 @@ struct drm_sched_backend_ops {
>    */
>   struct drm_gpu_scheduler {
>   	const struct drm_sched_backend_ops	*ops;
> +	struct drm_sched_entity		*single_entity;
>   	uint32_t			hw_submission_limit;
>   	long				timeout;
>   	const char			*name;
> @@ -585,6 +591,8 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>   			  struct drm_gpu_scheduler **sched_list,
>   			  unsigned int num_sched_list,
>   			  atomic_t *guilty);
> +struct drm_gpu_scheduler *
> +drm_sched_entity_to_scheduler(struct drm_sched_entity *entity);
>   long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout);
>   void drm_sched_entity_fini(struct drm_sched_entity *entity);
>   void drm_sched_entity_destroy(struct drm_sched_entity *entity);


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 3/9] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
  2023-08-29 17:37   ` Danilo Krummrich
@ 2023-09-05 11:10     ` Danilo Krummrich
  2023-09-11 19:44       ` Matthew Brost
  0 siblings, 1 reply; 80+ messages in thread
From: Danilo Krummrich @ 2023-09-05 11:10 UTC (permalink / raw)
  To: Matthew Brost
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, intel-xe, luben.tuikov, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

On 8/29/23 19:37, Danilo Krummrich wrote:
> On 8/11/23 04:31, Matthew Brost wrote:
>> DRM_SCHED_POLICY_SINGLE_ENTITY creates a 1 to 1 relationship between
>> scheduler and entity. No priorities or run queue used in this mode.
>> Intended for devices with firmware schedulers.
>>
>> v2:
>>    - Drop sched / rq union (Luben)
>>
>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_entity.c | 69 ++++++++++++++++++------
>>   drivers/gpu/drm/scheduler/sched_fence.c  |  2 +-
>>   drivers/gpu/drm/scheduler/sched_main.c   | 63 +++++++++++++++++++---
>>   include/drm/gpu_scheduler.h              |  8 +++
>>   4 files changed, 119 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index 65a972b52eda..1dec97caaba3 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -83,6 +83,7 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>>       memset(entity, 0, sizeof(struct drm_sched_entity));
>>       INIT_LIST_HEAD(&entity->list);
>>       entity->rq = NULL;
>> +    entity->single_sched = NULL;
>>       entity->guilty = guilty;
>>       entity->num_sched_list = num_sched_list;
>>       entity->priority = priority;
>> @@ -90,8 +91,17 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>>       RCU_INIT_POINTER(entity->last_scheduled, NULL);
>>       RB_CLEAR_NODE(&entity->rb_tree_node);
>> -    if(num_sched_list)
>> -        entity->rq = &sched_list[0]->sched_rq[entity->priority];
>> +    if (num_sched_list) {
>> +        if (sched_list[0]->sched_policy !=
>> +            DRM_SCHED_POLICY_SINGLE_ENTITY) {
>> +            entity->rq = &sched_list[0]->sched_rq[entity->priority];
>> +        } else {
>> +            if (num_sched_list != 1 || sched_list[0]->single_entity)
>> +                return -EINVAL;
>> +            sched_list[0]->single_entity = entity;
>> +            entity->single_sched = sched_list[0];
>> +        }
>> +    }
>>       init_completion(&entity->entity_idle);
>> @@ -124,7 +134,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>>                       struct drm_gpu_scheduler **sched_list,
>>                       unsigned int num_sched_list)
>>   {
>> -    WARN_ON(!num_sched_list || !sched_list);
>> +    WARN_ON(!num_sched_list || !sched_list ||
>> +        !!entity->single_sched);
>>       entity->sched_list = sched_list;
>>       entity->num_sched_list = num_sched_list;
>> @@ -231,13 +242,15 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity)
>>   {
>>       struct drm_sched_job *job;
>>       struct dma_fence *prev;
>> +    bool single_entity = !!entity->single_sched;
>> -    if (!entity->rq)
>> +    if (!entity->rq && !single_entity)
>>           return;
>>       spin_lock(&entity->rq_lock);
>>       entity->stopped = true;
>> -    drm_sched_rq_remove_entity(entity->rq, entity);
>> +    if (!single_entity)
>> +        drm_sched_rq_remove_entity(entity->rq, entity);
> 
> Looks like nothing prevents drm_sched_run_job_work() to fetch more jobs from the entity,
> hence if this is called for an entity still having queued up jobs and a still running
> scheduler, drm_sched_entity_kill() and drm_sched_run_job_work() would race for jobs, right?

I worked around this with:

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 9a5e9b7032da..0687da57757d 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1025,7 +1025,8 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
                 return NULL;
  
         if (sched->single_entity) {
-               if (drm_sched_entity_is_ready(sched->single_entity))
+               if (drm_sched_entity_is_ready(sched->single_entity) &&
+                   !READ_ONCE(sched->single_entity->stopped))
                         return sched->single_entity;
  
                 return NULL;

> 
> Not sure if this is by intention because we don't expect the driver to drm_sched_entity_fini()
> as long as there are still queued up jobs. At least this is inconsistant to what
> drm_sched_entity_kill() does without DRM_SCHED_POLICY_SINGLE_ENTITY and should either be fixed
> or documented if we agree nothing else makes sense.
> 
> I think it also touches my question on how to tear down the scheduler once a ring is removed
> or deinitialized.
> 
> I know XE is going its own way in this respect, but I also feel like we're leaving drivers
> potentially being interested in DRM_SCHED_POLICY_SINGLE_ENTITY a bit alone on that. I think
> we should probably give drivers a bit more guidance on how to do that.
> 
> Currently, I see two approaches.
> 
> (1) Do what XE does, which means letting the scheduler run dry, which includes both the
>      entity's job queue and the schedulers pending_list. While jobs from the entity's queue
>      aren't pushing any more work to the ring on tear down, but just "flow through" to get
>      freed up eventually. (Hopefully I got that right.)
> 
> (2) Kill the entity to cleanup jobs from the entity's queue, stop the scheduler and either
>      just wait for pending jobs or signal them right away and finally free them up.
> 
> Actually there'd also be (3), which could be a mix of both, discard the entity's queued jobs,
> but let the pending_list run dry.
> 
> I'm not saying we should provide a whole bunch of infrastructure for drivers, e.g. for (1)
> as you've mentioned already as well, there is probably not much to generalize anyway. However,
> I think we should document the options drivers have to tear things down and do enough to
> enable drivers using any option (as long as we agree it is reasonable).
> 
> For Nouveau specifically, I'd probably like to go with (3).
> 
> - Danilo
> 
>>       spin_unlock(&entity->rq_lock);
>>       /* Make sure this entity is not used by the scheduler at the moment */
>> @@ -259,6 +272,20 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity)
>>       dma_fence_put(prev);
>>   }
>> +/**
>> + * drm_sched_entity_to_scheduler - Schedule entity to GPU scheduler
>> + * @entity: scheduler entity
>> + *
>> + * Returns GPU scheduler for the entity
>> + */
>> +struct drm_gpu_scheduler *
>> +drm_sched_entity_to_scheduler(struct drm_sched_entity *entity)
>> +{
>> +    bool single_entity = !!entity->single_sched;
>> +
>> +    return single_entity ? entity->single_sched : entity->rq->sched;
>> +}
>> +
>>   /**
>>    * drm_sched_entity_flush - Flush a context entity
>>    *
>> @@ -276,11 +303,12 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
>>       struct drm_gpu_scheduler *sched;
>>       struct task_struct *last_user;
>>       long ret = timeout;
>> +    bool single_entity = !!entity->single_sched;
>> -    if (!entity->rq)
>> +    if (!entity->rq && !single_entity)
>>           return 0;
>> -    sched = entity->rq->sched;
>> +    sched = drm_sched_entity_to_scheduler(entity);
>>       /**
>>        * The client will not queue more IBs during this fini, consume existing
>>        * queued IBs or discard them on SIGKILL
>> @@ -373,7 +401,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>>           container_of(cb, struct drm_sched_entity, cb);
>>       drm_sched_entity_clear_dep(f, cb);
>> -    drm_sched_wakeup_if_can_queue(entity->rq->sched);
>> +    drm_sched_wakeup_if_can_queue(drm_sched_entity_to_scheduler(entity));
>>   }
>>   /**
>> @@ -387,6 +415,8 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
>>   void drm_sched_entity_set_priority(struct drm_sched_entity *entity,
>>                      enum drm_sched_priority priority)
>>   {
>> +    WARN_ON(!!entity->single_sched);
>> +
>>       spin_lock(&entity->rq_lock);
>>       entity->priority = priority;
>>       spin_unlock(&entity->rq_lock);
>> @@ -399,7 +429,7 @@ EXPORT_SYMBOL(drm_sched_entity_set_priority);
>>    */
>>   static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>   {
>> -    struct drm_gpu_scheduler *sched = entity->rq->sched;
>> +    struct drm_gpu_scheduler *sched = drm_sched_entity_to_scheduler(entity);
>>       struct dma_fence *fence = entity->dependency;
>>       struct drm_sched_fence *s_fence;
>> @@ -501,7 +531,8 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
>>        * Update the entity's location in the min heap according to
>>        * the timestamp of the next job, if any.
>>        */
>> -    if (entity->rq->sched->sched_policy == DRM_SCHED_POLICY_FIFO) {
>> +    if (drm_sched_entity_to_scheduler(entity)->sched_policy ==
>> +        DRM_SCHED_POLICY_FIFO) {
>>           struct drm_sched_job *next;
>>           next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
>> @@ -524,6 +555,8 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>>       struct drm_gpu_scheduler *sched;
>>       struct drm_sched_rq *rq;
>> +    WARN_ON(!!entity->single_sched);
>> +
>>       /* single possible engine and already selected */
>>       if (!entity->sched_list)
>>           return;
>> @@ -573,12 +606,13 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>>   void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>   {
>>       struct drm_sched_entity *entity = sched_job->entity;
>> -    bool first, fifo = entity->rq->sched->sched_policy ==
>> -        DRM_SCHED_POLICY_FIFO;
>> +    bool single_entity = !!entity->single_sched;
>> +    bool first;
>>       ktime_t submit_ts;
>>       trace_drm_sched_job(sched_job, entity);
>> -    atomic_inc(entity->rq->sched->score);
>> +    if (!single_entity)
>> +        atomic_inc(entity->rq->sched->score);
>>       WRITE_ONCE(entity->last_user, current->group_leader);
>>       /*
>> @@ -591,6 +625,10 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>       /* first job wakes up scheduler */
>>       if (first) {
>> +        struct drm_gpu_scheduler *sched =
>> +            drm_sched_entity_to_scheduler(entity);
>> +        bool fifo = sched->sched_policy == DRM_SCHED_POLICY_FIFO;
>> +
>>           /* Add the entity to the run queue */
>>           spin_lock(&entity->rq_lock);
>>           if (entity->stopped) {
>> @@ -600,13 +638,14 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
>>               return;
>>           }
>> -        drm_sched_rq_add_entity(entity->rq, entity);
>> +        if (!single_entity)
>> +            drm_sched_rq_add_entity(entity->rq, entity);
>>           spin_unlock(&entity->rq_lock);
>>           if (fifo)
>>               drm_sched_rq_update_fifo(entity, submit_ts);
>> -        drm_sched_wakeup_if_can_queue(entity->rq->sched);
>> +        drm_sched_wakeup_if_can_queue(sched);
>>       }
>>   }
>>   EXPORT_SYMBOL(drm_sched_entity_push_job);
>> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
>> index 06cedfe4b486..f6b926f5e188 100644
>> --- a/drivers/gpu/drm/scheduler/sched_fence.c
>> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
>> @@ -225,7 +225,7 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
>>   {
>>       unsigned seq;
>> -    fence->sched = entity->rq->sched;
>> +    fence->sched = drm_sched_entity_to_scheduler(entity);
>>       seq = atomic_inc_return(&entity->fence_seq);
>>       dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
>>                  &fence->lock, entity->fence_context, seq);
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 545d5298c086..cede47afc800 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -32,7 +32,8 @@
>>    * backend operations to the scheduler like submitting a job to hardware run queue,
>>    * returning the dependencies of a job etc.
>>    *
>> - * The organisation of the scheduler is the following:
>> + * The organisation of the scheduler is the following for scheduling policies
>> + * DRM_SCHED_POLICY_RR and DRM_SCHED_POLICY_FIFO:
>>    *
>>    * 1. Each hw run queue has one scheduler
>>    * 2. Each scheduler has multiple run queues with different priorities
>> @@ -43,6 +44,23 @@
>>    *
>>    * The jobs in a entity are always scheduled in the order that they were pushed.
>>    *
>> + * The organisation of the scheduler is the following for scheduling policy
>> + * DRM_SCHED_POLICY_SINGLE_ENTITY:
>> + *
>> + * 1. One to one relationship between scheduler and entity
>> + * 2. No priorities implemented per scheduler (single job queue)
>> + * 3. No run queues in scheduler rather jobs are directly dequeued from entity
>> + * 4. The entity maintains a queue of jobs that will be scheduled on the
>> + * hardware
>> + *
>> + * The jobs in a entity are always scheduled in the order that they were pushed
>> + * regardless of scheduling policy.
>> + *
>> + * A policy of DRM_SCHED_POLICY_RR or DRM_SCHED_POLICY_FIFO is expected to used
>> + * when the KMD is scheduling directly on the hardware while a scheduling policy
>> + * of DRM_SCHED_POLICY_SINGLE_ENTITY is expected to be used when there is a
>> + * firmware scheduler.
>> + *
>>    * Note that once a job was taken from the entities queue and pushed to the
>>    * hardware, i.e. the pending queue, the entity must not be referenced anymore
>>    * through the jobs entity pointer.
>> @@ -96,6 +114,8 @@ static inline void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti
>>   void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
>>   {
>> +    WARN_ON(!!entity->single_sched);
>> +
>>       /*
>>        * Both locks need to be grabbed, one to protect from entity->rq change
>>        * for entity from within concurrent drm_sched_entity_select_rq and the
>> @@ -126,6 +146,8 @@ void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
>>   static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
>>                     struct drm_sched_rq *rq)
>>   {
>> +    WARN_ON(sched->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
>> +
>>       spin_lock_init(&rq->lock);
>>       INIT_LIST_HEAD(&rq->entities);
>>       rq->rb_tree_root = RB_ROOT_CACHED;
>> @@ -144,6 +166,8 @@ static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
>>   void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
>>                    struct drm_sched_entity *entity)
>>   {
>> +    WARN_ON(!!entity->single_sched);
>> +
>>       if (!list_empty(&entity->list))
>>           return;
>> @@ -166,6 +190,8 @@ void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
>>   void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
>>                   struct drm_sched_entity *entity)
>>   {
>> +    WARN_ON(!!entity->single_sched);
>> +
>>       if (list_empty(&entity->list))
>>           return;
>> @@ -641,7 +667,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
>>                  struct drm_sched_entity *entity,
>>                  void *owner)
>>   {
>> -    if (!entity->rq)
>> +    if (!entity->rq && !entity->single_sched)
>>           return -ENOENT;
>>       job->entity = entity;
>> @@ -674,13 +700,16 @@ void drm_sched_job_arm(struct drm_sched_job *job)
>>   {
>>       struct drm_gpu_scheduler *sched;
>>       struct drm_sched_entity *entity = job->entity;
>> +    bool single_entity = !!entity->single_sched;
>>       BUG_ON(!entity);
>> -    drm_sched_entity_select_rq(entity);
>> -    sched = entity->rq->sched;
>> +    if (!single_entity)
>> +        drm_sched_entity_select_rq(entity);
>> +    sched = drm_sched_entity_to_scheduler(entity);
>>       job->sched = sched;
>> -    job->s_priority = entity->rq - sched->sched_rq;
>> +    if (!single_entity)
>> +        job->s_priority = entity->rq - sched->sched_rq;
>>       job->id = atomic64_inc_return(&sched->job_id_count);
>>       drm_sched_fence_init(job->s_fence, job->entity);
>> @@ -896,6 +925,13 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
>>       if (!drm_sched_can_queue(sched))
>>           return NULL;
>> +    if (sched->single_entity) {
>> +        if (drm_sched_entity_is_ready(sched->single_entity))
>> +            return sched->single_entity;
>> +
>> +        return NULL;
>> +    }
>> +
>>       /* Kernel run queue has higher priority than normal run queue*/
>>       for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>>           entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
>> @@ -1091,6 +1127,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>           return -EINVAL;
>>       sched->ops = ops;
>> +    sched->single_entity = NULL;
>>       sched->hw_submission_limit = hw_submission;
>>       sched->name = name;
>>       sched->submit_wq = submit_wq ? : system_wq;
>> @@ -1103,7 +1140,9 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>           sched->sched_policy = default_drm_sched_policy;
>>       else
>>           sched->sched_policy = sched_policy;
>> -    for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
>> +    for (i = DRM_SCHED_PRIORITY_MIN; sched_policy !=
>> +         DRM_SCHED_POLICY_SINGLE_ENTITY && i < DRM_SCHED_PRIORITY_COUNT;
>> +         i++)
>>           drm_sched_rq_init(sched, &sched->sched_rq[i]);
>>       init_waitqueue_head(&sched->job_scheduled);
>> @@ -1135,7 +1174,15 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>       drm_sched_submit_stop(sched);
>> -    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
>> +    if (sched->single_entity) {
>> +        spin_lock(&sched->single_entity->rq_lock);
>> +        sched->single_entity->stopped = true;
>> +        spin_unlock(&sched->single_entity->rq_lock);
>> +    }
>> +
>> +    for (i = DRM_SCHED_PRIORITY_COUNT - 1; sched->sched_policy !=
>> +         DRM_SCHED_POLICY_SINGLE_ENTITY && i >= DRM_SCHED_PRIORITY_MIN;
>> +         i--) {
>>           struct drm_sched_rq *rq = &sched->sched_rq[i];
>>           spin_lock(&rq->lock);
>> @@ -1176,6 +1223,8 @@ void drm_sched_increase_karma(struct drm_sched_job *bad)
>>       struct drm_sched_entity *entity;
>>       struct drm_gpu_scheduler *sched = bad->sched;
>> +    WARN_ON(sched->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
>> +
>>       /* don't change @bad's karma if it's from KERNEL RQ,
>>        * because sometimes GPU hang would cause kernel jobs (like VM updating jobs)
>>        * corrupt but keep in mind that kernel jobs always considered good.
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index 897d52a4ff4f..04eec2d7635f 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -79,6 +79,7 @@ enum drm_sched_policy {
>>       DRM_SCHED_POLICY_DEFAULT,
>>       DRM_SCHED_POLICY_RR,
>>       DRM_SCHED_POLICY_FIFO,
>> +    DRM_SCHED_POLICY_SINGLE_ENTITY,
>>       DRM_SCHED_POLICY_COUNT,
>>   };
>> @@ -112,6 +113,9 @@ struct drm_sched_entity {
>>        */
>>       struct drm_sched_rq        *rq;
>> +    /** @single_sched: Single scheduler */
>> +    struct drm_gpu_scheduler    *single_sched;
>> +
>>       /**
>>        * @sched_list:
>>        *
>> @@ -473,6 +477,7 @@ struct drm_sched_backend_ops {
>>    * struct drm_gpu_scheduler - scheduler instance-specific data
>>    *
>>    * @ops: backend operations provided by the driver.
>> + * @single_entity: Single entity for the scheduler
>>    * @hw_submission_limit: the max size of the hardware queue.
>>    * @timeout: the time after which a job is removed from the scheduler.
>>    * @name: name of the ring for which this scheduler is being used.
>> @@ -503,6 +508,7 @@ struct drm_sched_backend_ops {
>>    */
>>   struct drm_gpu_scheduler {
>>       const struct drm_sched_backend_ops    *ops;
>> +    struct drm_sched_entity        *single_entity;
>>       uint32_t            hw_submission_limit;
>>       long                timeout;
>>       const char            *name;
>> @@ -585,6 +591,8 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
>>                 struct drm_gpu_scheduler **sched_list,
>>                 unsigned int num_sched_list,
>>                 atomic_t *guilty);
>> +struct drm_gpu_scheduler *
>> +drm_sched_entity_to_scheduler(struct drm_sched_entity *entity);
>>   long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout);
>>   void drm_sched_entity_fini(struct drm_sched_entity *entity);
>>   void drm_sched_entity_destroy(struct drm_sched_entity *entity);


^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 3/9] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
  2023-09-05 11:10     ` Danilo Krummrich
@ 2023-09-11 19:44       ` Matthew Brost
  0 siblings, 0 replies; 80+ messages in thread
From: Matthew Brost @ 2023-09-11 19:44 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: robdclark, thomas.hellstrom, sarah.walker, ketil.johnsen, lina,
	Liviu.Dudau, dri-devel, intel-xe, luben.tuikov, donald.robson,
	boris.brezillon, christian.koenig, faith.ekstrand

On Tue, Sep 05, 2023 at 01:10:38PM +0200, Danilo Krummrich wrote:
> On 8/29/23 19:37, Danilo Krummrich wrote:
> > On 8/11/23 04:31, Matthew Brost wrote:
> > > DRM_SCHED_POLICY_SINGLE_ENTITY creates a 1 to 1 relationship between
> > > scheduler and entity. No priorities or run queue used in this mode.
> > > Intended for devices with firmware schedulers.
> > > 
> > > v2:
> > >    - Drop sched / rq union (Luben)
> > > 
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >   drivers/gpu/drm/scheduler/sched_entity.c | 69 ++++++++++++++++++------
> > >   drivers/gpu/drm/scheduler/sched_fence.c  |  2 +-
> > >   drivers/gpu/drm/scheduler/sched_main.c   | 63 +++++++++++++++++++---
> > >   include/drm/gpu_scheduler.h              |  8 +++
> > >   4 files changed, 119 insertions(+), 23 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> > > index 65a972b52eda..1dec97caaba3 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > @@ -83,6 +83,7 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
> > >       memset(entity, 0, sizeof(struct drm_sched_entity));
> > >       INIT_LIST_HEAD(&entity->list);
> > >       entity->rq = NULL;
> > > +    entity->single_sched = NULL;
> > >       entity->guilty = guilty;
> > >       entity->num_sched_list = num_sched_list;
> > >       entity->priority = priority;
> > > @@ -90,8 +91,17 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
> > >       RCU_INIT_POINTER(entity->last_scheduled, NULL);
> > >       RB_CLEAR_NODE(&entity->rb_tree_node);
> > > -    if(num_sched_list)
> > > -        entity->rq = &sched_list[0]->sched_rq[entity->priority];
> > > +    if (num_sched_list) {
> > > +        if (sched_list[0]->sched_policy !=
> > > +            DRM_SCHED_POLICY_SINGLE_ENTITY) {
> > > +            entity->rq = &sched_list[0]->sched_rq[entity->priority];
> > > +        } else {
> > > +            if (num_sched_list != 1 || sched_list[0]->single_entity)
> > > +                return -EINVAL;
> > > +            sched_list[0]->single_entity = entity;
> > > +            entity->single_sched = sched_list[0];
> > > +        }
> > > +    }
> > >       init_completion(&entity->entity_idle);
> > > @@ -124,7 +134,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> > >                       struct drm_gpu_scheduler **sched_list,
> > >                       unsigned int num_sched_list)
> > >   {
> > > -    WARN_ON(!num_sched_list || !sched_list);
> > > +    WARN_ON(!num_sched_list || !sched_list ||
> > > +        !!entity->single_sched);
> > >       entity->sched_list = sched_list;
> > >       entity->num_sched_list = num_sched_list;
> > > @@ -231,13 +242,15 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity)
> > >   {
> > >       struct drm_sched_job *job;
> > >       struct dma_fence *prev;
> > > +    bool single_entity = !!entity->single_sched;
> > > -    if (!entity->rq)
> > > +    if (!entity->rq && !single_entity)
> > >           return;
> > >       spin_lock(&entity->rq_lock);
> > >       entity->stopped = true;
> > > -    drm_sched_rq_remove_entity(entity->rq, entity);
> > > +    if (!single_entity)
> > > +        drm_sched_rq_remove_entity(entity->rq, entity);
> > 
> > Looks like nothing prevents drm_sched_run_job_work() to fetch more jobs from the entity,
> > hence if this is called for an entity still having queued up jobs and a still running
> > scheduler, drm_sched_entity_kill() and drm_sched_run_job_work() would race for jobs, right?
> 
> I worked around this with:
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 9a5e9b7032da..0687da57757d 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1025,7 +1025,8 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
>                 return NULL;
>         if (sched->single_entity) {
> -               if (drm_sched_entity_is_ready(sched->single_entity))
> +               if (drm_sched_entity_is_ready(sched->single_entity) &&
> +                   !READ_ONCE(sched->single_entity->stopped))

This looks like the proper fix. Will include in next rev.

>                         return sched->single_entity;
>                 return NULL;
> 
> > 
> > Not sure if this is by intention because we don't expect the driver to drm_sched_entity_fini()
> > as long as there are still queued up jobs. At least this is inconsistant to what
> > drm_sched_entity_kill() does without DRM_SCHED_POLICY_SINGLE_ENTITY and should either be fixed
> > or documented if we agree nothing else makes sense.
> > 
> > I think it also touches my question on how to tear down the scheduler once a ring is removed
> > or deinitialized.
> > 
> > I know XE is going its own way in this respect, but I also feel like we're leaving drivers
> > potentially being interested in DRM_SCHED_POLICY_SINGLE_ENTITY a bit alone on that. I think
> > we should probably give drivers a bit more guidance on how to do that.
> > 
> > Currently, I see two approaches.
> > 
> > (1) Do what XE does, which means letting the scheduler run dry, which includes both the
> >      entity's job queue and the schedulers pending_list. While jobs from the entity's queue
> >      aren't pushing any more work to the ring on tear down, but just "flow through" to get
> >      freed up eventually. (Hopefully I got that right.)
> > 
> > (2) Kill the entity to cleanup jobs from the entity's queue, stop the scheduler and either
> >      just wait for pending jobs or signal them right away and finally free them up.
> > 
> > Actually there'd also be (3), which could be a mix of both, discard the entity's queued jobs,
> > but let the pending_list run dry.
> > 
> > I'm not saying we should provide a whole bunch of infrastructure for drivers, e.g. for (1)
> > as you've mentioned already as well, there is probably not much to generalize anyway. However,
> > I think we should document the options drivers have to tear things down and do enough to
> > enable drivers using any option (as long as we agree it is reasonable).
> > 

Agree we should document this. Agree both ways seem reasonable. Let me
include a patch that documents this in my next rev.

Matt 

> > For Nouveau specifically, I'd probably like to go with (3).
> > 
> > - Danilo
> > 
> > >       spin_unlock(&entity->rq_lock);
> > >       /* Make sure this entity is not used by the scheduler at the moment */
> > > @@ -259,6 +272,20 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity)
> > >       dma_fence_put(prev);
> > >   }
> > > +/**
> > > + * drm_sched_entity_to_scheduler - Schedule entity to GPU scheduler
> > > + * @entity: scheduler entity
> > > + *
> > > + * Returns GPU scheduler for the entity
> > > + */
> > > +struct drm_gpu_scheduler *
> > > +drm_sched_entity_to_scheduler(struct drm_sched_entity *entity)
> > > +{
> > > +    bool single_entity = !!entity->single_sched;
> > > +
> > > +    return single_entity ? entity->single_sched : entity->rq->sched;
> > > +}
> > > +
> > >   /**
> > >    * drm_sched_entity_flush - Flush a context entity
> > >    *
> > > @@ -276,11 +303,12 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
> > >       struct drm_gpu_scheduler *sched;
> > >       struct task_struct *last_user;
> > >       long ret = timeout;
> > > +    bool single_entity = !!entity->single_sched;
> > > -    if (!entity->rq)
> > > +    if (!entity->rq && !single_entity)
> > >           return 0;
> > > -    sched = entity->rq->sched;
> > > +    sched = drm_sched_entity_to_scheduler(entity);
> > >       /**
> > >        * The client will not queue more IBs during this fini, consume existing
> > >        * queued IBs or discard them on SIGKILL
> > > @@ -373,7 +401,7 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
> > >           container_of(cb, struct drm_sched_entity, cb);
> > >       drm_sched_entity_clear_dep(f, cb);
> > > -    drm_sched_wakeup_if_can_queue(entity->rq->sched);
> > > +    drm_sched_wakeup_if_can_queue(drm_sched_entity_to_scheduler(entity));
> > >   }
> > >   /**
> > > @@ -387,6 +415,8 @@ static void drm_sched_entity_wakeup(struct dma_fence *f,
> > >   void drm_sched_entity_set_priority(struct drm_sched_entity *entity,
> > >                      enum drm_sched_priority priority)
> > >   {
> > > +    WARN_ON(!!entity->single_sched);
> > > +
> > >       spin_lock(&entity->rq_lock);
> > >       entity->priority = priority;
> > >       spin_unlock(&entity->rq_lock);
> > > @@ -399,7 +429,7 @@ EXPORT_SYMBOL(drm_sched_entity_set_priority);
> > >    */
> > >   static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
> > >   {
> > > -    struct drm_gpu_scheduler *sched = entity->rq->sched;
> > > +    struct drm_gpu_scheduler *sched = drm_sched_entity_to_scheduler(entity);
> > >       struct dma_fence *fence = entity->dependency;
> > >       struct drm_sched_fence *s_fence;
> > > @@ -501,7 +531,8 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
> > >        * Update the entity's location in the min heap according to
> > >        * the timestamp of the next job, if any.
> > >        */
> > > -    if (entity->rq->sched->sched_policy == DRM_SCHED_POLICY_FIFO) {
> > > +    if (drm_sched_entity_to_scheduler(entity)->sched_policy ==
> > > +        DRM_SCHED_POLICY_FIFO) {
> > >           struct drm_sched_job *next;
> > >           next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
> > > @@ -524,6 +555,8 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
> > >       struct drm_gpu_scheduler *sched;
> > >       struct drm_sched_rq *rq;
> > > +    WARN_ON(!!entity->single_sched);
> > > +
> > >       /* single possible engine and already selected */
> > >       if (!entity->sched_list)
> > >           return;
> > > @@ -573,12 +606,13 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
> > >   void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> > >   {
> > >       struct drm_sched_entity *entity = sched_job->entity;
> > > -    bool first, fifo = entity->rq->sched->sched_policy ==
> > > -        DRM_SCHED_POLICY_FIFO;
> > > +    bool single_entity = !!entity->single_sched;
> > > +    bool first;
> > >       ktime_t submit_ts;
> > >       trace_drm_sched_job(sched_job, entity);
> > > -    atomic_inc(entity->rq->sched->score);
> > > +    if (!single_entity)
> > > +        atomic_inc(entity->rq->sched->score);
> > >       WRITE_ONCE(entity->last_user, current->group_leader);
> > >       /*
> > > @@ -591,6 +625,10 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> > >       /* first job wakes up scheduler */
> > >       if (first) {
> > > +        struct drm_gpu_scheduler *sched =
> > > +            drm_sched_entity_to_scheduler(entity);
> > > +        bool fifo = sched->sched_policy == DRM_SCHED_POLICY_FIFO;
> > > +
> > >           /* Add the entity to the run queue */
> > >           spin_lock(&entity->rq_lock);
> > >           if (entity->stopped) {
> > > @@ -600,13 +638,14 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> > >               return;
> > >           }
> > > -        drm_sched_rq_add_entity(entity->rq, entity);
> > > +        if (!single_entity)
> > > +            drm_sched_rq_add_entity(entity->rq, entity);
> > >           spin_unlock(&entity->rq_lock);
> > >           if (fifo)
> > >               drm_sched_rq_update_fifo(entity, submit_ts);
> > > -        drm_sched_wakeup_if_can_queue(entity->rq->sched);
> > > +        drm_sched_wakeup_if_can_queue(sched);
> > >       }
> > >   }
> > >   EXPORT_SYMBOL(drm_sched_entity_push_job);
> > > diff --git a/drivers/gpu/drm/scheduler/sched_fence.c b/drivers/gpu/drm/scheduler/sched_fence.c
> > > index 06cedfe4b486..f6b926f5e188 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_fence.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_fence.c
> > > @@ -225,7 +225,7 @@ void drm_sched_fence_init(struct drm_sched_fence *fence,
> > >   {
> > >       unsigned seq;
> > > -    fence->sched = entity->rq->sched;
> > > +    fence->sched = drm_sched_entity_to_scheduler(entity);
> > >       seq = atomic_inc_return(&entity->fence_seq);
> > >       dma_fence_init(&fence->scheduled, &drm_sched_fence_ops_scheduled,
> > >                  &fence->lock, entity->fence_context, seq);
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > index 545d5298c086..cede47afc800 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -32,7 +32,8 @@
> > >    * backend operations to the scheduler like submitting a job to hardware run queue,
> > >    * returning the dependencies of a job etc.
> > >    *
> > > - * The organisation of the scheduler is the following:
> > > + * The organisation of the scheduler is the following for scheduling policies
> > > + * DRM_SCHED_POLICY_RR and DRM_SCHED_POLICY_FIFO:
> > >    *
> > >    * 1. Each hw run queue has one scheduler
> > >    * 2. Each scheduler has multiple run queues with different priorities
> > > @@ -43,6 +44,23 @@
> > >    *
> > >    * The jobs in a entity are always scheduled in the order that they were pushed.
> > >    *
> > > + * The organisation of the scheduler is the following for scheduling policy
> > > + * DRM_SCHED_POLICY_SINGLE_ENTITY:
> > > + *
> > > + * 1. One to one relationship between scheduler and entity
> > > + * 2. No priorities implemented per scheduler (single job queue)
> > > + * 3. No run queues in scheduler rather jobs are directly dequeued from entity
> > > + * 4. The entity maintains a queue of jobs that will be scheduled on the
> > > + * hardware
> > > + *
> > > + * The jobs in a entity are always scheduled in the order that they were pushed
> > > + * regardless of scheduling policy.
> > > + *
> > > + * A policy of DRM_SCHED_POLICY_RR or DRM_SCHED_POLICY_FIFO is expected to used
> > > + * when the KMD is scheduling directly on the hardware while a scheduling policy
> > > + * of DRM_SCHED_POLICY_SINGLE_ENTITY is expected to be used when there is a
> > > + * firmware scheduler.
> > > + *
> > >    * Note that once a job was taken from the entities queue and pushed to the
> > >    * hardware, i.e. the pending queue, the entity must not be referenced anymore
> > >    * through the jobs entity pointer.
> > > @@ -96,6 +114,8 @@ static inline void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti
> > >   void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
> > >   {
> > > +    WARN_ON(!!entity->single_sched);
> > > +
> > >       /*
> > >        * Both locks need to be grabbed, one to protect from entity->rq change
> > >        * for entity from within concurrent drm_sched_entity_select_rq and the
> > > @@ -126,6 +146,8 @@ void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts)
> > >   static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
> > >                     struct drm_sched_rq *rq)
> > >   {
> > > +    WARN_ON(sched->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
> > > +
> > >       spin_lock_init(&rq->lock);
> > >       INIT_LIST_HEAD(&rq->entities);
> > >       rq->rb_tree_root = RB_ROOT_CACHED;
> > > @@ -144,6 +166,8 @@ static void drm_sched_rq_init(struct drm_gpu_scheduler *sched,
> > >   void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
> > >                    struct drm_sched_entity *entity)
> > >   {
> > > +    WARN_ON(!!entity->single_sched);
> > > +
> > >       if (!list_empty(&entity->list))
> > >           return;
> > > @@ -166,6 +190,8 @@ void drm_sched_rq_add_entity(struct drm_sched_rq *rq,
> > >   void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > >                   struct drm_sched_entity *entity)
> > >   {
> > > +    WARN_ON(!!entity->single_sched);
> > > +
> > >       if (list_empty(&entity->list))
> > >           return;
> > > @@ -641,7 +667,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
> > >                  struct drm_sched_entity *entity,
> > >                  void *owner)
> > >   {
> > > -    if (!entity->rq)
> > > +    if (!entity->rq && !entity->single_sched)
> > >           return -ENOENT;
> > >       job->entity = entity;
> > > @@ -674,13 +700,16 @@ void drm_sched_job_arm(struct drm_sched_job *job)
> > >   {
> > >       struct drm_gpu_scheduler *sched;
> > >       struct drm_sched_entity *entity = job->entity;
> > > +    bool single_entity = !!entity->single_sched;
> > >       BUG_ON(!entity);
> > > -    drm_sched_entity_select_rq(entity);
> > > -    sched = entity->rq->sched;
> > > +    if (!single_entity)
> > > +        drm_sched_entity_select_rq(entity);
> > > +    sched = drm_sched_entity_to_scheduler(entity);
> > >       job->sched = sched;
> > > -    job->s_priority = entity->rq - sched->sched_rq;
> > > +    if (!single_entity)
> > > +        job->s_priority = entity->rq - sched->sched_rq;
> > >       job->id = atomic64_inc_return(&sched->job_id_count);
> > >       drm_sched_fence_init(job->s_fence, job->entity);
> > > @@ -896,6 +925,13 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > >       if (!drm_sched_can_queue(sched))
> > >           return NULL;
> > > +    if (sched->single_entity) {
> > > +        if (drm_sched_entity_is_ready(sched->single_entity))
> > > +            return sched->single_entity;
> > > +
> > > +        return NULL;
> > > +    }
> > > +
> > >       /* Kernel run queue has higher priority than normal run queue*/
> > >       for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> > >           entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> > > @@ -1091,6 +1127,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> > >           return -EINVAL;
> > >       sched->ops = ops;
> > > +    sched->single_entity = NULL;
> > >       sched->hw_submission_limit = hw_submission;
> > >       sched->name = name;
> > >       sched->submit_wq = submit_wq ? : system_wq;
> > > @@ -1103,7 +1140,9 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> > >           sched->sched_policy = default_drm_sched_policy;
> > >       else
> > >           sched->sched_policy = sched_policy;
> > > -    for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
> > > +    for (i = DRM_SCHED_PRIORITY_MIN; sched_policy !=
> > > +         DRM_SCHED_POLICY_SINGLE_ENTITY && i < DRM_SCHED_PRIORITY_COUNT;
> > > +         i++)
> > >           drm_sched_rq_init(sched, &sched->sched_rq[i]);
> > >       init_waitqueue_head(&sched->job_scheduled);
> > > @@ -1135,7 +1174,15 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
> > >       drm_sched_submit_stop(sched);
> > > -    for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> > > +    if (sched->single_entity) {
> > > +        spin_lock(&sched->single_entity->rq_lock);
> > > +        sched->single_entity->stopped = true;
> > > +        spin_unlock(&sched->single_entity->rq_lock);
> > > +    }
> > > +
> > > +    for (i = DRM_SCHED_PRIORITY_COUNT - 1; sched->sched_policy !=
> > > +         DRM_SCHED_POLICY_SINGLE_ENTITY && i >= DRM_SCHED_PRIORITY_MIN;
> > > +         i--) {
> > >           struct drm_sched_rq *rq = &sched->sched_rq[i];
> > >           spin_lock(&rq->lock);
> > > @@ -1176,6 +1223,8 @@ void drm_sched_increase_karma(struct drm_sched_job *bad)
> > >       struct drm_sched_entity *entity;
> > >       struct drm_gpu_scheduler *sched = bad->sched;
> > > +    WARN_ON(sched->sched_policy == DRM_SCHED_POLICY_SINGLE_ENTITY);
> > > +
> > >       /* don't change @bad's karma if it's from KERNEL RQ,
> > >        * because sometimes GPU hang would cause kernel jobs (like VM updating jobs)
> > >        * corrupt but keep in mind that kernel jobs always considered good.
> > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > index 897d52a4ff4f..04eec2d7635f 100644
> > > --- a/include/drm/gpu_scheduler.h
> > > +++ b/include/drm/gpu_scheduler.h
> > > @@ -79,6 +79,7 @@ enum drm_sched_policy {
> > >       DRM_SCHED_POLICY_DEFAULT,
> > >       DRM_SCHED_POLICY_RR,
> > >       DRM_SCHED_POLICY_FIFO,
> > > +    DRM_SCHED_POLICY_SINGLE_ENTITY,
> > >       DRM_SCHED_POLICY_COUNT,
> > >   };
> > > @@ -112,6 +113,9 @@ struct drm_sched_entity {
> > >        */
> > >       struct drm_sched_rq        *rq;
> > > +    /** @single_sched: Single scheduler */
> > > +    struct drm_gpu_scheduler    *single_sched;
> > > +
> > >       /**
> > >        * @sched_list:
> > >        *
> > > @@ -473,6 +477,7 @@ struct drm_sched_backend_ops {
> > >    * struct drm_gpu_scheduler - scheduler instance-specific data
> > >    *
> > >    * @ops: backend operations provided by the driver.
> > > + * @single_entity: Single entity for the scheduler
> > >    * @hw_submission_limit: the max size of the hardware queue.
> > >    * @timeout: the time after which a job is removed from the scheduler.
> > >    * @name: name of the ring for which this scheduler is being used.
> > > @@ -503,6 +508,7 @@ struct drm_sched_backend_ops {
> > >    */
> > >   struct drm_gpu_scheduler {
> > >       const struct drm_sched_backend_ops    *ops;
> > > +    struct drm_sched_entity        *single_entity;
> > >       uint32_t            hw_submission_limit;
> > >       long                timeout;
> > >       const char            *name;
> > > @@ -585,6 +591,8 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
> > >                 struct drm_gpu_scheduler **sched_list,
> > >                 unsigned int num_sched_list,
> > >                 atomic_t *guilty);
> > > +struct drm_gpu_scheduler *
> > > +drm_sched_entity_to_scheduler(struct drm_sched_entity *entity);
> > >   long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout);
> > >   void drm_sched_entity_fini(struct drm_sched_entity *entity);
> > >   void drm_sched_entity_destroy(struct drm_sched_entity *entity);
> 

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-25 13:45           ` Christian König
@ 2023-09-12 10:13             ` Boris Brezillon
  2023-09-12 10:46               ` Danilo Krummrich
  2023-09-12 13:27             ` Boris Brezillon
  1 sibling, 1 reply; 80+ messages in thread
From: Boris Brezillon @ 2023-09-12 10:13 UTC (permalink / raw)
  To: Christian König, donald.robson
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, luben.tuikov,
	Danilo Krummrich, intel-xe, faith.ekstrand

On Fri, 25 Aug 2023 15:45:49 +0200
Christian König <christian.koenig@amd.com> wrote:

> Am 25.08.23 um 15:36 schrieb Matthew Brost:
> > On Fri, Aug 25, 2023 at 10:02:32AM +0200, Christian König wrote:  
> >> Am 25.08.23 um 04:58 schrieb Matthew Brost:  
> >>> On Fri, Aug 25, 2023 at 01:04:10AM +0200, Danilo Krummrich wrote:  
> >>>> On Thu, Aug 10, 2023 at 07:31:32PM -0700, Matthew Brost wrote:  
> >>>>> Rather than call free_job and run_job in same work item have a dedicated
> >>>>> work item for each. This aligns with the design and intended use of work
> >>>>> queues.
> >>>>>
> >>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> >>>>> ---
> >>>>>    drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
> >>>>>    include/drm/gpu_scheduler.h            |   8 +-
> >>>>>    2 files changed, 106 insertions(+), 39 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> >>>>> index cede47afc800..b67469eac179 100644
> >>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
> >>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> >>>>> @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> >>>>>     * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
> >>>>>     *
> >>>>>     * @rq: scheduler run queue to check.
> >>>>> + * @dequeue: dequeue selected entity
> >>>>>     *
> >>>>>     * Try to find a ready entity, returns NULL if none found.
> >>>>>     */
> >>>>>    static struct drm_sched_entity *
> >>>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >>>>> +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
> >>>>>    {
> >>>>>    	struct drm_sched_entity *entity;
> >>>>> @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >>>>>    	if (entity) {
> >>>>>    		list_for_each_entry_continue(entity, &rq->entities, list) {
> >>>>>    			if (drm_sched_entity_is_ready(entity)) {
> >>>>> -				rq->current_entity = entity;
> >>>>> -				reinit_completion(&entity->entity_idle);
> >>>>> +				if (dequeue) {
> >>>>> +					rq->current_entity = entity;
> >>>>> +					reinit_completion(&entity->entity_idle);
> >>>>> +				}
> >>>>>    				spin_unlock(&rq->lock);
> >>>>>    				return entity;
> >>>>>    			}
> >>>>> @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >>>>>    	list_for_each_entry(entity, &rq->entities, list) {
> >>>>>    		if (drm_sched_entity_is_ready(entity)) {
> >>>>> -			rq->current_entity = entity;
> >>>>> -			reinit_completion(&entity->entity_idle);
> >>>>> +			if (dequeue) {
> >>>>> +				rq->current_entity = entity;
> >>>>> +				reinit_completion(&entity->entity_idle);
> >>>>> +			}
> >>>>>    			spin_unlock(&rq->lock);
> >>>>>    			return entity;
> >>>>>    		}
> >>>>> @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> >>>>>     * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
> >>>>>     *
> >>>>>     * @rq: scheduler run queue to check.
> >>>>> + * @dequeue: dequeue selected entity
> >>>>>     *
> >>>>>     * Find oldest waiting ready entity, returns NULL if none found.
> >>>>>     */
> >>>>>    static struct drm_sched_entity *
> >>>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> >>>>> +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
> >>>>>    {
> >>>>>    	struct rb_node *rb;
> >>>>> @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> >>>>>    		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> >>>>>    		if (drm_sched_entity_is_ready(entity)) {
> >>>>> -			rq->current_entity = entity;
> >>>>> -			reinit_completion(&entity->entity_idle);
> >>>>> +			if (dequeue) {
> >>>>> +				rq->current_entity = entity;
> >>>>> +				reinit_completion(&entity->entity_idle);
> >>>>> +			}
> >>>>>    			break;
> >>>>>    		}
> >>>>>    	}
> >>>>> @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> >>>>>    }
> >>>>>    /**
> >>>>> - * drm_sched_submit_queue - scheduler queue submission
> >>>>> + * drm_sched_run_job_queue - queue job submission
> >>>>>     * @sched: scheduler instance
> >>>>>     */
> >>>>> -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> >>>>> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> >>>>>    {
> >>>>>    	if (!READ_ONCE(sched->pause_submit))
> >>>>> -		queue_work(sched->submit_wq, &sched->work_submit);
> >>>>> +		queue_work(sched->submit_wq, &sched->work_run_job);
> >>>>> +}
> >>>>> +
> >>>>> +static struct drm_sched_entity *
> >>>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> >>>>> +
> >>>>> +/**
> >>>>> + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> >>>>> + * @sched: scheduler instance
> >>>>> + */
> >>>>> +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> >>>>> +{
> >>>>> +	if (drm_sched_select_entity(sched, false))
> >>>>> +		drm_sched_run_job_queue(sched);
> >>>>> +}
> >>>>> +
> >>>>> +/**
> >>>>> + * drm_sched_free_job_queue - queue free job
> >>>>> + *
> >>>>> + * @sched: scheduler instance to queue free job
> >>>>> + */
> >>>>> +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> >>>>> +{
> >>>>> +	if (!READ_ONCE(sched->pause_submit))
> >>>>> +		queue_work(sched->submit_wq, &sched->work_free_job);
> >>>>> +}
> >>>>> +
> >>>>> +/**
> >>>>> + * drm_sched_free_job_queue_if_ready - queue free job if ready
> >>>>> + *
> >>>>> + * @sched: scheduler instance to queue free job
> >>>>> + */
> >>>>> +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> >>>>> +{
> >>>>> +	struct drm_sched_job *job;
> >>>>> +
> >>>>> +	spin_lock(&sched->job_list_lock);
> >>>>> +	job = list_first_entry_or_null(&sched->pending_list,
> >>>>> +				       struct drm_sched_job, list);
> >>>>> +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
> >>>>> +		drm_sched_free_job_queue(sched);
> >>>>> +	spin_unlock(&sched->job_list_lock);
> >>>>>    }
> >>>>>    /**
> >>>>> @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
> >>>>>    	dma_fence_get(&s_fence->finished);
> >>>>>    	drm_sched_fence_finished(s_fence, result);
> >>>>>    	dma_fence_put(&s_fence->finished);
> >>>>> -	drm_sched_submit_queue(sched);
> >>>>> +	drm_sched_free_job_queue(sched);
> >>>>>    }
> >>>>>    /**
> >>>>> @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> >>>>>    void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> >>>>>    {
> >>>>>    	if (drm_sched_can_queue(sched))
> >>>>> -		drm_sched_submit_queue(sched);
> >>>>> +		drm_sched_run_job_queue(sched);
> >>>>>    }
> >>>>>    /**
> >>>>>     * drm_sched_select_entity - Select next entity to process
> >>>>>     *
> >>>>>     * @sched: scheduler instance
> >>>>> + * @dequeue: dequeue selected entity
> >>>>>     *
> >>>>>     * Returns the entity to process or NULL if none are found.
> >>>>>     */
> >>>>>    static struct drm_sched_entity *
> >>>>> -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> >>>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
> >>>>>    {
> >>>>>    	struct drm_sched_entity *entity;
> >>>>>    	int i;
> >>>>> @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> >>>>>    	/* Kernel run queue has higher priority than normal run queue*/
> >>>>>    	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> >>>>>    		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> >>>>> -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> >>>>> -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> >>>>> +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> >>>>> +							dequeue) :
> >>>>> +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> >>>>> +						      dequeue);
> >>>>>    		if (entity)
> >>>>>    			break;
> >>>>>    	}
> >>>>> @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> >>>>>    EXPORT_SYMBOL(drm_sched_pick_best);
> >>>>>    /**
> >>>>> - * drm_sched_main - main scheduler thread
> >>>>> + * drm_sched_free_job_work - worker to call free_job
> >>>>>     *
> >>>>> - * @param: scheduler instance
> >>>>> + * @w: free job work
> >>>>>     */
> >>>>> -static void drm_sched_main(struct work_struct *w)
> >>>>> +static void drm_sched_free_job_work(struct work_struct *w)
> >>>>>    {
> >>>>>    	struct drm_gpu_scheduler *sched =
> >>>>> -		container_of(w, struct drm_gpu_scheduler, work_submit);
> >>>>> -	struct drm_sched_entity *entity;
> >>>>> +		container_of(w, struct drm_gpu_scheduler, work_free_job);
> >>>>>    	struct drm_sched_job *cleanup_job;
> >>>>> -	int r;
> >>>>>    	if (READ_ONCE(sched->pause_submit))
> >>>>>    		return;
> >>>>>    	cleanup_job = drm_sched_get_cleanup_job(sched);  
> >>>> I tried this patch with Nouveau and found a race condition:
> >>>>
> >>>> In drm_sched_run_job_work() the job is added to the pending_list via
> >>>> drm_sched_job_begin(), then the run_job() callback is called and the scheduled
> >>>> fence is signaled.
> >>>>
> >>>> However, in parallel drm_sched_get_cleanup_job() might be called from
> >>>> drm_sched_free_job_work(), which picks the first job from the pending_list and
> >>>> for the next job on the pending_list sets the scheduled fence' timestamp field.  
> >> Well why can this happen in parallel? Either the work items are scheduled to
> >> a single threaded work queue or you have protected the pending list with
> >> some locks.
> >>  
> > Xe uses a single-threaded work queue, Nouveau does not (desired
> > behavior).

I'm a bit worried that leaving this single vs multi-threaded wq
decision to drivers is going to cause unnecessary pain, because what
was previously a granted in term of run/cleanup execution order (thanks
to the kthread+static-drm_sched_main-workflow approach) is now subject
to the wq ordering guarantees, which depend on the wq type picked by
the driver.

> >
> > The list of pending jobs is protected by a lock (safe), the race is:
> >
> > add job to pending list
> > run_job
> > signal scheduled fence
> >
> > dequeue from pending list
> > free_job
> > update timestamp
> >
> > Once a job is on the pending list its timestamp can be accessed which
> > can blow up if scheduled fence isn't signaled or more specifically unless
> > DMA_FENCE_FLAG_TIMESTAMP_BIT is set. 

Ah, so that's the reason for the TIMESTAMP test added in v3. Sorry for
the noise in my v3 review, but I still think it'd be beneficial to have
that change moved to its own commit.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-09-12 10:13             ` Boris Brezillon
@ 2023-09-12 10:46               ` Danilo Krummrich
  2023-09-12 12:18                 ` Boris Brezillon
  0 siblings, 1 reply; 80+ messages in thread
From: Danilo Krummrich @ 2023-09-12 10:46 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, Christian König,
	luben.tuikov, donald.robson, intel-xe, faith.ekstrand

On Tue, Sep 12, 2023 at 12:13:57PM +0200, Boris Brezillon wrote:
> On Fri, 25 Aug 2023 15:45:49 +0200
> Christian König <christian.koenig@amd.com> wrote:
> 
> > Am 25.08.23 um 15:36 schrieb Matthew Brost:
> > > On Fri, Aug 25, 2023 at 10:02:32AM +0200, Christian König wrote:  
> > >> Am 25.08.23 um 04:58 schrieb Matthew Brost:  
> > >>> On Fri, Aug 25, 2023 at 01:04:10AM +0200, Danilo Krummrich wrote:  
> > >>>> On Thu, Aug 10, 2023 at 07:31:32PM -0700, Matthew Brost wrote:  
> > >>>>> Rather than call free_job and run_job in same work item have a dedicated
> > >>>>> work item for each. This aligns with the design and intended use of work
> > >>>>> queues.
> > >>>>>
> > >>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > >>>>> ---
> > >>>>>    drivers/gpu/drm/scheduler/sched_main.c | 137 ++++++++++++++++++-------
> > >>>>>    include/drm/gpu_scheduler.h            |   8 +-
> > >>>>>    2 files changed, 106 insertions(+), 39 deletions(-)
> > >>>>>
> > >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > >>>>> index cede47afc800..b67469eac179 100644
> > >>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
> > >>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > >>>>> @@ -213,11 +213,12 @@ void drm_sched_rq_remove_entity(struct drm_sched_rq *rq,
> > >>>>>     * drm_sched_rq_select_entity_rr - Select an entity which could provide a job to run
> > >>>>>     *
> > >>>>>     * @rq: scheduler run queue to check.
> > >>>>> + * @dequeue: dequeue selected entity
> > >>>>>     *
> > >>>>>     * Try to find a ready entity, returns NULL if none found.
> > >>>>>     */
> > >>>>>    static struct drm_sched_entity *
> > >>>>> -drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > >>>>> +drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq, bool dequeue)
> > >>>>>    {
> > >>>>>    	struct drm_sched_entity *entity;
> > >>>>> @@ -227,8 +228,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > >>>>>    	if (entity) {
> > >>>>>    		list_for_each_entry_continue(entity, &rq->entities, list) {
> > >>>>>    			if (drm_sched_entity_is_ready(entity)) {
> > >>>>> -				rq->current_entity = entity;
> > >>>>> -				reinit_completion(&entity->entity_idle);
> > >>>>> +				if (dequeue) {
> > >>>>> +					rq->current_entity = entity;
> > >>>>> +					reinit_completion(&entity->entity_idle);
> > >>>>> +				}
> > >>>>>    				spin_unlock(&rq->lock);
> > >>>>>    				return entity;
> > >>>>>    			}
> > >>>>> @@ -238,8 +241,10 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > >>>>>    	list_for_each_entry(entity, &rq->entities, list) {
> > >>>>>    		if (drm_sched_entity_is_ready(entity)) {
> > >>>>> -			rq->current_entity = entity;
> > >>>>> -			reinit_completion(&entity->entity_idle);
> > >>>>> +			if (dequeue) {
> > >>>>> +				rq->current_entity = entity;
> > >>>>> +				reinit_completion(&entity->entity_idle);
> > >>>>> +			}
> > >>>>>    			spin_unlock(&rq->lock);
> > >>>>>    			return entity;
> > >>>>>    		}
> > >>>>> @@ -257,11 +262,12 @@ drm_sched_rq_select_entity_rr(struct drm_sched_rq *rq)
> > >>>>>     * drm_sched_rq_select_entity_fifo - Select an entity which provides a job to run
> > >>>>>     *
> > >>>>>     * @rq: scheduler run queue to check.
> > >>>>> + * @dequeue: dequeue selected entity
> > >>>>>     *
> > >>>>>     * Find oldest waiting ready entity, returns NULL if none found.
> > >>>>>     */
> > >>>>>    static struct drm_sched_entity *
> > >>>>> -drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > >>>>> +drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq, bool dequeue)
> > >>>>>    {
> > >>>>>    	struct rb_node *rb;
> > >>>>> @@ -271,8 +277,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > >>>>>    		entity = rb_entry(rb, struct drm_sched_entity, rb_tree_node);
> > >>>>>    		if (drm_sched_entity_is_ready(entity)) {
> > >>>>> -			rq->current_entity = entity;
> > >>>>> -			reinit_completion(&entity->entity_idle);
> > >>>>> +			if (dequeue) {
> > >>>>> +				rq->current_entity = entity;
> > >>>>> +				reinit_completion(&entity->entity_idle);
> > >>>>> +			}
> > >>>>>    			break;
> > >>>>>    		}
> > >>>>>    	}
> > >>>>> @@ -282,13 +290,54 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq)
> > >>>>>    }
> > >>>>>    /**
> > >>>>> - * drm_sched_submit_queue - scheduler queue submission
> > >>>>> + * drm_sched_run_job_queue - queue job submission
> > >>>>>     * @sched: scheduler instance
> > >>>>>     */
> > >>>>> -static void drm_sched_submit_queue(struct drm_gpu_scheduler *sched)
> > >>>>> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched)
> > >>>>>    {
> > >>>>>    	if (!READ_ONCE(sched->pause_submit))
> > >>>>> -		queue_work(sched->submit_wq, &sched->work_submit);
> > >>>>> +		queue_work(sched->submit_wq, &sched->work_run_job);
> > >>>>> +}
> > >>>>> +
> > >>>>> +static struct drm_sched_entity *
> > >>>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue);
> > >>>>> +
> > >>>>> +/**
> > >>>>> + * drm_sched_run_job_queue_if_ready - queue job submission if ready
> > >>>>> + * @sched: scheduler instance
> > >>>>> + */
> > >>>>> +static void drm_sched_run_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > >>>>> +{
> > >>>>> +	if (drm_sched_select_entity(sched, false))
> > >>>>> +		drm_sched_run_job_queue(sched);
> > >>>>> +}
> > >>>>> +
> > >>>>> +/**
> > >>>>> + * drm_sched_free_job_queue - queue free job
> > >>>>> + *
> > >>>>> + * @sched: scheduler instance to queue free job
> > >>>>> + */
> > >>>>> +static void drm_sched_free_job_queue(struct drm_gpu_scheduler *sched)
> > >>>>> +{
> > >>>>> +	if (!READ_ONCE(sched->pause_submit))
> > >>>>> +		queue_work(sched->submit_wq, &sched->work_free_job);
> > >>>>> +}
> > >>>>> +
> > >>>>> +/**
> > >>>>> + * drm_sched_free_job_queue_if_ready - queue free job if ready
> > >>>>> + *
> > >>>>> + * @sched: scheduler instance to queue free job
> > >>>>> + */
> > >>>>> +static void drm_sched_free_job_queue_if_ready(struct drm_gpu_scheduler *sched)
> > >>>>> +{
> > >>>>> +	struct drm_sched_job *job;
> > >>>>> +
> > >>>>> +	spin_lock(&sched->job_list_lock);
> > >>>>> +	job = list_first_entry_or_null(&sched->pending_list,
> > >>>>> +				       struct drm_sched_job, list);
> > >>>>> +	if (job && dma_fence_is_signaled(&job->s_fence->finished))
> > >>>>> +		drm_sched_free_job_queue(sched);
> > >>>>> +	spin_unlock(&sched->job_list_lock);
> > >>>>>    }
> > >>>>>    /**
> > >>>>> @@ -310,7 +359,7 @@ static void drm_sched_job_done(struct drm_sched_job *s_job, int result)
> > >>>>>    	dma_fence_get(&s_fence->finished);
> > >>>>>    	drm_sched_fence_finished(s_fence, result);
> > >>>>>    	dma_fence_put(&s_fence->finished);
> > >>>>> -	drm_sched_submit_queue(sched);
> > >>>>> +	drm_sched_free_job_queue(sched);
> > >>>>>    }
> > >>>>>    /**
> > >>>>> @@ -906,18 +955,19 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched)
> > >>>>>    void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched)
> > >>>>>    {
> > >>>>>    	if (drm_sched_can_queue(sched))
> > >>>>> -		drm_sched_submit_queue(sched);
> > >>>>> +		drm_sched_run_job_queue(sched);
> > >>>>>    }
> > >>>>>    /**
> > >>>>>     * drm_sched_select_entity - Select next entity to process
> > >>>>>     *
> > >>>>>     * @sched: scheduler instance
> > >>>>> + * @dequeue: dequeue selected entity
> > >>>>>     *
> > >>>>>     * Returns the entity to process or NULL if none are found.
> > >>>>>     */
> > >>>>>    static struct drm_sched_entity *
> > >>>>> -drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > >>>>> +drm_sched_select_entity(struct drm_gpu_scheduler *sched, bool dequeue)
> > >>>>>    {
> > >>>>>    	struct drm_sched_entity *entity;
> > >>>>>    	int i;
> > >>>>> @@ -935,8 +985,10 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched)
> > >>>>>    	/* Kernel run queue has higher priority than normal run queue*/
> > >>>>>    	for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> > >>>>>    		entity = sched->sched_policy == DRM_SCHED_POLICY_FIFO ?
> > >>>>> -			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i]) :
> > >>>>> -			drm_sched_rq_select_entity_rr(&sched->sched_rq[i]);
> > >>>>> +			drm_sched_rq_select_entity_fifo(&sched->sched_rq[i],
> > >>>>> +							dequeue) :
> > >>>>> +			drm_sched_rq_select_entity_rr(&sched->sched_rq[i],
> > >>>>> +						      dequeue);
> > >>>>>    		if (entity)
> > >>>>>    			break;
> > >>>>>    	}
> > >>>>> @@ -1024,30 +1076,44 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> > >>>>>    EXPORT_SYMBOL(drm_sched_pick_best);
> > >>>>>    /**
> > >>>>> - * drm_sched_main - main scheduler thread
> > >>>>> + * drm_sched_free_job_work - worker to call free_job
> > >>>>>     *
> > >>>>> - * @param: scheduler instance
> > >>>>> + * @w: free job work
> > >>>>>     */
> > >>>>> -static void drm_sched_main(struct work_struct *w)
> > >>>>> +static void drm_sched_free_job_work(struct work_struct *w)
> > >>>>>    {
> > >>>>>    	struct drm_gpu_scheduler *sched =
> > >>>>> -		container_of(w, struct drm_gpu_scheduler, work_submit);
> > >>>>> -	struct drm_sched_entity *entity;
> > >>>>> +		container_of(w, struct drm_gpu_scheduler, work_free_job);
> > >>>>>    	struct drm_sched_job *cleanup_job;
> > >>>>> -	int r;
> > >>>>>    	if (READ_ONCE(sched->pause_submit))
> > >>>>>    		return;
> > >>>>>    	cleanup_job = drm_sched_get_cleanup_job(sched);  
> > >>>> I tried this patch with Nouveau and found a race condition:
> > >>>>
> > >>>> In drm_sched_run_job_work() the job is added to the pending_list via
> > >>>> drm_sched_job_begin(), then the run_job() callback is called and the scheduled
> > >>>> fence is signaled.
> > >>>>
> > >>>> However, in parallel drm_sched_get_cleanup_job() might be called from
> > >>>> drm_sched_free_job_work(), which picks the first job from the pending_list and
> > >>>> for the next job on the pending_list sets the scheduled fence' timestamp field.  
> > >> Well why can this happen in parallel? Either the work items are scheduled to
> > >> a single threaded work queue or you have protected the pending list with
> > >> some locks.
> > >>  
> > > Xe uses a single-threaded work queue, Nouveau does not (desired
> > > behavior).
> 
> I'm a bit worried that leaving this single vs multi-threaded wq
> decision to drivers is going to cause unnecessary pain, because what
> was previously a granted in term of run/cleanup execution order (thanks
> to the kthread+static-drm_sched_main-workflow approach) is now subject
> to the wq ordering guarantees, which depend on the wq type picked by
> the driver.

Not sure if this ends up to be much different. The only thing I could think of
is that IIRC with the kthread implementation cleanup was always preferred over
run. With a single threaded wq this should be a bit more balanced.

With a multi-threaded wq it's still the same, but run and cleanup can run
concurrently, which has the nice side effect that free_job() gets out of the
fence signaling path. At least as long as the workqueue has max_active > 1.
Which is one reason why I'm using a multi-threaded wq in Nouveau.

That latter seems a bit subtile, we probably need to document this aspect of
under which conditions free_job() is or is not within the fence signaling path.

- Danilo

> 
> > >
> > > The list of pending jobs is protected by a lock (safe), the race is:
> > >
> > > add job to pending list
> > > run_job
> > > signal scheduled fence
> > >
> > > dequeue from pending list
> > > free_job
> > > update timestamp
> > >
> > > Once a job is on the pending list its timestamp can be accessed which
> > > can blow up if scheduled fence isn't signaled or more specifically unless
> > > DMA_FENCE_FLAG_TIMESTAMP_BIT is set. 
> 
> Ah, so that's the reason for the TIMESTAMP test added in v3. Sorry for
> the noise in my v3 review, but I still think it'd be beneficial to have
> that change moved to its own commit.
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-09-12 10:46               ` Danilo Krummrich
@ 2023-09-12 12:18                 ` Boris Brezillon
  2023-09-12 12:56                   ` Danilo Krummrich
  0 siblings, 1 reply; 80+ messages in thread
From: Boris Brezillon @ 2023-09-12 12:18 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, Christian König,
	luben.tuikov, donald.robson, intel-xe, faith.ekstrand

On Tue, 12 Sep 2023 12:46:26 +0200
Danilo Krummrich <dakr@redhat.com> wrote:

> > I'm a bit worried that leaving this single vs multi-threaded wq
> > decision to drivers is going to cause unnecessary pain, because what
> > was previously a granted in term of run/cleanup execution order (thanks
> > to the kthread+static-drm_sched_main-workflow approach) is now subject
> > to the wq ordering guarantees, which depend on the wq type picked by
> > the driver.  
> 
> Not sure if this ends up to be much different. The only thing I could think of
> is that IIRC with the kthread implementation cleanup was always preferred over
> run.

Given the sequence in drm_sched_main(), I'd say that cleanup and run
operations are naturally interleaved when both are available, but I
might be wrong.

> With a single threaded wq this should be a bit more balanced.

With a single threaded wq, it's less clear, because each work
reschedules itself for further processing, but it's likely to be more
or less interleaved. Anyway, I'm not too worried about cleanup taking
precedence on run or the other way around, because the limited amount
of HW slots (size of the ring-buffer) will regulate that.

> 
> With a multi-threaded wq it's still the same, but run and cleanup can run
> concurrently,

What I'm worried about is that ^. I'm not saying it's fundamentally
unsafe, but I'm saying drm_sched hasn't been designed with this
concurrency in mind, and I fear we'll face subtle bugs if we go from
kthread to multi-threaded-wq+run-and-cleanup-split-in-2-work-items.

> which has the nice side effect that free_job() gets out of the
> fence signaling path. At least as long as the workqueue has max_active > 1.

Oh, yeah, I don't deny using a multi-threaded workqueue has some
benefits, just saying it might be trickier than it sounds.

> Which is one reason why I'm using a multi-threaded wq in Nouveau.

Note that I'm using a multi-threaded workqueue internally at the moment
to deal with all sort of interactions with the FW (Mali HW only has a
limited amount of scheduling slots, and we need to rotate entities
having jobs to execute so every one gets a chance to run on the GPU),
but this has been designed this way from the ground up, unlike
drm_sched_main() operations, which were mostly thought as a fixed
sequential set of operations. That's not to say it's impossible to get
right, but I fear we'll face weird/unexpected behavior if we go from
completely-serialized to multi-threaded-with-pseudo-random-processing
order.

> 
> That latter seems a bit subtile, we probably need to document this aspect of
> under which conditions free_job() is or is not within the fence signaling path.

Well, I'm not even sure it can be clearly defined when the driver is
using the submit_wq for its own work items (which can be done since we
pass an optional submit_wq when calling drm_sched_init()). Sure, having
max_active >= 2 should be enough to guarantee that the free_job work
won't block the run_job one when these are the 2 only works being
queued, but what if you have many other work items being queued by the
driver to this wq, and some of those try to acquire resv locks? Could
this prevent execution of the run_job() callback, thus preventing
signaling of fences? I'm genuinely asking, don't know enough about the
cmwq implementation to tell what's happening when work items are
blocked (might be that the worker pool is extended to unblock the
situation).

Anyway, documenting when free_job() is in the dma signalling path should
be doable (single-threaded wq), but at this point, are we not better
off considering anything called from the submit_wq as being part of the
dma signalling path, so we can accommodate with both cases. And if
there is cleanup processing that require taking dma_resv locks, I'd be
tempted to queue that to a driver-specific wq (which is what I'm doing
right now), just to be safe.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-09-12 12:18                 ` Boris Brezillon
@ 2023-09-12 12:56                   ` Danilo Krummrich
  2023-09-12 13:52                     ` Boris Brezillon
  0 siblings, 1 reply; 80+ messages in thread
From: Danilo Krummrich @ 2023-09-12 12:56 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, Christian König,
	luben.tuikov, donald.robson, intel-xe, faith.ekstrand

On Tue, Sep 12, 2023 at 02:18:18PM +0200, Boris Brezillon wrote:
> On Tue, 12 Sep 2023 12:46:26 +0200
> Danilo Krummrich <dakr@redhat.com> wrote:
> 
> > > I'm a bit worried that leaving this single vs multi-threaded wq
> > > decision to drivers is going to cause unnecessary pain, because what
> > > was previously a granted in term of run/cleanup execution order (thanks
> > > to the kthread+static-drm_sched_main-workflow approach) is now subject
> > > to the wq ordering guarantees, which depend on the wq type picked by
> > > the driver.  
> > 
> > Not sure if this ends up to be much different. The only thing I could think of
> > is that IIRC with the kthread implementation cleanup was always preferred over
> > run.
> 
> Given the sequence in drm_sched_main(), I'd say that cleanup and run
> operations are naturally interleaved when both are available, but I
> might be wrong.

From drm_sched_main():

	wait_event_interruptible(sched->wake_up_worker,
				 (cleanup_job = drm_sched_get_cleanup_job(sched)) ||
				 (!drm_sched_blocked(sched) &&
				  (entity = drm_sched_select_entity(sched))) ||
				 kthread_should_stop());

	if (cleanup_job)
		sched->ops->free_job(cleanup_job);

	if (!entity)
		continue;

If cleanup_job is not NULL the rest shouldn't be evaluated I guess. Hence entity
would be NULL and we'd loop until there are no more cleanup_jobs if I don't miss
anything here.

> 
> > With a single threaded wq this should be a bit more balanced.
> 
> With a single threaded wq, it's less clear, because each work
> reschedules itself for further processing, but it's likely to be more
> or less interleaved. Anyway, I'm not too worried about cleanup taking
> precedence on run or the other way around, because the limited amount
> of HW slots (size of the ring-buffer) will regulate that.

Yeah, that's what I meant, with to work items rescheduling themselves it starts
to be interleaved. Which I'm not worried about as well.

> 
> > 
> > With a multi-threaded wq it's still the same, but run and cleanup can run
> > concurrently,
> 
> What I'm worried about is that ^. I'm not saying it's fundamentally
> unsafe, but I'm saying drm_sched hasn't been designed with this
> concurrency in mind, and I fear we'll face subtle bugs if we go from
> kthread to multi-threaded-wq+run-and-cleanup-split-in-2-work-items.
> 

Yeah, so what we get with that is that job_run() of job A and job_free() of job
B can run in parallel. Unless drivers do weird things there, I'm not seeing an
issue with that as well at a first glance.

> > which has the nice side effect that free_job() gets out of the
> > fence signaling path. At least as long as the workqueue has max_active > 1.
> 
> Oh, yeah, I don't deny using a multi-threaded workqueue has some
> benefits, just saying it might be trickier than it sounds.
> 
> > Which is one reason why I'm using a multi-threaded wq in Nouveau.
> 
> Note that I'm using a multi-threaded workqueue internally at the moment
> to deal with all sort of interactions with the FW (Mali HW only has a
> limited amount of scheduling slots, and we need to rotate entities
> having jobs to execute so every one gets a chance to run on the GPU),
> but this has been designed this way from the ground up, unlike
> drm_sched_main() operations, which were mostly thought as a fixed
> sequential set of operations. That's not to say it's impossible to get
> right, but I fear we'll face weird/unexpected behavior if we go from
> completely-serialized to multi-threaded-with-pseudo-random-processing
> order.

From a per job perspective it's still all sequential and besides fence
dependencies, which are still resolved, I don't see where jobs could have cross
dependencies that make this racy. But agree that it's probably worth to think
through it a bit more.

> 
> > 
> > That latter seems a bit subtile, we probably need to document this aspect of
> > under which conditions free_job() is or is not within the fence signaling path.
> 
> Well, I'm not even sure it can be clearly defined when the driver is
> using the submit_wq for its own work items (which can be done since we
> pass an optional submit_wq when calling drm_sched_init()). Sure, having
> max_active >= 2 should be enough to guarantee that the free_job work
> won't block the run_job one when these are the 2 only works being
> queued, but what if you have many other work items being queued by the
> driver to this wq, and some of those try to acquire resv locks? Could
> this prevent execution of the run_job() callback, thus preventing
> signaling of fences? I'm genuinely asking, don't know enough about the
> cmwq implementation to tell what's happening when work items are
> blocked (might be that the worker pool is extended to unblock the
> situation).

Yes, I think so. If max_active would be 2 and you have two jobs running on this
workqueue already waiting on allocations then the 3rd job signaling the fence
the allocation is blocked by would be stuck and we'd have a deadlock I guess.

But that's where I start to see the driver being responsible not to pass a
workqueue to the driver where it queues up other work, either at all, or that
interferes with fence signaling paths.

So, I guess the message here would be something like: free_job() must be
considered to be in the fence signaling path, unless the submit_wq is a
multi-threaded workqueue with max_active > 1 *dedicated* to the DRM scheduler.
Otherwise it's the drivers full resposibility to make sure it doesn't violate
the rules.

> 
> Anyway, documenting when free_job() is in the dma signalling path should
> be doable (single-threaded wq), but at this point, are we not better
> off considering anything called from the submit_wq as being part of the
> dma signalling path, so we can accommodate with both cases. And if
> there is cleanup processing that require taking dma_resv locks, I'd be
> tempted to queue that to a driver-specific wq (which is what I'm doing
> right now), just to be safe.
> 

It's not only the dma-resv lock, it's any lock under which allocations may be
performed.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-08-25 13:45           ` Christian König
  2023-09-12 10:13             ` Boris Brezillon
@ 2023-09-12 13:27             ` Boris Brezillon
  2023-09-12 13:34               ` Danilo Krummrich
  1 sibling, 1 reply; 80+ messages in thread
From: Boris Brezillon @ 2023-09-12 13:27 UTC (permalink / raw)
  To: Christian König
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, luben.tuikov,
	Danilo Krummrich, donald.robson, intel-xe, faith.ekstrand

On Fri, 25 Aug 2023 15:45:49 +0200
Christian König <christian.koenig@amd.com> wrote:

> >>>> I tried this patch with Nouveau and found a race condition:
> >>>>
> >>>> In drm_sched_run_job_work() the job is added to the pending_list via
> >>>> drm_sched_job_begin(), then the run_job() callback is called and the scheduled
> >>>> fence is signaled.
> >>>>
> >>>> However, in parallel drm_sched_get_cleanup_job() might be called from
> >>>> drm_sched_free_job_work(), which picks the first job from the pending_list and
> >>>> for the next job on the pending_list sets the scheduled fence' timestamp field.  
> >> Well why can this happen in parallel? Either the work items are scheduled to
> >> a single threaded work queue or you have protected the pending list with
> >> some locks.
> >>  
> > Xe uses a single-threaded work queue, Nouveau does not (desired
> > behavior).
> >
> > The list of pending jobs is protected by a lock (safe), the race is:
> >
> > add job to pending list
> > run_job
> > signal scheduled fence
> >
> > dequeue from pending list
> > free_job
> > update timestamp
> >
> > Once a job is on the pending list its timestamp can be accessed which
> > can blow up if scheduled fence isn't signaled or more specifically unless
> > DMA_FENCE_FLAG_TIMESTAMP_BIT is set.  

I'm a bit lost. How can this lead to a NULL deref? Timestamp is a
ktime_t embedded in dma_fence, and finished/scheduled are both
dma_fence objects embedded in drm_sched_fence. So, unless
{job,next_job}->s_fence is NULL, or {job,next_job} itself is NULL, I
don't really see where the NULL deref is. If s_fence is NULL, that means
drm_sched_job_init() wasn't called (unlikely to be detected that late),
or ->free_job()/drm_sched_job_cleanup() was called while the job was
still in the pending list. I don't really see a situation where job
could NULL to be honest.

While I agree that updating the timestamp before the fence has been
flagged as signaled/timestamped is broken (timestamp will be
overwritten when dma_fence_signal(scheduled) is called) I don't see a
situation where it would cause a NULL/invalid pointer deref. So I
suspect there's another race causing jobs to be cleaned up while
they're still in the pending_list.

> 
> Ah, that problem again. No that is actually quite harmless.
> 
> You just need to double check if the DMA_FENCE_FLAG_TIMESTAMP_BIT is 
> already set and if it's not set don't do anything.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-09-12 13:27             ` Boris Brezillon
@ 2023-09-12 13:34               ` Danilo Krummrich
  2023-09-12 13:53                 ` Boris Brezillon
  0 siblings, 1 reply; 80+ messages in thread
From: Danilo Krummrich @ 2023-09-12 13:34 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, intel-xe,
	luben.tuikov, donald.robson, Christian König, faith.ekstrand

On Tue, Sep 12, 2023 at 03:27:05PM +0200, Boris Brezillon wrote:
> On Fri, 25 Aug 2023 15:45:49 +0200
> Christian König <christian.koenig@amd.com> wrote:
> 
> > >>>> I tried this patch with Nouveau and found a race condition:
> > >>>>
> > >>>> In drm_sched_run_job_work() the job is added to the pending_list via
> > >>>> drm_sched_job_begin(), then the run_job() callback is called and the scheduled
> > >>>> fence is signaled.
> > >>>>
> > >>>> However, in parallel drm_sched_get_cleanup_job() might be called from
> > >>>> drm_sched_free_job_work(), which picks the first job from the pending_list and
> > >>>> for the next job on the pending_list sets the scheduled fence' timestamp field.  
> > >> Well why can this happen in parallel? Either the work items are scheduled to
> > >> a single threaded work queue or you have protected the pending list with
> > >> some locks.
> > >>  
> > > Xe uses a single-threaded work queue, Nouveau does not (desired
> > > behavior).
> > >
> > > The list of pending jobs is protected by a lock (safe), the race is:
> > >
> > > add job to pending list
> > > run_job
> > > signal scheduled fence
> > >
> > > dequeue from pending list
> > > free_job
> > > update timestamp
> > >
> > > Once a job is on the pending list its timestamp can be accessed which
> > > can blow up if scheduled fence isn't signaled or more specifically unless
> > > DMA_FENCE_FLAG_TIMESTAMP_BIT is set.  
> 
> I'm a bit lost. How can this lead to a NULL deref? Timestamp is a
> ktime_t embedded in dma_fence, and finished/scheduled are both
> dma_fence objects embedded in drm_sched_fence. So, unless
> {job,next_job}->s_fence is NULL, or {job,next_job} itself is NULL, I
> don't really see where the NULL deref is. If s_fence is NULL, that means
> drm_sched_job_init() wasn't called (unlikely to be detected that late),
> or ->free_job()/drm_sched_job_cleanup() was called while the job was
> still in the pending list. I don't really see a situation where job
> could NULL to be honest.

I think the problem here was that a dma_fence' timestamp field is within a union
together with it's cb_list list_head [1]. If a timestamp is set before the fence
is actually signalled, dma_fence_signal_timestamp_locked() will access the
cb_list to run the particular callbacks registered to this dma_fence. However,
writing the timestap will overwrite this list_head since it's a union, hence
we'd try to dereference the timestamp while iterating the list.

[1] https://elixir.bootlin.com/linux/latest/source/include/linux/dma-fence.h#L87

> 
> While I agree that updating the timestamp before the fence has been
> flagged as signaled/timestamped is broken (timestamp will be
> overwritten when dma_fence_signal(scheduled) is called) I don't see a
> situation where it would cause a NULL/invalid pointer deref. So I
> suspect there's another race causing jobs to be cleaned up while
> they're still in the pending_list.
> 
> > 
> > Ah, that problem again. No that is actually quite harmless.
> > 
> > You just need to double check if the DMA_FENCE_FLAG_TIMESTAMP_BIT is 
> > already set and if it's not set don't do anything.
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-09-12 12:56                   ` Danilo Krummrich
@ 2023-09-12 13:52                     ` Boris Brezillon
  2023-09-12 14:10                       ` Danilo Krummrich
  0 siblings, 1 reply; 80+ messages in thread
From: Boris Brezillon @ 2023-09-12 13:52 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, Christian König,
	luben.tuikov, donald.robson, intel-xe, faith.ekstrand

On Tue, 12 Sep 2023 14:56:06 +0200
Danilo Krummrich <dakr@redhat.com> wrote:

> On Tue, Sep 12, 2023 at 02:18:18PM +0200, Boris Brezillon wrote:
> > On Tue, 12 Sep 2023 12:46:26 +0200
> > Danilo Krummrich <dakr@redhat.com> wrote:
> >   
> > > > I'm a bit worried that leaving this single vs multi-threaded wq
> > > > decision to drivers is going to cause unnecessary pain, because what
> > > > was previously a granted in term of run/cleanup execution order (thanks
> > > > to the kthread+static-drm_sched_main-workflow approach) is now subject
> > > > to the wq ordering guarantees, which depend on the wq type picked by
> > > > the driver.    
> > > 
> > > Not sure if this ends up to be much different. The only thing I could think of
> > > is that IIRC with the kthread implementation cleanup was always preferred over
> > > run.  
> > 
> > Given the sequence in drm_sched_main(), I'd say that cleanup and run
> > operations are naturally interleaved when both are available, but I
> > might be wrong.  
> 
> From drm_sched_main():
> 
> 	wait_event_interruptible(sched->wake_up_worker,
> 				 (cleanup_job = drm_sched_get_cleanup_job(sched)) ||
> 				 (!drm_sched_blocked(sched) &&
> 				  (entity = drm_sched_select_entity(sched))) ||
> 				 kthread_should_stop());
> 
> 	if (cleanup_job)
> 		sched->ops->free_job(cleanup_job);
> 
> 	if (!entity)
> 		continue;
> 
> If cleanup_job is not NULL the rest shouldn't be evaluated I guess. Hence entity
> would be NULL and we'd loop until there are no more cleanup_jobs if I don't miss
> anything here.

Indeed, I got tricked by the wait_event() expression.

> 
> >   
> > > With a single threaded wq this should be a bit more balanced.  
> > 
> > With a single threaded wq, it's less clear, because each work
> > reschedules itself for further processing, but it's likely to be more
> > or less interleaved. Anyway, I'm not too worried about cleanup taking
> > precedence on run or the other way around, because the limited amount
> > of HW slots (size of the ring-buffer) will regulate that.  
> 
> Yeah, that's what I meant, with to work items rescheduling themselves it starts
> to be interleaved. Which I'm not worried about as well.
> 
> >   
> > > 
> > > With a multi-threaded wq it's still the same, but run and cleanup can run
> > > concurrently,  
> > 
> > What I'm worried about is that ^. I'm not saying it's fundamentally
> > unsafe, but I'm saying drm_sched hasn't been designed with this
> > concurrency in mind, and I fear we'll face subtle bugs if we go from
> > kthread to multi-threaded-wq+run-and-cleanup-split-in-2-work-items.
> >   
> 
> Yeah, so what we get with that is that job_run() of job A and job_free() of job
> B can run in parallel. Unless drivers do weird things there, I'm not seeing an
> issue with that as well at a first glance.

I might be wrong of course, but I'm pretty sure the timestamp race you
reported is indirectly coming from this ST -> MT transition. Again, I'm
not saying we should never use an MT wq, but it feels a bit premature,
and I think I'd prefer if we do it in 2 steps to minimize the amount of
things that could go wrong, and avoid a late revert.

> 
> > > which has the nice side effect that free_job() gets out of the
> > > fence signaling path. At least as long as the workqueue has max_active > 1.  
> > 
> > Oh, yeah, I don't deny using a multi-threaded workqueue has some
> > benefits, just saying it might be trickier than it sounds.
> >   
> > > Which is one reason why I'm using a multi-threaded wq in Nouveau.  
> > 
> > Note that I'm using a multi-threaded workqueue internally at the moment
> > to deal with all sort of interactions with the FW (Mali HW only has a
> > limited amount of scheduling slots, and we need to rotate entities
> > having jobs to execute so every one gets a chance to run on the GPU),
> > but this has been designed this way from the ground up, unlike
> > drm_sched_main() operations, which were mostly thought as a fixed
> > sequential set of operations. That's not to say it's impossible to get
> > right, but I fear we'll face weird/unexpected behavior if we go from
> > completely-serialized to multi-threaded-with-pseudo-random-processing
> > order.  
> 
> From a per job perspective it's still all sequential and besides fence
> dependencies,

Sure, per job ops are still sequential (run, then cleanup once parent
fence is signalled).

> which are still resolved, I don't see where jobs could have cross
> dependencies that make this racy. But agree that it's probably worth to think
> through it a bit more.
> 
> >   
> > > 
> > > That latter seems a bit subtile, we probably need to document this aspect of
> > > under which conditions free_job() is or is not within the fence signaling path.  
> > 
> > Well, I'm not even sure it can be clearly defined when the driver is
> > using the submit_wq for its own work items (which can be done since we
> > pass an optional submit_wq when calling drm_sched_init()). Sure, having
> > max_active >= 2 should be enough to guarantee that the free_job work
> > won't block the run_job one when these are the 2 only works being
> > queued, but what if you have many other work items being queued by the
> > driver to this wq, and some of those try to acquire resv locks? Could
> > this prevent execution of the run_job() callback, thus preventing
> > signaling of fences? I'm genuinely asking, don't know enough about the
> > cmwq implementation to tell what's happening when work items are
> > blocked (might be that the worker pool is extended to unblock the
> > situation).  
> 
> Yes, I think so. If max_active would be 2 and you have two jobs running on this
> workqueue already waiting on allocations then the 3rd job signaling the fence
> the allocation is blocked by would be stuck and we'd have a deadlock I guess.
> 
> But that's where I start to see the driver being responsible not to pass a
> workqueue to the driver where it queues up other work, either at all, or that
> interferes with fence signaling paths.
> 
> So, I guess the message here would be something like: free_job() must be
> considered to be in the fence signaling path, unless the submit_wq is a
> multi-threaded workqueue with max_active > 1 *dedicated* to the DRM scheduler.

If it's meant to be dedicated to the drm scheduler, is there any point
passing a custom submit_wq? I mean, we could start with a dedicated
ordered-wq created by the core to replace the kthread, and then, once
enough testing has been done to make sure things work correctly in a MT
env, switch everyone to a mutithreaded-wq. The fact that we let the
caller pass its own workqueue, to then restrict its usage to things
directly related to drm_sched is somewhat confusing.

> Otherwise it's the drivers full resposibility to make sure it doesn't violate
> the rules.

Yeah, that's what I'm worried about tbh. There's so many subtle ways we
let DRM drivers shoot themselves in the foot already, using the
excuse we want drivers to be in control (for optimization/perf
concerns). I'm just not comfortable adding one more way of doing that,
especially given drm_sched has been one thread calling multiple hooks
sequentially until now, which is essentially what an ordered wq would
provide.

> 
> > 
> > Anyway, documenting when free_job() is in the dma signalling path should
> > be doable (single-threaded wq), but at this point, are we not better
> > off considering anything called from the submit_wq as being part of the
> > dma signalling path, so we can accommodate with both cases. And if
> > there is cleanup processing that require taking dma_resv locks, I'd be
> > tempted to queue that to a driver-specific wq (which is what I'm doing
> > right now), just to be safe.
> >   
> 
> It's not only the dma-resv lock, it's any lock under which allocations may be
> performed.

Sure, I was taking the resv lock in example, because that's easy to
reason about, but that's indeed any lock being taken while doing
allocations that don't have the GFP_{NOWAIT,ATOMIC} flags set.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-09-12 13:34               ` Danilo Krummrich
@ 2023-09-12 13:53                 ` Boris Brezillon
  0 siblings, 0 replies; 80+ messages in thread
From: Boris Brezillon @ 2023-09-12 13:53 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, intel-xe,
	luben.tuikov, donald.robson, Christian König, faith.ekstrand

On Tue, 12 Sep 2023 15:34:41 +0200
Danilo Krummrich <dakr@redhat.com> wrote:

> On Tue, Sep 12, 2023 at 03:27:05PM +0200, Boris Brezillon wrote:
> > On Fri, 25 Aug 2023 15:45:49 +0200
> > Christian König <christian.koenig@amd.com> wrote:
> >   
> > > >>>> I tried this patch with Nouveau and found a race condition:
> > > >>>>
> > > >>>> In drm_sched_run_job_work() the job is added to the pending_list via
> > > >>>> drm_sched_job_begin(), then the run_job() callback is called and the scheduled
> > > >>>> fence is signaled.
> > > >>>>
> > > >>>> However, in parallel drm_sched_get_cleanup_job() might be called from
> > > >>>> drm_sched_free_job_work(), which picks the first job from the pending_list and
> > > >>>> for the next job on the pending_list sets the scheduled fence' timestamp field.    
> > > >> Well why can this happen in parallel? Either the work items are scheduled to
> > > >> a single threaded work queue or you have protected the pending list with
> > > >> some locks.
> > > >>    
> > > > Xe uses a single-threaded work queue, Nouveau does not (desired
> > > > behavior).
> > > >
> > > > The list of pending jobs is protected by a lock (safe), the race is:
> > > >
> > > > add job to pending list
> > > > run_job
> > > > signal scheduled fence
> > > >
> > > > dequeue from pending list
> > > > free_job
> > > > update timestamp
> > > >
> > > > Once a job is on the pending list its timestamp can be accessed which
> > > > can blow up if scheduled fence isn't signaled or more specifically unless
> > > > DMA_FENCE_FLAG_TIMESTAMP_BIT is set.    
> > 
> > I'm a bit lost. How can this lead to a NULL deref? Timestamp is a
> > ktime_t embedded in dma_fence, and finished/scheduled are both
> > dma_fence objects embedded in drm_sched_fence. So, unless
> > {job,next_job}->s_fence is NULL, or {job,next_job} itself is NULL, I
> > don't really see where the NULL deref is. If s_fence is NULL, that means
> > drm_sched_job_init() wasn't called (unlikely to be detected that late),
> > or ->free_job()/drm_sched_job_cleanup() was called while the job was
> > still in the pending list. I don't really see a situation where job
> > could NULL to be honest.  
> 
> I think the problem here was that a dma_fence' timestamp field is within a union
> together with it's cb_list list_head [1]. If a timestamp is set before the fence
> is actually signalled, dma_fence_signal_timestamp_locked() will access the
> cb_list to run the particular callbacks registered to this dma_fence. However,
> writing the timestap will overwrite this list_head since it's a union, hence
> we'd try to dereference the timestamp while iterating the list.

Ah, right. I didn't notice it was a union, thought it was a struct...

> 
> [1] https://elixir.bootlin.com/linux/latest/source/include/linux/dma-fence.h#L87
> 
> > 
> > While I agree that updating the timestamp before the fence has been
> > flagged as signaled/timestamped is broken (timestamp will be
> > overwritten when dma_fence_signal(scheduled) is called) I don't see a
> > situation where it would cause a NULL/invalid pointer deref. So I
> > suspect there's another race causing jobs to be cleaned up while
> > they're still in the pending_list.
> >   
> > > 
> > > Ah, that problem again. No that is actually quite harmless.
> > > 
> > > You just need to double check if the DMA_FENCE_FLAG_TIMESTAMP_BIT is 
> > > already set and if it's not set don't do anything.  
> >   
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 4/9] drm/sched: Split free_job into own work item
  2023-09-12 13:52                     ` Boris Brezillon
@ 2023-09-12 14:10                       ` Danilo Krummrich
  0 siblings, 0 replies; 80+ messages in thread
From: Danilo Krummrich @ 2023-09-12 14:10 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, Christian König,
	luben.tuikov, donald.robson, intel-xe, faith.ekstrand

On Tue, Sep 12, 2023 at 03:52:28PM +0200, Boris Brezillon wrote:
> On Tue, 12 Sep 2023 14:56:06 +0200
> Danilo Krummrich <dakr@redhat.com> wrote:
> 
> > On Tue, Sep 12, 2023 at 02:18:18PM +0200, Boris Brezillon wrote:
> > > On Tue, 12 Sep 2023 12:46:26 +0200
> > > Danilo Krummrich <dakr@redhat.com> wrote:
> > >   
> > > > > I'm a bit worried that leaving this single vs multi-threaded wq
> > > > > decision to drivers is going to cause unnecessary pain, because what
> > > > > was previously a granted in term of run/cleanup execution order (thanks
> > > > > to the kthread+static-drm_sched_main-workflow approach) is now subject
> > > > > to the wq ordering guarantees, which depend on the wq type picked by
> > > > > the driver.    
> > > > 
> > > > Not sure if this ends up to be much different. The only thing I could think of
> > > > is that IIRC with the kthread implementation cleanup was always preferred over
> > > > run.  
> > > 
> > > Given the sequence in drm_sched_main(), I'd say that cleanup and run
> > > operations are naturally interleaved when both are available, but I
> > > might be wrong.  
> > 
> > From drm_sched_main():
> > 
> > 	wait_event_interruptible(sched->wake_up_worker,
> > 				 (cleanup_job = drm_sched_get_cleanup_job(sched)) ||
> > 				 (!drm_sched_blocked(sched) &&
> > 				  (entity = drm_sched_select_entity(sched))) ||
> > 				 kthread_should_stop());
> > 
> > 	if (cleanup_job)
> > 		sched->ops->free_job(cleanup_job);
> > 
> > 	if (!entity)
> > 		continue;
> > 
> > If cleanup_job is not NULL the rest shouldn't be evaluated I guess. Hence entity
> > would be NULL and we'd loop until there are no more cleanup_jobs if I don't miss
> > anything here.
> 
> Indeed, I got tricked by the wait_event() expression.
> 
> > 
> > >   
> > > > With a single threaded wq this should be a bit more balanced.  
> > > 
> > > With a single threaded wq, it's less clear, because each work
> > > reschedules itself for further processing, but it's likely to be more
> > > or less interleaved. Anyway, I'm not too worried about cleanup taking
> > > precedence on run or the other way around, because the limited amount
> > > of HW slots (size of the ring-buffer) will regulate that.  
> > 
> > Yeah, that's what I meant, with to work items rescheduling themselves it starts
> > to be interleaved. Which I'm not worried about as well.
> > 
> > >   
> > > > 
> > > > With a multi-threaded wq it's still the same, but run and cleanup can run
> > > > concurrently,  
> > > 
> > > What I'm worried about is that ^. I'm not saying it's fundamentally
> > > unsafe, but I'm saying drm_sched hasn't been designed with this
> > > concurrency in mind, and I fear we'll face subtle bugs if we go from
> > > kthread to multi-threaded-wq+run-and-cleanup-split-in-2-work-items.
> > >   
> > 
> > Yeah, so what we get with that is that job_run() of job A and job_free() of job
> > B can run in parallel. Unless drivers do weird things there, I'm not seeing an
> > issue with that as well at a first glance.
> 
> I might be wrong of course, but I'm pretty sure the timestamp race you
> reported is indirectly coming from this ST -> MT transition. Again, I'm
> not saying we should never use an MT wq, but it feels a bit premature,
> and I think I'd prefer if we do it in 2 steps to minimize the amount of
> things that could go wrong, and avoid a late revert.

Indirectly, yes. I would agree with using an internal single threaded workqueue
by default although I'm a bit more optimistic about that. Howver, I'd still like
the driver to choose. Otherwise, in Nouveau I'd need to keep queueing work in
free_job() to another workqueue, which isn't very nice.

> 
> > 
> > > > which has the nice side effect that free_job() gets out of the
> > > > fence signaling path. At least as long as the workqueue has max_active > 1.  
> > > 
> > > Oh, yeah, I don't deny using a multi-threaded workqueue has some
> > > benefits, just saying it might be trickier than it sounds.
> > >   
> > > > Which is one reason why I'm using a multi-threaded wq in Nouveau.  
> > > 
> > > Note that I'm using a multi-threaded workqueue internally at the moment
> > > to deal with all sort of interactions with the FW (Mali HW only has a
> > > limited amount of scheduling slots, and we need to rotate entities
> > > having jobs to execute so every one gets a chance to run on the GPU),
> > > but this has been designed this way from the ground up, unlike
> > > drm_sched_main() operations, which were mostly thought as a fixed
> > > sequential set of operations. That's not to say it's impossible to get
> > > right, but I fear we'll face weird/unexpected behavior if we go from
> > > completely-serialized to multi-threaded-with-pseudo-random-processing
> > > order.  
> > 
> > From a per job perspective it's still all sequential and besides fence
> > dependencies,
> 
> Sure, per job ops are still sequential (run, then cleanup once parent
> fence is signalled).
> 
> > which are still resolved, I don't see where jobs could have cross
> > dependencies that make this racy. But agree that it's probably worth to think
> > through it a bit more.
> > 
> > >   
> > > > 
> > > > That latter seems a bit subtile, we probably need to document this aspect of
> > > > under which conditions free_job() is or is not within the fence signaling path.  
> > > 
> > > Well, I'm not even sure it can be clearly defined when the driver is
> > > using the submit_wq for its own work items (which can be done since we
> > > pass an optional submit_wq when calling drm_sched_init()). Sure, having
> > > max_active >= 2 should be enough to guarantee that the free_job work
> > > won't block the run_job one when these are the 2 only works being
> > > queued, but what if you have many other work items being queued by the
> > > driver to this wq, and some of those try to acquire resv locks? Could
> > > this prevent execution of the run_job() callback, thus preventing
> > > signaling of fences? I'm genuinely asking, don't know enough about the
> > > cmwq implementation to tell what's happening when work items are
> > > blocked (might be that the worker pool is extended to unblock the
> > > situation).  
> > 
> > Yes, I think so. If max_active would be 2 and you have two jobs running on this
> > workqueue already waiting on allocations then the 3rd job signaling the fence
> > the allocation is blocked by would be stuck and we'd have a deadlock I guess.
> > 
> > But that's where I start to see the driver being responsible not to pass a
> > workqueue to the driver where it queues up other work, either at all, or that
> > interferes with fence signaling paths.
> > 
> > So, I guess the message here would be something like: free_job() must be
> > considered to be in the fence signaling path, unless the submit_wq is a
> > multi-threaded workqueue with max_active > 1 *dedicated* to the DRM scheduler.
> 
> If it's meant to be dedicated to the drm scheduler, is there any point
> passing a custom submit_wq? I mean, we could start with a dedicated
> ordered-wq created by the core to replace the kthread, and then, once
> enough testing has been done to make sure things work correctly in a MT
> env, switch everyone to a mutithreaded-wq. The fact that we let the
> caller pass its own workqueue, to then restrict its usage to things
> directly related to drm_sched is somewhat confusing.
> 

Well, "dedicated to the scheduler" and the other conditions are only related to
giving a guarantee about free_job() not being in the fence signaling path.

The driver could still just not care about free_job() being in the fence
signalng path or not.

Also, drivers can also still queue work to the given workqueue as long as it
does comply with fence signaling critical sections.

I'd be absolutey fine leaving this to the driver, as long as we properly
document it.

It depends on the design goal. If we want to say free_job() is always safe,
then we need to entirely restrict it.

> > Otherwise it's the drivers full resposibility to make sure it doesn't violate
> > the rules.
> 
> Yeah, that's what I'm worried about tbh. There's so many subtle ways we
> let DRM drivers shoot themselves in the foot already, using the
> excuse we want drivers to be in control (for optimization/perf
> concerns). I'm just not comfortable adding one more way of doing that,
> especially given drm_sched has been one thread calling multiple hooks
> sequentially until now, which is essentially what an ordered wq would
> provide.
> 
> > 
> > > 
> > > Anyway, documenting when free_job() is in the dma signalling path should
> > > be doable (single-threaded wq), but at this point, are we not better
> > > off considering anything called from the submit_wq as being part of the
> > > dma signalling path, so we can accommodate with both cases. And if
> > > there is cleanup processing that require taking dma_resv locks, I'd be
> > > tempted to queue that to a driver-specific wq (which is what I'm doing
> > > right now), just to be safe.
> > >   
> > 
> > It's not only the dma-resv lock, it's any lock under which allocations may be
> > performed.
> 
> Sure, I was taking the resv lock in example, because that's easy to
> reason about, but that's indeed any lock being taken while doing
> allocations that don't have the GFP_{NOWAIT,ATOMIC} flags set.
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-08-17 11:13               ` Danilo Krummrich
  2023-08-17 13:35                 ` Christian König
  2023-08-18  3:08                 ` Matthew Brost
@ 2023-09-12 14:28                 ` Boris Brezillon
  2023-09-12 14:33                   ` Danilo Krummrich
  2 siblings, 1 reply; 80+ messages in thread
From: Boris Brezillon @ 2023-09-12 14:28 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, intel-xe,
	luben.tuikov, donald.robson, Christian König, faith.ekstrand

On Thu, 17 Aug 2023 13:13:31 +0200
Danilo Krummrich <dakr@redhat.com> wrote:

> I think that's a misunderstanding. I'm not trying to say that it is 
> *always* beneficial to fill up the ring as much as possible. But I think 
> it is under certain circumstances, exactly those circumstances I 
> described for Nouveau.
> 
> As mentioned, in Nouveau the size of a job is only really limited by the 
> ring size, which means that one job can (but does not necessarily) fill 
> up the whole ring. We both agree that this is inefficient, because it 
> potentially results into the HW run dry due to hw_submission_limit == 1.
> 
> I recognize you said that one should define hw_submission_limit and 
> adjust the other parts of the equation accordingly, the options I see are:
> 
> (1) Increase the ring size while keeping the maximum job size.
> (2) Decrease the maximum job size while keeping the ring size.
> (3) Let the scheduler track the actual job size rather than the maximum 
> job size.
> 
> (1) results into potentially wasted ring memory, because we're not 
> always reaching the maximum job size, but the scheduler assumes so.
> 
> (2) results into more IOCTLs from userspace for the same amount of IBs 
> and more jobs result into more memory allocations and more work being 
> submitted to the workqueue (with Matt's patches).
> 
> (3) doesn't seem to have any of those draw backs.
> 
> What would be your take on that?
> 
> Actually, if none of the other drivers is interested into a more precise 
> way of keeping track of the ring utilization, I'd be totally fine to do 
> it in a driver specific way. However, unfortunately I don't see how this 
> would be possible.

I'm not entirely sure, but I think PowerVR is pretty close to your
description: jobs size is dynamic size, and the ring buffer size is
picked by the driver at queue initialization time. What we did was to
set hw_submission_limit to an arbitrarily high value of 64k (we could
have used something like ringbuf_size/min_job_size instead), and then
have the control flow implemented with ->prepare_job() [1] (CCCB is the
PowerVR ring buffer). This allows us to maximize ring buffer utilization
while still allowing dynamic-size jobs.

> 
> My proposal would be to just keep the hw_submission_limit (maybe rename 
> it to submission_unit_limit) and add a submission_units field to struct 
> drm_sched_job. By default a jobs submission_units field would be 0 and 
> the scheduler would behave the exact same way as it does now.
> 
> Accordingly, jobs with submission_units > 1 would contribute more than 
> one unit to the submission_unit_limit.
> 
> What do you think about that?
> 
> Besides all that, you said that filling up the ring just enough to not 
> let the HW run dry rather than filling it up entirely is desirable. Why 
> do you think so? I tend to think that in most cases it shouldn't make 
> difference.

[1]https://gitlab.freedesktop.org/frankbinns/powervr/-/blob/powervr-next/drivers/gpu/drm/imagination/pvr_queue.c#L502

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-09-12 14:28                 ` Boris Brezillon
@ 2023-09-12 14:33                   ` Danilo Krummrich
  2023-09-12 14:49                     ` Boris Brezillon
  0 siblings, 1 reply; 80+ messages in thread
From: Danilo Krummrich @ 2023-09-12 14:33 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, intel-xe,
	luben.tuikov, donald.robson, Christian König, faith.ekstrand

On 9/12/23 16:28, Boris Brezillon wrote:
> On Thu, 17 Aug 2023 13:13:31 +0200
> Danilo Krummrich <dakr@redhat.com> wrote:
> 
>> I think that's a misunderstanding. I'm not trying to say that it is
>> *always* beneficial to fill up the ring as much as possible. But I think
>> it is under certain circumstances, exactly those circumstances I
>> described for Nouveau.
>>
>> As mentioned, in Nouveau the size of a job is only really limited by the
>> ring size, which means that one job can (but does not necessarily) fill
>> up the whole ring. We both agree that this is inefficient, because it
>> potentially results into the HW run dry due to hw_submission_limit == 1.
>>
>> I recognize you said that one should define hw_submission_limit and
>> adjust the other parts of the equation accordingly, the options I see are:
>>
>> (1) Increase the ring size while keeping the maximum job size.
>> (2) Decrease the maximum job size while keeping the ring size.
>> (3) Let the scheduler track the actual job size rather than the maximum
>> job size.
>>
>> (1) results into potentially wasted ring memory, because we're not
>> always reaching the maximum job size, but the scheduler assumes so.
>>
>> (2) results into more IOCTLs from userspace for the same amount of IBs
>> and more jobs result into more memory allocations and more work being
>> submitted to the workqueue (with Matt's patches).
>>
>> (3) doesn't seem to have any of those draw backs.
>>
>> What would be your take on that?
>>
>> Actually, if none of the other drivers is interested into a more precise
>> way of keeping track of the ring utilization, I'd be totally fine to do
>> it in a driver specific way. However, unfortunately I don't see how this
>> would be possible.
> 
> I'm not entirely sure, but I think PowerVR is pretty close to your
> description: jobs size is dynamic size, and the ring buffer size is
> picked by the driver at queue initialization time. What we did was to
> set hw_submission_limit to an arbitrarily high value of 64k (we could
> have used something like ringbuf_size/min_job_size instead), and then
> have the control flow implemented with ->prepare_job() [1] (CCCB is the
> PowerVR ring buffer). This allows us to maximize ring buffer utilization
> while still allowing dynamic-size jobs.

I guess this would work, but I think it would be better to bake this in,
especially if more drivers do have this need. I already have an
implementation [1] for doing that in the scheduler. My plan was to push
that as soon as Matt sends out V3.

[1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/269f05d6a2255384badff8b008b3c32d640d2d95

> 
>>
>> My proposal would be to just keep the hw_submission_limit (maybe rename
>> it to submission_unit_limit) and add a submission_units field to struct
>> drm_sched_job. By default a jobs submission_units field would be 0 and
>> the scheduler would behave the exact same way as it does now.
>>
>> Accordingly, jobs with submission_units > 1 would contribute more than
>> one unit to the submission_unit_limit.
>>
>> What do you think about that?
>>
>> Besides all that, you said that filling up the ring just enough to not
>> let the HW run dry rather than filling it up entirely is desirable. Why
>> do you think so? I tend to think that in most cases it shouldn't make
>> difference.
> 
> [1]https://gitlab.freedesktop.org/frankbinns/powervr/-/blob/powervr-next/drivers/gpu/drm/imagination/pvr_queue.c#L502
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-09-12 14:33                   ` Danilo Krummrich
@ 2023-09-12 14:49                     ` Boris Brezillon
  2023-09-12 15:13                       ` Boris Brezillon
  2023-09-12 16:52                       ` Danilo Krummrich
  0 siblings, 2 replies; 80+ messages in thread
From: Boris Brezillon @ 2023-09-12 14:49 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, intel-xe,
	luben.tuikov, donald.robson, Christian König, faith.ekstrand

On Tue, 12 Sep 2023 16:33:01 +0200
Danilo Krummrich <dakr@redhat.com> wrote:

> On 9/12/23 16:28, Boris Brezillon wrote:
> > On Thu, 17 Aug 2023 13:13:31 +0200
> > Danilo Krummrich <dakr@redhat.com> wrote:
> >   
> >> I think that's a misunderstanding. I'm not trying to say that it is
> >> *always* beneficial to fill up the ring as much as possible. But I think
> >> it is under certain circumstances, exactly those circumstances I
> >> described for Nouveau.
> >>
> >> As mentioned, in Nouveau the size of a job is only really limited by the
> >> ring size, which means that one job can (but does not necessarily) fill
> >> up the whole ring. We both agree that this is inefficient, because it
> >> potentially results into the HW run dry due to hw_submission_limit == 1.
> >>
> >> I recognize you said that one should define hw_submission_limit and
> >> adjust the other parts of the equation accordingly, the options I see are:
> >>
> >> (1) Increase the ring size while keeping the maximum job size.
> >> (2) Decrease the maximum job size while keeping the ring size.
> >> (3) Let the scheduler track the actual job size rather than the maximum
> >> job size.
> >>
> >> (1) results into potentially wasted ring memory, because we're not
> >> always reaching the maximum job size, but the scheduler assumes so.
> >>
> >> (2) results into more IOCTLs from userspace for the same amount of IBs
> >> and more jobs result into more memory allocations and more work being
> >> submitted to the workqueue (with Matt's patches).
> >>
> >> (3) doesn't seem to have any of those draw backs.
> >>
> >> What would be your take on that?
> >>
> >> Actually, if none of the other drivers is interested into a more precise
> >> way of keeping track of the ring utilization, I'd be totally fine to do
> >> it in a driver specific way. However, unfortunately I don't see how this
> >> would be possible.  
> > 
> > I'm not entirely sure, but I think PowerVR is pretty close to your
> > description: jobs size is dynamic size, and the ring buffer size is
> > picked by the driver at queue initialization time. What we did was to
> > set hw_submission_limit to an arbitrarily high value of 64k (we could
> > have used something like ringbuf_size/min_job_size instead), and then
> > have the control flow implemented with ->prepare_job() [1] (CCCB is the
> > PowerVR ring buffer). This allows us to maximize ring buffer utilization
> > while still allowing dynamic-size jobs.  
> 
> I guess this would work, but I think it would be better to bake this in,
> especially if more drivers do have this need. I already have an
> implementation [1] for doing that in the scheduler. My plan was to push
> that as soon as Matt sends out V3.
> 
> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/269f05d6a2255384badff8b008b3c32d640d2d95

PowerVR's ->can_fit_in_ringbuf() logic is a bit more involved in that
native fences waits are passed to the FW, and those add to the job size.
When we know our job is ready for execution (all non-native deps are
signaled), we evict already signaled native-deps (or native fences) to
shrink the job size further more, but that's something we need to
calculate late if we want the job size to be minimal. Of course, we can
always over-estimate the job size, but if we go for a full-blown
drm_sched integration, I wonder if it wouldn't be preferable to have a
->get_job_size() callback returning the number of units needed by job,
and have the core pick 1 when the hook is not implemented.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-09-12 14:49                     ` Boris Brezillon
@ 2023-09-12 15:13                       ` Boris Brezillon
  2023-09-12 16:58                         ` Danilo Krummrich
  2023-09-12 16:52                       ` Danilo Krummrich
  1 sibling, 1 reply; 80+ messages in thread
From: Boris Brezillon @ 2023-09-12 15:13 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, intel-xe,
	luben.tuikov, donald.robson, Christian König, faith.ekstrand

On Tue, 12 Sep 2023 16:49:09 +0200
Boris Brezillon <boris.brezillon@collabora.com> wrote:

> On Tue, 12 Sep 2023 16:33:01 +0200
> Danilo Krummrich <dakr@redhat.com> wrote:
> 
> > On 9/12/23 16:28, Boris Brezillon wrote:  
> > > On Thu, 17 Aug 2023 13:13:31 +0200
> > > Danilo Krummrich <dakr@redhat.com> wrote:
> > >     
> > >> I think that's a misunderstanding. I'm not trying to say that it is
> > >> *always* beneficial to fill up the ring as much as possible. But I think
> > >> it is under certain circumstances, exactly those circumstances I
> > >> described for Nouveau.
> > >>
> > >> As mentioned, in Nouveau the size of a job is only really limited by the
> > >> ring size, which means that one job can (but does not necessarily) fill
> > >> up the whole ring. We both agree that this is inefficient, because it
> > >> potentially results into the HW run dry due to hw_submission_limit == 1.
> > >>
> > >> I recognize you said that one should define hw_submission_limit and
> > >> adjust the other parts of the equation accordingly, the options I see are:
> > >>
> > >> (1) Increase the ring size while keeping the maximum job size.
> > >> (2) Decrease the maximum job size while keeping the ring size.
> > >> (3) Let the scheduler track the actual job size rather than the maximum
> > >> job size.
> > >>
> > >> (1) results into potentially wasted ring memory, because we're not
> > >> always reaching the maximum job size, but the scheduler assumes so.
> > >>
> > >> (2) results into more IOCTLs from userspace for the same amount of IBs
> > >> and more jobs result into more memory allocations and more work being
> > >> submitted to the workqueue (with Matt's patches).
> > >>
> > >> (3) doesn't seem to have any of those draw backs.
> > >>
> > >> What would be your take on that?
> > >>
> > >> Actually, if none of the other drivers is interested into a more precise
> > >> way of keeping track of the ring utilization, I'd be totally fine to do
> > >> it in a driver specific way. However, unfortunately I don't see how this
> > >> would be possible.    
> > > 
> > > I'm not entirely sure, but I think PowerVR is pretty close to your
> > > description: jobs size is dynamic size, and the ring buffer size is
> > > picked by the driver at queue initialization time. What we did was to
> > > set hw_submission_limit to an arbitrarily high value of 64k (we could
> > > have used something like ringbuf_size/min_job_size instead), and then
> > > have the control flow implemented with ->prepare_job() [1] (CCCB is the
> > > PowerVR ring buffer). This allows us to maximize ring buffer utilization
> > > while still allowing dynamic-size jobs.    
> > 
> > I guess this would work, but I think it would be better to bake this in,
> > especially if more drivers do have this need. I already have an
> > implementation [1] for doing that in the scheduler. My plan was to push
> > that as soon as Matt sends out V3.
> > 
> > [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/269f05d6a2255384badff8b008b3c32d640d2d95  
> 
> PowerVR's ->can_fit_in_ringbuf() logic is a bit more involved in that
> native fences waits are passed to the FW, and those add to the job size.
> When we know our job is ready for execution (all non-native deps are
> signaled), we evict already signaled native-deps (or native fences) to
> shrink the job size further more, but that's something we need to
> calculate late if we want the job size to be minimal. Of course, we can
> always over-estimate the job size, but if we go for a full-blown
> drm_sched integration, I wonder if it wouldn't be preferable to have a
> ->get_job_size() callback returning the number of units needed by job,  
> and have the core pick 1 when the hook is not implemented.

FWIW, I think last time I asked how to do that, I've been pointed to
->prepare_job() by someone  (don't remember if it was Daniel or
Christian), hence the PowerVR implementation. If that's still the
preferred solution, there's some opportunity to have a generic layer to
automate ringbuf utilization tracking and some helpers to prepare
wait_for_ringbuf dma_fences that drivers could return from
->prepare_job() (those fences would then be signaled when the driver
calls drm_ringbuf_job_done() and the next job waiting for ringbuf space
now fits in the ringbuf).

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-09-12 14:49                     ` Boris Brezillon
  2023-09-12 15:13                       ` Boris Brezillon
@ 2023-09-12 16:52                       ` Danilo Krummrich
  1 sibling, 0 replies; 80+ messages in thread
From: Danilo Krummrich @ 2023-09-12 16:52 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, intel-xe,
	luben.tuikov, donald.robson, Christian König, faith.ekstrand

On 9/12/23 16:49, Boris Brezillon wrote:
> On Tue, 12 Sep 2023 16:33:01 +0200
> Danilo Krummrich <dakr@redhat.com> wrote:
> 
>> On 9/12/23 16:28, Boris Brezillon wrote:
>>> On Thu, 17 Aug 2023 13:13:31 +0200
>>> Danilo Krummrich <dakr@redhat.com> wrote:
>>>    
>>>> I think that's a misunderstanding. I'm not trying to say that it is
>>>> *always* beneficial to fill up the ring as much as possible. But I think
>>>> it is under certain circumstances, exactly those circumstances I
>>>> described for Nouveau.
>>>>
>>>> As mentioned, in Nouveau the size of a job is only really limited by the
>>>> ring size, which means that one job can (but does not necessarily) fill
>>>> up the whole ring. We both agree that this is inefficient, because it
>>>> potentially results into the HW run dry due to hw_submission_limit == 1.
>>>>
>>>> I recognize you said that one should define hw_submission_limit and
>>>> adjust the other parts of the equation accordingly, the options I see are:
>>>>
>>>> (1) Increase the ring size while keeping the maximum job size.
>>>> (2) Decrease the maximum job size while keeping the ring size.
>>>> (3) Let the scheduler track the actual job size rather than the maximum
>>>> job size.
>>>>
>>>> (1) results into potentially wasted ring memory, because we're not
>>>> always reaching the maximum job size, but the scheduler assumes so.
>>>>
>>>> (2) results into more IOCTLs from userspace for the same amount of IBs
>>>> and more jobs result into more memory allocations and more work being
>>>> submitted to the workqueue (with Matt's patches).
>>>>
>>>> (3) doesn't seem to have any of those draw backs.
>>>>
>>>> What would be your take on that?
>>>>
>>>> Actually, if none of the other drivers is interested into a more precise
>>>> way of keeping track of the ring utilization, I'd be totally fine to do
>>>> it in a driver specific way. However, unfortunately I don't see how this
>>>> would be possible.
>>>
>>> I'm not entirely sure, but I think PowerVR is pretty close to your
>>> description: jobs size is dynamic size, and the ring buffer size is
>>> picked by the driver at queue initialization time. What we did was to
>>> set hw_submission_limit to an arbitrarily high value of 64k (we could
>>> have used something like ringbuf_size/min_job_size instead), and then
>>> have the control flow implemented with ->prepare_job() [1] (CCCB is the
>>> PowerVR ring buffer). This allows us to maximize ring buffer utilization
>>> while still allowing dynamic-size jobs.
>>
>> I guess this would work, but I think it would be better to bake this in,
>> especially if more drivers do have this need. I already have an
>> implementation [1] for doing that in the scheduler. My plan was to push
>> that as soon as Matt sends out V3.
>>
>> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/269f05d6a2255384badff8b008b3c32d640d2d95
> 
> PowerVR's ->can_fit_in_ringbuf() logic is a bit more involved in that
> native fences waits are passed to the FW, and those add to the job size.
> When we know our job is ready for execution (all non-native deps are
> signaled), we evict already signaled native-deps (or native fences) to
> shrink the job size further more, but that's something we need to
> calculate late if we want the job size to be minimal. Of course, we can
> always over-estimate the job size, but if we go for a full-blown
> drm_sched integration, I wonder if it wouldn't be preferable to have a
> ->get_job_size() callback returning the number of units needed by job,
> and have the core pick 1 when the hook is not implemented.
> 

Sure, why not. Sounds reasonable to me.


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread
  2023-09-12 15:13                       ` Boris Brezillon
@ 2023-09-12 16:58                         ` Danilo Krummrich
  0 siblings, 0 replies; 80+ messages in thread
From: Danilo Krummrich @ 2023-09-12 16:58 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Matthew Brost, robdclark, sarah.walker, thomas.hellstrom,
	ketil.johnsen, lina, Liviu.Dudau, dri-devel, intel-xe,
	luben.tuikov, donald.robson, Christian König, faith.ekstrand

On 9/12/23 17:13, Boris Brezillon wrote:
> On Tue, 12 Sep 2023 16:49:09 +0200
> Boris Brezillon <boris.brezillon@collabora.com> wrote:
> 
>> On Tue, 12 Sep 2023 16:33:01 +0200
>> Danilo Krummrich <dakr@redhat.com> wrote:
>>
>>> On 9/12/23 16:28, Boris Brezillon wrote:
>>>> On Thu, 17 Aug 2023 13:13:31 +0200
>>>> Danilo Krummrich <dakr@redhat.com> wrote:
>>>>      
>>>>> I think that's a misunderstanding. I'm not trying to say that it is
>>>>> *always* beneficial to fill up the ring as much as possible. But I think
>>>>> it is under certain circumstances, exactly those circumstances I
>>>>> described for Nouveau.
>>>>>
>>>>> As mentioned, in Nouveau the size of a job is only really limited by the
>>>>> ring size, which means that one job can (but does not necessarily) fill
>>>>> up the whole ring. We both agree that this is inefficient, because it
>>>>> potentially results into the HW run dry due to hw_submission_limit == 1.
>>>>>
>>>>> I recognize you said that one should define hw_submission_limit and
>>>>> adjust the other parts of the equation accordingly, the options I see are:
>>>>>
>>>>> (1) Increase the ring size while keeping the maximum job size.
>>>>> (2) Decrease the maximum job size while keeping the ring size.
>>>>> (3) Let the scheduler track the actual job size rather than the maximum
>>>>> job size.
>>>>>
>>>>> (1) results into potentially wasted ring memory, because we're not
>>>>> always reaching the maximum job size, but the scheduler assumes so.
>>>>>
>>>>> (2) results into more IOCTLs from userspace for the same amount of IBs
>>>>> and more jobs result into more memory allocations and more work being
>>>>> submitted to the workqueue (with Matt's patches).
>>>>>
>>>>> (3) doesn't seem to have any of those draw backs.
>>>>>
>>>>> What would be your take on that?
>>>>>
>>>>> Actually, if none of the other drivers is interested into a more precise
>>>>> way of keeping track of the ring utilization, I'd be totally fine to do
>>>>> it in a driver specific way. However, unfortunately I don't see how this
>>>>> would be possible.
>>>>
>>>> I'm not entirely sure, but I think PowerVR is pretty close to your
>>>> description: jobs size is dynamic size, and the ring buffer size is
>>>> picked by the driver at queue initialization time. What we did was to
>>>> set hw_submission_limit to an arbitrarily high value of 64k (we could
>>>> have used something like ringbuf_size/min_job_size instead), and then
>>>> have the control flow implemented with ->prepare_job() [1] (CCCB is the
>>>> PowerVR ring buffer). This allows us to maximize ring buffer utilization
>>>> while still allowing dynamic-size jobs.
>>>
>>> I guess this would work, but I think it would be better to bake this in,
>>> especially if more drivers do have this need. I already have an
>>> implementation [1] for doing that in the scheduler. My plan was to push
>>> that as soon as Matt sends out V3.
>>>
>>> [1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commit/269f05d6a2255384badff8b008b3c32d640d2d95
>>
>> PowerVR's ->can_fit_in_ringbuf() logic is a bit more involved in that
>> native fences waits are passed to the FW, and those add to the job size.
>> When we know our job is ready for execution (all non-native deps are
>> signaled), we evict already signaled native-deps (or native fences) to
>> shrink the job size further more, but that's something we need to
>> calculate late if we want the job size to be minimal. Of course, we can
>> always over-estimate the job size, but if we go for a full-blown
>> drm_sched integration, I wonder if it wouldn't be preferable to have a
>> ->get_job_size() callback returning the number of units needed by job,
>> and have the core pick 1 when the hook is not implemented.
> 
> FWIW, I think last time I asked how to do that, I've been pointed to
> ->prepare_job() by someone  (don't remember if it was Daniel or
> Christian), hence the PowerVR implementation. If that's still the
> preferred solution, there's some opportunity to have a generic layer to
> automate ringbuf utilization tracking and some helpers to prepare
> wait_for_ringbuf dma_fences that drivers could return from
> ->prepare_job() (those fences would then be signaled when the driver
> calls drm_ringbuf_job_done() and the next job waiting for ringbuf space
> now fits in the ringbuf).
> 

Not sure I like that, it's basically a different implementation to work
around limitations of an implementation that is supposed to cover this case
in general.


^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2023-09-12 16:58 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-11  2:31 [PATCH v2 0/9] DRM scheduler changes for Xe Matthew Brost
2023-08-11  2:31 ` [PATCH v2 1/9] drm/sched: Convert drm scheduler to use a work queue rather than kthread Matthew Brost
2023-08-16 11:30   ` Danilo Krummrich
2023-08-16 14:05     ` Christian König
2023-08-16 12:30       ` Danilo Krummrich
2023-08-16 14:38         ` Matthew Brost
2023-08-16 15:40           ` Danilo Krummrich
2023-08-16 14:59         ` Christian König
2023-08-16 16:33           ` Danilo Krummrich
2023-08-17  5:33             ` Christian König
2023-08-17 11:13               ` Danilo Krummrich
2023-08-17 13:35                 ` Christian König
2023-08-17 12:48                   ` Danilo Krummrich
2023-08-17 16:17                     ` Christian König
2023-08-18 11:58                       ` Danilo Krummrich
2023-08-21 14:07                         ` Christian König
2023-08-21 18:01                           ` Danilo Krummrich
2023-08-21 18:12                             ` Christian König
2023-08-21 19:07                               ` Danilo Krummrich
2023-08-22  9:35                                 ` Christian König
2023-08-21 19:46                               ` Faith Ekstrand
2023-08-22  9:51                                 ` Christian König
2023-08-22 16:55                                   ` Faith Ekstrand
2023-08-24 11:50                                     ` Bas Nieuwenhuizen
2023-08-18  3:08                 ` Matthew Brost
2023-08-18  5:40                   ` Christian König
2023-08-18 12:49                     ` Matthew Brost
2023-08-18 12:06                       ` Danilo Krummrich
2023-09-12 14:28                 ` Boris Brezillon
2023-09-12 14:33                   ` Danilo Krummrich
2023-09-12 14:49                     ` Boris Brezillon
2023-09-12 15:13                       ` Boris Brezillon
2023-09-12 16:58                         ` Danilo Krummrich
2023-09-12 16:52                       ` Danilo Krummrich
2023-08-11  2:31 ` [PATCH v2 2/9] drm/sched: Move schedule policy to scheduler / entity Matthew Brost
2023-08-11 21:43   ` Maira Canal
2023-08-12  3:20     ` Matthew Brost
2023-08-11  2:31 ` [PATCH v2 3/9] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy Matthew Brost
2023-08-29 17:37   ` Danilo Krummrich
2023-09-05 11:10     ` Danilo Krummrich
2023-09-11 19:44       ` Matthew Brost
2023-08-11  2:31 ` [PATCH v2 4/9] drm/sched: Split free_job into own work item Matthew Brost
2023-08-17 13:39   ` Christian König
2023-08-17 17:54     ` Matthew Brost
2023-08-18  5:27       ` Christian König
2023-08-18 13:13         ` Matthew Brost
2023-08-21 13:17           ` Christian König
2023-08-23  3:27             ` Matthew Brost
2023-08-23  7:10               ` Christian König
2023-08-23 15:24                 ` Matthew Brost
2023-08-23 15:41                   ` Alex Deucher
2023-08-23 17:26                     ` [Intel-xe] " Rodrigo Vivi
2023-08-23 23:12                       ` Matthew Brost
2023-08-24 11:44                         ` Christian König
2023-08-24 14:30                           ` Matthew Brost
2023-08-24 23:04   ` Danilo Krummrich
2023-08-25  2:58     ` Matthew Brost
2023-08-25  8:02       ` Christian König
2023-08-25 13:36         ` Matthew Brost
2023-08-25 13:45           ` Christian König
2023-09-12 10:13             ` Boris Brezillon
2023-09-12 10:46               ` Danilo Krummrich
2023-09-12 12:18                 ` Boris Brezillon
2023-09-12 12:56                   ` Danilo Krummrich
2023-09-12 13:52                     ` Boris Brezillon
2023-09-12 14:10                       ` Danilo Krummrich
2023-09-12 13:27             ` Boris Brezillon
2023-09-12 13:34               ` Danilo Krummrich
2023-09-12 13:53                 ` Boris Brezillon
2023-08-28 18:04   ` Danilo Krummrich
2023-08-28 18:41     ` Matthew Brost
2023-08-29  1:20       ` Danilo Krummrich
2023-08-11  2:31 ` [PATCH v2 5/9] drm/sched: Add generic scheduler message interface Matthew Brost
2023-08-11  2:31 ` [PATCH v2 6/9] drm/sched: Add drm_sched_start_timeout_unlocked helper Matthew Brost
2023-08-11  2:31 ` [PATCH v2 7/9] drm/sched: Start run wq before TDR in drm_sched_start Matthew Brost
2023-08-11  2:31 ` [PATCH v2 8/9] drm/sched: Submit job before starting TDR Matthew Brost
2023-08-11  2:31 ` [PATCH v2 9/9] drm/sched: Add helper to set TDR timeout Matthew Brost
2023-08-24  0:08 ` [PATCH v2 0/9] DRM scheduler changes for Xe Danilo Krummrich
2023-08-24  3:23   ` Matthew Brost
2023-08-24 14:51     ` Danilo Krummrich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).