[RFC 0/5] Add capacity key to fdinfo

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [RFC 0/5] Add capacity key to fdinfo
@ 2024-04-30 17:27 Tvrtko Ursulin
  2024-04-30 17:27 ` [RFC 1/5] drm/amdgpu: Cache number of rings per hw ip type Tvrtko Ursulin
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-04-30 17:27 UTC (permalink / raw
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

I have noticed AMD GPUs can have more than one "engine" (ring?) of the same type
but amdgpu is not reporting that in fdinfo using the capacity engine tag.

This series is therefore an attempt to improve that, but only an RFC since it is
quite likely I got stuff wrong on the first attempt. Or if not wrong it may not
be very beneficial in AMDs case.

So I tried to figure out how to count and store the number of instances of an
"engine" type and spotted that could perhaps be used in more than one place in
the driver. I was more than a little bit confused by the ip_instance and uapi
rings, then how rings are selected to context entities internally. Anyway..
hopefully it is a simple enough series to easily spot any such large misses.

End result should be that, assuming two "engine" instances, one fully loaded and
one idle will only report client using 50% of that engine type.

Tvrtko Ursulin (5):
  drm/amdgpu: Cache number of rings per hw ip type
  drm/amdgpu: Use cached number of rings from the AMDGPU_INFO_HW_IP_INFO
    ioctl
  drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage
  drm/amdgpu: Show engine capacity in fdinfo
  drm/amdgpu: Only show VRAM in fdinfo if it exists

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c    |  3 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 39 +++++++++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    | 62 +++-------------------
 5 files changed, 49 insertions(+), 70 deletions(-)

-- 
2.44.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC 1/5] drm/amdgpu: Cache number of rings per hw ip type
  2024-04-30 17:27 [RFC 0/5] Add capacity key to fdinfo Tvrtko Ursulin
@ 2024-04-30 17:27 ` Tvrtko Ursulin
  2024-04-30 17:27 ` [RFC 2/5] drm/amdgpu: Use cached number of rings from the AMDGPU_INFO_HW_IP_INFO ioctl Tvrtko Ursulin
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-04-30 17:27 UTC (permalink / raw
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

With the end goal being using the number of rings for exposing capacity in
fdinfo, and the observation that the number could be used elsewhere in the
driver, lets cache it during the driver init phase.

We count the number of created scheduler instances for user visible hw ip
block types, which correspond with the number of user visible rings.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 ++++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index f87d53e183c3..4f394602bbd8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -964,6 +964,7 @@ struct amdgpu_device {
 	/* rings */
 	u64				fence_context;
 	unsigned			num_rings;
+	unsigned			num_ip_rings[AMDGPU_HW_IP_NUM];
 	struct amdgpu_ring		*rings[AMDGPU_MAX_RINGS];
 	struct dma_fence __rcu		*gang_submit;
 	bool				ib_pool_ready;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 861ccff78af9..e421f352d77e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2756,6 +2756,20 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
 				  ring->name);
 			return r;
 		}
+
+		BUILD_BUG_ON(AMDGPU_RING_TYPE_GFX	!= AMDGPU_HW_IP_GFX);
+		BUILD_BUG_ON(AMDGPU_RING_TYPE_COMPUTE	!= AMDGPU_HW_IP_COMPUTE);
+		BUILD_BUG_ON(AMDGPU_RING_TYPE_SDMA	!= AMDGPU_HW_IP_DMA);
+		BUILD_BUG_ON(AMDGPU_RING_TYPE_UVD	!= AMDGPU_HW_IP_UVD);
+		BUILD_BUG_ON(AMDGPU_RING_TYPE_VCE	!= AMDGPU_HW_IP_VCE);
+		BUILD_BUG_ON(AMDGPU_RING_TYPE_UVD_ENC	!= AMDGPU_HW_IP_UVD_ENC);
+		BUILD_BUG_ON(AMDGPU_RING_TYPE_VCN_DEC	!= AMDGPU_HW_IP_VCN_DEC);
+		BUILD_BUG_ON(AMDGPU_RING_TYPE_VCN_ENC	!= AMDGPU_HW_IP_VCN_ENC);
+		BUILD_BUG_ON(AMDGPU_RING_TYPE_VCN_JPEG	!= AMDGPU_HW_IP_VCN_JPEG);
+		BUILD_BUG_ON(AMDGPU_RING_TYPE_VPE 	!= AMDGPU_HW_IP_VPE);
+
+		if (ring->funcs->type < ARRAY_SIZE(adev->num_ip_rings))
+			adev->num_ip_rings[ring->funcs->type]++;
 	}
 
 	amdgpu_xcp_update_partition_sched_list(adev);
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC 2/5] drm/amdgpu: Use cached number of rings from the AMDGPU_INFO_HW_IP_INFO ioctl
  2024-04-30 17:27 [RFC 0/5] Add capacity key to fdinfo Tvrtko Ursulin
  2024-04-30 17:27 ` [RFC 1/5] drm/amdgpu: Cache number of rings per hw ip type Tvrtko Ursulin
@ 2024-04-30 17:27 ` Tvrtko Ursulin
  2024-04-30 17:27 ` [RFC 3/5] drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage Tvrtko Ursulin
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-04-30 17:27 UTC (permalink / raw
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Now that we keep a cached value it is no longer required to walk the
arrays and check the fuses.

Open however is whether at runtime this query is supposed to report if a
ring has disappeared due for example amdgpu_ib_ring_tests() turning it
off.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 62 +++----------------------
 1 file changed, 6 insertions(+), 56 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index a0ea6fe8d060..b6f411c0801a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -376,114 +376,64 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
 	uint32_t ib_start_alignment = 0;
 	uint32_t ib_size_alignment = 0;
 	enum amd_ip_block_type type;
-	unsigned int num_rings = 0;
-	unsigned int i, j;
+	unsigned int num_rings;
+	unsigned int i;
 
 	if (info->query_hw_ip.ip_instance >= AMDGPU_HW_IP_INSTANCE_MAX_COUNT)
 		return -EINVAL;
 
+	if (info->query_hw_ip.type >= ARRAY_SIZE(adev->num_ip_rings))
+		return -EINVAL;
+
 	switch (info->query_hw_ip.type) {
 	case AMDGPU_HW_IP_GFX:
 		type = AMD_IP_BLOCK_TYPE_GFX;
-		for (i = 0; i < adev->gfx.num_gfx_rings; i++)
-			if (adev->gfx.gfx_ring[i].sched.ready)
-				++num_rings;
 		ib_start_alignment = 32;
 		ib_size_alignment = 32;
 		break;
 	case AMDGPU_HW_IP_COMPUTE:
 		type = AMD_IP_BLOCK_TYPE_GFX;
-		for (i = 0; i < adev->gfx.num_compute_rings; i++)
-			if (adev->gfx.compute_ring[i].sched.ready)
-				++num_rings;
 		ib_start_alignment = 32;
 		ib_size_alignment = 32;
 		break;
 	case AMDGPU_HW_IP_DMA:
 		type = AMD_IP_BLOCK_TYPE_SDMA;
-		for (i = 0; i < adev->sdma.num_instances; i++)
-			if (adev->sdma.instance[i].ring.sched.ready)
-				++num_rings;
 		ib_start_alignment = 256;
 		ib_size_alignment = 4;
 		break;
 	case AMDGPU_HW_IP_UVD:
 		type = AMD_IP_BLOCK_TYPE_UVD;
-		for (i = 0; i < adev->uvd.num_uvd_inst; i++) {
-			if (adev->uvd.harvest_config & (1 << i))
-				continue;
-
-			if (adev->uvd.inst[i].ring.sched.ready)
-				++num_rings;
-		}
 		ib_start_alignment = 256;
 		ib_size_alignment = 64;
 		break;
 	case AMDGPU_HW_IP_VCE:
 		type = AMD_IP_BLOCK_TYPE_VCE;
-		for (i = 0; i < adev->vce.num_rings; i++)
-			if (adev->vce.ring[i].sched.ready)
-				++num_rings;
 		ib_start_alignment = 256;
 		ib_size_alignment = 4;
 		break;
 	case AMDGPU_HW_IP_UVD_ENC:
 		type = AMD_IP_BLOCK_TYPE_UVD;
-		for (i = 0; i < adev->uvd.num_uvd_inst; i++) {
-			if (adev->uvd.harvest_config & (1 << i))
-				continue;
-
-			for (j = 0; j < adev->uvd.num_enc_rings; j++)
-				if (adev->uvd.inst[i].ring_enc[j].sched.ready)
-					++num_rings;
-		}
 		ib_start_alignment = 256;
 		ib_size_alignment = 4;
 		break;
 	case AMDGPU_HW_IP_VCN_DEC:
 		type = AMD_IP_BLOCK_TYPE_VCN;
-		for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
-			if (adev->vcn.harvest_config & (1 << i))
-				continue;
-
-			if (adev->vcn.inst[i].ring_dec.sched.ready)
-				++num_rings;
-		}
 		ib_start_alignment = 256;
 		ib_size_alignment = 64;
 		break;
 	case AMDGPU_HW_IP_VCN_ENC:
 		type = AMD_IP_BLOCK_TYPE_VCN;
-		for (i = 0; i < adev->vcn.num_vcn_inst; i++) {
-			if (adev->vcn.harvest_config & (1 << i))
-				continue;
-
-			for (j = 0; j < adev->vcn.num_enc_rings; j++)
-				if (adev->vcn.inst[i].ring_enc[j].sched.ready)
-					++num_rings;
-		}
 		ib_start_alignment = 256;
 		ib_size_alignment = 4;
 		break;
 	case AMDGPU_HW_IP_VCN_JPEG:
 		type = (amdgpu_device_ip_get_ip_block(adev, AMD_IP_BLOCK_TYPE_JPEG)) ?
 			AMD_IP_BLOCK_TYPE_JPEG : AMD_IP_BLOCK_TYPE_VCN;
-
-		for (i = 0; i < adev->jpeg.num_jpeg_inst; i++) {
-			if (adev->jpeg.harvest_config & (1 << i))
-				continue;
-
-			for (j = 0; j < adev->jpeg.num_jpeg_rings; j++)
-				if (adev->jpeg.inst[i].ring_dec[j].sched.ready)
-					++num_rings;
-		}
 		ib_start_alignment = 256;
 		ib_size_alignment = 64;
 		break;
 	case AMDGPU_HW_IP_VPE:
 		type = AMD_IP_BLOCK_TYPE_VPE;
-		if (adev->vpe.ring.sched.ready)
-			++num_rings;
 		ib_start_alignment = 256;
 		ib_size_alignment = 4;
 		break;
@@ -500,7 +450,7 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
 		return 0;
 
 	num_rings = min(amdgpu_ctx_num_entities[info->query_hw_ip.type],
-			num_rings);
+			adev->num_ip_rings[info->query_hw_ip.type]);
 
 	result->hw_ip_version_major = adev->ip_blocks[i].version->major;
 	result->hw_ip_version_minor = adev->ip_blocks[i].version->minor;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC 3/5] drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage
  2024-04-30 17:27 [RFC 0/5] Add capacity key to fdinfo Tvrtko Ursulin
  2024-04-30 17:27 ` [RFC 1/5] drm/amdgpu: Cache number of rings per hw ip type Tvrtko Ursulin
  2024-04-30 17:27 ` [RFC 2/5] drm/amdgpu: Use cached number of rings from the AMDGPU_INFO_HW_IP_INFO ioctl Tvrtko Ursulin
@ 2024-04-30 17:27 ` Tvrtko Ursulin
  2024-04-30 17:27 ` [RFC 4/5] drm/amdgpu: Show engine capacity in fdinfo Tvrtko Ursulin
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-04-30 17:27 UTC (permalink / raw
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Avoids a tiny bit of arithmetic and more importantly walking those empty
arrays up to amdgpu_sched_jobs slots and amdgpu_ctx_num_entities, both
which are pointless if there are no rings of the respective type.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 5cb33ac99f70..b409395f6e0c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -979,6 +979,9 @@ void amdgpu_ctx_mgr_usage(struct amdgpu_ctx_mgr *mgr,
 
 	idr_for_each_entry(&mgr->ctx_handles, ctx, id) {
 		for (hw_ip = 0; hw_ip < AMDGPU_HW_IP_NUM; ++hw_ip) {
+			if (!mgr->adev->num_ip_rings[hw_ip])
+				continue;
+
 			for (i = 0; i < amdgpu_ctx_num_entities[hw_ip]; ++i) {
 				struct amdgpu_ctx_entity *centity;
 				ktime_t spend;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC 4/5] drm/amdgpu: Show engine capacity in fdinfo
  2024-04-30 17:27 [RFC 0/5] Add capacity key to fdinfo Tvrtko Ursulin
                   ` (2 preceding siblings ...)
  2024-04-30 17:27 ` [RFC 3/5] drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage Tvrtko Ursulin
@ 2024-04-30 17:27 ` Tvrtko Ursulin
  2024-04-30 17:27 ` [RFC 5/5] drm/amdgpu: Only show VRAM in fdinfo if it exists Tvrtko Ursulin
  2024-04-30 18:32 ` [RFC 0/5] Add capacity key to fdinfo Alex Deucher
  5 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-04-30 17:27 UTC (permalink / raw
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

In cases where there is more than one user visible engine of the same type
DRM fdinfo spec allows the capacity tag to be emitted.

Start doing that so that gputop can adapt and show, for example, 50% if
only one of the two engine instances is fully loaded, while the other is
idle.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
index c7df7fa3459f..a09944104c41 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
@@ -57,6 +57,7 @@ static const char *amdgpu_ip_name[AMDGPU_HW_IP_NUM] = {
 void amdgpu_show_fdinfo(struct drm_printer *p, struct drm_file *file)
 {
 	struct amdgpu_fpriv *fpriv = file->driver_priv;
+	struct amdgpu_device *adev = fpriv->ctx_mgr.adev;
 	struct amdgpu_vm *vm = &fpriv->vm;
 
 	struct amdgpu_mem_stats stats;
@@ -105,7 +106,12 @@ void amdgpu_show_fdinfo(struct drm_printer *p, struct drm_file *file)
 		if (!usage[hw_ip])
 			continue;
 
-		drm_printf(p, "drm-engine-%s:\t%lld ns\n", amdgpu_ip_name[hw_ip],
-			   ktime_to_ns(usage[hw_ip]));
+		drm_printf(p, "drm-engine-%s:\t%lld ns\n",
+			   amdgpu_ip_name[hw_ip], ktime_to_ns(usage[hw_ip]));
+
+		if (adev->num_ip_rings[hw_ip] > 1)
+			drm_printf(p, "drm-engine-capacity-%s:\t%u\n",
+				   amdgpu_ip_name[hw_ip],
+				   adev->num_ip_rings[hw_ip]);
 	}
 }
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC 5/5] drm/amdgpu: Only show VRAM in fdinfo if it exists
  2024-04-30 17:27 [RFC 0/5] Add capacity key to fdinfo Tvrtko Ursulin
                   ` (3 preceding siblings ...)
  2024-04-30 17:27 ` [RFC 4/5] drm/amdgpu: Show engine capacity in fdinfo Tvrtko Ursulin
@ 2024-04-30 17:27 ` Tvrtko Ursulin
  2024-05-02 13:16   ` Christian König
  2024-04-30 18:32 ` [RFC 0/5] Add capacity key to fdinfo Alex Deucher
  5 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-04-30 17:27 UTC (permalink / raw
  To: amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin

From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

Do not emit the key-value pairs if the VRAM does not exist ie. VRAM
placement is not valid and accessible.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 29 +++++++++++++---------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
index a09944104c41..603a5c010f5d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
@@ -83,25 +83,30 @@ void amdgpu_show_fdinfo(struct drm_printer *p, struct drm_file *file)
 	 */
 
 	drm_printf(p, "pasid:\t%u\n", fpriv->vm.pasid);
-	drm_printf(p, "drm-memory-vram:\t%llu KiB\n", stats.vram/1024UL);
 	drm_printf(p, "drm-memory-gtt: \t%llu KiB\n", stats.gtt/1024UL);
 	drm_printf(p, "drm-memory-cpu: \t%llu KiB\n", stats.cpu/1024UL);
-	drm_printf(p, "amd-memory-visible-vram:\t%llu KiB\n",
-		   stats.visible_vram/1024UL);
-	drm_printf(p, "amd-evicted-vram:\t%llu KiB\n",
-		   stats.evicted_vram/1024UL);
-	drm_printf(p, "amd-evicted-visible-vram:\t%llu KiB\n",
-		   stats.evicted_visible_vram/1024UL);
-	drm_printf(p, "amd-requested-vram:\t%llu KiB\n",
-		   stats.requested_vram/1024UL);
-	drm_printf(p, "amd-requested-visible-vram:\t%llu KiB\n",
-		   stats.requested_visible_vram/1024UL);
 	drm_printf(p, "amd-requested-gtt:\t%llu KiB\n",
 		   stats.requested_gtt/1024UL);
-	drm_printf(p, "drm-shared-vram:\t%llu KiB\n", stats.vram_shared/1024UL);
 	drm_printf(p, "drm-shared-gtt:\t%llu KiB\n", stats.gtt_shared/1024UL);
 	drm_printf(p, "drm-shared-cpu:\t%llu KiB\n", stats.cpu_shared/1024UL);
 
+	if (!adev->gmc.is_app_apu) {
+		drm_printf(p, "drm-memory-vram:\t%llu KiB\n",
+			   stats.vram/1024UL);
+		drm_printf(p, "amd-memory-visible-vram:\t%llu KiB\n",
+			   stats.visible_vram/1024UL);
+		drm_printf(p, "amd-evicted-vram:\t%llu KiB\n",
+			   stats.evicted_vram/1024UL);
+		drm_printf(p, "amd-evicted-visible-vram:\t%llu KiB\n",
+			   stats.evicted_visible_vram/1024UL);
+		drm_printf(p, "amd-requested-vram:\t%llu KiB\n",
+			   stats.requested_vram/1024UL);
+		drm_printf(p, "amd-requested-visible-vram:\t%llu KiB\n",
+			   stats.requested_visible_vram/1024UL);
+		drm_printf(p, "drm-shared-vram:\t%llu KiB\n",
+			   stats.vram_shared/1024UL);
+	}
+
 	for (hw_ip = 0; hw_ip < AMDGPU_HW_IP_NUM; ++hw_ip) {
 		if (!usage[hw_ip])
 			continue;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC 0/5] Add capacity key to fdinfo
  2024-04-30 17:27 [RFC 0/5] Add capacity key to fdinfo Tvrtko Ursulin
                   ` (4 preceding siblings ...)
  2024-04-30 17:27 ` [RFC 5/5] drm/amdgpu: Only show VRAM in fdinfo if it exists Tvrtko Ursulin
@ 2024-04-30 18:32 ` Alex Deucher
  2024-05-01 13:27   ` Tvrtko Ursulin
  5 siblings, 1 reply; 19+ messages in thread
From: Alex Deucher @ 2024-04-30 18:32 UTC (permalink / raw
  To: Tvrtko Ursulin; +Cc: amd-gfx, kernel-dev, Tvrtko Ursulin

On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin <tursulin@igalia.com> wrote:
>
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>
> I have noticed AMD GPUs can have more than one "engine" (ring?) of the same type
> but amdgpu is not reporting that in fdinfo using the capacity engine tag.
>
> This series is therefore an attempt to improve that, but only an RFC since it is
> quite likely I got stuff wrong on the first attempt. Or if not wrong it may not
> be very beneficial in AMDs case.
>
> So I tried to figure out how to count and store the number of instances of an
> "engine" type and spotted that could perhaps be used in more than one place in
> the driver. I was more than a little bit confused by the ip_instance and uapi
> rings, then how rings are selected to context entities internally. Anyway..
> hopefully it is a simple enough series to easily spot any such large misses.
>
> End result should be that, assuming two "engine" instances, one fully loaded and
> one idle will only report client using 50% of that engine type.

That would only be true if there are multiple instantiations of the IP
on the chip which in most cases is not true.  In most cases there is
one instance of the IP that can be fed from multiple rings.  E.g. for
graphics and compute, all of the rings ultimately feed into the same
compute units on the chip.  So if you have a gfx ring and a compute
rings, you can schedule work to them asynchronously, but ultimately
whether they execute serially or in parallel depends on the actual
shader code in the command buffers and the extent to which it can
utilize the available compute units in the shader cores.

As for the UAPI portion of this, we generally expose a limited number
of rings to user space and then we use the GPU scheduler to load
balance between all of the available rings of a type to try and
extract as much parallelism as we can.

Alex


>
> Tvrtko Ursulin (5):
>   drm/amdgpu: Cache number of rings per hw ip type
>   drm/amdgpu: Use cached number of rings from the AMDGPU_INFO_HW_IP_INFO
>     ioctl
>   drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage
>   drm/amdgpu: Show engine capacity in fdinfo
>   drm/amdgpu: Only show VRAM in fdinfo if it exists
>
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c    |  3 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 39 +++++++++-----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    | 62 +++-------------------
>  5 files changed, 49 insertions(+), 70 deletions(-)
>
> --
> 2.44.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/5] Add capacity key to fdinfo
  2024-04-30 18:32 ` [RFC 0/5] Add capacity key to fdinfo Alex Deucher
@ 2024-05-01 13:27   ` Tvrtko Ursulin
  2024-05-02 13:07     ` Christian König
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-05-01 13:27 UTC (permalink / raw
  To: Alex Deucher, Tvrtko Ursulin; +Cc: amd-gfx, kernel-dev

Hi Alex,

On 30/04/2024 19:32, Alex Deucher wrote:
> On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin <tursulin@igalia.com> wrote:
>>
>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>
>> I have noticed AMD GPUs can have more than one "engine" (ring?) of the same type
>> but amdgpu is not reporting that in fdinfo using the capacity engine tag.
>>
>> This series is therefore an attempt to improve that, but only an RFC since it is
>> quite likely I got stuff wrong on the first attempt. Or if not wrong it may not
>> be very beneficial in AMDs case.
>>
>> So I tried to figure out how to count and store the number of instances of an
>> "engine" type and spotted that could perhaps be used in more than one place in
>> the driver. I was more than a little bit confused by the ip_instance and uapi
>> rings, then how rings are selected to context entities internally. Anyway..
>> hopefully it is a simple enough series to easily spot any such large misses.
>>
>> End result should be that, assuming two "engine" instances, one fully loaded and
>> one idle will only report client using 50% of that engine type.
> 
> That would only be true if there are multiple instantiations of the IP
> on the chip which in most cases is not true.  In most cases there is
> one instance of the IP that can be fed from multiple rings.  E.g. for
> graphics and compute, all of the rings ultimately feed into the same
> compute units on the chip.  So if you have a gfx ring and a compute
> rings, you can schedule work to them asynchronously, but ultimately
> whether they execute serially or in parallel depends on the actual
> shader code in the command buffers and the extent to which it can
> utilize the available compute units in the shader cores.

This is the same as with Intel/i915. Fdinfo is not intended to provide 
utilisation of EUs and such, just how busy are the "entities" kernel 
submits to. So doing something like in this series would make the 
reporting more similar between the two drivers.

I think both the 0-800% or 0-100% range (taking 8 ring compute as an 
example) can be misleading for different workloads. Neither <800% in the 
former means one can send more work and same for <100% in the latter.

There is also a parallel with the CPU world here and hyper threading, if 
not wider, where "What does 100% actually mean?" is also wishy-washy.

Also note that the reporting of actual time based values in fdinfo would 
not changing with this series.

Of if you can guide me towards how to distinguish real vs fake 
parallelism in HW IP blocks I could modify the series to only add 
capacity tags where there are truly independent blocks. That would be 
different from i915 though were I did not bother with that distinction. 
(For reasons that assignment of for instance EUs to compute "rings" 
(command streamers in i915) was supposed to be possible to re-configure 
on the fly. So it did not make sense to try and be super smart in fdinfo.)

> As for the UAPI portion of this, we generally expose a limited number
> of rings to user space and then we use the GPU scheduler to load
> balance between all of the available rings of a type to try and
> extract as much parallelism as we can.

The part I do not understand is the purpose of the ring argument in for 
instance drm_amdgpu_cs_chunk_ib. It appears userspace can create up to N 
scheduling entities using different ring id's, but internally they can 
map to 1:N same scheduler instances (depending on IP type, can be that 
each userspace ring maps to same N hw rings, or for rings with no drm 
sched load balancing userspace ring also does not appear to have a 
relation to the picked drm sched instance.).

So I neither understand how this ring is useful, or how it does not 
create a problem for IP types which use drm_sched_pick_best. It appears 
even if userspace created two scheduling entities with different ring 
ids they could randomly map to same drm sched aka same hw ring, no?

Regards,

Tvrtko

> Alex
> 
> 
>>
>> Tvrtko Ursulin (5):
>>    drm/amdgpu: Cache number of rings per hw ip type
>>    drm/amdgpu: Use cached number of rings from the AMDGPU_INFO_HW_IP_INFO
>>      ioctl
>>    drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage
>>    drm/amdgpu: Show engine capacity in fdinfo
>>    drm/amdgpu: Only show VRAM in fdinfo if it exists
>>
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c    |  3 ++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 39 +++++++++-----
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    | 62 +++-------------------
>>   5 files changed, 49 insertions(+), 70 deletions(-)
>>
>> --
>> 2.44.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/5] Add capacity key to fdinfo
  2024-05-01 13:27   ` Tvrtko Ursulin
@ 2024-05-02 13:07     ` Christian König
  2024-05-02 14:43       ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Christian König @ 2024-05-02 13:07 UTC (permalink / raw
  To: Tvrtko Ursulin, Alex Deucher, Tvrtko Ursulin; +Cc: amd-gfx, kernel-dev

Am 01.05.24 um 15:27 schrieb Tvrtko Ursulin:
>
> Hi Alex,
>
> On 30/04/2024 19:32, Alex Deucher wrote:
>> On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin <tursulin@igalia.com> 
>> wrote:
>>>
>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>
>>> I have noticed AMD GPUs can have more than one "engine" (ring?) of 
>>> the same type
>>> but amdgpu is not reporting that in fdinfo using the capacity engine 
>>> tag.
>>>
>>> This series is therefore an attempt to improve that, but only an RFC 
>>> since it is
>>> quite likely I got stuff wrong on the first attempt. Or if not wrong 
>>> it may not
>>> be very beneficial in AMDs case.
>>>
>>> So I tried to figure out how to count and store the number of 
>>> instances of an
>>> "engine" type and spotted that could perhaps be used in more than 
>>> one place in
>>> the driver. I was more than a little bit confused by the ip_instance 
>>> and uapi
>>> rings, then how rings are selected to context entities internally. 
>>> Anyway..
>>> hopefully it is a simple enough series to easily spot any such large 
>>> misses.
>>>
>>> End result should be that, assuming two "engine" instances, one 
>>> fully loaded and
>>> one idle will only report client using 50% of that engine type.
>>
>> That would only be true if there are multiple instantiations of the IP
>> on the chip which in most cases is not true.  In most cases there is
>> one instance of the IP that can be fed from multiple rings. E.g. for
>> graphics and compute, all of the rings ultimately feed into the same
>> compute units on the chip.  So if you have a gfx ring and a compute
>> rings, you can schedule work to them asynchronously, but ultimately
>> whether they execute serially or in parallel depends on the actual
>> shader code in the command buffers and the extent to which it can
>> utilize the available compute units in the shader cores.
>
> This is the same as with Intel/i915. Fdinfo is not intended to provide 
> utilisation of EUs and such, just how busy are the "entities" kernel 
> submits to. So doing something like in this series would make the 
> reporting more similar between the two drivers.
>
> I think both the 0-800% or 0-100% range (taking 8 ring compute as an 
> example) can be misleading for different workloads. Neither <800% in 
> the former means one can send more work and same for <100% in the latter.

Yeah, I think that's what Alex tries to describe. By using 8 compute 
rings your 800% load is actually incorrect and quite misleading.

Background is that those 8 compute rings won't be active all at the same 
time, but rather waiting on each other for resources.

But this "waiting" is unfortunately considered execution time since the 
used approach is actually not really capable of separating waiting and 
execution time.

>
> There is also a parallel with the CPU world here and hyper threading, 
> if not wider, where "What does 100% actually mean?" is also wishy-washy.
>
> Also note that the reporting of actual time based values in fdinfo 
> would not changing with this series.
>
> Of if you can guide me towards how to distinguish real vs fake 
> parallelism in HW IP blocks I could modify the series to only add 
> capacity tags where there are truly independent blocks. That would be 
> different from i915 though were I did not bother with that 
> distinction. (For reasons that assignment of for instance EUs to 
> compute "rings" (command streamers in i915) was supposed to be 
> possible to re-configure on the fly. So it did not make sense to try 
> and be super smart in fdinfo.)

Well exactly that's the point we don't really have truly independent 
blocks on AMD hardware.

There are things like independent SDMA instances, but those a meant to 
be used like the first instance for uploads and the second for downloads 
etc.. When you use both instances for the same job they will pretty much 
limit each other because of a single resource.

>> As for the UAPI portion of this, we generally expose a limited number
>> of rings to user space and then we use the GPU scheduler to load
>> balance between all of the available rings of a type to try and
>> extract as much parallelism as we can.
>
> The part I do not understand is the purpose of the ring argument in 
> for instance drm_amdgpu_cs_chunk_ib. It appears userspace can create 
> up to N scheduling entities using different ring id's, but internally 
> they can map to 1:N same scheduler instances (depending on IP type, 
> can be that each userspace ring maps to same N hw rings, or for rings 
> with no drm sched load balancing userspace ring also does not appear 
> to have a relation to the picked drm sched instance.).
>
> So I neither understand how this ring is useful, or how it does not 
> create a problem for IP types which use drm_sched_pick_best. It 
> appears even if userspace created two scheduling entities with 
> different ring ids they could randomly map to same drm sched aka same 
> hw ring, no?

Yeah, that is correct. The multimedia instances have to use a "fixed" 
load balancing because of lack of firmware support. That should have 
been fixed by now but we never found time to actually validate it.

Regarding the "ring" parameter in CS, that is basically just for 
backward compatibility with older userspace. E.g. that we don't map all 
SDMA jobs to the same instance when only once context is used.

Regards,
Christian.

>
> Regards,
>
> Tvrtko
>
>> Alex
>>
>>
>>>
>>> Tvrtko Ursulin (5):
>>>    drm/amdgpu: Cache number of rings per hw ip type
>>>    drm/amdgpu: Use cached number of rings from the 
>>> AMDGPU_INFO_HW_IP_INFO
>>>      ioctl
>>>    drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage
>>>    drm/amdgpu: Show engine capacity in fdinfo
>>>    drm/amdgpu: Only show VRAM in fdinfo if it exists
>>>
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c    |  3 ++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 39 +++++++++-----
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    | 62 
>>> +++-------------------
>>>   5 files changed, 49 insertions(+), 70 deletions(-)
>>>
>>> -- 
>>> 2.44.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 5/5] drm/amdgpu: Only show VRAM in fdinfo if it exists
  2024-04-30 17:27 ` [RFC 5/5] drm/amdgpu: Only show VRAM in fdinfo if it exists Tvrtko Ursulin
@ 2024-05-02 13:16   ` Christian König
  2024-05-02 14:46     ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Christian König @ 2024-05-02 13:16 UTC (permalink / raw
  To: Tvrtko Ursulin, amd-gfx; +Cc: kernel-dev, Tvrtko Ursulin

Am 30.04.24 um 19:27 schrieb Tvrtko Ursulin:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>
> Do not emit the key-value pairs if the VRAM does not exist ie. VRAM
> placement is not valid and accessible.

Yeah, that's unfortunately rather misleading.

Even APUs have VRAM or rather stolen system memory which is managed by 
the graphics driver.

We only have a single compute model which really doesn't have VRAM at all.

Regards,
Christian.

>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 29 +++++++++++++---------
>   1 file changed, 17 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
> index a09944104c41..603a5c010f5d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
> @@ -83,25 +83,30 @@ void amdgpu_show_fdinfo(struct drm_printer *p, struct drm_file *file)
>   	 */
>   
>   	drm_printf(p, "pasid:\t%u\n", fpriv->vm.pasid);
> -	drm_printf(p, "drm-memory-vram:\t%llu KiB\n", stats.vram/1024UL);
>   	drm_printf(p, "drm-memory-gtt: \t%llu KiB\n", stats.gtt/1024UL);
>   	drm_printf(p, "drm-memory-cpu: \t%llu KiB\n", stats.cpu/1024UL);
> -	drm_printf(p, "amd-memory-visible-vram:\t%llu KiB\n",
> -		   stats.visible_vram/1024UL);
> -	drm_printf(p, "amd-evicted-vram:\t%llu KiB\n",
> -		   stats.evicted_vram/1024UL);
> -	drm_printf(p, "amd-evicted-visible-vram:\t%llu KiB\n",
> -		   stats.evicted_visible_vram/1024UL);
> -	drm_printf(p, "amd-requested-vram:\t%llu KiB\n",
> -		   stats.requested_vram/1024UL);
> -	drm_printf(p, "amd-requested-visible-vram:\t%llu KiB\n",
> -		   stats.requested_visible_vram/1024UL);
>   	drm_printf(p, "amd-requested-gtt:\t%llu KiB\n",
>   		   stats.requested_gtt/1024UL);
> -	drm_printf(p, "drm-shared-vram:\t%llu KiB\n", stats.vram_shared/1024UL);
>   	drm_printf(p, "drm-shared-gtt:\t%llu KiB\n", stats.gtt_shared/1024UL);
>   	drm_printf(p, "drm-shared-cpu:\t%llu KiB\n", stats.cpu_shared/1024UL);
>   
> +	if (!adev->gmc.is_app_apu) {
> +		drm_printf(p, "drm-memory-vram:\t%llu KiB\n",
> +			   stats.vram/1024UL);
> +		drm_printf(p, "amd-memory-visible-vram:\t%llu KiB\n",
> +			   stats.visible_vram/1024UL);
> +		drm_printf(p, "amd-evicted-vram:\t%llu KiB\n",
> +			   stats.evicted_vram/1024UL);
> +		drm_printf(p, "amd-evicted-visible-vram:\t%llu KiB\n",
> +			   stats.evicted_visible_vram/1024UL);
> +		drm_printf(p, "amd-requested-vram:\t%llu KiB\n",
> +			   stats.requested_vram/1024UL);
> +		drm_printf(p, "amd-requested-visible-vram:\t%llu KiB\n",
> +			   stats.requested_visible_vram/1024UL);
> +		drm_printf(p, "drm-shared-vram:\t%llu KiB\n",
> +			   stats.vram_shared/1024UL);
> +	}
> +
>   	for (hw_ip = 0; hw_ip < AMDGPU_HW_IP_NUM; ++hw_ip) {
>   		if (!usage[hw_ip])
>   			continue;


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/5] Add capacity key to fdinfo
  2024-05-02 13:07     ` Christian König
@ 2024-05-02 14:43       ` Tvrtko Ursulin
  2024-05-02 15:00         ` Alex Deucher
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-05-02 14:43 UTC (permalink / raw
  To: Christian König, Alex Deucher, Tvrtko Ursulin; +Cc: amd-gfx, kernel-dev


On 02/05/2024 14:07, Christian König wrote:
> Am 01.05.24 um 15:27 schrieb Tvrtko Ursulin:
>>
>> Hi Alex,
>>
>> On 30/04/2024 19:32, Alex Deucher wrote:
>>> On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin <tursulin@igalia.com> 
>>> wrote:
>>>>
>>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>
>>>> I have noticed AMD GPUs can have more than one "engine" (ring?) of 
>>>> the same type
>>>> but amdgpu is not reporting that in fdinfo using the capacity engine 
>>>> tag.
>>>>
>>>> This series is therefore an attempt to improve that, but only an RFC 
>>>> since it is
>>>> quite likely I got stuff wrong on the first attempt. Or if not wrong 
>>>> it may not
>>>> be very beneficial in AMDs case.
>>>>
>>>> So I tried to figure out how to count and store the number of 
>>>> instances of an
>>>> "engine" type and spotted that could perhaps be used in more than 
>>>> one place in
>>>> the driver. I was more than a little bit confused by the ip_instance 
>>>> and uapi
>>>> rings, then how rings are selected to context entities internally. 
>>>> Anyway..
>>>> hopefully it is a simple enough series to easily spot any such large 
>>>> misses.
>>>>
>>>> End result should be that, assuming two "engine" instances, one 
>>>> fully loaded and
>>>> one idle will only report client using 50% of that engine type.
>>>
>>> That would only be true if there are multiple instantiations of the IP
>>> on the chip which in most cases is not true.  In most cases there is
>>> one instance of the IP that can be fed from multiple rings. E.g. for
>>> graphics and compute, all of the rings ultimately feed into the same
>>> compute units on the chip.  So if you have a gfx ring and a compute
>>> rings, you can schedule work to them asynchronously, but ultimately
>>> whether they execute serially or in parallel depends on the actual
>>> shader code in the command buffers and the extent to which it can
>>> utilize the available compute units in the shader cores.
>>
>> This is the same as with Intel/i915. Fdinfo is not intended to provide 
>> utilisation of EUs and such, just how busy are the "entities" kernel 
>> submits to. So doing something like in this series would make the 
>> reporting more similar between the two drivers.
>>
>> I think both the 0-800% or 0-100% range (taking 8 ring compute as an 
>> example) can be misleading for different workloads. Neither <800% in 
>> the former means one can send more work and same for <100% in the latter.
> 
> Yeah, I think that's what Alex tries to describe. By using 8 compute 
> rings your 800% load is actually incorrect and quite misleading.
> 
> Background is that those 8 compute rings won't be active all at the same 
> time, but rather waiting on each other for resources.
> 
> But this "waiting" is unfortunately considered execution time since the 
> used approach is actually not really capable of separating waiting and 
> execution time.

Right, so 800% is what gputop could be suggesting today, by the virtue 8 
context/clients can each use 100% if they only use a subset of compute 
units. I was proposing to expose the capacity in fdinfo so it can be 
scaled down and then dicussing how both situation have pros and cons.

>> There is also a parallel with the CPU world here and hyper threading, 
>> if not wider, where "What does 100% actually mean?" is also wishy-washy.
>>
>> Also note that the reporting of actual time based values in fdinfo 
>> would not changing with this series.
>>
>> Of if you can guide me towards how to distinguish real vs fake 
>> parallelism in HW IP blocks I could modify the series to only add 
>> capacity tags where there are truly independent blocks. That would be 
>> different from i915 though were I did not bother with that 
>> distinction. (For reasons that assignment of for instance EUs to 
>> compute "rings" (command streamers in i915) was supposed to be 
>> possible to re-configure on the fly. So it did not make sense to try 
>> and be super smart in fdinfo.)
> 
> Well exactly that's the point we don't really have truly independent 
> blocks on AMD hardware.
> 
> There are things like independent SDMA instances, but those a meant to 
> be used like the first instance for uploads and the second for downloads 
> etc.. When you use both instances for the same job they will pretty much 
> limit each other because of a single resource.

So _never_ multiple instances of the same IP block? No video decode, 
encode, anything?

>>> As for the UAPI portion of this, we generally expose a limited number
>>> of rings to user space and then we use the GPU scheduler to load
>>> balance between all of the available rings of a type to try and
>>> extract as much parallelism as we can.
>>
>> The part I do not understand is the purpose of the ring argument in 
>> for instance drm_amdgpu_cs_chunk_ib. It appears userspace can create 
>> up to N scheduling entities using different ring id's, but internally 
>> they can map to 1:N same scheduler instances (depending on IP type, 
>> can be that each userspace ring maps to same N hw rings, or for rings 
>> with no drm sched load balancing userspace ring also does not appear 
>> to have a relation to the picked drm sched instance.).
>>
>> So I neither understand how this ring is useful, or how it does not 
>> create a problem for IP types which use drm_sched_pick_best. It 
>> appears even if userspace created two scheduling entities with 
>> different ring ids they could randomly map to same drm sched aka same 
>> hw ring, no?
> 
> Yeah, that is correct. The multimedia instances have to use a "fixed" 
> load balancing because of lack of firmware support. That should have 
> been fixed by now but we never found time to actually validate it.

Gotcha.

> Regarding the "ring" parameter in CS, that is basically just for 
> backward compatibility with older userspace. E.g. that we don't map all 
> SDMA jobs to the same instance when only once context is used.

I see. In that sense "limits" for compute in amdgpu_ctx_num_entities are 
arbitrary, or related to some old userspace expectation?

Regards,

Tvrtko

> Regards,
> Christian.
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>> Alex
>>>
>>>
>>>>
>>>> Tvrtko Ursulin (5):
>>>>    drm/amdgpu: Cache number of rings per hw ip type
>>>>    drm/amdgpu: Use cached number of rings from the 
>>>> AMDGPU_INFO_HW_IP_INFO
>>>>      ioctl
>>>>    drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage
>>>>    drm/amdgpu: Show engine capacity in fdinfo
>>>>    drm/amdgpu: Only show VRAM in fdinfo if it exists
>>>>
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c    |  3 ++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++++
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 39 +++++++++-----
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    | 62 
>>>> +++-------------------
>>>>   5 files changed, 49 insertions(+), 70 deletions(-)
>>>>
>>>> -- 
>>>> 2.44.0
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 5/5] drm/amdgpu: Only show VRAM in fdinfo if it exists
  2024-05-02 13:16   ` Christian König
@ 2024-05-02 14:46     ` Tvrtko Ursulin
  2024-05-03 13:47       ` Alex Deucher
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-05-02 14:46 UTC (permalink / raw
  To: Christian König, Tvrtko Ursulin, amd-gfx; +Cc: kernel-dev


On 02/05/2024 14:16, Christian König wrote:
> Am 30.04.24 um 19:27 schrieb Tvrtko Ursulin:
>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>
>> Do not emit the key-value pairs if the VRAM does not exist ie. VRAM
>> placement is not valid and accessible.
> 
> Yeah, that's unfortunately rather misleading.
> 
> Even APUs have VRAM or rather stolen system memory which is managed by 
> the graphics driver.
> 
> We only have a single compute model which really doesn't have VRAM at all.

Hm what is misleading and how more precisely? :) Maybe in other words, 
if is_app_apu is not the right criteria to know when TTM_PL_VRAM is 
impossible, what is? Is the compute model you mentio the only thing 
which sets is_app_apu and uses the dummy vram manager?

Regards,

Tvrtko

> Regards,
> Christian.
> 
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 29 +++++++++++++---------
>>   1 file changed, 17 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
>> index a09944104c41..603a5c010f5d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
>> @@ -83,25 +83,30 @@ void amdgpu_show_fdinfo(struct drm_printer *p, 
>> struct drm_file *file)
>>        */
>>       drm_printf(p, "pasid:\t%u\n", fpriv->vm.pasid);
>> -    drm_printf(p, "drm-memory-vram:\t%llu KiB\n", stats.vram/1024UL);
>>       drm_printf(p, "drm-memory-gtt: \t%llu KiB\n", stats.gtt/1024UL);
>>       drm_printf(p, "drm-memory-cpu: \t%llu KiB\n", stats.cpu/1024UL);
>> -    drm_printf(p, "amd-memory-visible-vram:\t%llu KiB\n",
>> -           stats.visible_vram/1024UL);
>> -    drm_printf(p, "amd-evicted-vram:\t%llu KiB\n",
>> -           stats.evicted_vram/1024UL);
>> -    drm_printf(p, "amd-evicted-visible-vram:\t%llu KiB\n",
>> -           stats.evicted_visible_vram/1024UL);
>> -    drm_printf(p, "amd-requested-vram:\t%llu KiB\n",
>> -           stats.requested_vram/1024UL);
>> -    drm_printf(p, "amd-requested-visible-vram:\t%llu KiB\n",
>> -           stats.requested_visible_vram/1024UL);
>>       drm_printf(p, "amd-requested-gtt:\t%llu KiB\n",
>>              stats.requested_gtt/1024UL);
>> -    drm_printf(p, "drm-shared-vram:\t%llu KiB\n", 
>> stats.vram_shared/1024UL);
>>       drm_printf(p, "drm-shared-gtt:\t%llu KiB\n", 
>> stats.gtt_shared/1024UL);
>>       drm_printf(p, "drm-shared-cpu:\t%llu KiB\n", 
>> stats.cpu_shared/1024UL);
>> +    if (!adev->gmc.is_app_apu) {
>> +        drm_printf(p, "drm-memory-vram:\t%llu KiB\n",
>> +               stats.vram/1024UL);
>> +        drm_printf(p, "amd-memory-visible-vram:\t%llu KiB\n",
>> +               stats.visible_vram/1024UL);
>> +        drm_printf(p, "amd-evicted-vram:\t%llu KiB\n",
>> +               stats.evicted_vram/1024UL);
>> +        drm_printf(p, "amd-evicted-visible-vram:\t%llu KiB\n",
>> +               stats.evicted_visible_vram/1024UL);
>> +        drm_printf(p, "amd-requested-vram:\t%llu KiB\n",
>> +               stats.requested_vram/1024UL);
>> +        drm_printf(p, "amd-requested-visible-vram:\t%llu KiB\n",
>> +               stats.requested_visible_vram/1024UL);
>> +        drm_printf(p, "drm-shared-vram:\t%llu KiB\n",
>> +               stats.vram_shared/1024UL);
>> +    }
>> +
>>       for (hw_ip = 0; hw_ip < AMDGPU_HW_IP_NUM; ++hw_ip) {
>>           if (!usage[hw_ip])
>>               continue;
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/5] Add capacity key to fdinfo
  2024-05-02 14:43       ` Tvrtko Ursulin
@ 2024-05-02 15:00         ` Alex Deucher
  2024-05-03 11:50           ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Deucher @ 2024-05-02 15:00 UTC (permalink / raw
  To: Tvrtko Ursulin; +Cc: Christian König, Tvrtko Ursulin, amd-gfx, kernel-dev

On Thu, May 2, 2024 at 10:43 AM Tvrtko Ursulin
<tvrtko.ursulin@igalia.com> wrote:
>
>
> On 02/05/2024 14:07, Christian König wrote:
> > Am 01.05.24 um 15:27 schrieb Tvrtko Ursulin:
> >>
> >> Hi Alex,
> >>
> >> On 30/04/2024 19:32, Alex Deucher wrote:
> >>> On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin <tursulin@igalia.com>
> >>> wrote:
> >>>>
> >>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> >>>>
> >>>> I have noticed AMD GPUs can have more than one "engine" (ring?) of
> >>>> the same type
> >>>> but amdgpu is not reporting that in fdinfo using the capacity engine
> >>>> tag.
> >>>>
> >>>> This series is therefore an attempt to improve that, but only an RFC
> >>>> since it is
> >>>> quite likely I got stuff wrong on the first attempt. Or if not wrong
> >>>> it may not
> >>>> be very beneficial in AMDs case.
> >>>>
> >>>> So I tried to figure out how to count and store the number of
> >>>> instances of an
> >>>> "engine" type and spotted that could perhaps be used in more than
> >>>> one place in
> >>>> the driver. I was more than a little bit confused by the ip_instance
> >>>> and uapi
> >>>> rings, then how rings are selected to context entities internally.
> >>>> Anyway..
> >>>> hopefully it is a simple enough series to easily spot any such large
> >>>> misses.
> >>>>
> >>>> End result should be that, assuming two "engine" instances, one
> >>>> fully loaded and
> >>>> one idle will only report client using 50% of that engine type.
> >>>
> >>> That would only be true if there are multiple instantiations of the IP
> >>> on the chip which in most cases is not true.  In most cases there is
> >>> one instance of the IP that can be fed from multiple rings. E.g. for
> >>> graphics and compute, all of the rings ultimately feed into the same
> >>> compute units on the chip.  So if you have a gfx ring and a compute
> >>> rings, you can schedule work to them asynchronously, but ultimately
> >>> whether they execute serially or in parallel depends on the actual
> >>> shader code in the command buffers and the extent to which it can
> >>> utilize the available compute units in the shader cores.
> >>
> >> This is the same as with Intel/i915. Fdinfo is not intended to provide
> >> utilisation of EUs and such, just how busy are the "entities" kernel
> >> submits to. So doing something like in this series would make the
> >> reporting more similar between the two drivers.
> >>
> >> I think both the 0-800% or 0-100% range (taking 8 ring compute as an
> >> example) can be misleading for different workloads. Neither <800% in
> >> the former means one can send more work and same for <100% in the latter.
> >
> > Yeah, I think that's what Alex tries to describe. By using 8 compute
> > rings your 800% load is actually incorrect and quite misleading.
> >
> > Background is that those 8 compute rings won't be active all at the same
> > time, but rather waiting on each other for resources.
> >
> > But this "waiting" is unfortunately considered execution time since the
> > used approach is actually not really capable of separating waiting and
> > execution time.
>
> Right, so 800% is what gputop could be suggesting today, by the virtue 8
> context/clients can each use 100% if they only use a subset of compute
> units. I was proposing to expose the capacity in fdinfo so it can be
> scaled down and then dicussing how both situation have pros and cons.
>
> >> There is also a parallel with the CPU world here and hyper threading,
> >> if not wider, where "What does 100% actually mean?" is also wishy-washy.
> >>
> >> Also note that the reporting of actual time based values in fdinfo
> >> would not changing with this series.
> >>
> >> Of if you can guide me towards how to distinguish real vs fake
> >> parallelism in HW IP blocks I could modify the series to only add
> >> capacity tags where there are truly independent blocks. That would be
> >> different from i915 though were I did not bother with that
> >> distinction. (For reasons that assignment of for instance EUs to
> >> compute "rings" (command streamers in i915) was supposed to be
> >> possible to re-configure on the fly. So it did not make sense to try
> >> and be super smart in fdinfo.)
> >
> > Well exactly that's the point we don't really have truly independent
> > blocks on AMD hardware.
> >
> > There are things like independent SDMA instances, but those a meant to
> > be used like the first instance for uploads and the second for downloads
> > etc.. When you use both instances for the same job they will pretty much
> > limit each other because of a single resource.
>
> So _never_ multiple instances of the same IP block? No video decode,
> encode, anything?

Some chips have multiple encode/decode IP blocks that are actually
separate instances, however, we load balance between them so userspace
sees just one engine.  Also in some cases they are asymmetric (e.g.,
different sets of supported CODECs on each instance).  The driver
handles this by inspecting the command buffer and scheduling on the
appropriate instance based on the requested CODEC.  SDMA also supports
multiple IP blocks that are independent.

Alex

>
> >>> As for the UAPI portion of this, we generally expose a limited number
> >>> of rings to user space and then we use the GPU scheduler to load
> >>> balance between all of the available rings of a type to try and
> >>> extract as much parallelism as we can.
> >>
> >> The part I do not understand is the purpose of the ring argument in
> >> for instance drm_amdgpu_cs_chunk_ib. It appears userspace can create
> >> up to N scheduling entities using different ring id's, but internally
> >> they can map to 1:N same scheduler instances (depending on IP type,
> >> can be that each userspace ring maps to same N hw rings, or for rings
> >> with no drm sched load balancing userspace ring also does not appear
> >> to have a relation to the picked drm sched instance.).
> >>
> >> So I neither understand how this ring is useful, or how it does not
> >> create a problem for IP types which use drm_sched_pick_best. It
> >> appears even if userspace created two scheduling entities with
> >> different ring ids they could randomly map to same drm sched aka same
> >> hw ring, no?
> >
> > Yeah, that is correct. The multimedia instances have to use a "fixed"
> > load balancing because of lack of firmware support. That should have
> > been fixed by now but we never found time to actually validate it.
>
> Gotcha.
>
> > Regarding the "ring" parameter in CS, that is basically just for
> > backward compatibility with older userspace. E.g. that we don't map all
> > SDMA jobs to the same instance when only once context is used.
>
> I see. In that sense "limits" for compute in amdgpu_ctx_num_entities are
> arbitrary, or related to some old userspace expectation?
>
> Regards,
>
> Tvrtko
>
> > Regards,
> > Christian.
> >
> >>
> >> Regards,
> >>
> >> Tvrtko
> >>
> >>> Alex
> >>>
> >>>
> >>>>
> >>>> Tvrtko Ursulin (5):
> >>>>    drm/amdgpu: Cache number of rings per hw ip type
> >>>>    drm/amdgpu: Use cached number of rings from the
> >>>> AMDGPU_INFO_HW_IP_INFO
> >>>>      ioctl
> >>>>    drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage
> >>>>    drm/amdgpu: Show engine capacity in fdinfo
> >>>>    drm/amdgpu: Only show VRAM in fdinfo if it exists
> >>>>
> >>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
> >>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c    |  3 ++
> >>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++++
> >>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 39 +++++++++-----
> >>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    | 62
> >>>> +++-------------------
> >>>>   5 files changed, 49 insertions(+), 70 deletions(-)
> >>>>
> >>>> --
> >>>> 2.44.0
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/5] Add capacity key to fdinfo
  2024-05-02 15:00         ` Alex Deucher
@ 2024-05-03 11:50           ` Tvrtko Ursulin
  2024-05-03 14:28             ` Alex Deucher
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-05-03 11:50 UTC (permalink / raw
  To: Alex Deucher; +Cc: Christian König, Tvrtko Ursulin, amd-gfx, kernel-dev


On 02/05/2024 16:00, Alex Deucher wrote:
> On Thu, May 2, 2024 at 10:43 AM Tvrtko Ursulin
> <tvrtko.ursulin@igalia.com> wrote:
>>
>>
>> On 02/05/2024 14:07, Christian König wrote:
>>> Am 01.05.24 um 15:27 schrieb Tvrtko Ursulin:
>>>>
>>>> Hi Alex,
>>>>
>>>> On 30/04/2024 19:32, Alex Deucher wrote:
>>>>> On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin <tursulin@igalia.com>
>>>>> wrote:
>>>>>>
>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>>>
>>>>>> I have noticed AMD GPUs can have more than one "engine" (ring?) of
>>>>>> the same type
>>>>>> but amdgpu is not reporting that in fdinfo using the capacity engine
>>>>>> tag.
>>>>>>
>>>>>> This series is therefore an attempt to improve that, but only an RFC
>>>>>> since it is
>>>>>> quite likely I got stuff wrong on the first attempt. Or if not wrong
>>>>>> it may not
>>>>>> be very beneficial in AMDs case.
>>>>>>
>>>>>> So I tried to figure out how to count and store the number of
>>>>>> instances of an
>>>>>> "engine" type and spotted that could perhaps be used in more than
>>>>>> one place in
>>>>>> the driver. I was more than a little bit confused by the ip_instance
>>>>>> and uapi
>>>>>> rings, then how rings are selected to context entities internally.
>>>>>> Anyway..
>>>>>> hopefully it is a simple enough series to easily spot any such large
>>>>>> misses.
>>>>>>
>>>>>> End result should be that, assuming two "engine" instances, one
>>>>>> fully loaded and
>>>>>> one idle will only report client using 50% of that engine type.
>>>>>
>>>>> That would only be true if there are multiple instantiations of the IP
>>>>> on the chip which in most cases is not true.  In most cases there is
>>>>> one instance of the IP that can be fed from multiple rings. E.g. for
>>>>> graphics and compute, all of the rings ultimately feed into the same
>>>>> compute units on the chip.  So if you have a gfx ring and a compute
>>>>> rings, you can schedule work to them asynchronously, but ultimately
>>>>> whether they execute serially or in parallel depends on the actual
>>>>> shader code in the command buffers and the extent to which it can
>>>>> utilize the available compute units in the shader cores.
>>>>
>>>> This is the same as with Intel/i915. Fdinfo is not intended to provide
>>>> utilisation of EUs and such, just how busy are the "entities" kernel
>>>> submits to. So doing something like in this series would make the
>>>> reporting more similar between the two drivers.
>>>>
>>>> I think both the 0-800% or 0-100% range (taking 8 ring compute as an
>>>> example) can be misleading for different workloads. Neither <800% in
>>>> the former means one can send more work and same for <100% in the latter.
>>>
>>> Yeah, I think that's what Alex tries to describe. By using 8 compute
>>> rings your 800% load is actually incorrect and quite misleading.
>>>
>>> Background is that those 8 compute rings won't be active all at the same
>>> time, but rather waiting on each other for resources.
>>>
>>> But this "waiting" is unfortunately considered execution time since the
>>> used approach is actually not really capable of separating waiting and
>>> execution time.
>>
>> Right, so 800% is what gputop could be suggesting today, by the virtue 8
>> context/clients can each use 100% if they only use a subset of compute
>> units. I was proposing to expose the capacity in fdinfo so it can be
>> scaled down and then dicussing how both situation have pros and cons.
>>
>>>> There is also a parallel with the CPU world here and hyper threading,
>>>> if not wider, where "What does 100% actually mean?" is also wishy-washy.
>>>>
>>>> Also note that the reporting of actual time based values in fdinfo
>>>> would not changing with this series.
>>>>
>>>> Of if you can guide me towards how to distinguish real vs fake
>>>> parallelism in HW IP blocks I could modify the series to only add
>>>> capacity tags where there are truly independent blocks. That would be
>>>> different from i915 though were I did not bother with that
>>>> distinction. (For reasons that assignment of for instance EUs to
>>>> compute "rings" (command streamers in i915) was supposed to be
>>>> possible to re-configure on the fly. So it did not make sense to try
>>>> and be super smart in fdinfo.)
>>>
>>> Well exactly that's the point we don't really have truly independent
>>> blocks on AMD hardware.
>>>
>>> There are things like independent SDMA instances, but those a meant to
>>> be used like the first instance for uploads and the second for downloads
>>> etc.. When you use both instances for the same job they will pretty much
>>> limit each other because of a single resource.
>>
>> So _never_ multiple instances of the same IP block? No video decode,
>> encode, anything?
> 
> Some chips have multiple encode/decode IP blocks that are actually
> separate instances, however, we load balance between them so userspace
> sees just one engine.  Also in some cases they are asymmetric (e.g.,
> different sets of supported CODECs on each instance).  The driver
> handles this by inspecting the command buffer and scheduling on the
> appropriate instance based on the requested CODEC.  SDMA also supports
> multiple IP blocks that are independent.

Similar to i915 just that we don't inspect buffers but expose the 
instance capabilities and userspace is responsible to set up the load 
balancing engine with the correct physical mask.

Anyway, back to the main point - are you interested at all for me to add 
the capacity flags to at least the IP blocks which are probed to exist 
more than a singleton? And if yes, could you please suggest how to do it.

For instance should I use adev->sdma.num_instances, 
adev->uvd.num_uvd_inst, adev->vcn.num_vcn_inst, 
adev->jpeg.num_jpeg_inst? Or maybe adev->num_ip_blocks and count by 
hw_ip type?

Patch 3/5 interesting or not to skip all the empty array walking (in 
amdgpu_ctx_entity_time mostly)?

Regards,

Tvrtko

> 
> Alex
> 
>>
>>>>> As for the UAPI portion of this, we generally expose a limited number
>>>>> of rings to user space and then we use the GPU scheduler to load
>>>>> balance between all of the available rings of a type to try and
>>>>> extract as much parallelism as we can.
>>>>
>>>> The part I do not understand is the purpose of the ring argument in
>>>> for instance drm_amdgpu_cs_chunk_ib. It appears userspace can create
>>>> up to N scheduling entities using different ring id's, but internally
>>>> they can map to 1:N same scheduler instances (depending on IP type,
>>>> can be that each userspace ring maps to same N hw rings, or for rings
>>>> with no drm sched load balancing userspace ring also does not appear
>>>> to have a relation to the picked drm sched instance.).
>>>>
>>>> So I neither understand how this ring is useful, or how it does not
>>>> create a problem for IP types which use drm_sched_pick_best. It
>>>> appears even if userspace created two scheduling entities with
>>>> different ring ids they could randomly map to same drm sched aka same
>>>> hw ring, no?
>>>
>>> Yeah, that is correct. The multimedia instances have to use a "fixed"
>>> load balancing because of lack of firmware support. That should have
>>> been fixed by now but we never found time to actually validate it.
>>
>> Gotcha.
>>
>>> Regarding the "ring" parameter in CS, that is basically just for
>>> backward compatibility with older userspace. E.g. that we don't map all
>>> SDMA jobs to the same instance when only once context is used.
>>
>> I see. In that sense "limits" for compute in amdgpu_ctx_num_entities are
>> arbitrary, or related to some old userspace expectation?
>>
>> Regards,
>>
>> Tvrtko
>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>> Alex
>>>>>
>>>>>
>>>>>>
>>>>>> Tvrtko Ursulin (5):
>>>>>>     drm/amdgpu: Cache number of rings per hw ip type
>>>>>>     drm/amdgpu: Use cached number of rings from the
>>>>>> AMDGPU_INFO_HW_IP_INFO
>>>>>>       ioctl
>>>>>>     drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage
>>>>>>     drm/amdgpu: Show engine capacity in fdinfo
>>>>>>     drm/amdgpu: Only show VRAM in fdinfo if it exists
>>>>>>
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c    |  3 ++
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++++
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 39 +++++++++-----
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    | 62
>>>>>> +++-------------------
>>>>>>    5 files changed, 49 insertions(+), 70 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 2.44.0
>>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 5/5] drm/amdgpu: Only show VRAM in fdinfo if it exists
  2024-05-02 14:46     ` Tvrtko Ursulin
@ 2024-05-03 13:47       ` Alex Deucher
  2024-05-03 14:01         ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Deucher @ 2024-05-03 13:47 UTC (permalink / raw
  To: Tvrtko Ursulin; +Cc: Christian König, Tvrtko Ursulin, amd-gfx, kernel-dev

On Fri, May 3, 2024 at 3:50 AM Tvrtko Ursulin <tvrtko.ursulin@igalia.com> wrote:
>
>
> On 02/05/2024 14:16, Christian König wrote:
> > Am 30.04.24 um 19:27 schrieb Tvrtko Ursulin:
> >> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> >>
> >> Do not emit the key-value pairs if the VRAM does not exist ie. VRAM
> >> placement is not valid and accessible.
> >
> > Yeah, that's unfortunately rather misleading.
> >
> > Even APUs have VRAM or rather stolen system memory which is managed by
> > the graphics driver.
> >
> > We only have a single compute model which really doesn't have VRAM at all.
>
> Hm what is misleading and how more precisely? :) Maybe in other words,
> if is_app_apu is not the right criteria to know when TTM_PL_VRAM is
> impossible, what is? Is the compute model you mentio the only thing
> which sets is_app_apu and uses the dummy vram manager?

Probably better to check if adev->gmc.real_vram_size is non-0.

Alex

>
> Regards,
>
> Tvrtko
>
> > Regards,
> > Christian.
> >
> >>
> >> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 29 +++++++++++++---------
> >>   1 file changed, 17 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
> >> index a09944104c41..603a5c010f5d 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
> >> @@ -83,25 +83,30 @@ void amdgpu_show_fdinfo(struct drm_printer *p,
> >> struct drm_file *file)
> >>        */
> >>       drm_printf(p, "pasid:\t%u\n", fpriv->vm.pasid);
> >> -    drm_printf(p, "drm-memory-vram:\t%llu KiB\n", stats.vram/1024UL);
> >>       drm_printf(p, "drm-memory-gtt: \t%llu KiB\n", stats.gtt/1024UL);
> >>       drm_printf(p, "drm-memory-cpu: \t%llu KiB\n", stats.cpu/1024UL);
> >> -    drm_printf(p, "amd-memory-visible-vram:\t%llu KiB\n",
> >> -           stats.visible_vram/1024UL);
> >> -    drm_printf(p, "amd-evicted-vram:\t%llu KiB\n",
> >> -           stats.evicted_vram/1024UL);
> >> -    drm_printf(p, "amd-evicted-visible-vram:\t%llu KiB\n",
> >> -           stats.evicted_visible_vram/1024UL);
> >> -    drm_printf(p, "amd-requested-vram:\t%llu KiB\n",
> >> -           stats.requested_vram/1024UL);
> >> -    drm_printf(p, "amd-requested-visible-vram:\t%llu KiB\n",
> >> -           stats.requested_visible_vram/1024UL);
> >>       drm_printf(p, "amd-requested-gtt:\t%llu KiB\n",
> >>              stats.requested_gtt/1024UL);
> >> -    drm_printf(p, "drm-shared-vram:\t%llu KiB\n",
> >> stats.vram_shared/1024UL);
> >>       drm_printf(p, "drm-shared-gtt:\t%llu KiB\n",
> >> stats.gtt_shared/1024UL);
> >>       drm_printf(p, "drm-shared-cpu:\t%llu KiB\n",
> >> stats.cpu_shared/1024UL);
> >> +    if (!adev->gmc.is_app_apu) {
> >> +        drm_printf(p, "drm-memory-vram:\t%llu KiB\n",
> >> +               stats.vram/1024UL);
> >> +        drm_printf(p, "amd-memory-visible-vram:\t%llu KiB\n",
> >> +               stats.visible_vram/1024UL);
> >> +        drm_printf(p, "amd-evicted-vram:\t%llu KiB\n",
> >> +               stats.evicted_vram/1024UL);
> >> +        drm_printf(p, "amd-evicted-visible-vram:\t%llu KiB\n",
> >> +               stats.evicted_visible_vram/1024UL);
> >> +        drm_printf(p, "amd-requested-vram:\t%llu KiB\n",
> >> +               stats.requested_vram/1024UL);
> >> +        drm_printf(p, "amd-requested-visible-vram:\t%llu KiB\n",
> >> +               stats.requested_visible_vram/1024UL);
> >> +        drm_printf(p, "drm-shared-vram:\t%llu KiB\n",
> >> +               stats.vram_shared/1024UL);
> >> +    }
> >> +
> >>       for (hw_ip = 0; hw_ip < AMDGPU_HW_IP_NUM; ++hw_ip) {
> >>           if (!usage[hw_ip])
> >>               continue;
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 5/5] drm/amdgpu: Only show VRAM in fdinfo if it exists
  2024-05-03 13:47       ` Alex Deucher
@ 2024-05-03 14:01         ` Tvrtko Ursulin
  2024-05-03 14:43           ` Alex Deucher
  0 siblings, 1 reply; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-05-03 14:01 UTC (permalink / raw
  To: Alex Deucher; +Cc: Christian König, Tvrtko Ursulin, amd-gfx, kernel-dev


On 03/05/2024 14:47, Alex Deucher wrote:
> On Fri, May 3, 2024 at 3:50 AM Tvrtko Ursulin <tvrtko.ursulin@igalia.com> wrote:
>>
>>
>> On 02/05/2024 14:16, Christian König wrote:
>>> Am 30.04.24 um 19:27 schrieb Tvrtko Ursulin:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>
>>>> Do not emit the key-value pairs if the VRAM does not exist ie. VRAM
>>>> placement is not valid and accessible.
>>>
>>> Yeah, that's unfortunately rather misleading.
>>>
>>> Even APUs have VRAM or rather stolen system memory which is managed by
>>> the graphics driver.
>>>
>>> We only have a single compute model which really doesn't have VRAM at all.
>>
>> Hm what is misleading and how more precisely? :) Maybe in other words,
>> if is_app_apu is not the right criteria to know when TTM_PL_VRAM is
>> impossible, what is? Is the compute model you mentio the only thing
>> which sets is_app_apu and uses the dummy vram manager?
> 
> Probably better to check if adev->gmc.real_vram_size is non-0.

Hmm "real VRAM" - will that handle APUs correctly?

I am looking at this:

	if (!adev->gmc.is_app_apu) {
		man->func = &amdgpu_vram_mgr_func;

		err = drm_buddy_init(&mgr->mm, man->size, PAGE_SIZE);
		if (err)
			return err;
	} else {
		man->func = &amdgpu_dummy_vram_mgr_func;
		DRM_INFO("Setup dummy vram mgr\n");
	}

And assuming that unless the dummy manager is used, TTM_PL_VRAM will be 
valid. Wrong assumption?

Regards,

Tvrtko


> Alex
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 29 +++++++++++++---------
>>>>    1 file changed, 17 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
>>>> index a09944104c41..603a5c010f5d 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
>>>> @@ -83,25 +83,30 @@ void amdgpu_show_fdinfo(struct drm_printer *p,
>>>> struct drm_file *file)
>>>>         */
>>>>        drm_printf(p, "pasid:\t%u\n", fpriv->vm.pasid);
>>>> -    drm_printf(p, "drm-memory-vram:\t%llu KiB\n", stats.vram/1024UL);
>>>>        drm_printf(p, "drm-memory-gtt: \t%llu KiB\n", stats.gtt/1024UL);
>>>>        drm_printf(p, "drm-memory-cpu: \t%llu KiB\n", stats.cpu/1024UL);
>>>> -    drm_printf(p, "amd-memory-visible-vram:\t%llu KiB\n",
>>>> -           stats.visible_vram/1024UL);
>>>> -    drm_printf(p, "amd-evicted-vram:\t%llu KiB\n",
>>>> -           stats.evicted_vram/1024UL);
>>>> -    drm_printf(p, "amd-evicted-visible-vram:\t%llu KiB\n",
>>>> -           stats.evicted_visible_vram/1024UL);
>>>> -    drm_printf(p, "amd-requested-vram:\t%llu KiB\n",
>>>> -           stats.requested_vram/1024UL);
>>>> -    drm_printf(p, "amd-requested-visible-vram:\t%llu KiB\n",
>>>> -           stats.requested_visible_vram/1024UL);
>>>>        drm_printf(p, "amd-requested-gtt:\t%llu KiB\n",
>>>>               stats.requested_gtt/1024UL);
>>>> -    drm_printf(p, "drm-shared-vram:\t%llu KiB\n",
>>>> stats.vram_shared/1024UL);
>>>>        drm_printf(p, "drm-shared-gtt:\t%llu KiB\n",
>>>> stats.gtt_shared/1024UL);
>>>>        drm_printf(p, "drm-shared-cpu:\t%llu KiB\n",
>>>> stats.cpu_shared/1024UL);
>>>> +    if (!adev->gmc.is_app_apu) {
>>>> +        drm_printf(p, "drm-memory-vram:\t%llu KiB\n",
>>>> +               stats.vram/1024UL);
>>>> +        drm_printf(p, "amd-memory-visible-vram:\t%llu KiB\n",
>>>> +               stats.visible_vram/1024UL);
>>>> +        drm_printf(p, "amd-evicted-vram:\t%llu KiB\n",
>>>> +               stats.evicted_vram/1024UL);
>>>> +        drm_printf(p, "amd-evicted-visible-vram:\t%llu KiB\n",
>>>> +               stats.evicted_visible_vram/1024UL);
>>>> +        drm_printf(p, "amd-requested-vram:\t%llu KiB\n",
>>>> +               stats.requested_vram/1024UL);
>>>> +        drm_printf(p, "amd-requested-visible-vram:\t%llu KiB\n",
>>>> +               stats.requested_visible_vram/1024UL);
>>>> +        drm_printf(p, "drm-shared-vram:\t%llu KiB\n",
>>>> +               stats.vram_shared/1024UL);
>>>> +    }
>>>> +
>>>>        for (hw_ip = 0; hw_ip < AMDGPU_HW_IP_NUM; ++hw_ip) {
>>>>            if (!usage[hw_ip])
>>>>                continue;
>>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/5] Add capacity key to fdinfo
  2024-05-03 11:50           ` Tvrtko Ursulin
@ 2024-05-03 14:28             ` Alex Deucher
  2024-05-07  8:57               ` Tvrtko Ursulin
  0 siblings, 1 reply; 19+ messages in thread
From: Alex Deucher @ 2024-05-03 14:28 UTC (permalink / raw
  To: Tvrtko Ursulin; +Cc: Christian König, Tvrtko Ursulin, amd-gfx, kernel-dev

On Fri, May 3, 2024 at 7:50 AM Tvrtko Ursulin <tvrtko.ursulin@igalia.com> wrote:
>
>
> On 02/05/2024 16:00, Alex Deucher wrote:
> > On Thu, May 2, 2024 at 10:43 AM Tvrtko Ursulin
> > <tvrtko.ursulin@igalia.com> wrote:
> >>
> >>
> >> On 02/05/2024 14:07, Christian König wrote:
> >>> Am 01.05.24 um 15:27 schrieb Tvrtko Ursulin:
> >>>>
> >>>> Hi Alex,
> >>>>
> >>>> On 30/04/2024 19:32, Alex Deucher wrote:
> >>>>> On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin <tursulin@igalia.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> >>>>>>
> >>>>>> I have noticed AMD GPUs can have more than one "engine" (ring?) of
> >>>>>> the same type
> >>>>>> but amdgpu is not reporting that in fdinfo using the capacity engine
> >>>>>> tag.
> >>>>>>
> >>>>>> This series is therefore an attempt to improve that, but only an RFC
> >>>>>> since it is
> >>>>>> quite likely I got stuff wrong on the first attempt. Or if not wrong
> >>>>>> it may not
> >>>>>> be very beneficial in AMDs case.
> >>>>>>
> >>>>>> So I tried to figure out how to count and store the number of
> >>>>>> instances of an
> >>>>>> "engine" type and spotted that could perhaps be used in more than
> >>>>>> one place in
> >>>>>> the driver. I was more than a little bit confused by the ip_instance
> >>>>>> and uapi
> >>>>>> rings, then how rings are selected to context entities internally.
> >>>>>> Anyway..
> >>>>>> hopefully it is a simple enough series to easily spot any such large
> >>>>>> misses.
> >>>>>>
> >>>>>> End result should be that, assuming two "engine" instances, one
> >>>>>> fully loaded and
> >>>>>> one idle will only report client using 50% of that engine type.
> >>>>>
> >>>>> That would only be true if there are multiple instantiations of the IP
> >>>>> on the chip which in most cases is not true.  In most cases there is
> >>>>> one instance of the IP that can be fed from multiple rings. E.g. for
> >>>>> graphics and compute, all of the rings ultimately feed into the same
> >>>>> compute units on the chip.  So if you have a gfx ring and a compute
> >>>>> rings, you can schedule work to them asynchronously, but ultimately
> >>>>> whether they execute serially or in parallel depends on the actual
> >>>>> shader code in the command buffers and the extent to which it can
> >>>>> utilize the available compute units in the shader cores.
> >>>>
> >>>> This is the same as with Intel/i915. Fdinfo is not intended to provide
> >>>> utilisation of EUs and such, just how busy are the "entities" kernel
> >>>> submits to. So doing something like in this series would make the
> >>>> reporting more similar between the two drivers.
> >>>>
> >>>> I think both the 0-800% or 0-100% range (taking 8 ring compute as an
> >>>> example) can be misleading for different workloads. Neither <800% in
> >>>> the former means one can send more work and same for <100% in the latter.
> >>>
> >>> Yeah, I think that's what Alex tries to describe. By using 8 compute
> >>> rings your 800% load is actually incorrect and quite misleading.
> >>>
> >>> Background is that those 8 compute rings won't be active all at the same
> >>> time, but rather waiting on each other for resources.
> >>>
> >>> But this "waiting" is unfortunately considered execution time since the
> >>> used approach is actually not really capable of separating waiting and
> >>> execution time.
> >>
> >> Right, so 800% is what gputop could be suggesting today, by the virtue 8
> >> context/clients can each use 100% if they only use a subset of compute
> >> units. I was proposing to expose the capacity in fdinfo so it can be
> >> scaled down and then dicussing how both situation have pros and cons.
> >>
> >>>> There is also a parallel with the CPU world here and hyper threading,
> >>>> if not wider, where "What does 100% actually mean?" is also wishy-washy.
> >>>>
> >>>> Also note that the reporting of actual time based values in fdinfo
> >>>> would not changing with this series.
> >>>>
> >>>> Of if you can guide me towards how to distinguish real vs fake
> >>>> parallelism in HW IP blocks I could modify the series to only add
> >>>> capacity tags where there are truly independent blocks. That would be
> >>>> different from i915 though were I did not bother with that
> >>>> distinction. (For reasons that assignment of for instance EUs to
> >>>> compute "rings" (command streamers in i915) was supposed to be
> >>>> possible to re-configure on the fly. So it did not make sense to try
> >>>> and be super smart in fdinfo.)
> >>>
> >>> Well exactly that's the point we don't really have truly independent
> >>> blocks on AMD hardware.
> >>>
> >>> There are things like independent SDMA instances, but those a meant to
> >>> be used like the first instance for uploads and the second for downloads
> >>> etc.. When you use both instances for the same job they will pretty much
> >>> limit each other because of a single resource.
> >>
> >> So _never_ multiple instances of the same IP block? No video decode,
> >> encode, anything?
> >
> > Some chips have multiple encode/decode IP blocks that are actually
> > separate instances, however, we load balance between them so userspace
> > sees just one engine.  Also in some cases they are asymmetric (e.g.,
> > different sets of supported CODECs on each instance).  The driver
> > handles this by inspecting the command buffer and scheduling on the
> > appropriate instance based on the requested CODEC.  SDMA also supports
> > multiple IP blocks that are independent.
>
> Similar to i915 just that we don't inspect buffers but expose the
> instance capabilities and userspace is responsible to set up the load
> balancing engine with the correct physical mask.

How do you handle load balancing across applications?

>
> Anyway, back to the main point - are you interested at all for me to add
> the capacity flags to at least the IP blocks which are probed to exist
> more than a singleton? And if yes, could you please suggest how to do it.

I don't have a particularly strong opinion.  I should note that this
does not take into account time spent on user mode queues.  Since
those are in userspace, we don't currently have a way to track the
time on the engines.  We've been working with the HW/FW teams to try
and come up with a scheme to handle them.

>
> For instance should I use adev->sdma.num_instances,
> adev->uvd.num_uvd_inst, adev->vcn.num_vcn_inst,
> adev->jpeg.num_jpeg_inst? Or maybe adev->num_ip_blocks and count by
> hw_ip type?

The former.  The latter was how we had intended to handle these
situations, but it didn't work out as well as we had hoped in practice
due to quirks of the hardware implementations.

>
> Patch 3/5 interesting or not to skip all the empty array walking (in
> amdgpu_ctx_entity_time mostly)?

The first 3 patches seem like a nice clean up to me.

Thanks!

Alex

>
> Regards,
>
> Tvrtko
>
> >
> > Alex
> >
> >>
> >>>>> As for the UAPI portion of this, we generally expose a limited number
> >>>>> of rings to user space and then we use the GPU scheduler to load
> >>>>> balance between all of the available rings of a type to try and
> >>>>> extract as much parallelism as we can.
> >>>>
> >>>> The part I do not understand is the purpose of the ring argument in
> >>>> for instance drm_amdgpu_cs_chunk_ib. It appears userspace can create
> >>>> up to N scheduling entities using different ring id's, but internally
> >>>> they can map to 1:N same scheduler instances (depending on IP type,
> >>>> can be that each userspace ring maps to same N hw rings, or for rings
> >>>> with no drm sched load balancing userspace ring also does not appear
> >>>> to have a relation to the picked drm sched instance.).
> >>>>
> >>>> So I neither understand how this ring is useful, or how it does not
> >>>> create a problem for IP types which use drm_sched_pick_best. It
> >>>> appears even if userspace created two scheduling entities with
> >>>> different ring ids they could randomly map to same drm sched aka same
> >>>> hw ring, no?
> >>>
> >>> Yeah, that is correct. The multimedia instances have to use a "fixed"
> >>> load balancing because of lack of firmware support. That should have
> >>> been fixed by now but we never found time to actually validate it.
> >>
> >> Gotcha.
> >>
> >>> Regarding the "ring" parameter in CS, that is basically just for
> >>> backward compatibility with older userspace. E.g. that we don't map all
> >>> SDMA jobs to the same instance when only once context is used.
> >>
> >> I see. In that sense "limits" for compute in amdgpu_ctx_num_entities are
> >> arbitrary, or related to some old userspace expectation?
> >>
> >> Regards,
> >>
> >> Tvrtko
> >>
> >>> Regards,
> >>> Christian.
> >>>
> >>>>
> >>>> Regards,
> >>>>
> >>>> Tvrtko
> >>>>
> >>>>> Alex
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Tvrtko Ursulin (5):
> >>>>>>     drm/amdgpu: Cache number of rings per hw ip type
> >>>>>>     drm/amdgpu: Use cached number of rings from the
> >>>>>> AMDGPU_INFO_HW_IP_INFO
> >>>>>>       ioctl
> >>>>>>     drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage
> >>>>>>     drm/amdgpu: Show engine capacity in fdinfo
> >>>>>>     drm/amdgpu: Only show VRAM in fdinfo if it exists
> >>>>>>
> >>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
> >>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c    |  3 ++
> >>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++++
> >>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 39 +++++++++-----
> >>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    | 62
> >>>>>> +++-------------------
> >>>>>>    5 files changed, 49 insertions(+), 70 deletions(-)
> >>>>>>
> >>>>>> --
> >>>>>> 2.44.0
> >>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 5/5] drm/amdgpu: Only show VRAM in fdinfo if it exists
  2024-05-03 14:01         ` Tvrtko Ursulin
@ 2024-05-03 14:43           ` Alex Deucher
  0 siblings, 0 replies; 19+ messages in thread
From: Alex Deucher @ 2024-05-03 14:43 UTC (permalink / raw
  To: Tvrtko Ursulin; +Cc: Christian König, Tvrtko Ursulin, amd-gfx, kernel-dev

On Fri, May 3, 2024 at 10:01 AM Tvrtko Ursulin
<tvrtko.ursulin@igalia.com> wrote:
>
>
> On 03/05/2024 14:47, Alex Deucher wrote:
> > On Fri, May 3, 2024 at 3:50 AM Tvrtko Ursulin <tvrtko.ursulin@igalia.com> wrote:
> >>
> >>
> >> On 02/05/2024 14:16, Christian König wrote:
> >>> Am 30.04.24 um 19:27 schrieb Tvrtko Ursulin:
> >>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> >>>>
> >>>> Do not emit the key-value pairs if the VRAM does not exist ie. VRAM
> >>>> placement is not valid and accessible.
> >>>
> >>> Yeah, that's unfortunately rather misleading.
> >>>
> >>> Even APUs have VRAM or rather stolen system memory which is managed by
> >>> the graphics driver.
> >>>
> >>> We only have a single compute model which really doesn't have VRAM at all.
> >>
> >> Hm what is misleading and how more precisely? :) Maybe in other words,
> >> if is_app_apu is not the right criteria to know when TTM_PL_VRAM is
> >> impossible, what is? Is the compute model you mentio the only thing
> >> which sets is_app_apu and uses the dummy vram manager?
> >
> > Probably better to check if adev->gmc.real_vram_size is non-0.
>
> Hmm "real VRAM" - will that handle APUs correctly?

Yes.  In the client APU case "VRAM" will be the UMA carve out region
reserved by the sbios.

>
> I am looking at this:
>
>         if (!adev->gmc.is_app_apu) {
>                 man->func = &amdgpu_vram_mgr_func;
>
>                 err = drm_buddy_init(&mgr->mm, man->size, PAGE_SIZE);
>                 if (err)
>                         return err;
>         } else {
>                 man->func = &amdgpu_dummy_vram_mgr_func;
>                 DRM_INFO("Setup dummy vram mgr\n");
>         }
>
> And assuming that unless the dummy manager is used, TTM_PL_VRAM will be
> valid. Wrong assumption?

Today checking is_app_apu would be fine, but It's supposed to
distinguish datacenter APUs, not whether or not the device has a
"VRAM" pool or not, but its usage has gotten sort of overloaded.  Just
want to avoid overloading what it means in too many more places.

Alex

>
> Regards,
>
> Tvrtko
>
>
> > Alex
> >
> >>
> >> Regards,
> >>
> >> Tvrtko
> >>
> >>> Regards,
> >>> Christian.
> >>>
> >>>>
> >>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> >>>> ---
> >>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 29 +++++++++++++---------
> >>>>    1 file changed, 17 insertions(+), 12 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
> >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
> >>>> index a09944104c41..603a5c010f5d 100644
> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
> >>>> @@ -83,25 +83,30 @@ void amdgpu_show_fdinfo(struct drm_printer *p,
> >>>> struct drm_file *file)
> >>>>         */
> >>>>        drm_printf(p, "pasid:\t%u\n", fpriv->vm.pasid);
> >>>> -    drm_printf(p, "drm-memory-vram:\t%llu KiB\n", stats.vram/1024UL);
> >>>>        drm_printf(p, "drm-memory-gtt: \t%llu KiB\n", stats.gtt/1024UL);
> >>>>        drm_printf(p, "drm-memory-cpu: \t%llu KiB\n", stats.cpu/1024UL);
> >>>> -    drm_printf(p, "amd-memory-visible-vram:\t%llu KiB\n",
> >>>> -           stats.visible_vram/1024UL);
> >>>> -    drm_printf(p, "amd-evicted-vram:\t%llu KiB\n",
> >>>> -           stats.evicted_vram/1024UL);
> >>>> -    drm_printf(p, "amd-evicted-visible-vram:\t%llu KiB\n",
> >>>> -           stats.evicted_visible_vram/1024UL);
> >>>> -    drm_printf(p, "amd-requested-vram:\t%llu KiB\n",
> >>>> -           stats.requested_vram/1024UL);
> >>>> -    drm_printf(p, "amd-requested-visible-vram:\t%llu KiB\n",
> >>>> -           stats.requested_visible_vram/1024UL);
> >>>>        drm_printf(p, "amd-requested-gtt:\t%llu KiB\n",
> >>>>               stats.requested_gtt/1024UL);
> >>>> -    drm_printf(p, "drm-shared-vram:\t%llu KiB\n",
> >>>> stats.vram_shared/1024UL);
> >>>>        drm_printf(p, "drm-shared-gtt:\t%llu KiB\n",
> >>>> stats.gtt_shared/1024UL);
> >>>>        drm_printf(p, "drm-shared-cpu:\t%llu KiB\n",
> >>>> stats.cpu_shared/1024UL);
> >>>> +    if (!adev->gmc.is_app_apu) {
> >>>> +        drm_printf(p, "drm-memory-vram:\t%llu KiB\n",
> >>>> +               stats.vram/1024UL);
> >>>> +        drm_printf(p, "amd-memory-visible-vram:\t%llu KiB\n",
> >>>> +               stats.visible_vram/1024UL);
> >>>> +        drm_printf(p, "amd-evicted-vram:\t%llu KiB\n",
> >>>> +               stats.evicted_vram/1024UL);
> >>>> +        drm_printf(p, "amd-evicted-visible-vram:\t%llu KiB\n",
> >>>> +               stats.evicted_visible_vram/1024UL);
> >>>> +        drm_printf(p, "amd-requested-vram:\t%llu KiB\n",
> >>>> +               stats.requested_vram/1024UL);
> >>>> +        drm_printf(p, "amd-requested-visible-vram:\t%llu KiB\n",
> >>>> +               stats.requested_visible_vram/1024UL);
> >>>> +        drm_printf(p, "drm-shared-vram:\t%llu KiB\n",
> >>>> +               stats.vram_shared/1024UL);
> >>>> +    }
> >>>> +
> >>>>        for (hw_ip = 0; hw_ip < AMDGPU_HW_IP_NUM; ++hw_ip) {
> >>>>            if (!usage[hw_ip])
> >>>>                continue;
> >>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC 0/5] Add capacity key to fdinfo
  2024-05-03 14:28             ` Alex Deucher
@ 2024-05-07  8:57               ` Tvrtko Ursulin
  0 siblings, 0 replies; 19+ messages in thread
From: Tvrtko Ursulin @ 2024-05-07  8:57 UTC (permalink / raw
  To: Alex Deucher; +Cc: Christian König, Tvrtko Ursulin, amd-gfx, kernel-dev


On 03/05/2024 15:28, Alex Deucher wrote:
> On Fri, May 3, 2024 at 7:50 AM Tvrtko Ursulin <tvrtko.ursulin@igalia.com> wrote:
>> On 02/05/2024 16:00, Alex Deucher wrote:
>>> On Thu, May 2, 2024 at 10:43 AM Tvrtko Ursulin
>>> <tvrtko.ursulin@igalia.com> wrote:
>>>>
>>>>
>>>> On 02/05/2024 14:07, Christian König wrote:
>>>>> Am 01.05.24 um 15:27 schrieb Tvrtko Ursulin:
>>>>>>
>>>>>> Hi Alex,
>>>>>>
>>>>>> On 30/04/2024 19:32, Alex Deucher wrote:
>>>>>>> On Tue, Apr 30, 2024 at 1:27 PM Tvrtko Ursulin <tursulin@igalia.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>>>>>>>
>>>>>>>> I have noticed AMD GPUs can have more than one "engine" (ring?) of
>>>>>>>> the same type
>>>>>>>> but amdgpu is not reporting that in fdinfo using the capacity engine
>>>>>>>> tag.
>>>>>>>>
>>>>>>>> This series is therefore an attempt to improve that, but only an RFC
>>>>>>>> since it is
>>>>>>>> quite likely I got stuff wrong on the first attempt. Or if not wrong
>>>>>>>> it may not
>>>>>>>> be very beneficial in AMDs case.
>>>>>>>>
>>>>>>>> So I tried to figure out how to count and store the number of
>>>>>>>> instances of an
>>>>>>>> "engine" type and spotted that could perhaps be used in more than
>>>>>>>> one place in
>>>>>>>> the driver. I was more than a little bit confused by the ip_instance
>>>>>>>> and uapi
>>>>>>>> rings, then how rings are selected to context entities internally.
>>>>>>>> Anyway..
>>>>>>>> hopefully it is a simple enough series to easily spot any such large
>>>>>>>> misses.
>>>>>>>>
>>>>>>>> End result should be that, assuming two "engine" instances, one
>>>>>>>> fully loaded and
>>>>>>>> one idle will only report client using 50% of that engine type.
>>>>>>>
>>>>>>> That would only be true if there are multiple instantiations of the IP
>>>>>>> on the chip which in most cases is not true.  In most cases there is
>>>>>>> one instance of the IP that can be fed from multiple rings. E.g. for
>>>>>>> graphics and compute, all of the rings ultimately feed into the same
>>>>>>> compute units on the chip.  So if you have a gfx ring and a compute
>>>>>>> rings, you can schedule work to them asynchronously, but ultimately
>>>>>>> whether they execute serially or in parallel depends on the actual
>>>>>>> shader code in the command buffers and the extent to which it can
>>>>>>> utilize the available compute units in the shader cores.
>>>>>>
>>>>>> This is the same as with Intel/i915. Fdinfo is not intended to provide
>>>>>> utilisation of EUs and such, just how busy are the "entities" kernel
>>>>>> submits to. So doing something like in this series would make the
>>>>>> reporting more similar between the two drivers.
>>>>>>
>>>>>> I think both the 0-800% or 0-100% range (taking 8 ring compute as an
>>>>>> example) can be misleading for different workloads. Neither <800% in
>>>>>> the former means one can send more work and same for <100% in the latter.
>>>>>
>>>>> Yeah, I think that's what Alex tries to describe. By using 8 compute
>>>>> rings your 800% load is actually incorrect and quite misleading.
>>>>>
>>>>> Background is that those 8 compute rings won't be active all at the same
>>>>> time, but rather waiting on each other for resources.
>>>>>
>>>>> But this "waiting" is unfortunately considered execution time since the
>>>>> used approach is actually not really capable of separating waiting and
>>>>> execution time.
>>>>
>>>> Right, so 800% is what gputop could be suggesting today, by the virtue 8
>>>> context/clients can each use 100% if they only use a subset of compute
>>>> units. I was proposing to expose the capacity in fdinfo so it can be
>>>> scaled down and then dicussing how both situation have pros and cons.
>>>>
>>>>>> There is also a parallel with the CPU world here and hyper threading,
>>>>>> if not wider, where "What does 100% actually mean?" is also wishy-washy.
>>>>>>
>>>>>> Also note that the reporting of actual time based values in fdinfo
>>>>>> would not changing with this series.
>>>>>>
>>>>>> Of if you can guide me towards how to distinguish real vs fake
>>>>>> parallelism in HW IP blocks I could modify the series to only add
>>>>>> capacity tags where there are truly independent blocks. That would be
>>>>>> different from i915 though were I did not bother with that
>>>>>> distinction. (For reasons that assignment of for instance EUs to
>>>>>> compute "rings" (command streamers in i915) was supposed to be
>>>>>> possible to re-configure on the fly. So it did not make sense to try
>>>>>> and be super smart in fdinfo.)
>>>>>
>>>>> Well exactly that's the point we don't really have truly independent
>>>>> blocks on AMD hardware.
>>>>>
>>>>> There are things like independent SDMA instances, but those a meant to
>>>>> be used like the first instance for uploads and the second for downloads
>>>>> etc.. When you use both instances for the same job they will pretty much
>>>>> limit each other because of a single resource.
>>>>
>>>> So _never_ multiple instances of the same IP block? No video decode,
>>>> encode, anything?
>>>
>>> Some chips have multiple encode/decode IP blocks that are actually
>>> separate instances, however, we load balance between them so userspace
>>> sees just one engine.  Also in some cases they are asymmetric (e.g.,
>>> different sets of supported CODECs on each instance).  The driver
>>> handles this by inspecting the command buffer and scheduling on the
>>> appropriate instance based on the requested CODEC.  SDMA also supports
>>> multiple IP blocks that are independent.
>>
>> Similar to i915 just that we don't inspect buffers but expose the
>> instance capabilities and userspace is responsible to set up the load
>> balancing engine with the correct physical mask.
> 
> How do you handle load balancing across applications?

 From the uapi side of things: enumerate the hardware engines and their 
capabilities (which includes information of number of instances per 
class/type) and then create a submission queue composed of the instances 
compatible for the intended use.

 From the i915 side it forks into two - before firmware scheduling and 
with. In both cases scheduling is done a bit later than in amdgpu, not 
in the frontend but more in the backend, after the dma_resv/etc 
dependencies have been cleared.

For the former case we only got to have the primitive load-balancing 
which could be described as "first idle physical instance picks the next 
job from the global queue".

For the latter it is up to the firmware to do its magic but essentially 
it also boils down to pick something runnable from the global queue.

>> Anyway, back to the main point - are you interested at all for me to add
>> the capacity flags to at least the IP blocks which are probed to exist
>> more than a singleton? And if yes, could you please suggest how to do it.
> 
> I don't have a particularly strong opinion.  I should note that this
> does not take into account time spent on user mode queues.  Since
> those are in userspace, we don't currently have a way to track the
> time on the engines.  We've been working with the HW/FW teams to try
> and come up with a scheme to handle them.
> 
>>
>> For instance should I use adev->sdma.num_instances,
>> adev->uvd.num_uvd_inst, adev->vcn.num_vcn_inst,
>> adev->jpeg.num_jpeg_inst? Or maybe adev->num_ip_blocks and count by
>> hw_ip type?
> 
> The former.  The latter was how we had intended to handle these
> situations, but it didn't work out as well as we had hoped in practice
> due to quirks of the hardware implementations.

Got it, thanks!

>> Patch 3/5 interesting or not to skip all the empty array walking (in
>> amdgpu_ctx_entity_time mostly)?
> 
> The first 3 patches seem like a nice clean up to me.

Hah thanks! But note that 2/5 has a potential functional change due 
sched.ready check/games. I've noticed amdgpu can set it to false at 
runtime, if it decides to disable the ring, in which case the existing 
query ioctl would reflect it but the version from the patch would not. I 
don't know if that is relevant or not in practice.

Regards,

Tvrtko

> Thanks!
> 
> Alex
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>>
>>> Alex
>>>
>>>>
>>>>>>> As for the UAPI portion of this, we generally expose a limited number
>>>>>>> of rings to user space and then we use the GPU scheduler to load
>>>>>>> balance between all of the available rings of a type to try and
>>>>>>> extract as much parallelism as we can.
>>>>>>
>>>>>> The part I do not understand is the purpose of the ring argument in
>>>>>> for instance drm_amdgpu_cs_chunk_ib. It appears userspace can create
>>>>>> up to N scheduling entities using different ring id's, but internally
>>>>>> they can map to 1:N same scheduler instances (depending on IP type,
>>>>>> can be that each userspace ring maps to same N hw rings, or for rings
>>>>>> with no drm sched load balancing userspace ring also does not appear
>>>>>> to have a relation to the picked drm sched instance.).
>>>>>>
>>>>>> So I neither understand how this ring is useful, or how it does not
>>>>>> create a problem for IP types which use drm_sched_pick_best. It
>>>>>> appears even if userspace created two scheduling entities with
>>>>>> different ring ids they could randomly map to same drm sched aka same
>>>>>> hw ring, no?
>>>>>
>>>>> Yeah, that is correct. The multimedia instances have to use a "fixed"
>>>>> load balancing because of lack of firmware support. That should have
>>>>> been fixed by now but we never found time to actually validate it.
>>>>
>>>> Gotcha.
>>>>
>>>>> Regarding the "ring" parameter in CS, that is basically just for
>>>>> backward compatibility with older userspace. E.g. that we don't map all
>>>>> SDMA jobs to the same instance when only once context is used.
>>>>
>>>> I see. In that sense "limits" for compute in amdgpu_ctx_num_entities are
>>>> arbitrary, or related to some old userspace expectation?
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Tvrtko
>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Tvrtko Ursulin (5):
>>>>>>>>      drm/amdgpu: Cache number of rings per hw ip type
>>>>>>>>      drm/amdgpu: Use cached number of rings from the
>>>>>>>> AMDGPU_INFO_HW_IP_INFO
>>>>>>>>        ioctl
>>>>>>>>      drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage
>>>>>>>>      drm/amdgpu: Show engine capacity in fdinfo
>>>>>>>>      drm/amdgpu: Only show VRAM in fdinfo if it exists
>>>>>>>>
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  1 +
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c    |  3 ++
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 +++++
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c | 39 +++++++++-----
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    | 62
>>>>>>>> +++-------------------
>>>>>>>>     5 files changed, 49 insertions(+), 70 deletions(-)
>>>>>>>>
>>>>>>>> --
>>>>>>>> 2.44.0
>>>>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-05-08  7:56 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-30 17:27 [RFC 0/5] Add capacity key to fdinfo Tvrtko Ursulin
2024-04-30 17:27 ` [RFC 1/5] drm/amdgpu: Cache number of rings per hw ip type Tvrtko Ursulin
2024-04-30 17:27 ` [RFC 2/5] drm/amdgpu: Use cached number of rings from the AMDGPU_INFO_HW_IP_INFO ioctl Tvrtko Ursulin
2024-04-30 17:27 ` [RFC 3/5] drm/amdgpu: Skip not present rings in amdgpu_ctx_mgr_usage Tvrtko Ursulin
2024-04-30 17:27 ` [RFC 4/5] drm/amdgpu: Show engine capacity in fdinfo Tvrtko Ursulin
2024-04-30 17:27 ` [RFC 5/5] drm/amdgpu: Only show VRAM in fdinfo if it exists Tvrtko Ursulin
2024-05-02 13:16   ` Christian König
2024-05-02 14:46     ` Tvrtko Ursulin
2024-05-03 13:47       ` Alex Deucher
2024-05-03 14:01         ` Tvrtko Ursulin
2024-05-03 14:43           ` Alex Deucher
2024-04-30 18:32 ` [RFC 0/5] Add capacity key to fdinfo Alex Deucher
2024-05-01 13:27   ` Tvrtko Ursulin
2024-05-02 13:07     ` Christian König
2024-05-02 14:43       ` Tvrtko Ursulin
2024-05-02 15:00         ` Alex Deucher
2024-05-03 11:50           ` Tvrtko Ursulin
2024-05-03 14:28             ` Alex Deucher
2024-05-07  8:57               ` Tvrtko Ursulin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.