* [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support
@ 2024-03-19 11:59 Sergey Khimich
2024-03-25 13:19 ` Ulf Hansson
0 siblings, 1 reply; 5+ messages in thread
From: Sergey Khimich @ 2024-03-19 11:59 UTC (permalink / raw
To: linux-kernel
Cc: linux-mmc, Ulf Hansson, Adrian Hunter, Shawn Lin, Jyan Chou,
Asutosh Das, Ritesh Harjani
Hello!
This is implementation of SDHCI CQE support for sdhci-of-dwcmshc driver.
For enabling CQE support just set 'supports-cqe' in your DevTree file
for appropriate mmc node.
Also, while implementing CQE support for the driver, I faced with a problem
which I will describe below.
According to the IP block documentation CQE works only with "AMDA-2 only"
mode which is activated only with v4 mode enabled. I see in dwcmshc_probe()
function that v4 mode gets enabled only for 'sdhci_dwcmshc_bf3_pdata'
platform data.
So my question is: is it correct to enable v4 mode for all platform data
if 'SDHCI_CAN_64BIT_V4' bit is set in hw?
Because I`m afraid that enabling v4 mode for some platforms could break
them down. On the other hand, if host controller says that it can do v4
(caps & SDHCI_CAN_64BIT_V4), lets do v4 or disable it manualy by some
quirk. Anyway - RFC.
v2:
- Added dwcmshc specific cqe_disable hook to prevent losing
in-flight cmd when an ioctl is issued and cqe_disable is called;
- Added processing 128Mb boundary for the host memory data buffer size
and the data buffer. For implementing this processing an extra
callback is added to the struct 'sdhci_ops'.
- Fixed typo.
v3:
- Fix warning reported by kernel test robot:
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309270807.VoVn81m6-lkp@intel.com/
| Closes: https://lore.kernel.org/oe-kbuild-all/202309300806.dcR19kcE-lkp@intel.com/
v4:
- Data reset moved to custom driver tuning hook.
- Removed unnecessary dwcmshc_sdhci_cqe_disable() func
- Removed unnecessary dwcmshc_cqhci_set_tran_desc. Export and use
cqhci_set_tran_desc() instead.
- Provide a hook for cqhci_set_tran_desc() instead of cqhci_prep_tran_desc().
- Fix typo: int_clok_disable --> int_clock_disable
v5:
- Fix warning reported by kernel test robot:
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312301130.itEZhhI5-lkp@intel.com/
v6:
- Rebase to master branch
- Fix typo;
- Fix double blank line;
- Add cqhci_suspend() and cqhci_resume() functions
to support mmc suspend-to-ram (s2r);
- Move reading DWCMSHC_P_VENDOR_AREA2 register under "supports-cqe"
condition as not all IPs have that register;
- Remove sdhci V4 mode from the list of prerequisites to init cqhci.
v7:
- Add disabling MMC_CAP2_CQE and MMC_CAP2_CQE_DCMD caps
in case of CQE init fails to prevent problems in suspend/resume
functions.
Sergey Khimich (2):
mmc: cqhci: Add cqhci set_tran_desc() callback
mmc: sdhci-of-dwcmshc: Implement SDHCI CQE support
drivers/mmc/host/Kconfig | 1 +
drivers/mmc/host/cqhci-core.c | 11 +-
drivers/mmc/host/cqhci.h | 4 +
drivers/mmc/host/sdhci-of-dwcmshc.c | 191 +++++++++++++++++++++++++++-
4 files changed, 202 insertions(+), 5 deletions(-)
--
2.30.2
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support
@ 2024-03-20 10:36 Maxim Kiselev
2024-03-21 6:40 ` Adrian Hunter
2024-03-22 14:07 ` Christian Loehle
0 siblings, 2 replies; 5+ messages in thread
From: Maxim Kiselev @ 2024-03-20 10:36 UTC (permalink / raw
To: serghox
Cc: adrian.hunter, jyanchou, open list, linux-mmc, quic_asutoshd,
ritesh.list, shawn.lin, Ulf Hansson
Subject: [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support
Hi Sergey, Adrian!
First of all I want to thank Sergey for supporting the CQE feature
on the DWC MSHC controller.
I tested this series on the LicheePi 4A board (TH1520 SoC).
It has the DWC MSHC IP too and according to the T-Head datasheet
it also supports the CQE feature.
> Supports Command Queuing Engine (CQE) and compliant with eMMC CQ HCI.
So, to enable CQE on LicheePi 4A need to set a prop in DT
and add a IRQ handler to th1520_ops:
> .irq = dwcmshc_cqe_irq_handler,
And the CQE will work for th1520 SoC too.
But, when I enabled the CQE, I was faced with a strange effect.
The fio benchmark shows that emmc works ~2.5 slower with enabled CQE.
219MB/s w/o CQE vs 87.4MB/s w/ CQE. I'll put logs below.
I would be very appreciative if you could point me where to look for
the bottleneck.
Without CQE:
# cat /sys/kernel/debug/mmc0/ios
clock: 198000000 Hz
actual clock: 198000000 Hz
vdd: 21 (3.3 ~ 3.4 V)
bus mode: 2 (push-pull)
chip select: 0 (don't care)
power mode: 2 (on)
bus width: 3 (8 bits)
timing spec: 10 (mmc HS400 enhanced strobe)
signal voltage: 1 (1.80 V)
driver type: 0 (driver type B)
# fio --filename=/dev/mmcblk0 --direct=1 --rw=randread --bs=1M
--ioengine=sync --iodepth=256 --size=4G --numjobs=1 --group_reporting
--name=iops-test-job --eta-newline=1 --readonly
iops-test-job: (g=0): rw=randread, bs=(R) 1024KiB-1024KiB, (W)
1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=sync, iodepth=256
fio-3.34
Starting 1 process
note: both iodepth >= 1 and synchronous I/O engine are selected, queue
depth will be capped at 1
Jobs: 1 (f=1): [r(1)][15.0%][r=209MiB/s][r=209 IOPS][eta 00m:17s]
Jobs: 1 (f=1): [r(1)][25.0%][r=208MiB/s][r=208 IOPS][eta 00m:15s]
Jobs: 1 (f=1): [r(1)][35.0%][r=207MiB/s][r=207 IOPS][eta 00m:13s]
Jobs: 1 (f=1): [r(1)][47.4%][r=208MiB/s][r=208 IOPS][eta 00m:10s]
Jobs: 1 (f=1): [r(1)][52.6%][r=209MiB/s][r=208 IOPS][eta 00m:09s]
Jobs: 1 (f=1): [r(1)][63.2%][r=208MiB/s][r=208 IOPS][eta 00m:07s]
Jobs: 1 (f=1): [r(1)][68.4%][r=208MiB/s][r=207 IOPS][eta 00m:06s]
Jobs: 1 (f=1): [r(1)][78.9%][r=207MiB/s][r=207 IOPS][eta 00m:04s]
Jobs: 1 (f=1): [r(1)][89.5%][r=209MiB/s][r=209 IOPS][eta 00m:02s]
Jobs: 1 (f=1): [r(1)][100.0%][r=209MiB/s][r=209 IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=1): err= 0: pid=132: Thu Jan 1 00:03:44 1970
read: IOPS=208, BW=208MiB/s (219MB/s)(4096MiB/19652msec)
clat (usec): min=3882, max=11557, avg=4778.37, stdev=238.26
lat (usec): min=3883, max=11559, avg=4779.93, stdev=238.26
clat percentiles (usec):
| 1.00th=[ 4359], 5.00th=[ 4555], 10.00th=[ 4555], 20.00th=[ 4621],
| 30.00th=[ 4621], 40.00th=[ 4686], 50.00th=[ 4752], 60.00th=[ 4817],
| 70.00th=[ 4883], 80.00th=[ 4948], 90.00th=[ 5014], 95.00th=[ 5145],
| 99.00th=[ 5473], 99.50th=[ 5538], 99.90th=[ 5932], 99.95th=[ 6915],
| 99.99th=[11600]
bw ( KiB/s): min=208896, max=219136, per=100.00%, avg=213630.77,
stdev=1577.33, samples=39
iops : min= 204, max= 214, avg=208.56, stdev= 1.55, samples=39
lat (msec) : 4=0.39%, 10=99.58%, 20=0.02%
cpu : usr=0.38%, sys=13.04%, ctx=4132, majf=0, minf=275
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=4096,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
READ: bw=208MiB/s (219MB/s), 208MiB/s-208MiB/s (219MB/s-219MB/s),
io=4096MiB (4295MB), run=19652-19652msec
Disk stats (read/write):
mmcblk0: ios=8181/0, merge=0/0, ticks=25682/0, in_queue=25682, util=99.66%
With CQE:
fio --filename=/dev/mmcblk1 --direct=1 --rw=randread --bs=1M --ioengine=sync -
-iodepth=256 --size=4G --numjobs=1 --group_reporting --name=iops-test-job --eta-
newline=1 --readonly
iops-test-job: (g=0): rw=randread, bs=(R) 1024KiB-1024KiB, (W)
1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioeng
ine=sync, iodepth=256
fio-3.34
Starting 1 process
note: both iodepth >= 1 and synchronous I/O engine are selected, queue
depth will be capped at 1
Jobs: 1 (f=1): [r(1)][5.8%][r=83.1MiB/s][r=83 IOPS][eta 00m:49s]
Jobs: 1 (f=1): [r(1)][10.0%][r=84.0MiB/s][r=84 IOPS][eta 00m:45s]
Jobs: 1 (f=1): [r(1)][14.0%][r=83.1MiB/s][r=83 IOPS][eta 00m:43s]
Jobs: 1 (f=1): [r(1)][18.0%][r=83.1MiB/s][r=83 IOPS][eta 00m:41s]
Jobs: 1 (f=1): [r(1)][22.4%][r=84.1MiB/s][r=84 IOPS][eta 00m:38s]
Jobs: 1 (f=1): [r(1)][26.5%][r=83.1MiB/s][r=83 IOPS][eta 00m:36s]
Jobs: 1 (f=1): [r(1)][30.6%][r=83.1MiB/s][r=83 IOPS][eta 00m:34s]
Jobs: 1 (f=1): [r(1)][34.7%][r=84.1MiB/s][r=84 IOPS][eta 00m:32s]
Jobs: 1 (f=1): [r(1)][38.8%][r=83.1MiB/s][r=83 IOPS][eta 00m:30s]
Jobs: 1 (f=1): [r(1)][42.9%][r=83.1MiB/s][r=83 IOPS][eta 00m:28s]
Jobs: 1 (f=1): [r(1)][46.9%][r=84.1MiB/s][r=84 IOPS][eta 00m:26s]
Jobs: 1 (f=1): [r(1)][51.0%][r=83.0MiB/s][r=83 IOPS][eta 00m:24s]
Jobs: 1 (f=1): [r(1)][55.1%][r=83.0MiB/s][r=83 IOPS][eta 00m:22s]
Jobs: 1 (f=1): [r(1)][59.2%][r=84.1MiB/s][r=84 IOPS][eta 00m:20s]
Jobs: 1 (f=1): [r(1)][63.3%][r=83.0MiB/s][r=83 IOPS][eta 00m:18s]
Jobs: 1 (f=1): [r(1)][67.3%][r=83.1MiB/s][r=83 IOPS][eta 00m:16s]
Jobs: 1 (f=1): [r(1)][71.4%][r=84.1MiB/s][r=84 IOPS][eta 00m:14s]
Jobs: 1 (f=1): [r(1)][75.5%][r=83.0MiB/s][r=83 IOPS][eta 00m:12s]
Jobs: 1 (f=1): [r(1)][79.6%][r=83.0MiB/s][r=83 IOPS][eta 00m:10s]
Jobs: 1 (f=1): [r(1)][83.7%][r=84.0MiB/s][r=84 IOPS][eta 00m:08s]
Jobs: 1 (f=1): [r(1)][87.8%][r=83.1MiB/s][r=83 IOPS][eta 00m:06s]
Jobs: 1 (f=1): [r(1)][91.8%][r=83.0MiB/s][r=83 IOPS][eta 00m:04s]
Jobs: 1 (f=1): [r(1)][95.9%][r=84.0MiB/s][r=84 IOPS][eta 00m:02s]
Jobs: 1 (f=1): [r(1)][100.0%][r=83.0MiB/s][r=83 IOPS][eta 00m:00s]
iops-test-job: (groupid=0, jobs=1): err= 0: pid=134: Thu Jan 1 00:02:19 1970
read: IOPS=83, BW=83.3MiB/s (87.4MB/s)(4096MiB/49154msec)
clat (usec): min=11885, max=14840, avg=11981.37, stdev=61.89
lat (usec): min=11887, max=14843, avg=11983.00, stdev=61.92
clat percentiles (usec):
| 1.00th=[11863], 5.00th=[11994], 10.00th=[11994], 20.00th=[11994],
| 30.00th=[11994], 40.00th=[11994], 50.00th=[11994], 60.00th=[11994],
| 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
| 99.00th=[12125], 99.50th=[12256], 99.90th=[12387], 99.95th=[12387],
| 99.99th=[14877]
bw ( KiB/s): min=83800, max=86016, per=100.00%, avg=85430.61,
stdev=894.16, samples=98
iops : min= 81, max= 84, avg=83.22, stdev= 0.89, samples=98
lat (msec) : 20=100.00%
cpu : usr=0.00%, sys=5.44%, ctx=4097, majf=0, minf=274
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=4096,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256
Run status group 0 (all jobs):
READ: bw=83.3MiB/s (87.4MB/s), 83.3MiB/s-83.3MiB/s
(87.4MB/s-87.4MB/s), io=4096MiB (4295MB), run=49154-
49154msec
Disk stats (read/write):
mmcblk1: ios=8181/0, merge=0/0, ticks=69682/0, in_queue=69682, util=99.96%
Best regards,
Maksim
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support
2024-03-20 10:36 [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support Maxim Kiselev
@ 2024-03-21 6:40 ` Adrian Hunter
2024-03-22 14:07 ` Christian Loehle
1 sibling, 0 replies; 5+ messages in thread
From: Adrian Hunter @ 2024-03-21 6:40 UTC (permalink / raw
To: Maxim Kiselev, serghox
Cc: jyanchou, open list, linux-mmc, quic_asutoshd, ritesh.list,
shawn.lin, Ulf Hansson
On 20/03/24 12:36, Maxim Kiselev wrote:
> Subject: [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support
>
> Hi Sergey, Adrian!
>
> First of all I want to thank Sergey for supporting the CQE feature
> on the DWC MSHC controller.
>
> I tested this series on the LicheePi 4A board (TH1520 SoC).
> It has the DWC MSHC IP too and according to the T-Head datasheet
> it also supports the CQE feature.
>
>> Supports Command Queuing Engine (CQE) and compliant with eMMC CQ HCI.
>
> So, to enable CQE on LicheePi 4A need to set a prop in DT
> and add a IRQ handler to th1520_ops:
>> .irq = dwcmshc_cqe_irq_handler,
>
> And the CQE will work for th1520 SoC too.
>
> But, when I enabled the CQE, I was faced with a strange effect.
>
> The fio benchmark shows that emmc works ~2.5 slower with enabled CQE.
> 219MB/s w/o CQE vs 87.4MB/s w/ CQE. I'll put logs below.
>
> I would be very appreciative if you could point me where to look for
> the bottleneck.
Some things you could try:
Check for any related kernel messages.
Have a look at /sys/kernel/debug/mmc*/err_stats
See if disabling runtime PM for the host controller has any effect.
Enable mmc dynamic debug messages and see if anything looks different.
>
> Without CQE:
>
> # cat /sys/kernel/debug/mmc0/ios
> clock: 198000000 Hz
> actual clock: 198000000 Hz
> vdd: 21 (3.3 ~ 3.4 V)
> bus mode: 2 (push-pull)
> chip select: 0 (don't care)
> power mode: 2 (on)
> bus width: 3 (8 bits)
> timing spec: 10 (mmc HS400 enhanced strobe)
> signal voltage: 1 (1.80 V)
> driver type: 0 (driver type B)
>
> # fio --filename=/dev/mmcblk0 --direct=1 --rw=randread --bs=1M
> --ioengine=sync --iodepth=256 --size=4G --numjobs=1 --group_reporting
> --name=iops-test-job --eta-newline=1 --readonly
> iops-test-job: (g=0): rw=randread, bs=(R) 1024KiB-1024KiB, (W)
> 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=sync, iodepth=256
> fio-3.34
> Starting 1 process
> note: both iodepth >= 1 and synchronous I/O engine are selected, queue
> depth will be capped at 1
> Jobs: 1 (f=1): [r(1)][15.0%][r=209MiB/s][r=209 IOPS][eta 00m:17s]
> Jobs: 1 (f=1): [r(1)][25.0%][r=208MiB/s][r=208 IOPS][eta 00m:15s]
> Jobs: 1 (f=1): [r(1)][35.0%][r=207MiB/s][r=207 IOPS][eta 00m:13s]
> Jobs: 1 (f=1): [r(1)][47.4%][r=208MiB/s][r=208 IOPS][eta 00m:10s]
> Jobs: 1 (f=1): [r(1)][52.6%][r=209MiB/s][r=208 IOPS][eta 00m:09s]
> Jobs: 1 (f=1): [r(1)][63.2%][r=208MiB/s][r=208 IOPS][eta 00m:07s]
> Jobs: 1 (f=1): [r(1)][68.4%][r=208MiB/s][r=207 IOPS][eta 00m:06s]
> Jobs: 1 (f=1): [r(1)][78.9%][r=207MiB/s][r=207 IOPS][eta 00m:04s]
> Jobs: 1 (f=1): [r(1)][89.5%][r=209MiB/s][r=209 IOPS][eta 00m:02s]
> Jobs: 1 (f=1): [r(1)][100.0%][r=209MiB/s][r=209 IOPS][eta 00m:00s]
> iops-test-job: (groupid=0, jobs=1): err= 0: pid=132: Thu Jan 1 00:03:44 1970
> read: IOPS=208, BW=208MiB/s (219MB/s)(4096MiB/19652msec)
> clat (usec): min=3882, max=11557, avg=4778.37, stdev=238.26
> lat (usec): min=3883, max=11559, avg=4779.93, stdev=238.26
> clat percentiles (usec):
> | 1.00th=[ 4359], 5.00th=[ 4555], 10.00th=[ 4555], 20.00th=[ 4621],
> | 30.00th=[ 4621], 40.00th=[ 4686], 50.00th=[ 4752], 60.00th=[ 4817],
> | 70.00th=[ 4883], 80.00th=[ 4948], 90.00th=[ 5014], 95.00th=[ 5145],
> | 99.00th=[ 5473], 99.50th=[ 5538], 99.90th=[ 5932], 99.95th=[ 6915],
> | 99.99th=[11600]
> bw ( KiB/s): min=208896, max=219136, per=100.00%, avg=213630.77,
> stdev=1577.33, samples=39
> iops : min= 204, max= 214, avg=208.56, stdev= 1.55, samples=39
> lat (msec) : 4=0.39%, 10=99.58%, 20=0.02%
> cpu : usr=0.38%, sys=13.04%, ctx=4132, majf=0, minf=275
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued rwts: total=4096,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=256
>
> Run status group 0 (all jobs):
> READ: bw=208MiB/s (219MB/s), 208MiB/s-208MiB/s (219MB/s-219MB/s),
> io=4096MiB (4295MB), run=19652-19652msec
>
> Disk stats (read/write):
> mmcblk0: ios=8181/0, merge=0/0, ticks=25682/0, in_queue=25682, util=99.66%
>
>
> With CQE:
Was output from "cat /sys/kernel/debug/mmc0/ios" the same?
>
> fio --filename=/dev/mmcblk1 --direct=1 --rw=randread --bs=1M --ioengine=sync -
> -iodepth=256 --size=4G --numjobs=1 --group_reporting --name=iops-test-job --eta-
> newline=1 --readonly
> iops-test-job: (g=0): rw=randread, bs=(R) 1024KiB-1024KiB, (W)
> 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioeng
> ine=sync, iodepth=256
> fio-3.34
> Starting 1 process
> note: both iodepth >= 1 and synchronous I/O engine are selected, queue
> depth will be capped at 1
> Jobs: 1 (f=1): [r(1)][5.8%][r=83.1MiB/s][r=83 IOPS][eta 00m:49s]
> Jobs: 1 (f=1): [r(1)][10.0%][r=84.0MiB/s][r=84 IOPS][eta 00m:45s]
> Jobs: 1 (f=1): [r(1)][14.0%][r=83.1MiB/s][r=83 IOPS][eta 00m:43s]
> Jobs: 1 (f=1): [r(1)][18.0%][r=83.1MiB/s][r=83 IOPS][eta 00m:41s]
> Jobs: 1 (f=1): [r(1)][22.4%][r=84.1MiB/s][r=84 IOPS][eta 00m:38s]
> Jobs: 1 (f=1): [r(1)][26.5%][r=83.1MiB/s][r=83 IOPS][eta 00m:36s]
> Jobs: 1 (f=1): [r(1)][30.6%][r=83.1MiB/s][r=83 IOPS][eta 00m:34s]
> Jobs: 1 (f=1): [r(1)][34.7%][r=84.1MiB/s][r=84 IOPS][eta 00m:32s]
> Jobs: 1 (f=1): [r(1)][38.8%][r=83.1MiB/s][r=83 IOPS][eta 00m:30s]
> Jobs: 1 (f=1): [r(1)][42.9%][r=83.1MiB/s][r=83 IOPS][eta 00m:28s]
> Jobs: 1 (f=1): [r(1)][46.9%][r=84.1MiB/s][r=84 IOPS][eta 00m:26s]
> Jobs: 1 (f=1): [r(1)][51.0%][r=83.0MiB/s][r=83 IOPS][eta 00m:24s]
> Jobs: 1 (f=1): [r(1)][55.1%][r=83.0MiB/s][r=83 IOPS][eta 00m:22s]
> Jobs: 1 (f=1): [r(1)][59.2%][r=84.1MiB/s][r=84 IOPS][eta 00m:20s]
> Jobs: 1 (f=1): [r(1)][63.3%][r=83.0MiB/s][r=83 IOPS][eta 00m:18s]
> Jobs: 1 (f=1): [r(1)][67.3%][r=83.1MiB/s][r=83 IOPS][eta 00m:16s]
> Jobs: 1 (f=1): [r(1)][71.4%][r=84.1MiB/s][r=84 IOPS][eta 00m:14s]
> Jobs: 1 (f=1): [r(1)][75.5%][r=83.0MiB/s][r=83 IOPS][eta 00m:12s]
> Jobs: 1 (f=1): [r(1)][79.6%][r=83.0MiB/s][r=83 IOPS][eta 00m:10s]
> Jobs: 1 (f=1): [r(1)][83.7%][r=84.0MiB/s][r=84 IOPS][eta 00m:08s]
> Jobs: 1 (f=1): [r(1)][87.8%][r=83.1MiB/s][r=83 IOPS][eta 00m:06s]
> Jobs: 1 (f=1): [r(1)][91.8%][r=83.0MiB/s][r=83 IOPS][eta 00m:04s]
> Jobs: 1 (f=1): [r(1)][95.9%][r=84.0MiB/s][r=84 IOPS][eta 00m:02s]
> Jobs: 1 (f=1): [r(1)][100.0%][r=83.0MiB/s][r=83 IOPS][eta 00m:00s]
> iops-test-job: (groupid=0, jobs=1): err= 0: pid=134: Thu Jan 1 00:02:19 1970
> read: IOPS=83, BW=83.3MiB/s (87.4MB/s)(4096MiB/49154msec)
> clat (usec): min=11885, max=14840, avg=11981.37, stdev=61.89
> lat (usec): min=11887, max=14843, avg=11983.00, stdev=61.92
> clat percentiles (usec):
> | 1.00th=[11863], 5.00th=[11994], 10.00th=[11994], 20.00th=[11994],
> | 30.00th=[11994], 40.00th=[11994], 50.00th=[11994], 60.00th=[11994],
> | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
> | 99.00th=[12125], 99.50th=[12256], 99.90th=[12387], 99.95th=[12387],
> | 99.99th=[14877]
> bw ( KiB/s): min=83800, max=86016, per=100.00%, avg=85430.61,
> stdev=894.16, samples=98
> iops : min= 81, max= 84, avg=83.22, stdev= 0.89, samples=98
> lat (msec) : 20=100.00%
> cpu : usr=0.00%, sys=5.44%, ctx=4097, majf=0, minf=274
> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued rwts: total=4096,0,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=256
>
> Run status group 0 (all jobs):
> READ: bw=83.3MiB/s (87.4MB/s), 83.3MiB/s-83.3MiB/s
> (87.4MB/s-87.4MB/s), io=4096MiB (4295MB), run=49154-
> 49154msec
>
> Disk stats (read/write):
> mmcblk1: ios=8181/0, merge=0/0, ticks=69682/0, in_queue=69682, util=99.96%
>
>
> Best regards,
> Maksim
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support
2024-03-20 10:36 [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support Maxim Kiselev
2024-03-21 6:40 ` Adrian Hunter
@ 2024-03-22 14:07 ` Christian Loehle
1 sibling, 0 replies; 5+ messages in thread
From: Christian Loehle @ 2024-03-22 14:07 UTC (permalink / raw
To: Maxim Kiselev, serghox
Cc: adrian.hunter, jyanchou, open list, linux-mmc, quic_asutoshd,
ritesh.list, shawn.lin, Ulf Hansson
On 20/03/2024 10:36, Maxim Kiselev wrote:
> Subject: [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support
>
> Hi Sergey, Adrian!
>
> First of all I want to thank Sergey for supporting the CQE feature
> on the DWC MSHC controller.
>
> I tested this series on the LicheePi 4A board (TH1520 SoC).
> It has the DWC MSHC IP too and according to the T-Head datasheet
> it also supports the CQE feature.
>
>> Supports Command Queuing Engine (CQE) and compliant with eMMC CQ HCI.
>
> So, to enable CQE on LicheePi 4A need to set a prop in DT
> and add a IRQ handler to th1520_ops:
>> .irq = dwcmshc_cqe_irq_handler,
>
> And the CQE will work for th1520 SoC too.
>
> But, when I enabled the CQE, I was faced with a strange effect.
>
> The fio benchmark shows that emmc works ~2.5 slower with enabled CQE.
> 219MB/s w/o CQE vs 87.4MB/s w/ CQE. I'll put logs below.
>
> I would be very appreciative if you could point me where to look for
> the bottleneck.
>
> Without CQE:
I would also suspect some bus issues here, either read out ios or ext_csd
after enabling CQE, it could be helpful.
OTOH the CQE could just be limiting the frequency, which you wouldn't be
able to see without a scope. Does the TRM say anything about that?
Are you limited to <100MB/s with CQE for HS400(non-ES) and HS200, too?
What about sequential reads but smaller bs? like 256K sequential?
FWIW your fio call should be on par with non-CQE performance-wise at best,
as you just have one IO in-flight, i.e. no CQE performance improvement
possible, see your warning:
> both iodepth >= 1 and synchronous I/O engine are selected, queue
> depth will be capped at 1
Kind Regards,
Christian
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support
2024-03-19 11:59 Sergey Khimich
@ 2024-03-25 13:19 ` Ulf Hansson
0 siblings, 0 replies; 5+ messages in thread
From: Ulf Hansson @ 2024-03-25 13:19 UTC (permalink / raw
To: Sergey Khimich
Cc: linux-kernel, linux-mmc, Adrian Hunter, Shawn Lin, Jyan Chou,
Asutosh Das, Ritesh Harjani
On Tue, 19 Mar 2024 at 12:59, Sergey Khimich <serghox@gmail.com> wrote:
>
> Hello!
>
> This is implementation of SDHCI CQE support for sdhci-of-dwcmshc driver.
> For enabling CQE support just set 'supports-cqe' in your DevTree file
> for appropriate mmc node.
>
> Also, while implementing CQE support for the driver, I faced with a problem
> which I will describe below.
> According to the IP block documentation CQE works only with "AMDA-2 only"
> mode which is activated only with v4 mode enabled. I see in dwcmshc_probe()
> function that v4 mode gets enabled only for 'sdhci_dwcmshc_bf3_pdata'
> platform data.
>
> So my question is: is it correct to enable v4 mode for all platform data
> if 'SDHCI_CAN_64BIT_V4' bit is set in hw?
>
> Because I`m afraid that enabling v4 mode for some platforms could break
> them down. On the other hand, if host controller says that it can do v4
> (caps & SDHCI_CAN_64BIT_V4), lets do v4 or disable it manualy by some
> quirk. Anyway - RFC.
>
>
> v2:
> - Added dwcmshc specific cqe_disable hook to prevent losing
> in-flight cmd when an ioctl is issued and cqe_disable is called;
>
> - Added processing 128Mb boundary for the host memory data buffer size
> and the data buffer. For implementing this processing an extra
> callback is added to the struct 'sdhci_ops'.
>
> - Fixed typo.
>
> v3:
> - Fix warning reported by kernel test robot:
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202309270807.VoVn81m6-lkp@intel.com/
> | Closes: https://lore.kernel.org/oe-kbuild-all/202309300806.dcR19kcE-lkp@intel.com/
>
> v4:
> - Data reset moved to custom driver tuning hook.
> - Removed unnecessary dwcmshc_sdhci_cqe_disable() func
> - Removed unnecessary dwcmshc_cqhci_set_tran_desc. Export and use
> cqhci_set_tran_desc() instead.
> - Provide a hook for cqhci_set_tran_desc() instead of cqhci_prep_tran_desc().
> - Fix typo: int_clok_disable --> int_clock_disable
>
> v5:
> - Fix warning reported by kernel test robot:
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202312301130.itEZhhI5-lkp@intel.com/
>
> v6:
> - Rebase to master branch
> - Fix typo;
> - Fix double blank line;
> - Add cqhci_suspend() and cqhci_resume() functions
> to support mmc suspend-to-ram (s2r);
> - Move reading DWCMSHC_P_VENDOR_AREA2 register under "supports-cqe"
> condition as not all IPs have that register;
> - Remove sdhci V4 mode from the list of prerequisites to init cqhci.
>
> v7:
> - Add disabling MMC_CAP2_CQE and MMC_CAP2_CQE_DCMD caps
> in case of CQE init fails to prevent problems in suspend/resume
> functions.
>
> Sergey Khimich (2):
> mmc: cqhci: Add cqhci set_tran_desc() callback
> mmc: sdhci-of-dwcmshc: Implement SDHCI CQE support
>
> drivers/mmc/host/Kconfig | 1 +
> drivers/mmc/host/cqhci-core.c | 11 +-
> drivers/mmc/host/cqhci.h | 4 +
> drivers/mmc/host/sdhci-of-dwcmshc.c | 191 +++++++++++++++++++++++++++-
> 4 files changed, 202 insertions(+), 5 deletions(-)
>
Applied for next and by fixing a minor conflict when applying, thanks!
Kind regards
Uffe
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-03-25 13:19 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-20 10:36 [PATCH v7 0/2] mmc: sdhci-of-dwcmshc: Add CQE support Maxim Kiselev
2024-03-21 6:40 ` Adrian Hunter
2024-03-22 14:07 ` Christian Loehle
-- strict thread matches above, loose matches on Subject: below --
2024-03-19 11:59 Sergey Khimich
2024-03-25 13:19 ` Ulf Hansson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).