All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Wang Shanker <shankerwangmiao@gmail.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: linux-block@vger.kernel.org, linux-raid@vger.kernel.org
Subject: Re: [Bug Report] Discard bios cannot be correctly merged in blk-mq
Date: Tue, 8 Jun 2021 23:49:04 +0800	[thread overview]
Message-ID: <C866C380-7A71-4722-957F-2CE65BDACF26@gmail.com> (raw)
In-Reply-To: <YL4Z/QJCKc0NCV5L@T590>

> 2021年06月07日 21:07,Ming Lei <ming.lei@redhat.com> 写道:
> 
> On Sun, Jun 06, 2021 at 04:54:09AM +0800, Wang Shanker wrote:
>> Hi, all
>> 
>> I'm writing to report my recent findings about the handling of discard 
>> operations. As indicated by a few tests, discard operation cannot be 
>> correctly merged, which leads to poor performance of RAID456 on discard
>> requests. I'm not quite familiar with block subsystem, so please correct
>> me if there are any mistakes in the following analysis.
>> 
>> In blk_discard_mergable(), we can see the handling of merging discard 
>> operations goes through different processes, decided by whether we have
>> more than one queue_max_discard_segments. If the device requires the 
>> sectors should be contiguous in one discard operation, the merging process
>> will be the same as that for normal read/write operations. Otherwise, 
>> bio_attempt_discard_merge will try to merge as many bios as the device
>> allows, ignoring the contiguity. Sadly, for both cases, there are problems.
>> 
>> For devices requiring contiguous sector ranges(such as scsi disks), 
>> bio_attempt_front_merge() or bio_attempt_back_merge() will be handling 
>> the merging process, and both control flows will arrive at 
>> ll_new_hw_segment(), where the following statement:
>> 
>>    req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req)
>> 
>> can never be true, since blk_rq_get_max_segments(req) will always be 1.
>> As a result, no discard operations shall be merged.
> 
> OK, that looks a bug, and the following change may fix the issue:
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 4d97fb6dd226..65210e9a8efa 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -559,10 +559,14 @@ static inline unsigned int blk_rq_get_max_segments(struct request *rq)
> static inline int ll_new_hw_segment(struct request *req, struct bio *bio,
> 		unsigned int nr_phys_segs)
> {
> -	if (req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req))
> +	if (blk_integrity_merge_bio(req->q, req, bio) == false)
> 		goto no_merge;
> 
> -	if (blk_integrity_merge_bio(req->q, req, bio) == false)
> +	/* discard request merge won't add new segment */
> +	if (req_op(req) == REQ_OP_DISCARD)
> +		return 1;
> +
> +	if (req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req))
> 		goto no_merge;
> 
> 	/*
> 

Many thanks for this patch. I wonder whether it will be applied soon or it will go 
through a formal [PATCH] process.

>> 
>> For devices supporting multiple segments of sector ranges, 
>> bio_attempt_discard_merge() will take over the process. Indeed it will merge
>> some bios. But how many bios can be merged into one request? In the 
>> implementation, the maximum number of bios is limited mainly by 
>> queue_max_discard_segments (also by blk_rq_get_max_sectors, but it's not where
>> the problem is). However, it is not the case, since bio_attempt_discard_merge
>> is not aware of the contiguity of bios. Suppose there are 20 contiguous bios.
>> They should be considered as only one segment instead 20 of them.
> 
> Right, so far ELEVATOR_DISCARD_MERGE doesn't merge bios actually, but it
> can be supported without much difficulty.
> 
>> 
>> You may wonder the importance of merging discard operations. In the 
>> implementation of RAID456, bios are committed in 4k trunks (they call
>> them as stripes in the code and the size is determined by DEFAULT_STRIPE_SIZE).
>> The proper merging of the bios is of vital importance for a reasonable 
>> operating performance of RAID456 devices. In fact, I met this problem
>> when attempting to create a raid5 volume on a bunch of Nvme SSDs enabling trim
>> support. Someone also reported similar issues in the linux-raid list
>> (https://www.spinics.net/lists/raid/msg62108.html). In that post, the author
>> reported that ``lots of small 4k discard requests that get merged into larger
>> 512k chunks submitted to devices". This can be explained by my above discovery
>> because nvme allows 128 segments at the maximum in a dsm instruction.
>> 
>> The above two scenarios can be reproduced utilizing latest QEMU, with emulated
>> scsi drives (for the first scenario) or nvme drives (for the second scenario)
>> and enabling the trace of scsi_disk_emulate_command_UNMAP or pci_nvme_dsm_deallocate.
>> The detailed process reproducing is as follows:
>> 
>> 1. create a rootfs (e.g. using debootstrap) under ./rootfs/ ;
>> 2. obtain a kernel image vmlinuz and generate a initramfs image initrd.img ;
>> 3. create 3 empty sparse disk images:
>>  # truncate -s 1T disk1 disk2 disk3
>> 4. using the following qemu command to start the guest vm (here 9p is used 
>> as the rootfs because we don't want the io operations on the rootfs influence
>> the debugging of the block layer of the guest vm)
>>  # qemu-system-x86_64 \
>>        -cpu kvm64 -machine pc,accel=kvm -smp cpus=2,cores=2,sockets=1 -m 2G  \
>>        -chardev stdio,mux=on,id=char0,signal=off  \
>>        -fsdev local,path=./rootfs,security_model=passthrough,id=rootfs \
>>        -device virtio-9p,fsdev=rootfs,mount_tag=rootfs \
>>        -monitor chardev:char0 \
>>        -device isa-serial,baudbase=1500000,chardev=char0,index=0,id=ttyS0  \
>>        -nographic \
>>        -kernel vmlinuz -initrd initrd.img  \
>>        -append 'root=rootfs rw rootfstype=9p rootflags=trans=virtio,msize=524288 console=ttyS0,1500000 nokaslr' \
>>        -blockdev driver=raw,node-name=nvme1,file.driver=file,file.filename=disk1 \
>>        -blockdev driver=raw,node-name=nvme2,file.driver=file,file.filename=disk2 \
>>        -blockdev driver=raw,node-name=nvme3,file.driver=file,file.filename=disk3 \
>>        -trace pci_nvme_dsm_deallocate,file=nvmetrace.log \
>>        -device nvme,drive=nvme1,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME1 \
>>        -device nvme,drive=nvme2,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME2 \
>>        -device nvme,drive=nvme3,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME3
>> 5. enable trim support of the raid456 module:
>>  # modprobe raid456
>>  # echo Y > /sys/module/raid456/parameters/devices_handle_discard_safely
>> 6. using mdaam to create a raid5 device in the guest vm:
>>  # mdadm --create --level=5 --raid-devices=3 /dev/md/test /dev/nvme*n1
>> 7. and issue a discard request on the dm device: (limit the size of 
>> discard request because discarding all the 2T data is too slow)
>>  # blkdiscard -o 0 -l 1M -p 1M --verbose /dev/md/test
>> 8. in nvmetrace.log, there are many pci_nvme_dsm_deallocate events of 4k 
>> length (nlb 1).
> 
> 4kb should be the discard segment length, instead of discard request
> length, which should be 512k in the above test.

Actually, what are received by the nvme controller are discard requests
with 128 segments of 4k, instead of one segment of 512k.

> 
>> 
>> Similarly, the problem with scsi devices can be emulated using the following 
>> options for qemu:
>> 
>>        -device virtio-scsi,id=scsi \
>>        -device scsi-hd,drive=nvme1,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME1 \
>>        -device scsi-hd,drive=nvme2,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME2 \
>>        -device scsi-hd,drive=nvme3,bus=scsi.0,logical_block_size=4096,discard_granularity=2097152,physical_block_size=4096,serial=NVME3 \
>>        -trace scsi_disk_emulate_command_UNMAP,file=scsitrace.log
>> 
>> 
>> Despite the discovery, I cannot come up with a proper fix of this issue due
>> to my lack of familiarity of the block subsystem. I expect your kind feedback
>> on this. Thanks in advance.
> 
> In the above setting and raid456 test, I observe that rq->nr_phys_segments can
> reach 128, but queue_max_discard_segments() reports 256. So discard
> request size can be 512KB, which is the max size when you run 1MB discard on
> raid456. However, if the discard length on raid456 is increased, the
> current way will become inefficient.

Exactly. 

I suggest that bio's can be merged and be calculated as one segment if they are
contiguous and contain no data.

And I also discovered later that, even normal long write requests, e.g.
a 10m write, will be split into 4k bio's. The maximum number of bio's which can 
be merged into one request is limited by queue_max_segments, regardless
of whether those bio's are contiguous. In my test environment, for scsi devices,
queue_max_segments can be 254, which means about 1m size of requests. For nvme
devices(e.g. Intel DC P4610), queue_max_segments is only 33 since their mdts is 5,
which results in only 132k of requests. 

So, I would also suggest that raid456 should be improved to issue bigger bio's to
underlying drives.

Cheers,

Miao Wang


  reply	other threads:[~2021-06-08 15:50 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-05 20:54 [Bug Report] Discard bios cannot be correctly merged in blk-mq Wang Shanker
2021-06-05 22:38 ` antlists
2021-06-06  3:44   ` Wang Shanker
2021-06-07 13:07 ` Ming Lei
2021-06-08 15:49   ` Wang Shanker [this message]
2021-06-09  0:41     ` Ming Lei
2021-06-09  2:40       ` Wang Shanker
2021-06-09  8:44         ` Xiao Ni
2021-06-09  9:03           ` Wang Shanker
2021-06-18  6:28             ` Wang Shanker
2021-06-18 12:49               ` Xiao Ni
2021-06-21  7:49                 ` Wang Shanker
2021-06-22  1:48                   ` Xiao Ni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=C866C380-7A71-4722-957F-2CE65BDACF26@gmail.com \
    --to=shankerwangmiao@gmail.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.