From: Naohiro Aota via Linux-erofs <linux-erofs@lists.ozlabs.org>
To: Tejun Heo <tj@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Cc: "linux-wireless@vger.kernel.org" <linux-wireless@vger.kernel.org>,
"linux-remoteproc@vger.kernel.org"
<linux-remoteproc@vger.kernel.org>,
"dri-devel@lists.freedesktop.org"
<dri-devel@lists.freedesktop.org>,
"platform-driver-x86@vger.kernel.org"
<platform-driver-x86@vger.kernel.org>,
"gfs2@lists.linux.dev" <gfs2@lists.linux.dev>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"oss-drivers@corigine.com" <oss-drivers@corigine.com>,
"target-devel@vger.kernel.org" <target-devel@vger.kernel.org>,
"samba-technical@lists.samba.org"
<samba-technical@lists.samba.org>,
"linux-cifs@vger.kernel.org" <linux-cifs@vger.kernel.org>,
"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
"linux-bcachefs@vger.kernel.org" <linux-bcachefs@vger.kernel.org>,
"iommu@lists.linux.dev" <iommu@lists.linux.dev>,
"linux-cachefs@redhat.com" <linux-cachefs@redhat.com>,
"open-iscsi@googlegroups.com" <open-iscsi@googlegroups.com>,
"linux-media@vger.kernel .org" <linux-media@vger.kernel.org>,
"dm-devel@lists.linux.dev" <dm-devel@lists.linux.dev>,
"coreteam@netfilter.org" <coreteam@netfilter.org>,
"intel-gfx@lists.freedesktop.org"
<intel-gfx@lists.freedesktop.org>,
"virtualization@lists.linux.dev" <virtualization@lists.linux.dev>,
"nbd@other.debian.org" <nbd@other.debian.org>,
"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
"linux-fscrypt@vger.kernel.org" <linux-fscrypt@vger.kernel.org>,
"ntb@lists.linux.dev" <ntb@lists.linux.dev>,
"linux-mediatek@lists.infradead.org"
<linux-mediatek@lists.infradead.org>,
"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>,
"linux-arm-kernel@lists.infradead.org"
<linux-arm-kernel@lists.infradead.org>,
"linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>,
"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
"linux-usb@vger.kernel.org" <linux-usb@vger.kernel.org>,
"linux-mmc@vger.kernel.org" <linux-mmc@vger.kernel.org>,
"linux-f2fs-devel@lists.sourceforge.net"
<linux-f2fs-devel@lists.sourceforge.net>,
"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
"linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>,
"linux-trace-kernel@vger.kernel.org"
<linux-trace-kernel@vger.kernel.org>,
"linux-erofs@lists.ozlabs.org" <linux-erofs@lists.ozlabs.org>,
"wireguard@lists.zx2c4.com" <wireguard@lists.zx2c4.com>
Subject: Performance drop due to alloc_workqueue() misuse and recent change
Date: Mon, 4 Dec 2023 16:03:47 +0000 [thread overview]
Message-ID: <dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3> (raw)
Recently, commit 636b927eba5b ("workqueue: Make unbound workqueues to use
per-cpu pool_workqueues") changed WQ_UNBOUND workqueue's behavior. It
changed the meaning of alloc_workqueue()'s max_active from an upper limit
imposed per NUMA node to a limit per CPU. As a result, massive number of
workers can be running at the same time, especially if the workqueue user
thinks the max_active is a global limit.
Actually, it is already written it is per-CPU limit in the documentation
before the commit. However, several callers seem to misuse max_active,
maybe thinking it is a global limit. It is an unexpected behavior change
for them.
For example, these callers set max_active = num_online_cpus(), which is a
suspicious limit applying to per-CPU. This config means we can have nr_cpu
* nr_cpu active tasks working at the same time.
fs/f2fs/data.c: sbi->post_read_wq = alloc_workqueue("f2fs_post_read_wq",
fs/f2fs/data.c- WQ_UNBOUND | WQ_HIGHPRI,
fs/f2fs/data.c- num_online_cpus());
fs/crypto/crypto.c: fscrypt_read_workqueue = alloc_workqueue("fscrypt_read_queue",
fs/crypto/crypto.c- WQ_UNBOUND | WQ_HIGHPRI,
fs/crypto/crypto.c- num_online_cpus());
fs/verity/verify.c: fsverity_read_workqueue = alloc_workqueue("fsverity_read_queue",
fs/verity/verify.c- WQ_HIGHPRI,
fs/verity/verify.c- num_online_cpus());
drivers/crypto/hisilicon/qm.c: qm->wq = alloc_workqueue("%s", WQ_HIGHPRI | WQ_MEM_RECLAIM |
drivers/crypto/hisilicon/qm.c- WQ_UNBOUND, num_online_cpus(),
drivers/crypto/hisilicon/qm.c- pci_name(qm->pdev));
block/blk-crypto-fallback.c: blk_crypto_wq = alloc_workqueue("blk_crypto_wq",
block/blk-crypto-fallback.c- WQ_UNBOUND | WQ_HIGHPRI |
block/blk-crypto-fallback.c- WQ_MEM_RECLAIM, num_online_cpus());
drivers/md/dm-crypt.c: cc->crypt_queue = alloc_workqueue("kcryptd/%s",
drivers/md/dm-crypt.c- WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM | WQ_UNBOUND,
drivers/md/dm-crypt.c- num_online_cpus(), devname);
Furthermore, the change affects performance in a certain case.
Btrfs creates several WQ_UNBOUND workqueues with a default max_active =
min(NRCPUS + 2, 8). As my machine has 96 CPUs with NUMA disabled, this
max_active config allows running over 700 active works. Before the commit,
it is limited to 8 if NUMA is disabled or limited to 16 if NUMA nodes is 2.
I reverted the workqueue code back to before the commit, and I ran the
following fio command on RAID0 btrfs on 6 SSDs.
fio --group_reporting --eta=always --eta-interval=30s --eta-newline=30s \
--rw=write --fallocate=none \
--direct=1 --ioengine=libaio --iodepth=32 \
--filesize=100G \
--blocksize=64k \
--time_based --runtime=300s \
--end_fsync=1 \
--directory=${MNT} \
--name=writer --numjobs=32
By changing workqueue's max_active, the result varies.
- wq max_active=8 (intended limit by btrfs?)
WRITE: bw=2495MiB/s (2616MB/s), 2495MiB/s-2495MiB/s (2616MB/s-2616MB/s), io=753GiB (808GB), run=308953-308953msec
- wq max_active=16 (actual limit on 2 NUMA nodes setup)
WRITE: bw=1736MiB/s (1820MB/s), 1736MiB/s-1736MiB/s (1820MB/s-1820MB/s), io=670GiB (720GB), run=395532-395532msec
- wq max_active=768 (simulating current limit)
WRITE: bw=1276MiB/s (1338MB/s), 1276MiB/s-1276MiB/s (1338MB/s-1338MB/s), io=375GiB (403GB), run=300984-300984msec
The current performance is slower than the previous limit (max_active=16)
by 27%, or it is 50% slower than the intended limit. The performance drop
might be due to contention of the btrfs-endio-write works. There are over
700 kworker instances were created and 100 works are on the 'D' state
competing for a lock.
More specifically, I tested the same workload on the commit.
- At commit 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues")
WRITE: bw=1191MiB/s (1249MB/s), 1191MiB/s-1191MiB/s (1249MB/s-1249MB/s), io=350GiB (376GB), run=300714-300714msec
- At the previous commit = 4cbfd3de73 ("workqueue: Call wq_update_unbound_numa() on all CPUs in NUMA node on CPU hotplug")
WRITE: bw=1747MiB/s (1832MB/s), 1747MiB/s-1747MiB/s (1832MB/s-1832MB/s), io=748GiB (803GB), run=438134-438134msec
So, it is -31.8% performance down with the commit.
In summary, we misuse max_active, considering it is a global limit. And,
the recent commit introduced a huge performance drop in some cases. We
need to review alloc_workqueue() usage to check if its max_active setting
is proper or not.
next reply other threads:[~2023-12-04 16:06 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-04 16:03 Naohiro Aota via Linux-erofs [this message]
2023-12-04 18:07 ` Performance drop due to alloc_workqueue() misuse and recent change Tejun Heo
2023-12-20 7:14 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3 \
--to=linux-erofs@lists.ozlabs.org \
--cc=Naohiro.Aota@wdc.com \
--cc=ceph-devel@vger.kernel.org \
--cc=cgroups@vger.kernel.org \
--cc=coreteam@netfilter.org \
--cc=dm-devel@lists.linux.dev \
--cc=dri-devel@lists.freedesktop.org \
--cc=gfs2@lists.linux.dev \
--cc=intel-gfx@lists.freedesktop.org \
--cc=iommu@lists.linux.dev \
--cc=jiangshanlai@gmail.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-bcachefs@vger.kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-cachefs@redhat.com \
--cc=linux-cifs@vger.kernel.org \
--cc=linux-crypto@vger.kernel.org \
--cc=linux-f2fs-devel@lists.sourceforge.net \
--cc=linux-fscrypt@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-media@vger.kernel.org \
--cc=linux-mediatek@lists.infradead.org \
--cc=linux-mm@kvack.org \
--cc=linux-mmc@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=linux-raid@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=linux-remoteproc@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=linux-trace-kernel@vger.kernel.org \
--cc=linux-usb@vger.kernel.org \
--cc=linux-wireless@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=nbd@other.debian.org \
--cc=netdev@vger.kernel.org \
--cc=ntb@lists.linux.dev \
--cc=open-iscsi@googlegroups.com \
--cc=oss-drivers@corigine.com \
--cc=platform-driver-x86@vger.kernel.org \
--cc=samba-technical@lists.samba.org \
--cc=target-devel@vger.kernel.org \
--cc=tj@kernel.org \
--cc=virtualization@lists.linux.dev \
--cc=wireguard@lists.zx2c4.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).