dm-crypt parallelization patches

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* dm-crypt parallelization patches
  2013-03-28 20:45               ` Tejun Heo
@ 2013-04-09 17:51                 ` Mikulas Patocka
  2013-04-09 17:57                   ` Tejun Heo
  2013-04-09 18:36                   ` Vivek Goyal
  0 siblings, 2 replies; 22+ messages in thread
From: Mikulas Patocka @ 2013-04-09 17:51 UTC (permalink / raw
  To: Jens Axboe
  Cc: Vivek Goyal, Tejun Heo, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

Hi

I placed the dm-crypt parallization patches at: 
http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/current/

The patches paralellize dm-crypt and make it possible to use all processor 
cores.

The patch dm-crypt-remove-percpu.patch removes some percpu variables and 
replaces them with per-request variables.

The patch dm-crypt-unbound-workqueue.patch sets WQ_UNBOUND on the 
encryption workqueue, allowing the encryption to be distributed to all 
CPUs in the system.

The patch dm-crypt-offload-writes-to-thread.patch moves submission of all 
write requests to a single thread.

The patch dm-crypt-sort-requests.patch sorts write requests submitted by a 
single thread. The requests are sorted according to the sector number, 
rb-tree is used for efficient sorting.

Some usage notes:

* turn off automatic cpu frequency scaling (or set it to "performance"
  governor) - cpufreq doesn't recognize encryption workload correctly, 
  sometimes it underclocks all the CPU cores when there is some encryption 
  work to do, resulting in bad performance

* when using filesystem on encrypted dm-crypt device, reduce maximum 
  request size with "/sys/block/dm-2/queue/max_sectors_kb" (substitute 
  "dm-2" with the real name of your dm-crypt device). Note that having too 
  big requests means that there is a small number of requests and they 
  cannot be distributed to all available processors in parallel - it 
  results in worse performance. Having too small requests results in high 
  request overhead and also reduced performance. So you must find the 
  optimal request size for your system and workload. For me, when testing 
  this on ramdisk, the optimal is 8KiB. 

---

Now, the problem with I/O scheduler: when doing performance testing, it 
turns out that the parallel version is sometimes worse than the previous 
implementation.

When I create a 4.3GiB dm-crypt device on the top of dm-loop on the top of 
ext2 filesystem on 15k SCSI disk and run this command

time fio --rw=randrw --size=64M --bs=256k --filename=/dev/mapper/crypt 
--direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 
--name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11 
--name=job12

the results are this:
CFQ scheduler:
--------------
no patches:
21.9s
patch 1:
21.7s
patches 1,2:
2:33s
patches 1,2 (+ nr_requests = 1280000)
2:18s
patches 1,2,3:
20.7s
patches 1,2,3,4:
20.7s

deadline scheduler:
-------------------
no patches:
27.4s
patch 1:
27.4s
patches 1,2:
27.8s
patches 1,2,3:
29.6s
patches 1,2,3,4:
29.6s

We can see that CFQ performs badly with the patch 2, but improves with the 
patch 3. All that patch 3 does is that it moves write requests from 
encryption threads to a separate thread.

So it seems that CFQ has some deficiency that it cannot merge adjacent 
requests done by different processes.

The problem is this:
- we have 256k write direct-i/o request
- it is broken to 4k bios (because we run on dm-loop on a filesystem with 
  4k block size)
- encryption of these 4k bios is distributed to 12 processes on a 12-core 
  machine
- encryption finishes out of order and in different processes, 4k bios 
  with encrypted data are submitted to CFQ
- CFQ doesn't merge them
- the disk is flooded with random 4k write requests, and performs much 
  worse than with 256k requests

Increasing nr_requests to 1280000 helps a little, but not much - it is 
still order of magnitute slower.

I'd like to ask if someone who knows the CFQ scheduler (Jens?) could look 
at it and find out why it doesn't merge requests from different processes. 

Why do I have to do a seemingly senseless operation (hand over write 
requests to a separate thread) in patch 3 to improve performance?

Mikulas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 17:51                 ` dm-crypt parallelization patches Mikulas Patocka
@ 2013-04-09 17:57                   ` Tejun Heo
  2013-04-09 18:08                     ` Mikulas Patocka
  2013-04-09 18:36                   ` Vivek Goyal
  1 sibling, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2013-04-09 17:57 UTC (permalink / raw
  To: Mikulas Patocka
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote:
> The patch dm-crypt-sort-requests.patch sorts write requests submitted by a 
> single thread. The requests are sorted according to the sector number, 
> rb-tree is used for efficient sorting.

Hmmm? Why not just keep the issuing order along with plugging
boundaries?

> So it seems that CFQ has some deficiency that it cannot merge adjacent 
> requests done by different processes.

As I wrote before, please use bio_associate_current().  Currently,
dm-crypt is completely messing up all the context information that cfq
depends on to schedule IOs.  Of course, it doesn't perform well.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 17:57                   ` Tejun Heo
@ 2013-04-09 18:08                     ` Mikulas Patocka
  2013-04-09 18:10                       ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Mikulas Patocka @ 2013-04-09 18:08 UTC (permalink / raw
  To: Tejun Heo
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt



On Tue, 9 Apr 2013, Tejun Heo wrote:

> On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote:
> > The patch dm-crypt-sort-requests.patch sorts write requests submitted by a 
> > single thread. The requests are sorted according to the sector number, 
> > rb-tree is used for efficient sorting.
> 
> Hmmm? Why not just keep the issuing order along with plugging
> boundaries?

What do you mean?

I used to have a patch that keeps order of requests as they were 
introduced, but sorting the requests according to sector number is a bit 
simpler.

> > So it seems that CFQ has some deficiency that it cannot merge adjacent 
> > requests done by different processes.
> 
> As I wrote before, please use bio_associate_current().  Currently,
> dm-crypt is completely messing up all the context information that cfq
> depends on to schedule IOs.  Of course, it doesn't perform well.

bio_associate_current() is only valid on a system with cgroups and there 
are no cgroups on the kernel where I tested it. It is an empty function:

static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }

Mikulas

> Thanks.
> 
> -- 
> tejun
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 18:08                     ` Mikulas Patocka
@ 2013-04-09 18:10                       ` Tejun Heo
  2013-04-09 18:42                         ` Vivek Goyal
  2013-04-09 19:42                         ` Mikulas Patocka
  0 siblings, 2 replies; 22+ messages in thread
From: Tejun Heo @ 2013-04-09 18:10 UTC (permalink / raw
  To: Mikulas Patocka
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

Hey,

On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote:
> > Hmmm? Why not just keep the issuing order along with plugging
> > boundaries?
> 
> What do you mean?
> 
> I used to have a patch that keeps order of requests as they were 
> introduced, but sorting the requests according to sector number is a bit 
> simpler.

You're still destroying the context information.  Please just keep the
issuing order along with plugging boundaries.

> > As I wrote before, please use bio_associate_current().  Currently,
> > dm-crypt is completely messing up all the context information that cfq
> > depends on to schedule IOs.  Of course, it doesn't perform well.
> 
> bio_associate_current() is only valid on a system with cgroups and there 
> are no cgroups on the kernel where I tested it. It is an empty function:
> 
> static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }

Yeah, because blkcg was the only user.  Please feel free to drop the
ifdefs.  It covers both iocontext and cgroup association.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 17:51                 ` dm-crypt parallelization patches Mikulas Patocka
  2013-04-09 17:57                   ` Tejun Heo
@ 2013-04-09 18:36                   ` Vivek Goyal
  1 sibling, 0 replies; 22+ messages in thread
From: Vivek Goyal @ 2013-04-09 18:36 UTC (permalink / raw
  To: Mikulas Patocka
  Cc: Jens Axboe, Tejun Heo, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote:
> Hi
> 
> I placed the dm-crypt parallization patches at: 
> http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/current/
> 
> The patches paralellize dm-crypt and make it possible to use all processor 
> cores.
> 
> 
> The patch dm-crypt-remove-percpu.patch removes some percpu variables and 
> replaces them with per-request variables.
> 
> The patch dm-crypt-unbound-workqueue.patch sets WQ_UNBOUND on the 
> encryption workqueue, allowing the encryption to be distributed to all 
> CPUs in the system.
> 
> The patch dm-crypt-offload-writes-to-thread.patch moves submission of all 
> write requests to a single thread.
> 
> The patch dm-crypt-sort-requests.patch sorts write requests submitted by a 
> single thread. The requests are sorted according to the sector number, 
> rb-tree is used for efficient sorting.
> 
> Some usage notes:
> 
> * turn off automatic cpu frequency scaling (or set it to "performance"
>   governor) - cpufreq doesn't recognize encryption workload correctly, 
>   sometimes it underclocks all the CPU cores when there is some encryption 
>   work to do, resulting in bad performance
> 
> * when using filesystem on encrypted dm-crypt device, reduce maximum 
>   request size with "/sys/block/dm-2/queue/max_sectors_kb" (substitute 
>   "dm-2" with the real name of your dm-crypt device). Note that having too 
>   big requests means that there is a small number of requests and they 
>   cannot be distributed to all available processors in parallel - it 
>   results in worse performance. Having too small requests results in high 
>   request overhead and also reduced performance. So you must find the 
>   optimal request size for your system and workload. For me, when testing 
>   this on ramdisk, the optimal is 8KiB. 
> 
> ---
> 
> Now, the problem with I/O scheduler: when doing performance testing, it 
> turns out that the parallel version is sometimes worse than the previous 
> implementation.
> 
> When I create a 4.3GiB dm-crypt device on the top of dm-loop on the top of 
> ext2 filesystem on 15k SCSI disk and run this command
> 
> time fio --rw=randrw --size=64M --bs=256k --filename=/dev/mapper/crypt 
> --direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 
> --name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11 
> --name=job12
> 
> the results are this:
> CFQ scheduler:
> --------------
> no patches:
> 21.9s
> patch 1:
> 21.7s
> patches 1,2:
> 2:33s
> patches 1,2 (+ nr_requests = 1280000)
> 2:18s
> patches 1,2,3:
> 20.7s
> patches 1,2,3,4:
> 20.7s
> 
> deadline scheduler:
> -------------------
> no patches:
> 27.4s
> patch 1:
> 27.4s
> patches 1,2:
> 27.8s
> patches 1,2,3:
> 29.6s
> patches 1,2,3,4:
> 29.6s
> 
> 
> We can see that CFQ performs badly with the patch 2, but improves with the 
> patch 3. All that patch 3 does is that it moves write requests from 
> encryption threads to a separate thread.
> 
> So it seems that CFQ has some deficiency that it cannot merge adjacent 
> requests done by different processes.
> 

CFQ does not merge requests across different cfq queues (cfqq). Each
queue is associated with one iocontext. So in this case each worker
thread is submitting its own bio and each 4K bio must be going in
separate cfqq hence no merging is taking place.

The moment you applied patch 3, where a single thread submitted bios,
each bio went into single queue and possibly got merged.

So either use single thread to submit bio or better use
bio_associate_current() (as tejun suggested) on original 256K bio.
(Hopefully bio iocontext association information is retained when you
 split the bios into smaller pieces).

> The problem is this:
> - we have 256k write direct-i/o request
> - it is broken to 4k bios (because we run on dm-loop on a filesystem with 
>   4k block size)
> - encryption of these 4k bios is distributed to 12 processes on a 12-core 
>   machine
> - encryption finishes out of order and in different processes, 4k bios 
>   with encrypted data are submitted to CFQ
> - CFQ doesn't merge them
> - the disk is flooded with random 4k write requests, and performs much 
>   worse than with 256k requests
> 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 18:10                       ` Tejun Heo
@ 2013-04-09 18:42                         ` Vivek Goyal
  2013-04-09 18:57                           ` Tejun Heo
  2013-04-09 19:42                         ` Mikulas Patocka
  1 sibling, 1 reply; 22+ messages in thread
From: Vivek Goyal @ 2013-04-09 18:42 UTC (permalink / raw
  To: Tejun Heo
  Cc: Mikulas Patocka, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 11:10:31AM -0700, Tejun Heo wrote:
> Hey,
> 
> On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote:
> > > Hmmm? Why not just keep the issuing order along with plugging
> > > boundaries?
> > 
> > What do you mean?
> > 
> > I used to have a patch that keeps order of requests as they were 
> > introduced, but sorting the requests according to sector number is a bit 
> > simpler.
> 
> You're still destroying the context information.  Please just keep the
> issuing order along with plugging boundaries.

I guess plugging boundary is more important than issuing order as 
block layer should take care of mering the bio and put in right
order (attempt_plug_merge()).

But to make use of plugging boundary, one would probably still need
submission using single thread.

And if one is using single thread for submission, one will still get
good performance (even if you are not using bio_associate_current()), as
by default all bio will go to submitting thread's context.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 18:42                         ` Vivek Goyal
@ 2013-04-09 18:57                           ` Tejun Heo
  2013-04-09 19:13                             ` Vivek Goyal
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2013-04-09 18:57 UTC (permalink / raw
  To: Vivek Goyal
  Cc: Mikulas Patocka, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

Hello,

On Tue, Apr 09, 2013 at 02:42:48PM -0400, Vivek Goyal wrote:
> I guess plugging boundary is more important than issuing order as 
> block layer should take care of mering the bio and put in right
> order (attempt_plug_merge()).

Yeah, the exact order probably doesn't affect things too much but it's
just a nice design principle to follow - if you're gonna step in in
the middle and meddle with requests, preserve as much context as
reasonably possible, and it's not like preserving that order is
difficult.

> But to make use of plugging boundary, one would probably still need
> submission using single thread.

It doesn't have to a specific task.  Whoever finishes the last bio /
segment / whatever in the plugging domain can issue all of them.  I
probably am missing details but the overall mechanism can be pretty
simple.  Just keep the bios from the same plugging domain in the
received order along with an atomic counter and issue them all when
the counter hits zero.  No need to fiddle with sorting or whatever.

> And if one is using single thread for submission, one will still get
> good performance (even if you are not using bio_associate_current()), as
> by default all bio will go to submitting thread's context.

And destroy all per-ioc and cgroup logics in block layer in the
process.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 18:57                           ` Tejun Heo
@ 2013-04-09 19:13                             ` Vivek Goyal
  0 siblings, 0 replies; 22+ messages in thread
From: Vivek Goyal @ 2013-04-09 19:13 UTC (permalink / raw
  To: Tejun Heo
  Cc: Mikulas Patocka, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 11:57:21AM -0700, Tejun Heo wrote:

[..]
> And destroy all per-ioc and cgroup logics in block layer in the
> process.

Oh, I am in no-way suggesting don't use bio_associate_current(). I am 
just trying to analyze the performance issue right now and saying that
as far as performance is concenred, one will get it back even if one
does not use bio_associate_current().

Yes but one needs to use bio_associate_current() to make sure bio's
are attributed to right cgroup and associate bio to right task. This
should help solving the long standing issue of task losing its ioprio
if dm-crypt target is in the stack.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 18:10                       ` Tejun Heo
  2013-04-09 18:42                         ` Vivek Goyal
@ 2013-04-09 19:42                         ` Mikulas Patocka
  2013-04-09 19:52                           ` Tejun Heo
  1 sibling, 1 reply; 22+ messages in thread
From: Mikulas Patocka @ 2013-04-09 19:42 UTC (permalink / raw
  To: Tejun Heo
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt



On Tue, 9 Apr 2013, Tejun Heo wrote:

> Hey,
> 
> On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote:
> > > Hmmm? Why not just keep the issuing order along with plugging
> > > boundaries?
> > 
> > What do you mean?
> > 
> > I used to have a patch that keeps order of requests as they were 
> > introduced, but sorting the requests according to sector number is a bit 
> > simpler.
> 
> You're still destroying the context information.  Please just keep the
> issuing order along with plugging boundaries.
>
> > > As I wrote before, please use bio_associate_current().  Currently,
> > > dm-crypt is completely messing up all the context information that cfq
> > > depends on to schedule IOs.  Of course, it doesn't perform well.
> > 
> > bio_associate_current() is only valid on a system with cgroups and there 
> > are no cgroups on the kernel where I tested it. It is an empty function:
> > 
> > static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
> 
> Yeah, because blkcg was the only user.  Please feel free to drop the
> ifdefs.  It covers both iocontext and cgroup association.
> 
> Thanks.

If I drop ifdefs, it doesn't compile (because other cgroup stuff it 
missing).

So I enabled bio cgroups.

bio_associate_current can't be used, because by the time we allocate the 
outgoing write bio, we are no longer in the process that submitted the 
original bio.

Anyway, I tried to reproduce in dm-crypt what bio_associate_current does - 
in the submitting process I record "ioc" and "css" fields in "dm_crypt_io" 
structure and set these fields on all outgoing bios. It has no effect on 
performance, it is as bad as if I hadn't done it.

Mikulas

---
(this is the patch that I used, to be applied after 
dm-crypt-unbound-workqueue.patch)

---
 drivers/md/dm-crypt.c |   41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

Index: linux-3.8.6-fast/drivers/md/dm-crypt.c
===================================================================
--- linux-3.8.6-fast.orig/drivers/md/dm-crypt.c	2013-04-09 20:32:41.000000000 +0200
+++ linux-3.8.6-fast/drivers/md/dm-crypt.c	2013-04-09 21:29:12.000000000 +0200
@@ -20,6 +20,7 @@
 #include <linux/backing-dev.h>
 #include <linux/atomic.h>
 #include <linux/scatterlist.h>
+#include <linux/cgroup.h>
 #include <asm/page.h>
 #include <asm/unaligned.h>
 #include <crypto/hash.h>
@@ -60,6 +61,9 @@ struct dm_crypt_io {
 	int error;
 	sector_t sector;
 	struct dm_crypt_io *base_io;
+
+	struct io_context *ioc;
+	struct cgroup_subsys_state *css;
 };
 
 struct dm_crypt_request {
@@ -797,6 +801,14 @@ static struct bio *crypt_alloc_buffer(st
 	if (!clone)
 		return NULL;
 
+	if (unlikely(io->base_io != NULL)) {
+		clone->bi_ioc = io->base_io->ioc;
+		clone->bi_css = io->base_io->css;
+	} else {
+		clone->bi_ioc = io->ioc;
+		clone->bi_css = io->css;
+	}
+
 	clone_init(io, clone);
 	*out_of_pages = 0;
 
@@ -859,6 +871,9 @@ static struct dm_crypt_io *crypt_io_allo
 	io->ctx.req = NULL;
 	atomic_set(&io->io_pending, 0);
 
+	io->ioc = NULL;
+	io->css = NULL;
+
 	return io;
 }
 
@@ -884,6 +899,14 @@ static void crypt_dec_pending(struct dm_
 
 	if (io->ctx.req)
 		mempool_free(io->ctx.req, cc->req_pool);
+
+	if (io->ioc) {
+		put_io_context(io->ioc);
+	}
+	if (io->css) {
+		css_put(io->css);
+	}
+
 	mempool_free(io, cc->io_pool);
 
 	if (likely(!base_io))
@@ -927,6 +950,9 @@ static void crypt_endio(struct bio *clon
 	if (rw == WRITE)
 		crypt_free_buffer_pages(cc, clone);
 
+	clone->bi_ioc = NULL;
+	clone->bi_css = NULL;
+
 	bio_put(clone);
 
 	if (rw == READ && !error) {
@@ -1658,6 +1684,21 @@ static int crypt_map(struct dm_target *t
 
 	io = crypt_io_alloc(cc, bio, dm_target_offset(ti, bio->bi_sector));
 
+	if (current->io_context) {
+		struct io_context *ioc = current->io_context;
+		struct cgroup_subsys_state *css;
+
+		get_io_context_active(ioc);
+		io->ioc = ioc;
+
+		/* associate blkcg if exists */
+		rcu_read_lock();
+		css = task_subsys_state(current, blkio_subsys_id);
+		if (css && css_tryget(css))
+			io->css = css;
+		rcu_read_unlock();
+	}
+
 	if (bio_data_dir(io->base_bio) == READ) {
 		if (kcryptd_io_read(io, GFP_NOWAIT))
 			kcryptd_queue_io(io);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 19:42                         ` Mikulas Patocka
@ 2013-04-09 19:52                           ` Tejun Heo
  2013-04-09 20:32                             ` Mikulas Patocka
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2013-04-09 19:52 UTC (permalink / raw
  To: Mikulas Patocka
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 03:42:16PM -0400, Mikulas Patocka wrote:
> If I drop ifdefs, it doesn't compile (because other cgroup stuff it 
> missing).
> 
> So I enabled bio cgroups.
> 
> bio_associate_current can't be used, because by the time we allocate the 
> outgoing write bio, we are no longer in the process that submitted the 
> original bio.

Oh, I suppose it'd need some massaging to selectively turn off the
cgroup part.

> Anyway, I tried to reproduce in dm-crypt what bio_associate_current does - 

and we probably need to change that to bio_associate_task().

> in the submitting process I record "ioc" and "css" fields in "dm_crypt_io" 
> structure and set these fields on all outgoing bios. It has no effect on 
> performance, it is as bad as if I hadn't done it.

A good way to verify that the tagging is correct would be configuring
io limits in block cgroup and see whether the limits are correctly
applied when going through dm-crypt (please test with direct-io or
reads, writeback is horribly broken, sorry).working correctly, maybe
plugging is the overriding factor?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 19:52                           ` Tejun Heo
@ 2013-04-09 20:32                             ` Mikulas Patocka
  2013-04-09 21:02                               ` Tejun Heo
  2013-04-09 21:07                               ` Vivek Goyal
  0 siblings, 2 replies; 22+ messages in thread
From: Mikulas Patocka @ 2013-04-09 20:32 UTC (permalink / raw
  To: Tejun Heo
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt



On Tue, 9 Apr 2013, Tejun Heo wrote:

> On Tue, Apr 09, 2013 at 03:42:16PM -0400, Mikulas Patocka wrote:
> > If I drop ifdefs, it doesn't compile (because other cgroup stuff it 
> > missing).
> > 
> > So I enabled bio cgroups.
> > 
> > bio_associate_current can't be used, because by the time we allocate the 
> > outgoing write bio, we are no longer in the process that submitted the 
> > original bio.
> 
> Oh, I suppose it'd need some massaging to selectively turn off the
> cgroup part.
> 
> > Anyway, I tried to reproduce in dm-crypt what bio_associate_current does - 
> 
> and we probably need to change that to bio_associate_task().

Generally, we shouldn't associate bios with "current" task in device 
mapper targets. For example suppose that we have two stacked dm-crypt 
targets:

In the "current" process pointer in lower dm-crypt target's request 
function always points to the workqueue of the upper dm-crypt target that 
submits the bios. So if we associate the bio with "current" in the lower 
target, we are associating it with a preallocated workqueue and we already 
lost the information who submitted it.

You should associate a bio with a task when you create the bio and "md" 
and "dm" midlayers should just forward this association to lower layer 
bios.

> > in the submitting process I record "ioc" and "css" fields in "dm_crypt_io" 
> > structure and set these fields on all outgoing bios. It has no effect on 
> > performance, it is as bad as if I hadn't done it.
> 
> A good way to verify that the tagging is correct would be configuring
> io limits in block cgroup and see whether the limits are correctly
> applied when going through dm-crypt (please test with direct-io or
> reads, writeback is horribly broken, sorry).working correctly, maybe
> plugging is the overriding factor?
> 
> Thanks.

It doesn't work because device mapper on the underlying layers ignores 
bi_ioc and bi_css.

If I make device mapper forward bi_ioc and bi_css to outgoing bios, it 
improves performance (from 2:30 to 1:30), but it is still far from 
perfect.

Mikulas

---

dm: forward cgroup context

This patch makes dm forward associated cgroup context to cloned bios.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/md/dm.c |    9 +++++++++
 fs/bio.c        |    2 ++
 2 files changed, 11 insertions(+)

Index: linux-3.8.6-fast/drivers/md/dm.c
===================================================================
--- linux-3.8.6-fast.orig/drivers/md/dm.c	2013-04-09 22:00:36.000000000 +0200
+++ linux-3.8.6-fast/drivers/md/dm.c	2013-04-09 22:19:40.000000000 +0200
@@ -453,6 +453,10 @@ static void free_io(struct mapped_device
 
 static void free_tio(struct mapped_device *md, struct dm_target_io *tio)
 {
+#ifdef CONFIG_BLK_CGROUP
+	tio->clone.bi_ioc = NULL;
+	tio->clone.bi_css = NULL;
+#endif
 	bio_put(&tio->clone);
 }
 
@@ -1124,6 +1128,11 @@ static struct dm_target_io *alloc_tio(st
 	clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci->md->bs);
 	tio = container_of(clone, struct dm_target_io, clone);
 
+#ifdef CONFIG_BLK_CGROUP
+	tio->clone.bi_ioc = ci->bio->bi_ioc;
+	tio->clone.bi_css = ci->bio->bi_css;
+#endif
+
 	tio->io = ci->io;
 	tio->ti = ti;
 	memset(&tio->info, 0, sizeof(tio->info));

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 20:32                             ` Mikulas Patocka
@ 2013-04-09 21:02                               ` Tejun Heo
  2013-04-09 21:03                                 ` Tejun Heo
  2013-04-09 21:07                               ` Vivek Goyal
  1 sibling, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2013-04-09 21:02 UTC (permalink / raw
  To: Mikulas Patocka
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

Hey,

On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote:
> > and we probably need to change that to bio_associate_task().
> 
> Generally, we shouldn't associate bios with "current" task in device 
> mapper targets. For example suppose that we have two stacked dm-crypt 
> targets:

It only follows the first association so it doesn't matter how many
layers it goes through.  That said, yeah, there could be situations
where @task is availalbe but the bio's already in the hand of a
different task.  If that's the case, change it to
associate_task(@task).

> It doesn't work because device mapper on the underlying layers ignores 
> bi_ioc and bi_css.
> 
> If I make device mapper forward bi_ioc and bi_css to outgoing bios, it 
> improves performance (from 2:30 to 1:30), but it is still far from 
> perfect.

For testing, copying bi_ioc and bi_css directly is fine but please add
another interface to copy those for the actual code.  Say,
bio_copy_association(@to_bio, @from_bio) or whatever.

As for the performance loss, I'm somewhat confident in saying the
remaining difference would be from ignoring plugging boundaries.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 21:02                               ` Tejun Heo
@ 2013-04-09 21:03                                 ` Tejun Heo
  0 siblings, 0 replies; 22+ messages in thread
From: Tejun Heo @ 2013-04-09 21:03 UTC (permalink / raw
  To: Mikulas Patocka
  Cc: Jens Axboe, Vivek Goyal, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 02:02:01PM -0700, Tejun Heo wrote:
> For testing, copying bi_ioc and bi_css directly is fine but please add
> another interface to copy those for the actual code.  Say,
> bio_copy_association(@to_bio, @from_bio) or whatever.

Another and probably better possibility is just remembering the
issuing task (you would of course need to hold an extra ref as long as
you wanna use it) and use bio_associate_task() on it when creating new
bios.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 20:32                             ` Mikulas Patocka
  2013-04-09 21:02                               ` Tejun Heo
@ 2013-04-09 21:07                               ` Vivek Goyal
  2013-04-09 21:18                                 ` Mikulas Patocka
  1 sibling, 1 reply; 22+ messages in thread
From: Vivek Goyal @ 2013-04-09 21:07 UTC (permalink / raw
  To: Mikulas Patocka
  Cc: Tejun Heo, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote:

[..]
> Generally, we shouldn't associate bios with "current" task in device 
> mapper targets. For example suppose that we have two stacked dm-crypt 
> targets:
>
> In the "current" process pointer in lower dm-crypt target's request 
> function always points to the workqueue of the upper dm-crypt target that 
> submits the bios. So if we associate the bio with "current" in the lower 
> target, we are associating it with a preallocated workqueue and we already 
> lost the information who submitted it.
> 
> You should associate a bio with a task when you create the bio and "md" 
> and "dm" midlayers should just forward this association to lower layer 
> bios.

bio_associate_current() return -EBUSY if bio has already been associated
with an io context.

So in a stack if every driver calls bio_associate_current(), upon bio
submission, it will automatically amke sure bio gets associated with
submitter task in top level device and calls by lower level device
will be ignored.

Lower level devices I think just need to make sure this context
info is propogated to cloned bios.


[..]
> +#ifdef CONFIG_BLK_CGROUP
> +	tio->clone.bi_ioc = ci->bio->bi_ioc;
> +	tio->clone.bi_css = ci->bio->bi_css;

You also need to take references to ioc and css objects. I guess a helper
function will be better. May be something like.

bio_associate_bio_context(bio1, bio2)

And this initialize bio2's context with bio1's context.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 21:07                               ` Vivek Goyal
@ 2013-04-09 21:18                                 ` Mikulas Patocka
  2013-04-10 19:24                                   ` Vivek Goyal
  0 siblings, 1 reply; 22+ messages in thread
From: Mikulas Patocka @ 2013-04-09 21:18 UTC (permalink / raw
  To: Vivek Goyal
  Cc: Tejun Heo, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt



On Tue, 9 Apr 2013, Vivek Goyal wrote:

> On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote:
> 
> [..]
> > Generally, we shouldn't associate bios with "current" task in device 
> > mapper targets. For example suppose that we have two stacked dm-crypt 
> > targets:
> >
> > In the "current" process pointer in lower dm-crypt target's request 
> > function always points to the workqueue of the upper dm-crypt target that 
> > submits the bios. So if we associate the bio with "current" in the lower 
> > target, we are associating it with a preallocated workqueue and we already 
> > lost the information who submitted it.
> > 
> > You should associate a bio with a task when you create the bio and "md" 
> > and "dm" midlayers should just forward this association to lower layer 
> > bios.
> 
> bio_associate_current() return -EBUSY if bio has already been associated
> with an io context.
> 
> So in a stack if every driver calls bio_associate_current(), upon bio
> submission, it will automatically amke sure bio gets associated with
> submitter task in top level device and calls by lower level device
> will be ignored.

The stacking drivers do not pass the same bio to each other.

The stacking driver receives a bio, allocates zero or more new bios and 
sends these new bios to the lower layer. So you need to propagate 
ownership from the received bio to newly allocated bios, you don't want to 
associate newly allocated bio with "current" process.

> Lower level devices I think just need to make sure this context
> info is propogated to cloned bios.
> 
> 
> [..]
> > +#ifdef CONFIG_BLK_CGROUP
> > +	tio->clone.bi_ioc = ci->bio->bi_ioc;
> > +	tio->clone.bi_css = ci->bio->bi_css;
> 
> You also need to take references to ioc and css objects. I guess a helper
> function will be better. May be something like.

The lifetime of the "tio" structure is shorter than the lifetime of 
"ci->bio". So we don't need to increment reference.

We only need to increment reference if we copy ownership to a new bio that 
has could have longer lifetime than the original bio. But this situation 
is very rare - in most stacking drivers the newly allocated bio has 
shorter lifetime than the original one.

> bio_associate_bio_context(bio1, bio2)
> 
> And this initialize bio2's context with bio1's context.

Yes, that would be ok.

> Thanks
> Vivek

Mikulas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-04-09 21:18                                 ` Mikulas Patocka
@ 2013-04-10 19:24                                   ` Vivek Goyal
  0 siblings, 0 replies; 22+ messages in thread
From: Vivek Goyal @ 2013-04-10 19:24 UTC (permalink / raw
  To: Mikulas Patocka
  Cc: Tejun Heo, Jens Axboe, Mike Snitzer, Milan Broz, dm-devel,
	Andi Kleen, dm-crypt, linux-kernel, Christoph Hellwig,
	Christian Schmidt

On Tue, Apr 09, 2013 at 05:18:25PM -0400, Mikulas Patocka wrote:

[..]
> > bio_associate_current() return -EBUSY if bio has already been associated
> > with an io context.
> > 
> > So in a stack if every driver calls bio_associate_current(), upon bio
> > submission, it will automatically amke sure bio gets associated with
> > submitter task in top level device and calls by lower level device
> > will be ignored.
> 
> The stacking drivers do not pass the same bio to each other.
> 
> The stacking driver receives a bio, allocates zero or more new bios and 
> sends these new bios to the lower layer. So you need to propagate 
> ownership from the received bio to newly allocated bios, you don't want to 
> associate newly allocated bio with "current" process.

Actually I was asking to call bio_associate_current() for the incoming
bio in the driver and not for the newly created bios by the driver.

For any newly created bios on behalf of this incoming bio, we need to
copy the context so that this context info can be propogated down the
stack.


> 
> > Lower level devices I think just need to make sure this context
> > info is propogated to cloned bios.
> > 
> > 
> > [..]
> > > +#ifdef CONFIG_BLK_CGROUP
> > > +	tio->clone.bi_ioc = ci->bio->bi_ioc;
> > > +	tio->clone.bi_css = ci->bio->bi_css;
> > 
> > You also need to take references to ioc and css objects. I guess a helper
> > function will be better. May be something like.
> 
> The lifetime of the "tio" structure is shorter than the lifetime of 
> "ci->bio". So we don't need to increment reference.
> 
> We only need to increment reference if we copy ownership to a new bio that 
> has could have longer lifetime than the original bio. But this situation 
> is very rare - in most stacking drivers the newly allocated bio has 
> shorter lifetime than the original one.

I think it is not a good idea to rely on the fact that cloned bios or
newly created bios will have shorter lifetime than the original bio. In fact
when the bio completes and you free it up bio_disassociate_task() will try
to put io context and blkcg reference. So it is important to take
reference if you are copying context info in any newly created bio.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 22+ messages in thread

* dm-crypt parallelization patches
@ 2013-09-03 19:06 Mikulas Patocka
  2013-09-03 20:07 ` Andi Kleen
  2013-09-03 20:49 ` Milan Broz
  0 siblings, 2 replies; 22+ messages in thread
From: Mikulas Patocka @ 2013-09-03 19:06 UTC (permalink / raw
  To: Andi Kleen; +Cc: dm-devel, Milan Broz, Alasdair G. Kergon

Hi Andi

You wrote the original dm-crypt paralllelization patch (commit 
c029772125594e31eb1a5ad9e0913724ed9891f2). This patch improves 
parallelization, it makes data being encrypted on the same CPU that 
submitted the request.

I improved that so that encryption is performed by all processors, it uses 
unbound workqueue, so that any processor can do encryption. My patches 
improves performance mainly when doing sequential read - your original 
patch paralellized sequential read badly because most requests were 
submitted by the same cpu.

I put my patch series at 
http://people.redhat.com/mpatocka/patches/kernel/dm-crypt-paralelizace/current/series.html

I'd like if you could look at it and ack it for inclusion in the upstream 
kernel.

Mikulas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-09-03 19:06 dm-crypt parallelization patches Mikulas Patocka
@ 2013-09-03 20:07 ` Andi Kleen
  2013-09-11 23:03   ` Mikulas Patocka
  2013-09-03 20:49 ` Milan Broz
  1 sibling, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2013-09-03 20:07 UTC (permalink / raw
  To: Mikulas Patocka; +Cc: dm-devel, Milan Broz, Alasdair G. Kergon

On Tue, Sep 03, 2013 at 03:06:03PM -0400, Mikulas Patocka wrote:
> Hi Andi
> 
> You wrote the original dm-crypt paralllelization patch (commit 
> c029772125594e31eb1a5ad9e0913724ed9891f2). This patch improves 
> parallelization, it makes data being encrypted on the same CPU that 
> submitted the request.

The motivation for my patches was to make large systems with many
cores scale.

I think the motivation for yours is to make small systems without
hardware crypto offload slightly faster.

That's fine (although I suspect it's more and more obscure), as long as
you don't impact the large systems scalability.

Can you do some tests with a larger system with very parallel IO
(many CPUs submitting)  and see if the IO rate is equivalent?

If that's given the patches are ok for me.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-09-03 19:06 dm-crypt parallelization patches Mikulas Patocka
  2013-09-03 20:07 ` Andi Kleen
@ 2013-09-03 20:49 ` Milan Broz
  1 sibling, 0 replies; 22+ messages in thread
From: Milan Broz @ 2013-09-03 20:49 UTC (permalink / raw
  To: Mikulas Patocka; +Cc: dm-devel, Andi Kleen, Milan Broz, Alasdair G. Kergon

On 09/03/2013 09:06 PM, Mikulas Patocka wrote:
> Hi Andi
> 
> You wrote the original dm-crypt paralllelization patch (commit 
> c029772125594e31eb1a5ad9e0913724ed9891f2). This patch improves 
> parallelization, it makes data being encrypted on the same CPU that 
> submitted the request.

Hi Mikulas,

Do you have some recent reported performance problems with
the current dmcrypt code?

Please post numbers from tests. Also include systems using AES-NI.

You know I tested your patchset cca year ago and result was not clear
improvement... So if this changed, it is good news.

Milan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-09-03 20:07 ` Andi Kleen
@ 2013-09-11 23:03   ` Mikulas Patocka
  2013-09-11 23:33     ` Andi Kleen
  2013-09-12  3:51     ` Milan Broz
  0 siblings, 2 replies; 22+ messages in thread
From: Mikulas Patocka @ 2013-09-11 23:03 UTC (permalink / raw
  To: Andi Kleen; +Cc: dm-devel, Milan Broz, Alasdair G. Kergon



On Tue, 3 Sep 2013, Andi Kleen wrote:

> On Tue, Sep 03, 2013 at 03:06:03PM -0400, Mikulas Patocka wrote:
> > Hi Andi
> > 
> > You wrote the original dm-crypt paralllelization patch (commit 
> > c029772125594e31eb1a5ad9e0913724ed9891f2). This patch improves 
> > parallelization, it makes data being encrypted on the same CPU that 
> > submitted the request.
> 
> The motivation for my patches was to make large systems with many
> cores scale.
> 
> I think the motivation for yours is to make small systems without
> hardware crypto offload slightly faster.
> 
> That's fine (although I suspect it's more and more obscure), as long as
> you don't impact the large systems scalability.
> 
> Can you do some tests with a larger system with very parallel IO
> (many CPUs submitting)  and see if the IO rate is equivalent?
> 
> If that's given the patches are ok for me.
> 
> -Andi

The patches mainly help with sequential I/O (sequential I/O is normally 
submitted from a few CPUs, so existing dm-crypt doesn't parallelize 
right). There is a decrease of performance when 12 threads are doing 
writes on ramdisk:

Sequential read of two SCSI disks in RAID-0 (throughput):
	raw read rate: 262MB/s stdev 2.9
	existing dm-crypt: 111MB/s stdev 19.4 (it varies greatly)
	new dm-crypt: 254MB/s stdev 1.5

Sequential read of ramdisk (throughput):
	raw read rate: 908MB/s stdev 7.7
	existing dm-crypt: 133MB/s stdev 0.8
	my new dm-crypt: 475MB/s stdev 6

fio --rw=randread --bs=4k --direct=1
Random read (4k block) on ramdisk from 12 threads (time in seconds):
	raw device: 2.32s stdev 0.02
	existing dm-crypt: 17.95s stdev 3.40
	new dm-crypt: 15.72s stdev 1.86

fio --rw=randwrite --bs=4k --direct=1
Random write (4k block) on ramdisk from 12 threads (time in seconds):
	raw device 3.91s stdev 0.01
	existing dm-crypt: 21.16s stdev 3.67
	new dm-crypt: 27.42s stdev 0.27

fio --rw=randrw --bs=4k --direct=1
Random read+write (4k block) on ramdisk from 12 threads (time in seconds):
	raw device: 3.09s stdev 0.01
	existing dm-crypt: 20.08s stdev 3.48
	new dm-crypt: 24.78s stdev 0.13

Tests were done on two six-core Istanbul-class Opterons.

I don't have a computer with AES-NI, so I can't test that.

Mikulas

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-09-11 23:03   ` Mikulas Patocka
@ 2013-09-11 23:33     ` Andi Kleen
  2013-09-12  3:51     ` Milan Broz
  1 sibling, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2013-09-11 23:33 UTC (permalink / raw
  To: Mikulas Patocka; +Cc: dm-devel, Milan Broz, Alasdair G. Kergon

> The patches mainly help with sequential I/O (sequential I/O is normally 
> submitted from a few CPUs, so existing dm-crypt doesn't parallelize 
> right). There is a decrease of performance when 12 threads are doing 
> writes on ramdisk:

That's a bit worrying. Could you please find a larger system (4S) in RH and
try with that?

I suspect you may need to limit the load balancing to smaller groups
of CPUs. Querying all other CPUs all the time is unlikely to scale.

-Andi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: dm-crypt parallelization patches
  2013-09-11 23:03   ` Mikulas Patocka
  2013-09-11 23:33     ` Andi Kleen
@ 2013-09-12  3:51     ` Milan Broz
  1 sibling, 0 replies; 22+ messages in thread
From: Milan Broz @ 2013-09-12  3:51 UTC (permalink / raw
  To: Mikulas Patocka; +Cc: Ondrej Kozina, dm-devel, Andi Kleen, Alasdair G. Kergon

On 09/12/2013 01:03 AM, Mikulas Patocka wrote:
> 
> Tests were done on two six-core Istanbul-class Opterons.
> 
> I don't have a computer with AES-NI, so I can't test that.

AES-NI is used almost in all systems where is dmcrypt currently
deployed (notebooks typically)... so I think it must be tested there.

Also ramdisk tests are good baseline but this is not real world scenario.

I would really like to see the same test with RAID5 (over fast rotational disks)
with various stripe size and for SSDs (and RAID0 with SSDs).
(These were reported problematic scenarios in recent years I know about.)

Can you also show how the sorting patch influences performance?
(IOW how it performs without it.)

Another problematic part was latency (e.g. sql database over dmcrypt doesn't
perform well). Isn't it even worse with these patches?
(I am not sure if anyone have fio script to simulate such workload.)

I would like to test it myself but I have no usable fast-enough test systems
at all currently. But there are surely some in RH.
(Added cc to Ondra, he has some test scripts I used.)

I also remember I tried your patches (year ago) with limited CPU workqueues
(I mean using e.g. only 3 of 6 cores) and it performed better in some tests.
This support Andi's suggestion to limit it to smaller groups of cpus.

Also I think we discussed this should be configurable through optional dmcrypt
per-device parameters (e.g. I would like to set up some dmcrypt device
to use only one CPU). Think about backend for virtual machines etc, it should not
eat all available CPUs power on host...
Do you plan to add such patch as well?

Milan

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2013-09-12  3:51 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-03 19:06 dm-crypt parallelization patches Mikulas Patocka
2013-09-03 20:07 ` Andi Kleen
2013-09-11 23:03   ` Mikulas Patocka
2013-09-11 23:33     ` Andi Kleen
2013-09-12  3:51     ` Milan Broz
2013-09-03 20:49 ` Milan Broz
  -- strict thread matches above, loose matches on Subject: below --
2013-03-26  3:47 dm-crypt performance Mikulas Patocka
2013-03-26 12:27 ` [dm-devel] " Alasdair G Kergon
2013-03-26 20:05   ` Milan Broz
2013-03-26 20:28     ` Mike Snitzer
2013-03-28 18:53       ` Tejun Heo
2013-03-28 19:33         ` Vivek Goyal
2013-03-28 19:44           ` Tejun Heo
2013-03-28 20:38             ` Vivek Goyal
2013-03-28 20:45               ` Tejun Heo
2013-04-09 17:51                 ` dm-crypt parallelization patches Mikulas Patocka
2013-04-09 17:57                   ` Tejun Heo
2013-04-09 18:08                     ` Mikulas Patocka
2013-04-09 18:10                       ` Tejun Heo
2013-04-09 18:42                         ` Vivek Goyal
2013-04-09 18:57                           ` Tejun Heo
2013-04-09 19:13                             ` Vivek Goyal
2013-04-09 19:42                         ` Mikulas Patocka
2013-04-09 19:52                           ` Tejun Heo
2013-04-09 20:32                             ` Mikulas Patocka
2013-04-09 21:02                               ` Tejun Heo
2013-04-09 21:03                                 ` Tejun Heo
2013-04-09 21:07                               ` Vivek Goyal
2013-04-09 21:18                                 ` Mikulas Patocka
2013-04-10 19:24                                   ` Vivek Goyal
2013-04-09 18:36                   ` Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.