[PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers

LKML Archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers
@ 2021-05-19 14:03 Leo Yan
  2021-05-19 14:03 ` [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release Leo Yan
  2021-05-27  7:54 ` [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers Adrian Hunter
  0 siblings, 2 replies; 22+ messages in thread
From: Leo Yan @ 2021-05-19 14:03 UTC (permalink / raw
  To: Arnaldo Carvalho de Melo, Adrian Hunter, Peter Zijlstra,
	Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim,
	Andi Kleen, linux-perf-users, linux-kernel
  Cc: Leo Yan

The AUX ring buffer's head and tail can be accessed from multiple CPUs
on SMP system, so changes to use SMP memory barriers to replace the
uniprocessor barriers.

Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
 tools/perf/util/auxtrace.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index 472c0973b1f1..8bed284ccc82 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -452,7 +452,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm)
 	u64 head = READ_ONCE(pc->aux_head);
 
 	/* Ensure all reads are done after we read the head */
-	rmb();
+	smp_rmb();
 	return head;
 }
 
@@ -466,7 +466,7 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm)
 #endif
 
 	/* Ensure all reads are done after we read the head */
-	rmb();
+	smp_rmb();
 	return head;
 }
 
@@ -478,7 +478,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail)
 #endif
 
 	/* Ensure all reads are done before we write the tail out */
-	mb();
+	smp_mb();
 #if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT)
 	pc->aux_tail = tail;
 #else
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release
  2021-05-19 14:03 [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers Leo Yan
@ 2021-05-19 14:03 ` Leo Yan
  2021-05-31 15:10   ` Leo Yan
  2021-05-27  7:54 ` [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers Adrian Hunter
  1 sibling, 1 reply; 22+ messages in thread
From: Leo Yan @ 2021-05-19 14:03 UTC (permalink / raw
  To: Arnaldo Carvalho de Melo, Adrian Hunter, Peter Zijlstra,
	Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim,
	Andi Kleen, linux-perf-users, linux-kernel
  Cc: Leo Yan

Load-acquire and store-release are one-way permeable barriers, which can
be used to guarantee the memory ordering between accessing the buffer
data and the buffer's head / tail.

This patch optimizes the memory ordering with the load-acquire and
store-release barriers.

If the load-acquire is not supported by the architectures, it uses the
existed READ_ONCE() + smp_rmb() pair rather than use the fallback
definition of smp_load_acquire().  The reason is smp_rmb() is a minor
improvement than smp_mb() which is called in the fallback macro
smp_load_acquire().

In auxtrace_mmap__write_tail(), updating the tail pointer is removed; if
the store-release barrier is not supported, it rollbacks to use
smp_mb() + WRITE_ONCE() pair which is defined in the fallback macro
smp_store_release() (see include/asm/barrier.h).

Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
 tools/perf/util/auxtrace.h | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index 8bed284ccc82..f18070f1993f 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -449,16 +449,27 @@ struct auxtrace_cache;
 static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm)
 {
 	struct perf_event_mmap_page *pc = mm->userpg;
+
+#if defined(__x86_64__) || defined(__aarch64__) || defined(__powerpc64__) || \
+    defined(__ia64__) || defined(__sparc__) && defined(__arch64__)
+	return smp_load_acquire(&pc->aux_head);
+#else
 	u64 head = READ_ONCE(pc->aux_head);
 
 	/* Ensure all reads are done after we read the head */
 	smp_rmb();
 	return head;
+#endif
 }
 
 static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm)
 {
 	struct perf_event_mmap_page *pc = mm->userpg;
+
+#if defined(__x86_64__) || defined(__aarch64__) || defined(__powerpc64__) || \
+    defined(__ia64__) || defined(__sparc__) && defined(__arch64__)
+	return smp_load_acquire(&pc->aux_head);
+#else
 #if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT)
 	u64 head = READ_ONCE(pc->aux_head);
 #else
@@ -468,20 +479,20 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm)
 	/* Ensure all reads are done after we read the head */
 	smp_rmb();
 	return head;
+#endif
 }
 
 static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail)
 {
 	struct perf_event_mmap_page *pc = mm->userpg;
-#if BITS_PER_LONG != 64 && defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT)
+
+#if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT)
+	smp_store_release(&pc->aux_tail, tail);
+#else
 	u64 old_tail;
-#endif
 
 	/* Ensure all reads are done before we write the tail out */
 	smp_mb();
-#if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT)
-	pc->aux_tail = tail;
-#else
 	do {
 		old_tail = __sync_val_compare_and_swap(&pc->aux_tail, 0, 0);
 	} while (!__sync_bool_compare_and_swap(&pc->aux_tail, old_tail, tail));
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release
  2021-05-19 14:03 ` [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release Leo Yan
@ 2021-05-31 15:10   ` Leo Yan
  2021-05-31 15:55     ` Peter Zijlstra
  2021-05-31 19:03     ` Adrian Hunter
  0 siblings, 2 replies; 22+ messages in thread
From: Leo Yan @ 2021-05-31 15:10 UTC (permalink / raw
  To: Arnaldo Carvalho de Melo, Adrian Hunter, Peter Zijlstra,
	Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim,
	Andi Kleen, linux-perf-users, linux-kernel

Hi Peter, Adrian,

On Wed, May 19, 2021 at 10:03:19PM +0800, Leo Yan wrote:
> Load-acquire and store-release are one-way permeable barriers, which can
> be used to guarantee the memory ordering between accessing the buffer
> data and the buffer's head / tail.
> 
> This patch optimizes the memory ordering with the load-acquire and
> store-release barriers.

Is this patch okay for you?

Besides this patch, I have an extra question.  You could see for
accessing the AUX buffer's head and tail, it also support to use
compiler build-in functions for atomicity accessing:

  __sync_val_compare_and_swap()
  __sync_bool_compare_and_swap()

Since now we have READ_ONCE()/WRITE_ONCE(), do you think we still need
to support __sync_xxx_compare_and_swap() atomicity?

I checked the code for updating head and tail for the perf ring buffer
(see ring_buffer_read_head() and ring_buffer_write_tail() in the file
tools/include/linux/ring_buffer.h), which doesn't support
__sync_xxx_compare_and_swap() anymore.  This is why I wander if should
drop __sync_xxx_compare_and_swap() atomicity for AUX ring buffer as
well.

Thanks,
Leo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release
  2021-05-31 15:10   ` Leo Yan
@ 2021-05-31 15:55     ` Peter Zijlstra
  2021-05-31 19:03     ` Adrian Hunter
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2021-05-31 15:55 UTC (permalink / raw
  To: Leo Yan
  Cc: Arnaldo Carvalho de Melo, Adrian Hunter, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen,
	linux-perf-users, linux-kernel

On Mon, May 31, 2021 at 11:10:49PM +0800, Leo Yan wrote:
> Hi Peter, Adrian,
> 
> On Wed, May 19, 2021 at 10:03:19PM +0800, Leo Yan wrote:
> > Load-acquire and store-release are one-way permeable barriers, which can
> > be used to guarantee the memory ordering between accessing the buffer
> > data and the buffer's head / tail.
> > 
> > This patch optimizes the memory ordering with the load-acquire and
> > store-release barriers.
> 
> Is this patch okay for you?

Not without actual numbers; that's some terrible ifdef soup.

> Besides this patch, I have an extra question.  You could see for
> accessing the AUX buffer's head and tail, it also support to use
> compiler build-in functions for atomicity accessing:
> 
>   __sync_val_compare_and_swap()
>   __sync_bool_compare_and_swap()
> 
> Since now we have READ_ONCE()/WRITE_ONCE(), do you think we still need
> to support __sync_xxx_compare_and_swap() atomicity?
> 
> I checked the code for updating head and tail for the perf ring buffer
> (see ring_buffer_read_head() and ring_buffer_write_tail() in the file
> tools/include/linux/ring_buffer.h), which doesn't support
> __sync_xxx_compare_and_swap() anymore.  This is why I wander if should
> drop __sync_xxx_compare_and_swap() atomicity for AUX ring buffer as
> well.

I'm not sure wth that code is even trying to do, that's some seriously
dodgy code.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release
  2021-05-31 15:10   ` Leo Yan
  2021-05-31 15:55     ` Peter Zijlstra
@ 2021-05-31 19:03     ` Adrian Hunter
  2021-06-01  6:33       ` Leo Yan
  2021-06-01  6:48       ` Peter Zijlstra
  1 sibling, 2 replies; 22+ messages in thread
From: Adrian Hunter @ 2021-05-31 19:03 UTC (permalink / raw
  To: Leo Yan, Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen,
	linux-perf-users, linux-kernel

On 31/05/21 6:10 pm, Leo Yan wrote:
> Hi Peter, Adrian,
> 
> On Wed, May 19, 2021 at 10:03:19PM +0800, Leo Yan wrote:
>> Load-acquire and store-release are one-way permeable barriers, which can
>> be used to guarantee the memory ordering between accessing the buffer
>> data and the buffer's head / tail.
>>
>> This patch optimizes the memory ordering with the load-acquire and
>> store-release barriers.
> 
> Is this patch okay for you?
> 
> Besides this patch, I have an extra question.  You could see for
> accessing the AUX buffer's head and tail, it also support to use
> compiler build-in functions for atomicity accessing:
> 
>   __sync_val_compare_and_swap()
>   __sync_bool_compare_and_swap()
> 
> Since now we have READ_ONCE()/WRITE_ONCE(), do you think we still need
> to support __sync_xxx_compare_and_swap() atomicity?

I don't remember, but it seems to me atomicity is needed only
for a 32-bit perf running with a 64-bit kernel.

> 
> I checked the code for updating head and tail for the perf ring buffer
> (see ring_buffer_read_head() and ring_buffer_write_tail() in the file
> tools/include/linux/ring_buffer.h), which doesn't support
> __sync_xxx_compare_and_swap() anymore.  This is why I wander if should
> drop __sync_xxx_compare_and_swap() atomicity for AUX ring buffer as
> well.
> 
> Thanks,
> Leo
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release
  2021-05-31 19:03     ` Adrian Hunter
@ 2021-06-01  6:33       ` Leo Yan
  2021-06-01  6:58         ` Peter Zijlstra
  2021-06-01  9:07         ` Adrian Hunter
  2021-06-01  6:48       ` Peter Zijlstra
  1 sibling, 2 replies; 22+ messages in thread
From: Leo Yan @ 2021-06-01  6:33 UTC (permalink / raw
  To: Adrian Hunter
  Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen,
	linux-perf-users, linux-kernel

On Mon, May 31, 2021 at 10:03:33PM +0300, Adrian Hunter wrote:
> On 31/05/21 6:10 pm, Leo Yan wrote:
> > Hi Peter, Adrian,
> > 
> > On Wed, May 19, 2021 at 10:03:19PM +0800, Leo Yan wrote:
> >> Load-acquire and store-release are one-way permeable barriers, which can
> >> be used to guarantee the memory ordering between accessing the buffer
> >> data and the buffer's head / tail.
> >>
> >> This patch optimizes the memory ordering with the load-acquire and
> >> store-release barriers.
> > 
> > Is this patch okay for you?
> > 
> > Besides this patch, I have an extra question.  You could see for
> > accessing the AUX buffer's head and tail, it also support to use
> > compiler build-in functions for atomicity accessing:
> > 
> >   __sync_val_compare_and_swap()
> >   __sync_bool_compare_and_swap()
> > 
> > Since now we have READ_ONCE()/WRITE_ONCE(), do you think we still need
> > to support __sync_xxx_compare_and_swap() atomicity?
> 
> I don't remember, but it seems to me atomicity is needed only
> for a 32-bit perf running with a 64-bit kernel.

32-bit perf wants to access 64-bit value atomically, I think it tries to
avoid the issue caused by scenario:

        CPU0 (64-bit kernel)           CPU1 (32-bit user)

                                         read head_lo
        WRITE_ONCE(head)
                                         read head_hi


I dumped the disassembly for reading 64-bit value for perf Arm32 and get
below results:

  perf Arm32 for READ_ONCE():

	case 8: *(__u64_alias_t *) res = *(volatile __u64_alias_t *) p; break;
     84a:	68fb      	ldr	r3, [r7, #12]
     84c:	e9d3 2300 	ldrd	r2, r3, [r3]
     850:	6939      	ldr	r1, [r7, #16]
     852:	e9c1 2300 	strd	r2, r3, [r1]
     856:	e007      	b.n	868 <auxtrace_mmap__read_head+0xb0>

It uses the instruction ldrd which is "Load Register Dual (register)",
but this doesn't mean the instruction is atomic, especially based on
the comment in the kernel header include/asm-generic/rwonce.h, I think
the instruction ldrd/strd will be "atomic in some cases (namely Armv7 +
LPAE), but for others we rely on the access being split into 2x32-bit
accesses".


  perf Arm32 for __sync_val_compare_and_swap():

	u64 head = __sync_val_compare_and_swap(&pc->aux_head, 0, 0);
     7d6:	68fb      	ldr	r3, [r7, #12]
     7d8:	f503 6484 	add.w	r4, r3, #1056	; 0x420
     7dc:	f04f 0000 	mov.w	r0, #0
     7e0:	f04f 0100 	mov.w	r1, #0
     7e4:	f3bf 8f5b 	dmb	ish
     7e8:	e8d4 237f 	ldrexd	r2, r3, [r4]
     7ec:	ea52 0c03 	orrs.w	ip, r2, r3
     7f0:	d106      	bne.n	800 <auxtrace_mmap__read_head+0x48>
     7f2:	e8c4 017c 	strexd	ip, r0, r1, [r4]
     7f6:	f1bc 0f00 	cmp.w	ip, #0
     7fa:	f1bc 0f00 	cmp.w	ip, #0
     7fe:	d1f3      	bne.n	7e8 <auxtrace_mmap__read_head+0x30>
     800:	f3bf 8f5b 	dmb	ish
     804:	e9c7 2304 	strd	r2, r3, [r7, #16]

For __sync_val_compare_and_swap(), it uses the instructions
ldrexd/ldrexd, these two instructions rely on the exclusive monitor
for accessing 64-bit value, so seems to me this is more reliable way
for accessing 64-bit value in CPU's 32-bit mode.

Conclusion: seems to me __sync_xxx_compare_and_swap() should be kept
in this case rather than using READ_ONCE() for 32-bit building.  Or
any other suggestions?  Thanks!

Leo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release
  2021-06-01  6:33       ` Leo Yan
@ 2021-06-01  6:58         ` Peter Zijlstra
  2021-06-01  9:07         ` Adrian Hunter
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2021-06-01  6:58 UTC (permalink / raw
  To: Leo Yan
  Cc: Adrian Hunter, Arnaldo Carvalho de Melo, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen,
	linux-perf-users, linux-kernel

On Tue, Jun 01, 2021 at 02:33:42PM +0800, Leo Yan wrote:
> 
> 32-bit perf wants to access 64-bit value atomically, I think it tries to
> avoid the issue caused by scenario:
> 
>         CPU0 (64-bit kernel)           CPU1 (32-bit user)
> 
>                                          read head_lo
>         WRITE_ONCE(head)
>                                          read head_hi

Right; so I think Mark and me once spend a bunch of time on this for the
regular ring buffer, but my memory is vague.

It was supposed to be that the high word would always be zero on 32bit,
but it turns out that that is not in fact the case and we get to have
this race that's basically unfixable :/

Or maybe that was only the compat case.. Ah yes, so see the kernel uses
unsigned long, so on 32bit the high word is empty and we always
read/write 0s, unless you're explicitly doing daft things.

But on compat, the high word can be non-zero and we get to have 'fun'.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release
  2021-06-01  6:33       ` Leo Yan
  2021-06-01  6:58         ` Peter Zijlstra
@ 2021-06-01  9:07         ` Adrian Hunter
  2021-06-01  9:17           ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Adrian Hunter @ 2021-06-01  9:07 UTC (permalink / raw
  To: Leo Yan
  Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen,
	linux-perf-users, linux-kernel

On 1/06/21 9:33 am, Leo Yan wrote:
> On Mon, May 31, 2021 at 10:03:33PM +0300, Adrian Hunter wrote:
>> On 31/05/21 6:10 pm, Leo Yan wrote:
>>> Hi Peter, Adrian,
>>>
>>> On Wed, May 19, 2021 at 10:03:19PM +0800, Leo Yan wrote:
>>>> Load-acquire and store-release are one-way permeable barriers, which can
>>>> be used to guarantee the memory ordering between accessing the buffer
>>>> data and the buffer's head / tail.
>>>>
>>>> This patch optimizes the memory ordering with the load-acquire and
>>>> store-release barriers.
>>>
>>> Is this patch okay for you?
>>>
>>> Besides this patch, I have an extra question.  You could see for
>>> accessing the AUX buffer's head and tail, it also support to use
>>> compiler build-in functions for atomicity accessing:
>>>
>>>   __sync_val_compare_and_swap()
>>>   __sync_bool_compare_and_swap()
>>>
>>> Since now we have READ_ONCE()/WRITE_ONCE(), do you think we still need
>>> to support __sync_xxx_compare_and_swap() atomicity?
>>
>> I don't remember, but it seems to me atomicity is needed only
>> for a 32-bit perf running with a 64-bit kernel.
> 
> 32-bit perf wants to access 64-bit value atomically, I think it tries to
> avoid the issue caused by scenario:
> 
>         CPU0 (64-bit kernel)           CPU1 (32-bit user)
> 
>                                          read head_lo
>         WRITE_ONCE(head)
>                                          read head_hi
> 
> 
> I dumped the disassembly for reading 64-bit value for perf Arm32 and get
> below results:
> 
>   perf Arm32 for READ_ONCE():
> 
> 	case 8: *(__u64_alias_t *) res = *(volatile __u64_alias_t *) p; break;
>      84a:	68fb      	ldr	r3, [r7, #12]
>      84c:	e9d3 2300 	ldrd	r2, r3, [r3]
>      850:	6939      	ldr	r1, [r7, #16]
>      852:	e9c1 2300 	strd	r2, r3, [r1]
>      856:	e007      	b.n	868 <auxtrace_mmap__read_head+0xb0>
> 
> It uses the instruction ldrd which is "Load Register Dual (register)",
> but this doesn't mean the instruction is atomic, especially based on
> the comment in the kernel header include/asm-generic/rwonce.h, I think
> the instruction ldrd/strd will be "atomic in some cases (namely Armv7 +
> LPAE), but for others we rely on the access being split into 2x32-bit
> accesses".
> 
> 
>   perf Arm32 for __sync_val_compare_and_swap():
> 
> 	u64 head = __sync_val_compare_and_swap(&pc->aux_head, 0, 0);
>      7d6:	68fb      	ldr	r3, [r7, #12]
>      7d8:	f503 6484 	add.w	r4, r3, #1056	; 0x420
>      7dc:	f04f 0000 	mov.w	r0, #0
>      7e0:	f04f 0100 	mov.w	r1, #0
>      7e4:	f3bf 8f5b 	dmb	ish
>      7e8:	e8d4 237f 	ldrexd	r2, r3, [r4]
>      7ec:	ea52 0c03 	orrs.w	ip, r2, r3
>      7f0:	d106      	bne.n	800 <auxtrace_mmap__read_head+0x48>
>      7f2:	e8c4 017c 	strexd	ip, r0, r1, [r4]
>      7f6:	f1bc 0f00 	cmp.w	ip, #0
>      7fa:	f1bc 0f00 	cmp.w	ip, #0
>      7fe:	d1f3      	bne.n	7e8 <auxtrace_mmap__read_head+0x30>
>      800:	f3bf 8f5b 	dmb	ish
>      804:	e9c7 2304 	strd	r2, r3, [r7, #16]
> 
> For __sync_val_compare_and_swap(), it uses the instructions
> ldrexd/ldrexd, these two instructions rely on the exclusive monitor
> for accessing 64-bit value, so seems to me this is more reliable way
> for accessing 64-bit value in CPU's 32-bit mode.
> 
> Conclusion: seems to me __sync_xxx_compare_and_swap() should be kept
> in this case rather than using READ_ONCE() for 32-bit building.  Or
> any other suggestions?  Thanks!

__sync_xxx_compare_and_swap is out-of-date now. This page:

https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005fsync-Builtins

recommends '__atomic' builtins instead.

Since atomics are needed only for the "compat" case (i.e. 32-bit perf with 64-bit kernel)
you could try to find an elegant way to check for a 64-bit kernel, and avoid the atomics
for a 32-bit perf with 32-bit kernel.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release
  2021-06-01  9:07         ` Adrian Hunter
@ 2021-06-01  9:17           ` Peter Zijlstra
  2021-06-01  9:45             ` Adrian Hunter
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2021-06-01  9:17 UTC (permalink / raw
  To: Adrian Hunter
  Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users,
	linux-kernel

On Tue, Jun 01, 2021 at 12:07:31PM +0300, Adrian Hunter wrote:
> __sync_xxx_compare_and_swap is out-of-date now. This page:
> 
> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005fsync-Builtins
> 
> recommends '__atomic' builtins instead.

perf doesn't seem to use that.

> Since atomics are needed only for the "compat" case (i.e. 32-bit perf with 64-bit kernel)
> you could try to find an elegant way to check for a 64-bit kernel, and avoid the atomics
> for a 32-bit perf with 32-bit kernel.

Most 32bit archs cannot do 64bit atomics. I suppose the only reason this
doesn't explode is because the aux stuff isn't supported on many
architectures?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release
  2021-06-01  9:17           ` Peter Zijlstra
@ 2021-06-01  9:45             ` Adrian Hunter
  2021-06-01  9:48               ` Peter Zijlstra
  2021-06-01 14:56               ` Leo Yan
  0 siblings, 2 replies; 22+ messages in thread
From: Adrian Hunter @ 2021-06-01  9:45 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users,
	linux-kernel

On 1/06/21 12:17 pm, Peter Zijlstra wrote:
> On Tue, Jun 01, 2021 at 12:07:31PM +0300, Adrian Hunter wrote:
>> __sync_xxx_compare_and_swap is out-of-date now. This page:
>>
>> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005fsync-Builtins
>>
>> recommends '__atomic' builtins instead.
> 
> perf doesn't seem to use that.

I guess we could drop support for the compat case; add validation:

"Error, 32-bit perf cannot record AUX area traces from a 64-bit kernel.
Please use a 64-bit version of perf instead."

> 
>> Since atomics are needed only for the "compat" case (i.e. 32-bit perf with 64-bit kernel)
>> you could try to find an elegant way to check for a 64-bit kernel, and avoid the atomics
>> for a 32-bit perf with 32-bit kernel.
> 
> Most 32bit archs cannot do 64bit atomics. I suppose the only reason this
> doesn't explode is because the aux stuff isn't supported on many
> architectures?
> 

Yes but presumably the race itself is unlikely since the upper byte changes only once every 4GiB.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release
  2021-06-01  9:45             ` Adrian Hunter
@ 2021-06-01  9:48               ` Peter Zijlstra
  2021-06-01 14:56               ` Leo Yan
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2021-06-01  9:48 UTC (permalink / raw
  To: Adrian Hunter
  Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users,
	linux-kernel

On Tue, Jun 01, 2021 at 12:45:16PM +0300, Adrian Hunter wrote:
> On 1/06/21 12:17 pm, Peter Zijlstra wrote:
> > On Tue, Jun 01, 2021 at 12:07:31PM +0300, Adrian Hunter wrote:
> >> __sync_xxx_compare_and_swap is out-of-date now. This page:
> >>
> >> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005fsync-Builtins
> >>
> >> recommends '__atomic' builtins instead.
> > 
> > perf doesn't seem to use that.
> 
> I guess we could drop support for the compat case; add validation:
> 
> "Error, 32-bit perf cannot record AUX area traces from a 64-bit kernel.
> Please use a 64-bit version of perf instead."

For AUX, possibly, sadly the exact same problem exists for the normal
buffer IIRC.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release
  2021-06-01  9:45             ` Adrian Hunter
  2021-06-01  9:48               ` Peter Zijlstra
@ 2021-06-01 14:56               ` Leo Yan
  1 sibling, 0 replies; 22+ messages in thread
From: Leo Yan @ 2021-06-01 14:56 UTC (permalink / raw
  To: Adrian Hunter
  Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen,
	linux-perf-users, linux-kernel

On Tue, Jun 01, 2021 at 12:45:16PM +0300, Adrian Hunter wrote:
> On 1/06/21 12:17 pm, Peter Zijlstra wrote:
> > On Tue, Jun 01, 2021 at 12:07:31PM +0300, Adrian Hunter wrote:
> >> __sync_xxx_compare_and_swap is out-of-date now. This page:
> >>
> >> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005fsync-Builtins
> >>
> >> recommends '__atomic' builtins instead.
> > 
> > perf doesn't seem to use that.
> 
> I guess we could drop support for the compat case; add validation:
> 
> "Error, 32-bit perf cannot record AUX area traces from a 64-bit kernel.
> Please use a 64-bit version of perf instead."

Not sure what's a good method to detect the compat case.  I can think
out to use below conditions to check the compat case:

  if ((sizeof(unsigned long) == 4) &&
      (machine__is(machine, "x86_64") || machine__is(machine, "arm64") ||
       machine__is(machine, "aarch64")))  {

       pr_warn("Error, 32-bit perf cannot record from a 64-bit kernel.\n"
               "Please use a 64-bit version of perf instead.\n");
       return -ENOTSUP;
  }

Just want to check if any better to detect compat case in perf?

Thanks for suggestions,
Leo

> >> Since atomics are needed only for the "compat" case (i.e. 32-bit perf with 64-bit kernel)
> >> you could try to find an elegant way to check for a 64-bit kernel, and avoid the atomics
> >> for a 32-bit perf with 32-bit kernel.
> > 
> > Most 32bit archs cannot do 64bit atomics. I suppose the only reason this
> > doesn't explode is because the aux stuff isn't supported on many
> > architectures?
> > 
> 
> Yes but presumably the race itself is unlikely since the upper byte changes only once every 4GiB.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release
  2021-05-31 19:03     ` Adrian Hunter
  2021-06-01  6:33       ` Leo Yan
@ 2021-06-01  6:48       ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2021-06-01  6:48 UTC (permalink / raw
  To: Adrian Hunter
  Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users,
	linux-kernel

On Mon, May 31, 2021 at 10:03:33PM +0300, Adrian Hunter wrote:
> On 31/05/21 6:10 pm, Leo Yan wrote:
> > Hi Peter, Adrian,
> > 
> > On Wed, May 19, 2021 at 10:03:19PM +0800, Leo Yan wrote:
> >> Load-acquire and store-release are one-way permeable barriers, which can
> >> be used to guarantee the memory ordering between accessing the buffer
> >> data and the buffer's head / tail.
> >>
> >> This patch optimizes the memory ordering with the load-acquire and
> >> store-release barriers.
> > 
> > Is this patch okay for you?
> > 
> > Besides this patch, I have an extra question.  You could see for
> > accessing the AUX buffer's head and tail, it also support to use
> > compiler build-in functions for atomicity accessing:
> > 
> >   __sync_val_compare_and_swap()
> >   __sync_bool_compare_and_swap()
> > 
> > Since now we have READ_ONCE()/WRITE_ONCE(), do you think we still need
> > to support __sync_xxx_compare_and_swap() atomicity?
> 
> I don't remember, but it seems to me atomicity is needed only
> for a 32-bit perf running with a 64-bit kernel.

But that:

	do {
		old_tail = __sync_val_compare_and_swap(&pc->aux_tail, 0, 0);
	} while (!__sync_bool_compare_and_swap(&pc->aux_tail, old_tail, tail));

doesn't want to make sense to me. It appears to do a cmpxchg with 0
values to load old_tail, and then a further cmpxchg with old_tail to
write the new tail.

That's some really crazy code. That makes absolutely no sense what so
ever and needs to just be deleted.

On top of that, it uses atomic instrincs for a u64, which is not
something that'll actually work on a whole bunch of 32bit platforms.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers
  2021-05-19 14:03 [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers Leo Yan
  2021-05-19 14:03 ` [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release Leo Yan
@ 2021-05-27  7:54 ` Adrian Hunter
  2021-05-27  8:11   ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Adrian Hunter @ 2021-05-27  7:54 UTC (permalink / raw
  To: Leo Yan, Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen,
	linux-perf-users, linux-kernel

On 19/05/21 5:03 pm, Leo Yan wrote:
> The AUX ring buffer's head and tail can be accessed from multiple CPUs
> on SMP system, so changes to use SMP memory barriers to replace the
> uniprocessor barriers.

I don't think user space should attempt to be SMP-aware.

For perf tools, on __x86_64__ it looks like smp_rmb() is only a compiler barrier, whereas
rmb() is a "lfence" memory barrier instruction, so this patch does not
seem to do what the commit message says at least for x86.

With regard to the AUX area, we don't know in general how data gets there,
so using memory barriers seems sensible.

> 
> Signed-off-by: Leo Yan <leo.yan@linaro.org>
> ---
>  tools/perf/util/auxtrace.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
> index 472c0973b1f1..8bed284ccc82 100644
> --- a/tools/perf/util/auxtrace.h
> +++ b/tools/perf/util/auxtrace.h
> @@ -452,7 +452,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm)
>  	u64 head = READ_ONCE(pc->aux_head);
>  
>  	/* Ensure all reads are done after we read the head */
> -	rmb();
> +	smp_rmb();
>  	return head;
>  }
>  
> @@ -466,7 +466,7 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm)
>  #endif
>  
>  	/* Ensure all reads are done after we read the head */
> -	rmb();
> +	smp_rmb();
>  	return head;
>  }
>  
> @@ -478,7 +478,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail)
>  #endif
>  
>  	/* Ensure all reads are done before we write the tail out */
> -	mb();
> +	smp_mb();
>  #if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT)
>  	pc->aux_tail = tail;
>  #else
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers
  2021-05-27  7:54 ` [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers Adrian Hunter
@ 2021-05-27  8:11   ` Peter Zijlstra
  2021-05-27  8:25     ` Adrian Hunter
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2021-05-27  8:11 UTC (permalink / raw
  To: Adrian Hunter
  Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users,
	linux-kernel

On Thu, May 27, 2021 at 10:54:56AM +0300, Adrian Hunter wrote:
> On 19/05/21 5:03 pm, Leo Yan wrote:
> > The AUX ring buffer's head and tail can be accessed from multiple CPUs
> > on SMP system, so changes to use SMP memory barriers to replace the
> > uniprocessor barriers.
> 
> I don't think user space should attempt to be SMP-aware.

Uhh, what? It pretty much has to. Since userspace cannot assume UP, it
must assume SMP.

> For perf tools, on __x86_64__ it looks like smp_rmb() is only a compiler barrier, whereas
> rmb() is a "lfence" memory barrier instruction, so this patch does not
> seem to do what the commit message says at least for x86.

The commit message is somewhat confused; *mb() are not UP barriers
(although they are available and useful on UP). They're device/dma
barriers.

> With regard to the AUX area, we don't know in general how data gets there,
> so using memory barriers seems sensible.

IIRC (but I ddn't check) the rule was that the kernel needs to ensure
the AUX area is complete before it updates the head pointer. So if
userspace can observe the head pointer, it must then also be able to
observe the data. This is not something userspace can fix up anyway.

The ordering here is between the head pointer and the data, and from a
userspace perspective that's a regular smp ordering. Similar for the
tail update, that's between our reading the data and writing the tail,
regular cache coherent smp ordering.

So ACK on the patch, it's sane and an optimization for both x86 and ARM.
Just the Changelog needs work.

> > Signed-off-by: Leo Yan <leo.yan@linaro.org>
> > ---
> >  tools/perf/util/auxtrace.h | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
> > index 472c0973b1f1..8bed284ccc82 100644
> > --- a/tools/perf/util/auxtrace.h
> > +++ b/tools/perf/util/auxtrace.h
> > @@ -452,7 +452,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm)
> >  	u64 head = READ_ONCE(pc->aux_head);
> >  
> >  	/* Ensure all reads are done after we read the head */
> > -	rmb();
> > +	smp_rmb();
> >  	return head;
> >  }
> >  
> > @@ -466,7 +466,7 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm)
> >  #endif
> >  
> >  	/* Ensure all reads are done after we read the head */
> > -	rmb();
> > +	smp_rmb();
> >  	return head;
> >  }
> >  
> > @@ -478,7 +478,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail)
> >  #endif
> >  
> >  	/* Ensure all reads are done before we write the tail out */
> > -	mb();
> > +	smp_mb();
> >  #if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT)
> >  	pc->aux_tail = tail;
> >  #else
> > 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers
  2021-05-27  8:11   ` Peter Zijlstra
@ 2021-05-27  8:25     ` Adrian Hunter
  2021-05-27  9:24       ` Adrian Hunter
  2021-05-27  9:45       ` Peter Zijlstra
  0 siblings, 2 replies; 22+ messages in thread
From: Adrian Hunter @ 2021-05-27  8:25 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users,
	linux-kernel

On 27/05/21 11:11 am, Peter Zijlstra wrote:
> On Thu, May 27, 2021 at 10:54:56AM +0300, Adrian Hunter wrote:
>> On 19/05/21 5:03 pm, Leo Yan wrote:
>>> The AUX ring buffer's head and tail can be accessed from multiple CPUs
>>> on SMP system, so changes to use SMP memory barriers to replace the
>>> uniprocessor barriers.
>>
>> I don't think user space should attempt to be SMP-aware.
> 
> Uhh, what? It pretty much has to. Since userspace cannot assume UP, it
> must assume SMP.

Yeah that is what I meant, but consequently we generally shouldn't be
using functions called smp_<anything>

> 
>> For perf tools, on __x86_64__ it looks like smp_rmb() is only a compiler barrier, whereas
>> rmb() is a "lfence" memory barrier instruction, so this patch does not
>> seem to do what the commit message says at least for x86.
> 
> The commit message is somewhat confused; *mb() are not UP barriers
> (although they are available and useful on UP). They're device/dma
> barriers.
> 
>> With regard to the AUX area, we don't know in general how data gets there,
>> so using memory barriers seems sensible.
> 
> IIRC (but I ddn't check) the rule was that the kernel needs to ensure
> the AUX area is complete before it updates the head pointer. So if
> userspace can observe the head pointer, it must then also be able to
> observe the data. This is not something userspace can fix up anyway.
> 
> The ordering here is between the head pointer and the data, and from a
> userspace perspective that's a regular smp ordering. Similar for the
> tail update, that's between our reading the data and writing the tail,
> regular cache coherent smp ordering.
> 
> So ACK on the patch, it's sane and an optimization for both x86 and ARM.
> Just the Changelog needs work.

If all we want is a compiler barrier, then shouldn't that be what we use?
i.e. barrier()

> 
>>> Signed-off-by: Leo Yan <leo.yan@linaro.org>
>>> ---
>>>  tools/perf/util/auxtrace.h | 6 +++---
>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
>>> index 472c0973b1f1..8bed284ccc82 100644
>>> --- a/tools/perf/util/auxtrace.h
>>> +++ b/tools/perf/util/auxtrace.h
>>> @@ -452,7 +452,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm)
>>>  	u64 head = READ_ONCE(pc->aux_head);
>>>  
>>>  	/* Ensure all reads are done after we read the head */
>>> -	rmb();
>>> +	smp_rmb();
>>>  	return head;
>>>  }
>>>  
>>> @@ -466,7 +466,7 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm)
>>>  #endif
>>>  
>>>  	/* Ensure all reads are done after we read the head */
>>> -	rmb();
>>> +	smp_rmb();
>>>  	return head;
>>>  }
>>>  
>>> @@ -478,7 +478,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail)
>>>  #endif
>>>  
>>>  	/* Ensure all reads are done before we write the tail out */
>>> -	mb();
>>> +	smp_mb();
>>>  #if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT)
>>>  	pc->aux_tail = tail;
>>>  #else
>>>
>>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers
  2021-05-27  8:25     ` Adrian Hunter
@ 2021-05-27  9:24       ` Adrian Hunter
  2021-05-27  9:57         ` Peter Zijlstra
  2021-05-27  9:45       ` Peter Zijlstra
  1 sibling, 1 reply; 22+ messages in thread
From: Adrian Hunter @ 2021-05-27  9:24 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users,
	linux-kernel

On 27/05/21 11:25 am, Adrian Hunter wrote:
> On 27/05/21 11:11 am, Peter Zijlstra wrote:
>> On Thu, May 27, 2021 at 10:54:56AM +0300, Adrian Hunter wrote:
>>> On 19/05/21 5:03 pm, Leo Yan wrote:
>>>> The AUX ring buffer's head and tail can be accessed from multiple CPUs
>>>> on SMP system, so changes to use SMP memory barriers to replace the
>>>> uniprocessor barriers.
>>>
>>> I don't think user space should attempt to be SMP-aware.
>>
>> Uhh, what? It pretty much has to. Since userspace cannot assume UP, it
>> must assume SMP.
> 
> Yeah that is what I meant, but consequently we generally shouldn't be
> using functions called smp_<anything>
> 
>>
>>> For perf tools, on __x86_64__ it looks like smp_rmb() is only a compiler barrier, whereas
>>> rmb() is a "lfence" memory barrier instruction, so this patch does not
>>> seem to do what the commit message says at least for x86.
>>
>> The commit message is somewhat confused; *mb() are not UP barriers
>> (although they are available and useful on UP). They're device/dma
>> barriers.
>>
>>> With regard to the AUX area, we don't know in general how data gets there,
>>> so using memory barriers seems sensible.
>>
>> IIRC (but I ddn't check) the rule was that the kernel needs to ensure
>> the AUX area is complete before it updates the head pointer. So if
>> userspace can observe the head pointer, it must then also be able to
>> observe the data. This is not something userspace can fix up anyway.
>>
>> The ordering here is between the head pointer and the data, and from a
>> userspace perspective that's a regular smp ordering. Similar for the
>> tail update, that's between our reading the data and writing the tail,
>> regular cache coherent smp ordering.
>>
>> So ACK on the patch, it's sane and an optimization for both x86 and ARM.
>> Just the Changelog needs work.
> 
> If all we want is a compiler barrier, then shouldn't that be what we use?
> i.e. barrier()

I guess you are saying we still need to stop potential re-ordering across
CPUs, so please ignore my comments.

> 
>>
>>>> Signed-off-by: Leo Yan <leo.yan@linaro.org>
>>>> ---
>>>>  tools/perf/util/auxtrace.h | 6 +++---
>>>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
>>>> index 472c0973b1f1..8bed284ccc82 100644
>>>> --- a/tools/perf/util/auxtrace.h
>>>> +++ b/tools/perf/util/auxtrace.h
>>>> @@ -452,7 +452,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm)
>>>>  	u64 head = READ_ONCE(pc->aux_head);
>>>>  
>>>>  	/* Ensure all reads are done after we read the head */
>>>> -	rmb();
>>>> +	smp_rmb();
>>>>  	return head;
>>>>  }
>>>>  
>>>> @@ -466,7 +466,7 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm)
>>>>  #endif
>>>>  
>>>>  	/* Ensure all reads are done after we read the head */
>>>> -	rmb();
>>>> +	smp_rmb();
>>>>  	return head;
>>>>  }
>>>>  
>>>> @@ -478,7 +478,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail)
>>>>  #endif
>>>>  
>>>>  	/* Ensure all reads are done before we write the tail out */
>>>> -	mb();
>>>> +	smp_mb();
>>>>  #if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT)
>>>>  	pc->aux_tail = tail;
>>>>  #else
>>>>
>>>
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers
  2021-05-27  9:24       ` Adrian Hunter
@ 2021-05-27  9:57         ` Peter Zijlstra
  2021-05-31 14:53           ` Leo Yan
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2021-05-27  9:57 UTC (permalink / raw
  To: Adrian Hunter
  Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users,
	linux-kernel

On Thu, May 27, 2021 at 12:24:15PM +0300, Adrian Hunter wrote:

> > If all we want is a compiler barrier, then shouldn't that be what we use?
> > i.e. barrier()
> 
> I guess you are saying we still need to stop potential re-ordering across
> CPUs, so please ignore my comments.

Right; so the ordering issue is real, consider:

	CPU0 (kernel)		CPU1 (user)

	write data		read head
	smp_wmb()		smp_rmb()
	write head		read data

Without explicit ordering (on either side), we might either read data
that isn't written yet:

			     ,--(read data)
	write data	     |
	smp_wmb()	     |
	write head ---.	     |
		      `-->   |	read head
			     `- read data

Where the head load observes the new head writte, but the data load is
speculated and loads data before it is written.

Or, we can write the head before the data write is visible:

    ,--	write data
    |	write head
    |				read head
    |				smp_rmb()
    |				read data
    `-> (data visible)

Where we read the head head, but still observe stale data because the
stores got committed out of order.

x86 is TSO, so neither reordering is actually possible, hence both
barriers being a compiler barrier (to ensure the compiler doesn't
reorder them for us). But weaker hardware *will* allow those orderings
and we very much need actual barriers there.

Welcome to the magical world of memory ordering and weak architectures.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers
  2021-05-27  9:57         ` Peter Zijlstra
@ 2021-05-31 14:53           ` Leo Yan
  2021-05-31 15:48             ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Leo Yan @ 2021-05-31 14:53 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: Adrian Hunter, Arnaldo Carvalho de Melo, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen,
	linux-perf-users, linux-kernel

Hi Peter, Adrian,

On Thu, May 27, 2021 at 11:57:37AM +0200, Peter Zijlstra wrote:
> On Thu, May 27, 2021 at 12:24:15PM +0300, Adrian Hunter wrote:
> 
> > > If all we want is a compiler barrier, then shouldn't that be what we use?
> > > i.e. barrier()

Sorry for a bit late.  Just bring up one question before I respin
this patch set.

> > I guess you are saying we still need to stop potential re-ordering across
> > CPUs, so please ignore my comments.
> 
> Right; so the ordering issue is real, consider:
> 
> 	CPU0 (kernel)		CPU1 (user)
> 
> 	write data		read head
> 	smp_wmb()		smp_rmb()
> 	write head		read data

One thing should be mentioned is the Linux kernel has _not_ used an
explict "smb_wmb()" between writing AUX trace data and updating header
"aux_head".  Please see the function perf_aux_output_end():

void perf_aux_output_end(..., size)
{
        ...

	if (size || (handle->aux_flags & ~(u64)PERF_AUX_FLAG_OVERWRITE))
		perf_event_aux_event(handle->event, aux_head, size,
				     handle->aux_flags);

	WRITE_ONCE(rb->user_page->aux_head, rb->aux_head);

        ...
}

But I think it's needless to add "smb_wmb()" prior to WRITE_ONCE()
sentence.  This is because:

Before updating the "aux_head", it calls perf_event_aux_event(), so
event PERF_RECORD_AUX is filled into the perf ring buffer, and executes
smb_wmb() + updates the header "user_page->data_head"; so the flow is
like blow:

  Fill AUX trace data to AUX ring buffer
  Fill RECORD_AUX event into perf ring buffer
  smb_wmb()
  update "user_page->data_head"  -> See perf_event_aux_event()/perf_output_end()
  update "user_page->aux_head"

This is a bit weird for two ring buffers (AUX and perf generic ring
buffers) share the same memory barrier between the write data and
write headers.

Do you think I understand correctly?  Or should add an explict
"smb_wmb()" before WRITE_ONCE(rb->user_page->aux_head, ...)?

Thanks,
Leo

> Without explicit ordering (on either side), we might either read data
> that isn't written yet:
> 
> 			     ,--(read data)
> 	write data	     |
> 	smp_wmb()	     |
> 	write head ---.	     |
> 		      `-->   |	read head
> 			     `- read data
> 
> Where the head load observes the new head writte, but the data load is
> speculated and loads data before it is written.
> 
> Or, we can write the head before the data write is visible:
> 
>     ,--	write data
>     |	write head
>     |				read head
>     |				smp_rmb()
>     |				read data
>     `-> (data visible)
> 
> Where we read the head head, but still observe stale data because the
> stores got committed out of order.
> 
> x86 is TSO, so neither reordering is actually possible, hence both
> barriers being a compiler barrier (to ensure the compiler doesn't
> reorder them for us). But weaker hardware *will* allow those orderings
> and we very much need actual barriers there.
> 
> Welcome to the magical world of memory ordering and weak architectures.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers
  2021-05-31 14:53           ` Leo Yan
@ 2021-05-31 15:48             ` Peter Zijlstra
  2021-06-01  3:21               ` Leo Yan
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2021-05-31 15:48 UTC (permalink / raw
  To: Leo Yan
  Cc: Adrian Hunter, Arnaldo Carvalho de Melo, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen,
	linux-perf-users, linux-kernel

On Mon, May 31, 2021 at 10:53:02PM +0800, Leo Yan wrote:
> Hi Peter, Adrian,
> 
> On Thu, May 27, 2021 at 11:57:37AM +0200, Peter Zijlstra wrote:
> > On Thu, May 27, 2021 at 12:24:15PM +0300, Adrian Hunter wrote:
> > 
> > > > If all we want is a compiler barrier, then shouldn't that be what we use?
> > > > i.e. barrier()
> 
> Sorry for a bit late.  Just bring up one question before I respin
> this patch set.
> 
> > > I guess you are saying we still need to stop potential re-ordering across
> > > CPUs, so please ignore my comments.
> > 
> > Right; so the ordering issue is real, consider:
> > 
> > 	CPU0 (kernel)		CPU1 (user)
> > 
> > 	write data		read head
> > 	smp_wmb()		smp_rmb()
> > 	write head		read data
> 
> One thing should be mentioned is the Linux kernel has _not_ used an
> explict "smb_wmb()" between writing AUX trace data and updating header
> "aux_head".  Please see the function perf_aux_output_end():

I think we pushed that into the driver. There is nothing the generic
code can do here.

It is the drivers responsibility of ensuring the data is stable before
calling perf_aux_output_end() or something along those lines.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers
  2021-05-31 15:48             ` Peter Zijlstra
@ 2021-06-01  3:21               ` Leo Yan
  0 siblings, 0 replies; 22+ messages in thread
From: Leo Yan @ 2021-06-01  3:21 UTC (permalink / raw
  To: Peter Zijlstra
  Cc: Adrian Hunter, Arnaldo Carvalho de Melo, Ingo Molnar,
	Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen,
	linux-perf-users, linux-kernel

On Mon, May 31, 2021 at 05:48:03PM +0200, Peter Zijlstra wrote:
> On Mon, May 31, 2021 at 10:53:02PM +0800, Leo Yan wrote:
> > Hi Peter, Adrian,
> > 
> > On Thu, May 27, 2021 at 11:57:37AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 27, 2021 at 12:24:15PM +0300, Adrian Hunter wrote:
> > > 
> > > > > If all we want is a compiler barrier, then shouldn't that be what we use?
> > > > > i.e. barrier()
> > 
> > Sorry for a bit late.  Just bring up one question before I respin
> > this patch set.
> > 
> > > > I guess you are saying we still need to stop potential re-ordering across
> > > > CPUs, so please ignore my comments.
> > > 
> > > Right; so the ordering issue is real, consider:
> > > 
> > > 	CPU0 (kernel)		CPU1 (user)
> > > 
> > > 	write data		read head
> > > 	smp_wmb()		smp_rmb()
> > > 	write head		read data
> > 
> > One thing should be mentioned is the Linux kernel has _not_ used an
> > explict "smb_wmb()" between writing AUX trace data and updating header
> > "aux_head".  Please see the function perf_aux_output_end():
> 
> I think we pushed that into the driver. There is nothing the generic
> code can do here.
> 
> It is the drivers responsibility of ensuring the data is stable before
> calling perf_aux_output_end() or something along those lines.

Thanks for explaination.  I reviewed the drivers, some of them have
used memory barriers (e.g. Intel-PT, Arm SPE), but some drivers miss
to use memory barriers before calling perf_aux_output_end() (Like Arm
CoreSight, Intel-bts).

Will address this issue in next patch set.

Thanks,
Leo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers
  2021-05-27  8:25     ` Adrian Hunter
  2021-05-27  9:24       ` Adrian Hunter
@ 2021-05-27  9:45       ` Peter Zijlstra
  1 sibling, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2021-05-27  9:45 UTC (permalink / raw
  To: Adrian Hunter
  Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users,
	linux-kernel

On Thu, May 27, 2021 at 11:25:40AM +0300, Adrian Hunter wrote:
> On 27/05/21 11:11 am, Peter Zijlstra wrote:
> > On Thu, May 27, 2021 at 10:54:56AM +0300, Adrian Hunter wrote:
> >> On 19/05/21 5:03 pm, Leo Yan wrote:
> >>> The AUX ring buffer's head and tail can be accessed from multiple CPUs
> >>> on SMP system, so changes to use SMP memory barriers to replace the
> >>> uniprocessor barriers.
> >>
> >> I don't think user space should attempt to be SMP-aware.
> > 
> > Uhh, what? It pretty much has to. Since userspace cannot assume UP, it
> > must assume SMP.
> 
> Yeah that is what I meant, but consequently we generally shouldn't be
> using functions called smp_<anything>

Of course we should; they're the SMP class of barriers.

> > So ACK on the patch, it's sane and an optimization for both x86 and ARM.
> > Just the Changelog needs work.
> 
> If all we want is a compiler barrier, then shouldn't that be what we use?
> i.e. barrier()

No, we want the SMP barriers, smp_rmb() happens to be a compiler barrier
on x86, but it will be dmb(ishld) on Aarrgh64 for example.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-06-01 14:57 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-05-19 14:03 [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers Leo Yan
2021-05-19 14:03 ` [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release Leo Yan
2021-05-31 15:10   ` Leo Yan
2021-05-31 15:55     ` Peter Zijlstra
2021-05-31 19:03     ` Adrian Hunter
2021-06-01  6:33       ` Leo Yan
2021-06-01  6:58         ` Peter Zijlstra
2021-06-01  9:07         ` Adrian Hunter
2021-06-01  9:17           ` Peter Zijlstra
2021-06-01  9:45             ` Adrian Hunter
2021-06-01  9:48               ` Peter Zijlstra
2021-06-01 14:56               ` Leo Yan
2021-06-01  6:48       ` Peter Zijlstra
2021-05-27  7:54 ` [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers Adrian Hunter
2021-05-27  8:11   ` Peter Zijlstra
2021-05-27  8:25     ` Adrian Hunter
2021-05-27  9:24       ` Adrian Hunter
2021-05-27  9:57         ` Peter Zijlstra
2021-05-31 14:53           ` Leo Yan
2021-05-31 15:48             ` Peter Zijlstra
2021-06-01  3:21               ` Leo Yan
2021-05-27  9:45       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).