* [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers @ 2021-05-19 14:03 Leo Yan 2021-05-19 14:03 ` [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release Leo Yan 2021-05-27 7:54 ` [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers Adrian Hunter 0 siblings, 2 replies; 22+ messages in thread From: Leo Yan @ 2021-05-19 14:03 UTC (permalink / raw To: Arnaldo Carvalho de Melo, Adrian Hunter, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel Cc: Leo Yan The AUX ring buffer's head and tail can be accessed from multiple CPUs on SMP system, so changes to use SMP memory barriers to replace the uniprocessor barriers. Signed-off-by: Leo Yan <leo.yan@linaro.org> --- tools/perf/util/auxtrace.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h index 472c0973b1f1..8bed284ccc82 100644 --- a/tools/perf/util/auxtrace.h +++ b/tools/perf/util/auxtrace.h @@ -452,7 +452,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm) u64 head = READ_ONCE(pc->aux_head); /* Ensure all reads are done after we read the head */ - rmb(); + smp_rmb(); return head; } @@ -466,7 +466,7 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) #endif /* Ensure all reads are done after we read the head */ - rmb(); + smp_rmb(); return head; } @@ -478,7 +478,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail) #endif /* Ensure all reads are done before we write the tail out */ - mb(); + smp_mb(); #if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT) pc->aux_tail = tail; #else -- 2.25.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release 2021-05-19 14:03 [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers Leo Yan @ 2021-05-19 14:03 ` Leo Yan 2021-05-31 15:10 ` Leo Yan 2021-05-27 7:54 ` [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers Adrian Hunter 1 sibling, 1 reply; 22+ messages in thread From: Leo Yan @ 2021-05-19 14:03 UTC (permalink / raw To: Arnaldo Carvalho de Melo, Adrian Hunter, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel Cc: Leo Yan Load-acquire and store-release are one-way permeable barriers, which can be used to guarantee the memory ordering between accessing the buffer data and the buffer's head / tail. This patch optimizes the memory ordering with the load-acquire and store-release barriers. If the load-acquire is not supported by the architectures, it uses the existed READ_ONCE() + smp_rmb() pair rather than use the fallback definition of smp_load_acquire(). The reason is smp_rmb() is a minor improvement than smp_mb() which is called in the fallback macro smp_load_acquire(). In auxtrace_mmap__write_tail(), updating the tail pointer is removed; if the store-release barrier is not supported, it rollbacks to use smp_mb() + WRITE_ONCE() pair which is defined in the fallback macro smp_store_release() (see include/asm/barrier.h). Signed-off-by: Leo Yan <leo.yan@linaro.org> --- tools/perf/util/auxtrace.h | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h index 8bed284ccc82..f18070f1993f 100644 --- a/tools/perf/util/auxtrace.h +++ b/tools/perf/util/auxtrace.h @@ -449,16 +449,27 @@ struct auxtrace_cache; static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm) { struct perf_event_mmap_page *pc = mm->userpg; + +#if defined(__x86_64__) || defined(__aarch64__) || defined(__powerpc64__) || \ + defined(__ia64__) || defined(__sparc__) && defined(__arch64__) + return smp_load_acquire(&pc->aux_head); +#else u64 head = READ_ONCE(pc->aux_head); /* Ensure all reads are done after we read the head */ smp_rmb(); return head; +#endif } static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) { struct perf_event_mmap_page *pc = mm->userpg; + +#if defined(__x86_64__) || defined(__aarch64__) || defined(__powerpc64__) || \ + defined(__ia64__) || defined(__sparc__) && defined(__arch64__) + return smp_load_acquire(&pc->aux_head); +#else #if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT) u64 head = READ_ONCE(pc->aux_head); #else @@ -468,20 +479,20 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) /* Ensure all reads are done after we read the head */ smp_rmb(); return head; +#endif } static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail) { struct perf_event_mmap_page *pc = mm->userpg; -#if BITS_PER_LONG != 64 && defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT) + +#if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT) + smp_store_release(&pc->aux_tail, tail); +#else u64 old_tail; -#endif /* Ensure all reads are done before we write the tail out */ smp_mb(); -#if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT) - pc->aux_tail = tail; -#else do { old_tail = __sync_val_compare_and_swap(&pc->aux_tail, 0, 0); } while (!__sync_bool_compare_and_swap(&pc->aux_tail, old_tail, tail)); -- 2.25.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release 2021-05-19 14:03 ` [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release Leo Yan @ 2021-05-31 15:10 ` Leo Yan 2021-05-31 15:55 ` Peter Zijlstra 2021-05-31 19:03 ` Adrian Hunter 0 siblings, 2 replies; 22+ messages in thread From: Leo Yan @ 2021-05-31 15:10 UTC (permalink / raw To: Arnaldo Carvalho de Melo, Adrian Hunter, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel Hi Peter, Adrian, On Wed, May 19, 2021 at 10:03:19PM +0800, Leo Yan wrote: > Load-acquire and store-release are one-way permeable barriers, which can > be used to guarantee the memory ordering between accessing the buffer > data and the buffer's head / tail. > > This patch optimizes the memory ordering with the load-acquire and > store-release barriers. Is this patch okay for you? Besides this patch, I have an extra question. You could see for accessing the AUX buffer's head and tail, it also support to use compiler build-in functions for atomicity accessing: __sync_val_compare_and_swap() __sync_bool_compare_and_swap() Since now we have READ_ONCE()/WRITE_ONCE(), do you think we still need to support __sync_xxx_compare_and_swap() atomicity? I checked the code for updating head and tail for the perf ring buffer (see ring_buffer_read_head() and ring_buffer_write_tail() in the file tools/include/linux/ring_buffer.h), which doesn't support __sync_xxx_compare_and_swap() anymore. This is why I wander if should drop __sync_xxx_compare_and_swap() atomicity for AUX ring buffer as well. Thanks, Leo ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release 2021-05-31 15:10 ` Leo Yan @ 2021-05-31 15:55 ` Peter Zijlstra 2021-05-31 19:03 ` Adrian Hunter 1 sibling, 0 replies; 22+ messages in thread From: Peter Zijlstra @ 2021-05-31 15:55 UTC (permalink / raw To: Leo Yan Cc: Arnaldo Carvalho de Melo, Adrian Hunter, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On Mon, May 31, 2021 at 11:10:49PM +0800, Leo Yan wrote: > Hi Peter, Adrian, > > On Wed, May 19, 2021 at 10:03:19PM +0800, Leo Yan wrote: > > Load-acquire and store-release are one-way permeable barriers, which can > > be used to guarantee the memory ordering between accessing the buffer > > data and the buffer's head / tail. > > > > This patch optimizes the memory ordering with the load-acquire and > > store-release barriers. > > Is this patch okay for you? Not without actual numbers; that's some terrible ifdef soup. > Besides this patch, I have an extra question. You could see for > accessing the AUX buffer's head and tail, it also support to use > compiler build-in functions for atomicity accessing: > > __sync_val_compare_and_swap() > __sync_bool_compare_and_swap() > > Since now we have READ_ONCE()/WRITE_ONCE(), do you think we still need > to support __sync_xxx_compare_and_swap() atomicity? > > I checked the code for updating head and tail for the perf ring buffer > (see ring_buffer_read_head() and ring_buffer_write_tail() in the file > tools/include/linux/ring_buffer.h), which doesn't support > __sync_xxx_compare_and_swap() anymore. This is why I wander if should > drop __sync_xxx_compare_and_swap() atomicity for AUX ring buffer as > well. I'm not sure wth that code is even trying to do, that's some seriously dodgy code. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release 2021-05-31 15:10 ` Leo Yan 2021-05-31 15:55 ` Peter Zijlstra @ 2021-05-31 19:03 ` Adrian Hunter 2021-06-01 6:33 ` Leo Yan 2021-06-01 6:48 ` Peter Zijlstra 1 sibling, 2 replies; 22+ messages in thread From: Adrian Hunter @ 2021-05-31 19:03 UTC (permalink / raw To: Leo Yan, Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On 31/05/21 6:10 pm, Leo Yan wrote: > Hi Peter, Adrian, > > On Wed, May 19, 2021 at 10:03:19PM +0800, Leo Yan wrote: >> Load-acquire and store-release are one-way permeable barriers, which can >> be used to guarantee the memory ordering between accessing the buffer >> data and the buffer's head / tail. >> >> This patch optimizes the memory ordering with the load-acquire and >> store-release barriers. > > Is this patch okay for you? > > Besides this patch, I have an extra question. You could see for > accessing the AUX buffer's head and tail, it also support to use > compiler build-in functions for atomicity accessing: > > __sync_val_compare_and_swap() > __sync_bool_compare_and_swap() > > Since now we have READ_ONCE()/WRITE_ONCE(), do you think we still need > to support __sync_xxx_compare_and_swap() atomicity? I don't remember, but it seems to me atomicity is needed only for a 32-bit perf running with a 64-bit kernel. > > I checked the code for updating head and tail for the perf ring buffer > (see ring_buffer_read_head() and ring_buffer_write_tail() in the file > tools/include/linux/ring_buffer.h), which doesn't support > __sync_xxx_compare_and_swap() anymore. This is why I wander if should > drop __sync_xxx_compare_and_swap() atomicity for AUX ring buffer as > well. > > Thanks, > Leo > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release 2021-05-31 19:03 ` Adrian Hunter @ 2021-06-01 6:33 ` Leo Yan 2021-06-01 6:58 ` Peter Zijlstra 2021-06-01 9:07 ` Adrian Hunter 2021-06-01 6:48 ` Peter Zijlstra 1 sibling, 2 replies; 22+ messages in thread From: Leo Yan @ 2021-06-01 6:33 UTC (permalink / raw To: Adrian Hunter Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On Mon, May 31, 2021 at 10:03:33PM +0300, Adrian Hunter wrote: > On 31/05/21 6:10 pm, Leo Yan wrote: > > Hi Peter, Adrian, > > > > On Wed, May 19, 2021 at 10:03:19PM +0800, Leo Yan wrote: > >> Load-acquire and store-release are one-way permeable barriers, which can > >> be used to guarantee the memory ordering between accessing the buffer > >> data and the buffer's head / tail. > >> > >> This patch optimizes the memory ordering with the load-acquire and > >> store-release barriers. > > > > Is this patch okay for you? > > > > Besides this patch, I have an extra question. You could see for > > accessing the AUX buffer's head and tail, it also support to use > > compiler build-in functions for atomicity accessing: > > > > __sync_val_compare_and_swap() > > __sync_bool_compare_and_swap() > > > > Since now we have READ_ONCE()/WRITE_ONCE(), do you think we still need > > to support __sync_xxx_compare_and_swap() atomicity? > > I don't remember, but it seems to me atomicity is needed only > for a 32-bit perf running with a 64-bit kernel. 32-bit perf wants to access 64-bit value atomically, I think it tries to avoid the issue caused by scenario: CPU0 (64-bit kernel) CPU1 (32-bit user) read head_lo WRITE_ONCE(head) read head_hi I dumped the disassembly for reading 64-bit value for perf Arm32 and get below results: perf Arm32 for READ_ONCE(): case 8: *(__u64_alias_t *) res = *(volatile __u64_alias_t *) p; break; 84a: 68fb ldr r3, [r7, #12] 84c: e9d3 2300 ldrd r2, r3, [r3] 850: 6939 ldr r1, [r7, #16] 852: e9c1 2300 strd r2, r3, [r1] 856: e007 b.n 868 <auxtrace_mmap__read_head+0xb0> It uses the instruction ldrd which is "Load Register Dual (register)", but this doesn't mean the instruction is atomic, especially based on the comment in the kernel header include/asm-generic/rwonce.h, I think the instruction ldrd/strd will be "atomic in some cases (namely Armv7 + LPAE), but for others we rely on the access being split into 2x32-bit accesses". perf Arm32 for __sync_val_compare_and_swap(): u64 head = __sync_val_compare_and_swap(&pc->aux_head, 0, 0); 7d6: 68fb ldr r3, [r7, #12] 7d8: f503 6484 add.w r4, r3, #1056 ; 0x420 7dc: f04f 0000 mov.w r0, #0 7e0: f04f 0100 mov.w r1, #0 7e4: f3bf 8f5b dmb ish 7e8: e8d4 237f ldrexd r2, r3, [r4] 7ec: ea52 0c03 orrs.w ip, r2, r3 7f0: d106 bne.n 800 <auxtrace_mmap__read_head+0x48> 7f2: e8c4 017c strexd ip, r0, r1, [r4] 7f6: f1bc 0f00 cmp.w ip, #0 7fa: f1bc 0f00 cmp.w ip, #0 7fe: d1f3 bne.n 7e8 <auxtrace_mmap__read_head+0x30> 800: f3bf 8f5b dmb ish 804: e9c7 2304 strd r2, r3, [r7, #16] For __sync_val_compare_and_swap(), it uses the instructions ldrexd/ldrexd, these two instructions rely on the exclusive monitor for accessing 64-bit value, so seems to me this is more reliable way for accessing 64-bit value in CPU's 32-bit mode. Conclusion: seems to me __sync_xxx_compare_and_swap() should be kept in this case rather than using READ_ONCE() for 32-bit building. Or any other suggestions? Thanks! Leo ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release 2021-06-01 6:33 ` Leo Yan @ 2021-06-01 6:58 ` Peter Zijlstra 2021-06-01 9:07 ` Adrian Hunter 1 sibling, 0 replies; 22+ messages in thread From: Peter Zijlstra @ 2021-06-01 6:58 UTC (permalink / raw To: Leo Yan Cc: Adrian Hunter, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On Tue, Jun 01, 2021 at 02:33:42PM +0800, Leo Yan wrote: > > 32-bit perf wants to access 64-bit value atomically, I think it tries to > avoid the issue caused by scenario: > > CPU0 (64-bit kernel) CPU1 (32-bit user) > > read head_lo > WRITE_ONCE(head) > read head_hi Right; so I think Mark and me once spend a bunch of time on this for the regular ring buffer, but my memory is vague. It was supposed to be that the high word would always be zero on 32bit, but it turns out that that is not in fact the case and we get to have this race that's basically unfixable :/ Or maybe that was only the compat case.. Ah yes, so see the kernel uses unsigned long, so on 32bit the high word is empty and we always read/write 0s, unless you're explicitly doing daft things. But on compat, the high word can be non-zero and we get to have 'fun'. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release 2021-06-01 6:33 ` Leo Yan 2021-06-01 6:58 ` Peter Zijlstra @ 2021-06-01 9:07 ` Adrian Hunter 2021-06-01 9:17 ` Peter Zijlstra 1 sibling, 1 reply; 22+ messages in thread From: Adrian Hunter @ 2021-06-01 9:07 UTC (permalink / raw To: Leo Yan Cc: Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On 1/06/21 9:33 am, Leo Yan wrote: > On Mon, May 31, 2021 at 10:03:33PM +0300, Adrian Hunter wrote: >> On 31/05/21 6:10 pm, Leo Yan wrote: >>> Hi Peter, Adrian, >>> >>> On Wed, May 19, 2021 at 10:03:19PM +0800, Leo Yan wrote: >>>> Load-acquire and store-release are one-way permeable barriers, which can >>>> be used to guarantee the memory ordering between accessing the buffer >>>> data and the buffer's head / tail. >>>> >>>> This patch optimizes the memory ordering with the load-acquire and >>>> store-release barriers. >>> >>> Is this patch okay for you? >>> >>> Besides this patch, I have an extra question. You could see for >>> accessing the AUX buffer's head and tail, it also support to use >>> compiler build-in functions for atomicity accessing: >>> >>> __sync_val_compare_and_swap() >>> __sync_bool_compare_and_swap() >>> >>> Since now we have READ_ONCE()/WRITE_ONCE(), do you think we still need >>> to support __sync_xxx_compare_and_swap() atomicity? >> >> I don't remember, but it seems to me atomicity is needed only >> for a 32-bit perf running with a 64-bit kernel. > > 32-bit perf wants to access 64-bit value atomically, I think it tries to > avoid the issue caused by scenario: > > CPU0 (64-bit kernel) CPU1 (32-bit user) > > read head_lo > WRITE_ONCE(head) > read head_hi > > > I dumped the disassembly for reading 64-bit value for perf Arm32 and get > below results: > > perf Arm32 for READ_ONCE(): > > case 8: *(__u64_alias_t *) res = *(volatile __u64_alias_t *) p; break; > 84a: 68fb ldr r3, [r7, #12] > 84c: e9d3 2300 ldrd r2, r3, [r3] > 850: 6939 ldr r1, [r7, #16] > 852: e9c1 2300 strd r2, r3, [r1] > 856: e007 b.n 868 <auxtrace_mmap__read_head+0xb0> > > It uses the instruction ldrd which is "Load Register Dual (register)", > but this doesn't mean the instruction is atomic, especially based on > the comment in the kernel header include/asm-generic/rwonce.h, I think > the instruction ldrd/strd will be "atomic in some cases (namely Armv7 + > LPAE), but for others we rely on the access being split into 2x32-bit > accesses". > > > perf Arm32 for __sync_val_compare_and_swap(): > > u64 head = __sync_val_compare_and_swap(&pc->aux_head, 0, 0); > 7d6: 68fb ldr r3, [r7, #12] > 7d8: f503 6484 add.w r4, r3, #1056 ; 0x420 > 7dc: f04f 0000 mov.w r0, #0 > 7e0: f04f 0100 mov.w r1, #0 > 7e4: f3bf 8f5b dmb ish > 7e8: e8d4 237f ldrexd r2, r3, [r4] > 7ec: ea52 0c03 orrs.w ip, r2, r3 > 7f0: d106 bne.n 800 <auxtrace_mmap__read_head+0x48> > 7f2: e8c4 017c strexd ip, r0, r1, [r4] > 7f6: f1bc 0f00 cmp.w ip, #0 > 7fa: f1bc 0f00 cmp.w ip, #0 > 7fe: d1f3 bne.n 7e8 <auxtrace_mmap__read_head+0x30> > 800: f3bf 8f5b dmb ish > 804: e9c7 2304 strd r2, r3, [r7, #16] > > For __sync_val_compare_and_swap(), it uses the instructions > ldrexd/ldrexd, these two instructions rely on the exclusive monitor > for accessing 64-bit value, so seems to me this is more reliable way > for accessing 64-bit value in CPU's 32-bit mode. > > Conclusion: seems to me __sync_xxx_compare_and_swap() should be kept > in this case rather than using READ_ONCE() for 32-bit building. Or > any other suggestions? Thanks! __sync_xxx_compare_and_swap is out-of-date now. This page: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005fsync-Builtins recommends '__atomic' builtins instead. Since atomics are needed only for the "compat" case (i.e. 32-bit perf with 64-bit kernel) you could try to find an elegant way to check for a 64-bit kernel, and avoid the atomics for a 32-bit perf with 32-bit kernel. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release 2021-06-01 9:07 ` Adrian Hunter @ 2021-06-01 9:17 ` Peter Zijlstra 2021-06-01 9:45 ` Adrian Hunter 0 siblings, 1 reply; 22+ messages in thread From: Peter Zijlstra @ 2021-06-01 9:17 UTC (permalink / raw To: Adrian Hunter Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On Tue, Jun 01, 2021 at 12:07:31PM +0300, Adrian Hunter wrote: > __sync_xxx_compare_and_swap is out-of-date now. This page: > > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005fsync-Builtins > > recommends '__atomic' builtins instead. perf doesn't seem to use that. > Since atomics are needed only for the "compat" case (i.e. 32-bit perf with 64-bit kernel) > you could try to find an elegant way to check for a 64-bit kernel, and avoid the atomics > for a 32-bit perf with 32-bit kernel. Most 32bit archs cannot do 64bit atomics. I suppose the only reason this doesn't explode is because the aux stuff isn't supported on many architectures? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release 2021-06-01 9:17 ` Peter Zijlstra @ 2021-06-01 9:45 ` Adrian Hunter 2021-06-01 9:48 ` Peter Zijlstra 2021-06-01 14:56 ` Leo Yan 0 siblings, 2 replies; 22+ messages in thread From: Adrian Hunter @ 2021-06-01 9:45 UTC (permalink / raw To: Peter Zijlstra Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On 1/06/21 12:17 pm, Peter Zijlstra wrote: > On Tue, Jun 01, 2021 at 12:07:31PM +0300, Adrian Hunter wrote: >> __sync_xxx_compare_and_swap is out-of-date now. This page: >> >> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005fsync-Builtins >> >> recommends '__atomic' builtins instead. > > perf doesn't seem to use that. I guess we could drop support for the compat case; add validation: "Error, 32-bit perf cannot record AUX area traces from a 64-bit kernel. Please use a 64-bit version of perf instead." > >> Since atomics are needed only for the "compat" case (i.e. 32-bit perf with 64-bit kernel) >> you could try to find an elegant way to check for a 64-bit kernel, and avoid the atomics >> for a 32-bit perf with 32-bit kernel. > > Most 32bit archs cannot do 64bit atomics. I suppose the only reason this > doesn't explode is because the aux stuff isn't supported on many > architectures? > Yes but presumably the race itself is unlikely since the upper byte changes only once every 4GiB. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release 2021-06-01 9:45 ` Adrian Hunter @ 2021-06-01 9:48 ` Peter Zijlstra 2021-06-01 14:56 ` Leo Yan 1 sibling, 0 replies; 22+ messages in thread From: Peter Zijlstra @ 2021-06-01 9:48 UTC (permalink / raw To: Adrian Hunter Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On Tue, Jun 01, 2021 at 12:45:16PM +0300, Adrian Hunter wrote: > On 1/06/21 12:17 pm, Peter Zijlstra wrote: > > On Tue, Jun 01, 2021 at 12:07:31PM +0300, Adrian Hunter wrote: > >> __sync_xxx_compare_and_swap is out-of-date now. This page: > >> > >> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005fsync-Builtins > >> > >> recommends '__atomic' builtins instead. > > > > perf doesn't seem to use that. > > I guess we could drop support for the compat case; add validation: > > "Error, 32-bit perf cannot record AUX area traces from a 64-bit kernel. > Please use a 64-bit version of perf instead." For AUX, possibly, sadly the exact same problem exists for the normal buffer IIRC. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release 2021-06-01 9:45 ` Adrian Hunter 2021-06-01 9:48 ` Peter Zijlstra @ 2021-06-01 14:56 ` Leo Yan 1 sibling, 0 replies; 22+ messages in thread From: Leo Yan @ 2021-06-01 14:56 UTC (permalink / raw To: Adrian Hunter Cc: Peter Zijlstra, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On Tue, Jun 01, 2021 at 12:45:16PM +0300, Adrian Hunter wrote: > On 1/06/21 12:17 pm, Peter Zijlstra wrote: > > On Tue, Jun 01, 2021 at 12:07:31PM +0300, Adrian Hunter wrote: > >> __sync_xxx_compare_and_swap is out-of-date now. This page: > >> > >> https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005fsync-Builtins > >> > >> recommends '__atomic' builtins instead. > > > > perf doesn't seem to use that. > > I guess we could drop support for the compat case; add validation: > > "Error, 32-bit perf cannot record AUX area traces from a 64-bit kernel. > Please use a 64-bit version of perf instead." Not sure what's a good method to detect the compat case. I can think out to use below conditions to check the compat case: if ((sizeof(unsigned long) == 4) && (machine__is(machine, "x86_64") || machine__is(machine, "arm64") || machine__is(machine, "aarch64"))) { pr_warn("Error, 32-bit perf cannot record from a 64-bit kernel.\n" "Please use a 64-bit version of perf instead.\n"); return -ENOTSUP; } Just want to check if any better to detect compat case in perf? Thanks for suggestions, Leo > >> Since atomics are needed only for the "compat" case (i.e. 32-bit perf with 64-bit kernel) > >> you could try to find an elegant way to check for a 64-bit kernel, and avoid the atomics > >> for a 32-bit perf with 32-bit kernel. > > > > Most 32bit archs cannot do 64bit atomics. I suppose the only reason this > > doesn't explode is because the aux stuff isn't supported on many > > architectures? > > > > Yes but presumably the race itself is unlikely since the upper byte changes only once every 4GiB. > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release 2021-05-31 19:03 ` Adrian Hunter 2021-06-01 6:33 ` Leo Yan @ 2021-06-01 6:48 ` Peter Zijlstra 1 sibling, 0 replies; 22+ messages in thread From: Peter Zijlstra @ 2021-06-01 6:48 UTC (permalink / raw To: Adrian Hunter Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On Mon, May 31, 2021 at 10:03:33PM +0300, Adrian Hunter wrote: > On 31/05/21 6:10 pm, Leo Yan wrote: > > Hi Peter, Adrian, > > > > On Wed, May 19, 2021 at 10:03:19PM +0800, Leo Yan wrote: > >> Load-acquire and store-release are one-way permeable barriers, which can > >> be used to guarantee the memory ordering between accessing the buffer > >> data and the buffer's head / tail. > >> > >> This patch optimizes the memory ordering with the load-acquire and > >> store-release barriers. > > > > Is this patch okay for you? > > > > Besides this patch, I have an extra question. You could see for > > accessing the AUX buffer's head and tail, it also support to use > > compiler build-in functions for atomicity accessing: > > > > __sync_val_compare_and_swap() > > __sync_bool_compare_and_swap() > > > > Since now we have READ_ONCE()/WRITE_ONCE(), do you think we still need > > to support __sync_xxx_compare_and_swap() atomicity? > > I don't remember, but it seems to me atomicity is needed only > for a 32-bit perf running with a 64-bit kernel. But that: do { old_tail = __sync_val_compare_and_swap(&pc->aux_tail, 0, 0); } while (!__sync_bool_compare_and_swap(&pc->aux_tail, old_tail, tail)); doesn't want to make sense to me. It appears to do a cmpxchg with 0 values to load old_tail, and then a further cmpxchg with old_tail to write the new tail. That's some really crazy code. That makes absolutely no sense what so ever and needs to just be deleted. On top of that, it uses atomic instrincs for a u64, which is not something that'll actually work on a whole bunch of 32bit platforms. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers 2021-05-19 14:03 [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers Leo Yan 2021-05-19 14:03 ` [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release Leo Yan @ 2021-05-27 7:54 ` Adrian Hunter 2021-05-27 8:11 ` Peter Zijlstra 1 sibling, 1 reply; 22+ messages in thread From: Adrian Hunter @ 2021-05-27 7:54 UTC (permalink / raw To: Leo Yan, Arnaldo Carvalho de Melo, Peter Zijlstra, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On 19/05/21 5:03 pm, Leo Yan wrote: > The AUX ring buffer's head and tail can be accessed from multiple CPUs > on SMP system, so changes to use SMP memory barriers to replace the > uniprocessor barriers. I don't think user space should attempt to be SMP-aware. For perf tools, on __x86_64__ it looks like smp_rmb() is only a compiler barrier, whereas rmb() is a "lfence" memory barrier instruction, so this patch does not seem to do what the commit message says at least for x86. With regard to the AUX area, we don't know in general how data gets there, so using memory barriers seems sensible. > > Signed-off-by: Leo Yan <leo.yan@linaro.org> > --- > tools/perf/util/auxtrace.h | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h > index 472c0973b1f1..8bed284ccc82 100644 > --- a/tools/perf/util/auxtrace.h > +++ b/tools/perf/util/auxtrace.h > @@ -452,7 +452,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm) > u64 head = READ_ONCE(pc->aux_head); > > /* Ensure all reads are done after we read the head */ > - rmb(); > + smp_rmb(); > return head; > } > > @@ -466,7 +466,7 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) > #endif > > /* Ensure all reads are done after we read the head */ > - rmb(); > + smp_rmb(); > return head; > } > > @@ -478,7 +478,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail) > #endif > > /* Ensure all reads are done before we write the tail out */ > - mb(); > + smp_mb(); > #if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT) > pc->aux_tail = tail; > #else > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers 2021-05-27 7:54 ` [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers Adrian Hunter @ 2021-05-27 8:11 ` Peter Zijlstra 2021-05-27 8:25 ` Adrian Hunter 0 siblings, 1 reply; 22+ messages in thread From: Peter Zijlstra @ 2021-05-27 8:11 UTC (permalink / raw To: Adrian Hunter Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On Thu, May 27, 2021 at 10:54:56AM +0300, Adrian Hunter wrote: > On 19/05/21 5:03 pm, Leo Yan wrote: > > The AUX ring buffer's head and tail can be accessed from multiple CPUs > > on SMP system, so changes to use SMP memory barriers to replace the > > uniprocessor barriers. > > I don't think user space should attempt to be SMP-aware. Uhh, what? It pretty much has to. Since userspace cannot assume UP, it must assume SMP. > For perf tools, on __x86_64__ it looks like smp_rmb() is only a compiler barrier, whereas > rmb() is a "lfence" memory barrier instruction, so this patch does not > seem to do what the commit message says at least for x86. The commit message is somewhat confused; *mb() are not UP barriers (although they are available and useful on UP). They're device/dma barriers. > With regard to the AUX area, we don't know in general how data gets there, > so using memory barriers seems sensible. IIRC (but I ddn't check) the rule was that the kernel needs to ensure the AUX area is complete before it updates the head pointer. So if userspace can observe the head pointer, it must then also be able to observe the data. This is not something userspace can fix up anyway. The ordering here is between the head pointer and the data, and from a userspace perspective that's a regular smp ordering. Similar for the tail update, that's between our reading the data and writing the tail, regular cache coherent smp ordering. So ACK on the patch, it's sane and an optimization for both x86 and ARM. Just the Changelog needs work. > > Signed-off-by: Leo Yan <leo.yan@linaro.org> > > --- > > tools/perf/util/auxtrace.h | 6 +++--- > > 1 file changed, 3 insertions(+), 3 deletions(-) > > > > diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h > > index 472c0973b1f1..8bed284ccc82 100644 > > --- a/tools/perf/util/auxtrace.h > > +++ b/tools/perf/util/auxtrace.h > > @@ -452,7 +452,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm) > > u64 head = READ_ONCE(pc->aux_head); > > > > /* Ensure all reads are done after we read the head */ > > - rmb(); > > + smp_rmb(); > > return head; > > } > > > > @@ -466,7 +466,7 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) > > #endif > > > > /* Ensure all reads are done after we read the head */ > > - rmb(); > > + smp_rmb(); > > return head; > > } > > > > @@ -478,7 +478,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail) > > #endif > > > > /* Ensure all reads are done before we write the tail out */ > > - mb(); > > + smp_mb(); > > #if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT) > > pc->aux_tail = tail; > > #else > > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers 2021-05-27 8:11 ` Peter Zijlstra @ 2021-05-27 8:25 ` Adrian Hunter 2021-05-27 9:24 ` Adrian Hunter 2021-05-27 9:45 ` Peter Zijlstra 0 siblings, 2 replies; 22+ messages in thread From: Adrian Hunter @ 2021-05-27 8:25 UTC (permalink / raw To: Peter Zijlstra Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On 27/05/21 11:11 am, Peter Zijlstra wrote: > On Thu, May 27, 2021 at 10:54:56AM +0300, Adrian Hunter wrote: >> On 19/05/21 5:03 pm, Leo Yan wrote: >>> The AUX ring buffer's head and tail can be accessed from multiple CPUs >>> on SMP system, so changes to use SMP memory barriers to replace the >>> uniprocessor barriers. >> >> I don't think user space should attempt to be SMP-aware. > > Uhh, what? It pretty much has to. Since userspace cannot assume UP, it > must assume SMP. Yeah that is what I meant, but consequently we generally shouldn't be using functions called smp_<anything> > >> For perf tools, on __x86_64__ it looks like smp_rmb() is only a compiler barrier, whereas >> rmb() is a "lfence" memory barrier instruction, so this patch does not >> seem to do what the commit message says at least for x86. > > The commit message is somewhat confused; *mb() are not UP barriers > (although they are available and useful on UP). They're device/dma > barriers. > >> With regard to the AUX area, we don't know in general how data gets there, >> so using memory barriers seems sensible. > > IIRC (but I ddn't check) the rule was that the kernel needs to ensure > the AUX area is complete before it updates the head pointer. So if > userspace can observe the head pointer, it must then also be able to > observe the data. This is not something userspace can fix up anyway. > > The ordering here is between the head pointer and the data, and from a > userspace perspective that's a regular smp ordering. Similar for the > tail update, that's between our reading the data and writing the tail, > regular cache coherent smp ordering. > > So ACK on the patch, it's sane and an optimization for both x86 and ARM. > Just the Changelog needs work. If all we want is a compiler barrier, then shouldn't that be what we use? i.e. barrier() > >>> Signed-off-by: Leo Yan <leo.yan@linaro.org> >>> --- >>> tools/perf/util/auxtrace.h | 6 +++--- >>> 1 file changed, 3 insertions(+), 3 deletions(-) >>> >>> diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h >>> index 472c0973b1f1..8bed284ccc82 100644 >>> --- a/tools/perf/util/auxtrace.h >>> +++ b/tools/perf/util/auxtrace.h >>> @@ -452,7 +452,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm) >>> u64 head = READ_ONCE(pc->aux_head); >>> >>> /* Ensure all reads are done after we read the head */ >>> - rmb(); >>> + smp_rmb(); >>> return head; >>> } >>> >>> @@ -466,7 +466,7 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) >>> #endif >>> >>> /* Ensure all reads are done after we read the head */ >>> - rmb(); >>> + smp_rmb(); >>> return head; >>> } >>> >>> @@ -478,7 +478,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail) >>> #endif >>> >>> /* Ensure all reads are done before we write the tail out */ >>> - mb(); >>> + smp_mb(); >>> #if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT) >>> pc->aux_tail = tail; >>> #else >>> >> ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers 2021-05-27 8:25 ` Adrian Hunter @ 2021-05-27 9:24 ` Adrian Hunter 2021-05-27 9:57 ` Peter Zijlstra 2021-05-27 9:45 ` Peter Zijlstra 1 sibling, 1 reply; 22+ messages in thread From: Adrian Hunter @ 2021-05-27 9:24 UTC (permalink / raw To: Peter Zijlstra Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On 27/05/21 11:25 am, Adrian Hunter wrote: > On 27/05/21 11:11 am, Peter Zijlstra wrote: >> On Thu, May 27, 2021 at 10:54:56AM +0300, Adrian Hunter wrote: >>> On 19/05/21 5:03 pm, Leo Yan wrote: >>>> The AUX ring buffer's head and tail can be accessed from multiple CPUs >>>> on SMP system, so changes to use SMP memory barriers to replace the >>>> uniprocessor barriers. >>> >>> I don't think user space should attempt to be SMP-aware. >> >> Uhh, what? It pretty much has to. Since userspace cannot assume UP, it >> must assume SMP. > > Yeah that is what I meant, but consequently we generally shouldn't be > using functions called smp_<anything> > >> >>> For perf tools, on __x86_64__ it looks like smp_rmb() is only a compiler barrier, whereas >>> rmb() is a "lfence" memory barrier instruction, so this patch does not >>> seem to do what the commit message says at least for x86. >> >> The commit message is somewhat confused; *mb() are not UP barriers >> (although they are available and useful on UP). They're device/dma >> barriers. >> >>> With regard to the AUX area, we don't know in general how data gets there, >>> so using memory barriers seems sensible. >> >> IIRC (but I ddn't check) the rule was that the kernel needs to ensure >> the AUX area is complete before it updates the head pointer. So if >> userspace can observe the head pointer, it must then also be able to >> observe the data. This is not something userspace can fix up anyway. >> >> The ordering here is between the head pointer and the data, and from a >> userspace perspective that's a regular smp ordering. Similar for the >> tail update, that's between our reading the data and writing the tail, >> regular cache coherent smp ordering. >> >> So ACK on the patch, it's sane and an optimization for both x86 and ARM. >> Just the Changelog needs work. > > If all we want is a compiler barrier, then shouldn't that be what we use? > i.e. barrier() I guess you are saying we still need to stop potential re-ordering across CPUs, so please ignore my comments. > >> >>>> Signed-off-by: Leo Yan <leo.yan@linaro.org> >>>> --- >>>> tools/perf/util/auxtrace.h | 6 +++--- >>>> 1 file changed, 3 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h >>>> index 472c0973b1f1..8bed284ccc82 100644 >>>> --- a/tools/perf/util/auxtrace.h >>>> +++ b/tools/perf/util/auxtrace.h >>>> @@ -452,7 +452,7 @@ static inline u64 auxtrace_mmap__read_snapshot_head(struct auxtrace_mmap *mm) >>>> u64 head = READ_ONCE(pc->aux_head); >>>> >>>> /* Ensure all reads are done after we read the head */ >>>> - rmb(); >>>> + smp_rmb(); >>>> return head; >>>> } >>>> >>>> @@ -466,7 +466,7 @@ static inline u64 auxtrace_mmap__read_head(struct auxtrace_mmap *mm) >>>> #endif >>>> >>>> /* Ensure all reads are done after we read the head */ >>>> - rmb(); >>>> + smp_rmb(); >>>> return head; >>>> } >>>> >>>> @@ -478,7 +478,7 @@ static inline void auxtrace_mmap__write_tail(struct auxtrace_mmap *mm, u64 tail) >>>> #endif >>>> >>>> /* Ensure all reads are done before we write the tail out */ >>>> - mb(); >>>> + smp_mb(); >>>> #if BITS_PER_LONG == 64 || !defined(HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT) >>>> pc->aux_tail = tail; >>>> #else >>>> >>> > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers 2021-05-27 9:24 ` Adrian Hunter @ 2021-05-27 9:57 ` Peter Zijlstra 2021-05-31 14:53 ` Leo Yan 0 siblings, 1 reply; 22+ messages in thread From: Peter Zijlstra @ 2021-05-27 9:57 UTC (permalink / raw To: Adrian Hunter Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On Thu, May 27, 2021 at 12:24:15PM +0300, Adrian Hunter wrote: > > If all we want is a compiler barrier, then shouldn't that be what we use? > > i.e. barrier() > > I guess you are saying we still need to stop potential re-ordering across > CPUs, so please ignore my comments. Right; so the ordering issue is real, consider: CPU0 (kernel) CPU1 (user) write data read head smp_wmb() smp_rmb() write head read data Without explicit ordering (on either side), we might either read data that isn't written yet: ,--(read data) write data | smp_wmb() | write head ---. | `--> | read head `- read data Where the head load observes the new head writte, but the data load is speculated and loads data before it is written. Or, we can write the head before the data write is visible: ,-- write data | write head | read head | smp_rmb() | read data `-> (data visible) Where we read the head head, but still observe stale data because the stores got committed out of order. x86 is TSO, so neither reordering is actually possible, hence both barriers being a compiler barrier (to ensure the compiler doesn't reorder them for us). But weaker hardware *will* allow those orderings and we very much need actual barriers there. Welcome to the magical world of memory ordering and weak architectures. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers 2021-05-27 9:57 ` Peter Zijlstra @ 2021-05-31 14:53 ` Leo Yan 2021-05-31 15:48 ` Peter Zijlstra 0 siblings, 1 reply; 22+ messages in thread From: Leo Yan @ 2021-05-31 14:53 UTC (permalink / raw To: Peter Zijlstra Cc: Adrian Hunter, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel Hi Peter, Adrian, On Thu, May 27, 2021 at 11:57:37AM +0200, Peter Zijlstra wrote: > On Thu, May 27, 2021 at 12:24:15PM +0300, Adrian Hunter wrote: > > > > If all we want is a compiler barrier, then shouldn't that be what we use? > > > i.e. barrier() Sorry for a bit late. Just bring up one question before I respin this patch set. > > I guess you are saying we still need to stop potential re-ordering across > > CPUs, so please ignore my comments. > > Right; so the ordering issue is real, consider: > > CPU0 (kernel) CPU1 (user) > > write data read head > smp_wmb() smp_rmb() > write head read data One thing should be mentioned is the Linux kernel has _not_ used an explict "smb_wmb()" between writing AUX trace data and updating header "aux_head". Please see the function perf_aux_output_end(): void perf_aux_output_end(..., size) { ... if (size || (handle->aux_flags & ~(u64)PERF_AUX_FLAG_OVERWRITE)) perf_event_aux_event(handle->event, aux_head, size, handle->aux_flags); WRITE_ONCE(rb->user_page->aux_head, rb->aux_head); ... } But I think it's needless to add "smb_wmb()" prior to WRITE_ONCE() sentence. This is because: Before updating the "aux_head", it calls perf_event_aux_event(), so event PERF_RECORD_AUX is filled into the perf ring buffer, and executes smb_wmb() + updates the header "user_page->data_head"; so the flow is like blow: Fill AUX trace data to AUX ring buffer Fill RECORD_AUX event into perf ring buffer smb_wmb() update "user_page->data_head" -> See perf_event_aux_event()/perf_output_end() update "user_page->aux_head" This is a bit weird for two ring buffers (AUX and perf generic ring buffers) share the same memory barrier between the write data and write headers. Do you think I understand correctly? Or should add an explict "smb_wmb()" before WRITE_ONCE(rb->user_page->aux_head, ...)? Thanks, Leo > Without explicit ordering (on either side), we might either read data > that isn't written yet: > > ,--(read data) > write data | > smp_wmb() | > write head ---. | > `--> | read head > `- read data > > Where the head load observes the new head writte, but the data load is > speculated and loads data before it is written. > > Or, we can write the head before the data write is visible: > > ,-- write data > | write head > | read head > | smp_rmb() > | read data > `-> (data visible) > > Where we read the head head, but still observe stale data because the > stores got committed out of order. > > x86 is TSO, so neither reordering is actually possible, hence both > barriers being a compiler barrier (to ensure the compiler doesn't > reorder them for us). But weaker hardware *will* allow those orderings > and we very much need actual barriers there. > > Welcome to the magical world of memory ordering and weak architectures. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers 2021-05-31 14:53 ` Leo Yan @ 2021-05-31 15:48 ` Peter Zijlstra 2021-06-01 3:21 ` Leo Yan 0 siblings, 1 reply; 22+ messages in thread From: Peter Zijlstra @ 2021-05-31 15:48 UTC (permalink / raw To: Leo Yan Cc: Adrian Hunter, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On Mon, May 31, 2021 at 10:53:02PM +0800, Leo Yan wrote: > Hi Peter, Adrian, > > On Thu, May 27, 2021 at 11:57:37AM +0200, Peter Zijlstra wrote: > > On Thu, May 27, 2021 at 12:24:15PM +0300, Adrian Hunter wrote: > > > > > > If all we want is a compiler barrier, then shouldn't that be what we use? > > > > i.e. barrier() > > Sorry for a bit late. Just bring up one question before I respin > this patch set. > > > > I guess you are saying we still need to stop potential re-ordering across > > > CPUs, so please ignore my comments. > > > > Right; so the ordering issue is real, consider: > > > > CPU0 (kernel) CPU1 (user) > > > > write data read head > > smp_wmb() smp_rmb() > > write head read data > > One thing should be mentioned is the Linux kernel has _not_ used an > explict "smb_wmb()" between writing AUX trace data and updating header > "aux_head". Please see the function perf_aux_output_end(): I think we pushed that into the driver. There is nothing the generic code can do here. It is the drivers responsibility of ensuring the data is stable before calling perf_aux_output_end() or something along those lines. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers 2021-05-31 15:48 ` Peter Zijlstra @ 2021-06-01 3:21 ` Leo Yan 0 siblings, 0 replies; 22+ messages in thread From: Leo Yan @ 2021-06-01 3:21 UTC (permalink / raw To: Peter Zijlstra Cc: Adrian Hunter, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On Mon, May 31, 2021 at 05:48:03PM +0200, Peter Zijlstra wrote: > On Mon, May 31, 2021 at 10:53:02PM +0800, Leo Yan wrote: > > Hi Peter, Adrian, > > > > On Thu, May 27, 2021 at 11:57:37AM +0200, Peter Zijlstra wrote: > > > On Thu, May 27, 2021 at 12:24:15PM +0300, Adrian Hunter wrote: > > > > > > > > If all we want is a compiler barrier, then shouldn't that be what we use? > > > > > i.e. barrier() > > > > Sorry for a bit late. Just bring up one question before I respin > > this patch set. > > > > > > I guess you are saying we still need to stop potential re-ordering across > > > > CPUs, so please ignore my comments. > > > > > > Right; so the ordering issue is real, consider: > > > > > > CPU0 (kernel) CPU1 (user) > > > > > > write data read head > > > smp_wmb() smp_rmb() > > > write head read data > > > > One thing should be mentioned is the Linux kernel has _not_ used an > > explict "smb_wmb()" between writing AUX trace data and updating header > > "aux_head". Please see the function perf_aux_output_end(): > > I think we pushed that into the driver. There is nothing the generic > code can do here. > > It is the drivers responsibility of ensuring the data is stable before > calling perf_aux_output_end() or something along those lines. Thanks for explaination. I reviewed the drivers, some of them have used memory barriers (e.g. Intel-PT, Arm SPE), but some drivers miss to use memory barriers before calling perf_aux_output_end() (Like Arm CoreSight, Intel-bts). Will address this issue in next patch set. Thanks, Leo ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers 2021-05-27 8:25 ` Adrian Hunter 2021-05-27 9:24 ` Adrian Hunter @ 2021-05-27 9:45 ` Peter Zijlstra 1 sibling, 0 replies; 22+ messages in thread From: Peter Zijlstra @ 2021-05-27 9:45 UTC (permalink / raw To: Adrian Hunter Cc: Leo Yan, Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland, Alexander Shishkin, Namhyung Kim, Andi Kleen, linux-perf-users, linux-kernel On Thu, May 27, 2021 at 11:25:40AM +0300, Adrian Hunter wrote: > On 27/05/21 11:11 am, Peter Zijlstra wrote: > > On Thu, May 27, 2021 at 10:54:56AM +0300, Adrian Hunter wrote: > >> On 19/05/21 5:03 pm, Leo Yan wrote: > >>> The AUX ring buffer's head and tail can be accessed from multiple CPUs > >>> on SMP system, so changes to use SMP memory barriers to replace the > >>> uniprocessor barriers. > >> > >> I don't think user space should attempt to be SMP-aware. > > > > Uhh, what? It pretty much has to. Since userspace cannot assume UP, it > > must assume SMP. > > Yeah that is what I meant, but consequently we generally shouldn't be > using functions called smp_<anything> Of course we should; they're the SMP class of barriers. > > So ACK on the patch, it's sane and an optimization for both x86 and ARM. > > Just the Changelog needs work. > > If all we want is a compiler barrier, then shouldn't that be what we use? > i.e. barrier() No, we want the SMP barriers, smp_rmb() happens to be a compiler barrier on x86, but it will be dmb(ishld) on Aarrgh64 for example. ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2021-06-01 14:57 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2021-05-19 14:03 [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers Leo Yan 2021-05-19 14:03 ` [PATCH v1 2/2] perf auxtrace: Optimize barriers with load-acquire and store-release Leo Yan 2021-05-31 15:10 ` Leo Yan 2021-05-31 15:55 ` Peter Zijlstra 2021-05-31 19:03 ` Adrian Hunter 2021-06-01 6:33 ` Leo Yan 2021-06-01 6:58 ` Peter Zijlstra 2021-06-01 9:07 ` Adrian Hunter 2021-06-01 9:17 ` Peter Zijlstra 2021-06-01 9:45 ` Adrian Hunter 2021-06-01 9:48 ` Peter Zijlstra 2021-06-01 14:56 ` Leo Yan 2021-06-01 6:48 ` Peter Zijlstra 2021-05-27 7:54 ` [PATCH v1 1/2] perf auxtrace: Change to use SMP memory barriers Adrian Hunter 2021-05-27 8:11 ` Peter Zijlstra 2021-05-27 8:25 ` Adrian Hunter 2021-05-27 9:24 ` Adrian Hunter 2021-05-27 9:57 ` Peter Zijlstra 2021-05-31 14:53 ` Leo Yan 2021-05-31 15:48 ` Peter Zijlstra 2021-06-01 3:21 ` Leo Yan 2021-05-27 9:45 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).