[RFC PATCH] powerpc: Optimise barriers for fully ordered atomics

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] powerpc: Optimise barriers for fully ordered atomics
@ 2024-04-12 17:25 Nicholas Piggin
  2024-04-13  9:48 ` Michael Ellerman
  0 siblings, 1 reply; 3+ messages in thread
From: Nicholas Piggin @ 2024-04-12 17:25 UTC (permalink / raw
  To: linuxppc-dev; +Cc: Nicholas Piggin

"Fully ordered" atomics (RMW that return a value) are said to have a
full barrier before and after the atomic operation. This is implemented
as:

  hwsync
  larx
  ...
  stcx.
  bne-
  hwsync

This is slow on POWER processors because hwsync and stcx. require a
round-trip to the nest (~= L2 cache). The hwsyncs can be avoided with
the sequence:

  lwsync
  larx
  ...
  stcx.
  bne-
  isync

lwsync prevents all reorderings except store/load reordering, so the
larx could be execued ahead of a prior store becoming visible. However
the stcx. is a store, so it is ordered by the lwsync against all prior
access and if the value in memory had been modified since the larx, it
will fail. So the point at which the larx executes is not a concern
because the stcx. always verifies the memory was unchanged.

The isync prevents subsequent instructions being executed before the
stcx. executes, and stcx. is necessarily visible to the system after
it executes, so there is no opportunity for it (or prior stores, thanks
to lwsync) to become visible after a subsequent load or store.

This sequence requires only one L2 round-trip and so is around 2x faster
measured on a POWER10 with back-to-back atomic ops on cached memory.

[ Remains to be seen if this is always faster when there is other activity
going on, and if it's faster on non-POEWR CPUs or perhaps older ones
like 970 that might not optimise isync so much. ]

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/synch.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/synch.h b/arch/powerpc/include/asm/synch.h
index b0b4c64870d7..0b1718eb9a40 100644
--- a/arch/powerpc/include/asm/synch.h
+++ b/arch/powerpc/include/asm/synch.h
@@ -60,8 +60,8 @@ static inline void ppc_after_tlbiel_barrier(void)
 	MAKE_LWSYNC_SECTION_ENTRY(97, __lwsync_fixup);
 #define PPC_ACQUIRE_BARRIER	 "\n" stringify_in_c(__PPC_ACQUIRE_BARRIER)
 #define PPC_RELEASE_BARRIER	 stringify_in_c(LWSYNC) "\n"
-#define PPC_ATOMIC_ENTRY_BARRIER "\n" stringify_in_c(sync) "\n"
-#define PPC_ATOMIC_EXIT_BARRIER	 "\n" stringify_in_c(sync) "\n"
+#define PPC_ATOMIC_ENTRY_BARRIER "\n" stringify_in_c(LWSYNC) "\n"
+#define PPC_ATOMIC_EXIT_BARRIER	 "\n" stringify_in_c(isync) "\n"
 #else
 #define PPC_ACQUIRE_BARRIER
 #define PPC_RELEASE_BARRIER
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH] powerpc: Optimise barriers for fully ordered atomics
  2024-04-12 17:25 [RFC PATCH] powerpc: Optimise barriers for fully ordered atomics Nicholas Piggin
@ 2024-04-13  9:48 ` Michael Ellerman
  2024-04-16  2:12   ` Nicholas Piggin
  0 siblings, 1 reply; 3+ messages in thread
From: Michael Ellerman @ 2024-04-13  9:48 UTC (permalink / raw
  To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin

Nicholas Piggin <npiggin@gmail.com> writes:
> "Fully ordered" atomics (RMW that return a value) are said to have a
> full barrier before and after the atomic operation. This is implemented
> as:
>
>   hwsync
>   larx
>   ...
>   stcx.
>   bne-
>   hwsync
>
> This is slow on POWER processors because hwsync and stcx. require a
> round-trip to the nest (~= L2 cache). The hwsyncs can be avoided with
> the sequence:
>
>   lwsync
>   larx
>   ...
>   stcx.
>   bne-
>   isync
>
> lwsync prevents all reorderings except store/load reordering, so the
> larx could be execued ahead of a prior store becoming visible. However
> the stcx. is a store, so it is ordered by the lwsync against all prior
> access and if the value in memory had been modified since the larx, it
> will fail. So the point at which the larx executes is not a concern
> because the stcx. always verifies the memory was unchanged.
>
> The isync prevents subsequent instructions being executed before the
> stcx. executes, and stcx. is necessarily visible to the system after
> it executes, so there is no opportunity for it (or prior stores, thanks
> to lwsync) to become visible after a subsequent load or store.

AFAICS commit b97021f85517 ("powerpc: Fix atomic_xxx_return barrier
semantics") disagrees.

That was 2011, so maybe it's wrong or outdated?

Either way it would be good to have some litmus tests to back this up.

cheers

ps. isn't there a rule against sending barrier patches after midnight? ;)

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH] powerpc: Optimise barriers for fully ordered atomics
  2024-04-13  9:48 ` Michael Ellerman
@ 2024-04-16  2:12   ` Nicholas Piggin
  0 siblings, 0 replies; 3+ messages in thread
From: Nicholas Piggin @ 2024-04-16  2:12 UTC (permalink / raw
  To: Michael Ellerman, linuxppc-dev

On Sat Apr 13, 2024 at 7:48 PM AEST, Michael Ellerman wrote:
> Nicholas Piggin <npiggin@gmail.com> writes:
> > "Fully ordered" atomics (RMW that return a value) are said to have a
> > full barrier before and after the atomic operation. This is implemented
> > as:
> >
> >   hwsync
> >   larx
> >   ...
> >   stcx.
> >   bne-
> >   hwsync
> >
> > This is slow on POWER processors because hwsync and stcx. require a
> > round-trip to the nest (~= L2 cache). The hwsyncs can be avoided with
> > the sequence:
> >
> >   lwsync
> >   larx
> >   ...
> >   stcx.
> >   bne-
> >   isync
> >
> > lwsync prevents all reorderings except store/load reordering, so the
> > larx could be execued ahead of a prior store becoming visible. However
> > the stcx. is a store, so it is ordered by the lwsync against all prior
> > access and if the value in memory had been modified since the larx, it
> > will fail. So the point at which the larx executes is not a concern
> > because the stcx. always verifies the memory was unchanged.
> >
> > The isync prevents subsequent instructions being executed before the
> > stcx. executes, and stcx. is necessarily visible to the system after
> > it executes, so there is no opportunity for it (or prior stores, thanks
> > to lwsync) to become visible after a subsequent load or store.
>
> AFAICS commit b97021f85517 ("powerpc: Fix atomic_xxx_return barrier
> semantics") disagrees.
>
> That was 2011, so maybe it's wrong or outdated?

Hmm, thanks for the reference. I didn't know about that.

isync or ordering execution / completion of a load after a previous
plain store doesn't guarantee ordering, because stores drain from
queues and become visible some time after they complete.

Okay, but I was thinking a successful stcx. should have to be visible.
I guess that's not true if it broke on P7. Maybe it's possible in the
architecture to have a successful stcx. not having this property though,
I find it pretty hard to read this part of the architecture. Clearly
it has to be visible to other processors performing larx/stcx.,
otherwise the canonical stcx. ; bne- ; isync sequence can't provide
mutual exclusion.

I wonder if the P7 breakage was caused by to the "wormhole coherency"
thing that was changed ("fixed") in POWER9. I'll have to look into it
a bit more. The cache coherency guy I was talking to retired before
answering :/ I'll have to find another victim.

I should try to break a P8 too.

>
> Either way it would be good to have some litmus tests to back this up.
>
> cheers
>
> ps. isn't there a rule against sending barrier patches after midnight? ;)

There should be if not.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-04-16  2:13 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-12 17:25 [RFC PATCH] powerpc: Optimise barriers for fully ordered atomics Nicholas Piggin
2024-04-13  9:48 ` Michael Ellerman
2024-04-16  2:12   ` Nicholas Piggin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.