[PATCH v2 0/6] Faster AES-XTS on modern x86

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs
@ 2024-03-29  8:03 Eric Biggers
  2024-03-29  8:03 ` [PATCH v2 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support Eric Biggers
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Eric Biggers @ 2024-03-29  8:03 UTC (permalink / raw)
  To: linux-crypto, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Chang S . Bae

This patchset adds new AES-XTS implementations that accelerate disk and
file encryption on modern x86_64 CPUs.

The largest improvements are seen on CPUs that support the VAES
extension: Intel Ice Lake (2019) and later, and AMD Zen 3 (2020) and
later.  However, an implementation using plain AESNI + AVX is also added
and provides a boost on older CPUs too.

To try to handle the mess that is x86 SIMD, the code for all the new
AES-XTS implementations is generated from an assembly macro.  This makes
it so that we e.g. don't have to have entirely different source code
just for different vector lengths (xmm, ymm, zmm).

To avoid downclocking effects, zmm registers aren't used on certain
Intel CPU models such as Ice Lake.  These CPU models default to an
implementation using ymm registers instead.

To make testing easier, all four new AES-XTS implementations are
registered separately with the crypto API.  They are prioritized
appropriately so that the best one for the CPU is used by default.

There's no separate kconfig option for the new implementations, as they
are included in the existing option CONFIG_CRYPTO_AES_NI_INTEL.

This patchset increases the throughput of AES-256-XTS by the following
amounts on the following CPUs:

                          | 4096-byte messages | 512-byte messages |
    ----------------------+--------------------+-------------------+
    Intel Skylake         |        6%          |       31%         |
    Intel Cascade Lake    |        4%          |       26%         |
    Intel Ice Lake        |       127%         |      120%         |
    Intel Sapphire Rapids |       151%         |      112%         |
    AMD Zen 1             |        61%         |       73%         |
    AMD Zen 2             |        36%         |       59%         |
    AMD Zen 3             |       138%         |       99%         |
    AMD Zen 4             |       155%         |      117%         |

To summarize how the XTS implementations perform in general, here are
benchmarks of all of them on AMD Zen 4, with 4096-byte messages.  (Of
course, in practice only the best one for the CPU actually gets used.)

    xts-aes-aesni                  4247 MB/s
    xts-aes-aesni-avx              5669 MB/s
    xts-aes-vaes-avx2              9588 MB/s
    xts-aes-vaes-avx10_256         9631 MB/s
    xts-aes-vaes-avx10_512         10868 MB/s

... and on Intel Sapphire Rapids:

    xts-aes-aesni                  4848 MB/s
    xts-aes-aesni-avx              5287 MB/s
    xts-aes-vaes-avx2              11685 MB/s
    xts-aes-vaes-avx10_256         11938 MB/s
    xts-aes-vaes-avx10_512         12176 MB/s

Notes about benchmarking methods:

- All my benchmarks were done using a custom kernel module that invokes
  the crypto_skcipher API.  Note that benchmarking the crypto API from
  userspace using AF_ALG, e.g. as 'cryptsetup benchmark' does, is bad at
  measuring fast algorithms due to the syscall overhead of AF_ALG.  I
  don't recommend that method.  Instead, I measured the crypto
  performance directly, as that's what this patchset focuses on.

- All numbers I give are for decryption.  However, on all the CPUs I
  tested, encryption performs almost identically to decryption.

Open questions:

- Is the policy that I implemented for preferring ymm registers to zmm
  registers the right one?  arch/x86/crypto/poly1305_glue.c thinks that
  only Skylake has the bad downclocking.  My current proposal is a bit
  more conservative; it also excludes Ice Lake and Tiger Lake.  Those
  CPUs supposedly still have some downclocking, though not as much.

- Should the policy on the use of zmm registers be in a centralized
  place?  It probably doesn't make sense to have random different
  policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).

- Are there any other known issues with using AVX512 in kernel mode?  It
  seems to work, and technically it's not new because Poly1305 and ARIA
  already use AVX512, including the mask registers and zmm registers up
  to 31.  So if there was a major issue, like the new registers not
  being properly saved and restored, it probably would have already been
  found.  But AES-XTS support would introduce a wider use of it.

- Should we perhaps not even bother with AVX512 / AVX10 at all for now,
  given that on current CPUs most of the improvement is achieved by
  going to VAES + AVX2?  I.e. should we skip the last two patches?  I'm
  hoping the improvement will be greater on future CPUs, though.

Changed in v2:
  - Additional optimizations:
      - Interleaved the tweak computation with AES en/decryption.  This
        helps significantly on some CPUs, especially those without VAES.
      - Further optimized for single-page sources and destinations.
      - Used fewer instructions to update tweaks in VPCLMULQDQ case.
      - Improved handling of "round 0".
      - Eliminated a jump instruction from the main loop.
  - Other
      - Fixed zmm_exclusion_list[] to be null-terminated.
      - Added missing #ifdef to unregister_xts_algs().
      - Added some more comments.
      - Improved cover letter and some commit messages.
      - Now that the next tweak is always computed anyways, made it be
        returned unconditionally.
      - Moved the IV encryption to a separate function.

Eric Biggers (6):
  x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support
  crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs
  crypto: x86/aes-xts - wire up AESNI + AVX implementation
  crypto: x86/aes-xts - wire up VAES + AVX2 implementation
  crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation
  crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation

 arch/x86/Kconfig.assembler           |  10 +
 arch/x86/crypto/Makefile             |   3 +-
 arch/x86/crypto/aes-xts-avx-x86_64.S | 838 +++++++++++++++++++++++++++
 arch/x86/crypto/aesni-intel_glue.c   | 270 ++++++++-
 4 files changed, 1118 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/crypto/aes-xts-avx-x86_64.S


base-commit: 4cece764965020c22cff7665b18a012006359095
-- 
2.44.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support
  2024-03-29  8:03 [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
@ 2024-03-29  8:03 ` Eric Biggers
  2024-03-29  8:03 ` [PATCH v2 2/6] crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs Eric Biggers
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Eric Biggers @ 2024-03-29  8:03 UTC (permalink / raw)
  To: linux-crypto, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Chang S . Bae,
	Ingo Molnar

From: Eric Biggers <ebiggers@google.com>

Add config symbols AS_VAES and AS_VPCLMULQDQ that expose whether the
assembler supports the vector AES and carryless multiplication
cryptographic extensions.

Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/Kconfig.assembler | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/Kconfig.assembler b/arch/x86/Kconfig.assembler
index 8ad41da301e5..59aedf32c4ea 100644
--- a/arch/x86/Kconfig.assembler
+++ b/arch/x86/Kconfig.assembler
@@ -23,9 +23,19 @@ config AS_TPAUSE
 config AS_GFNI
 	def_bool $(as-instr,vgf2p8mulb %xmm0$(comma)%xmm1$(comma)%xmm2)
 	help
 	  Supported by binutils >= 2.30 and LLVM integrated assembler
 
+config AS_VAES
+	def_bool $(as-instr,vaesenc %ymm0$(comma)%ymm1$(comma)%ymm2)
+	help
+	  Supported by binutils >= 2.30 and LLVM integrated assembler
+
+config AS_VPCLMULQDQ
+	def_bool $(as-instr,vpclmulqdq \$0x10$(comma)%ymm0$(comma)%ymm1$(comma)%ymm2)
+	help
+	  Supported by binutils >= 2.30 and LLVM integrated assembler
+
 config AS_WRUSS
 	def_bool $(as-instr,wrussq %rax$(comma)(%rbx))
 	help
 	  Supported by binutils >= 2.31 and LLVM integrated assembler
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/6] crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs
  2024-03-29  8:03 [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
  2024-03-29  8:03 ` [PATCH v2 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support Eric Biggers
@ 2024-03-29  8:03 ` Eric Biggers
  2024-03-29  8:03 ` [PATCH v2 3/6] crypto: x86/aes-xts - wire up AESNI + AVX implementation Eric Biggers
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Eric Biggers @ 2024-03-29  8:03 UTC (permalink / raw)
  To: linux-crypto, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Chang S . Bae

From: Eric Biggers <ebiggers@google.com>

Add an assembly file aes-xts-avx-x86_64.S which contains a macro that
expands into AES-XTS implementations for x86_64 CPUs that support at
least AES-NI and AVX, optionally also taking advantage of VAES,
VPCLMULQDQ, and AVX512 or AVX10.

This patch doesn't expand the macro at all.  Later patches will do so,
adding each implementation individually so that the motivation and use
case for each individual implementation can be fully presented.

The file also provides a function aes_xts_encrypt_iv() which handles the
encryption of the IV (tweak), using AES-NI and AVX.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/Makefile             |   3 +-
 arch/x86/crypto/aes-xts-avx-x86_64.S | 800 +++++++++++++++++++++++++++
 2 files changed, 802 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/crypto/aes-xts-avx-x86_64.S

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index 9aa46093c91b..9c5ce5613738 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -46,11 +46,12 @@ obj-$(CONFIG_CRYPTO_CHACHA20_X86_64) += chacha-x86_64.o
 chacha-x86_64-y := chacha-avx2-x86_64.o chacha-ssse3-x86_64.o chacha_glue.o
 chacha-x86_64-$(CONFIG_AS_AVX512) += chacha-avx512vl-x86_64.o
 
 obj-$(CONFIG_CRYPTO_AES_NI_INTEL) += aesni-intel.o
 aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o
-aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o
+aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o \
+	aes_ctrby8_avx-x86_64.o aes-xts-avx-x86_64.o
 
 obj-$(CONFIG_CRYPTO_SHA1_SSSE3) += sha1-ssse3.o
 sha1-ssse3-y := sha1_avx2_x86_64_asm.o sha1_ssse3_asm.o sha1_ssse3_glue.o
 sha1-ssse3-$(CONFIG_AS_SHA1_NI) += sha1_ni_asm.o
 
diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S
new file mode 100644
index 000000000000..a5e2783c46ec
--- /dev/null
+++ b/arch/x86/crypto/aes-xts-avx-x86_64.S
@@ -0,0 +1,800 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * AES-XTS for modern x86_64 CPUs
+ *
+ * Copyright 2024 Google LLC
+ *
+ * Author: Eric Biggers <ebiggers@google.com>
+ */
+
+/*
+ * This file implements AES-XTS for modern x86_64 CPUs.  To handle the
+ * complexities of coding for x86 SIMD, e.g. where every vector length needs
+ * different code, it uses a macro to generate several implementations that
+ * share similar source code but are targeted at different CPUs, listed below:
+ *
+ * AES-NI + AVX
+ *    - 128-bit vectors (1 AES block per vector)
+ *    - VEX-coded instructions
+ *    - xmm0-xmm15
+ *    - This is for older CPUs that lack VAES but do have AVX.
+ *
+ * VAES + VPCLMULQDQ + AVX2
+ *    - 256-bit vectors (2 AES blocks per vector)
+ *    - VEX-coded instructions
+ *    - ymm0-ymm15
+ *    - This is for CPUs that have VAES but lack AVX512 or AVX10,
+ *      e.g. Intel's Alder Lake and AMD's Zen 3.
+ *
+ * VAES + VPCLMULQDQ + AVX10/256 + BMI2
+ *    - 256-bit vectors (2 AES blocks per vector)
+ *    - EVEX-coded instructions
+ *    - ymm0-ymm31
+ *    - This is for CPUs that have AVX512 but where using zmm registers causes
+ *      downclocking, and for CPUs that have AVX10/256 but not AVX10/512.
+ *    - By "AVX10/256" we really mean (AVX512BW + AVX512VL) || AVX10/256.
+ *      To avoid confusion with 512-bit, we just write AVX10/256.
+ *
+ * VAES + VPCLMULQDQ + AVX10/512 + BMI2
+ *    - Same as the previous one, but upgrades to 512-bit vectors
+ *      (4 AES blocks per vector) in zmm0-zmm31.
+ *    - This is for CPUs that have good AVX512 or AVX10/512 support.
+ *
+ * This file doesn't have an implementation for AES-NI alone (without AVX), as
+ * the lack of VEX would make all the assembly code different.
+ *
+ * When we use VAES, we also use VPCLMULQDQ to parallelize the computation of
+ * the XTS tweaks.  This avoids a bottleneck.  Currently there don't seem to be
+ * any CPUs that support VAES but not VPCLMULQDQ.  If that changes, we might
+ * need to start also providing an implementation using VAES alone.
+ *
+ * The AES-XTS implementations in this file support everything required by the
+ * crypto API, including support for arbitrary input lengths and multi-part
+ * processing.  However, they are most heavily optimized for the common case of
+ * power-of-2 length inputs that are processed in a single part (disk sectors).
+ */
+
+#include <linux/linkage.h>
+#include <linux/cfi_types.h>
+
+.section .rodata
+.p2align 4
+.Lgf_poly:
+	// The low 64 bits of this value represent the polynomial x^7 + x^2 + x
+	// + 1.  It is the value that must be XOR'd into the low 64 bits of the
+	// tweak each time a 1 is carried out of the high 64 bits.
+	//
+	// The high 64 bits of this value is just the internal carry bit that
+	// exists when there's a carry out of the low 64 bits of the tweak.
+	.quad	0x87, 1
+
+	// This table contains constants for vpshufb and vpblendvb, used to
+	// handle variable byte shifts and blending during ciphertext stealing
+	// on CPUs that don't support AVX10-style masking.
+.Lcts_permute_table:
+	.byte	0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80
+	.byte	0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80
+	.byte	0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
+	.byte	0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
+	.byte	0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80
+	.byte	0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80
+.text
+
+// Function parameters
+.set	KEY,		%rdi	// Initially points to crypto_aes_ctx, then is
+				// advanced to point directly to the round keys
+.set	SRC,		%rsi	// Pointer to next source data
+.set	DST,		%rdx	// Pointer to next destination data
+.set	LEN,		%rcx	// Remaining length in bytes
+.set	TWEAK,		%r8	// Pointer to next tweak
+
+// %r9d holds the AES key length in bytes.
+.set	KEYLEN,		%r9d
+
+// %rax and %r10-r11 are available as temporaries.
+
+.macro	_define_Vi	i
+.if VL == 16
+	.set	V\i,		%xmm\i
+.elseif VL == 32
+	.set	V\i,		%ymm\i
+.elseif VL == 64
+	.set	V\i,		%zmm\i
+.else
+	.error "Unsupported Vector Length (VL)"
+.endif
+.endm
+
+.macro _define_aliases
+	// Define register aliases V0-V15, or V0-V31 if all 32 SIMD registers
+	// are available, that map to the xmm, ymm, or zmm registers according
+	// to the selected Vector Length (VL).
+	_define_Vi	0
+	_define_Vi	1
+	_define_Vi	2
+	_define_Vi	3
+	_define_Vi	4
+	_define_Vi	5
+	_define_Vi	6
+	_define_Vi	7
+	_define_Vi	8
+	_define_Vi	9
+	_define_Vi	10
+	_define_Vi	11
+	_define_Vi	12
+	_define_Vi	13
+	_define_Vi	14
+	_define_Vi	15
+.if USE_AVX10
+	_define_Vi	16
+	_define_Vi	17
+	_define_Vi	18
+	_define_Vi	19
+	_define_Vi	20
+	_define_Vi	21
+	_define_Vi	22
+	_define_Vi	23
+	_define_Vi	24
+	_define_Vi	25
+	_define_Vi	26
+	_define_Vi	27
+	_define_Vi	28
+	_define_Vi	29
+	_define_Vi	30
+	_define_Vi	31
+.endif
+
+	// V0-V3 hold the data blocks during the main loop, or temporary values
+	// otherwise.  V4-V5 hold temporary values.
+
+	// V6-V9 hold XTS tweaks.  Each 128-bit lane holds one tweak.
+	.set	TWEAK0_XMM,	%xmm6
+	.set	TWEAK0,		V6
+	.set	TWEAK1_XMM,	%xmm7
+	.set	TWEAK1,		V7
+	.set	TWEAK2,		V8
+	.set	TWEAK3,		V9
+
+	// V10-V13 are used for computing the next values of TWEAK[0-3].
+	.set	NEXT_TWEAK0,	V10
+	.set	NEXT_TWEAK1,	V11
+	.set	NEXT_TWEAK2,	V12
+	.set	NEXT_TWEAK3,	V13
+
+	// V14 holds the constant from .Lgf_poly, copied to all 128-bit lanes.
+	.set	GF_POLY_XMM,	%xmm14
+	.set	GF_POLY,	V14
+
+	// V15 holds the first AES round key, copied to all 128-bit lanes.
+	.set	KEY0_XMM,	%xmm15
+	.set	KEY0,		V15
+
+	// If 32 SIMD registers are available, then V16-V29 hold the remaining
+	// AES round keys, copied to all 128-bit lanes.
+.if USE_AVX10
+	.set	KEY1_XMM,	%xmm16
+	.set	KEY1,		V16
+	.set	KEY2_XMM,	%xmm17
+	.set	KEY2,		V17
+	.set	KEY3_XMM,	%xmm18
+	.set	KEY3,		V18
+	.set	KEY4_XMM,	%xmm19
+	.set	KEY4,		V19
+	.set	KEY5_XMM,	%xmm20
+	.set	KEY5,		V20
+	.set	KEY6_XMM,	%xmm21
+	.set	KEY6,		V21
+	.set	KEY7_XMM,	%xmm22
+	.set	KEY7,		V22
+	.set	KEY8_XMM,	%xmm23
+	.set	KEY8,		V23
+	.set	KEY9_XMM,	%xmm24
+	.set	KEY9,		V24
+	.set	KEY10_XMM,	%xmm25
+	.set	KEY10,		V25
+	.set	KEY11_XMM,	%xmm26
+	.set	KEY11,		V26
+	.set	KEY12_XMM,	%xmm27
+	.set	KEY12,		V27
+	.set	KEY13_XMM,	%xmm28
+	.set	KEY13,		V28
+	.set	KEY14_XMM,	%xmm29
+	.set	KEY14,		V29
+.endif
+	// V30-V31 are currently unused.
+.endm
+
+// Move a vector between memory and a register.
+.macro	_vmovdqu	src, dst
+.if VL < 64
+	vmovdqu		\src, \dst
+.else
+	vmovdqu8	\src, \dst
+.endif
+.endm
+
+// Broadcast a 128-bit value into a vector.
+.macro	_vbroadcast128	src, dst
+.if VL == 16 && !USE_AVX10
+	vmovdqu		\src, \dst
+.elseif VL == 32 && !USE_AVX10
+	vbroadcasti128	\src, \dst
+.else
+	vbroadcasti32x4	\src, \dst
+.endif
+.endm
+
+// XOR two vectors together.
+.macro	_vpxor	src1, src2, dst
+.if USE_AVX10
+	vpxord		\src1, \src2, \dst
+.else
+	vpxor		\src1, \src2, \dst
+.endif
+.endm
+
+// XOR three vectors together.
+.macro	_xor3	src1, src2, src3_and_dst
+.if USE_AVX10
+	// vpternlogd with immediate 0x96 is a three-argument XOR.
+	vpternlogd	$0x96, \src1, \src2, \src3_and_dst
+.else
+	vpxor		\src1, \src3_and_dst, \src3_and_dst
+	vpxor		\src2, \src3_and_dst, \src3_and_dst
+.endif
+.endm
+
+// Given a 128-bit XTS tweak in the xmm register \src, compute the next tweak
+// (by multiplying by the polynomial 'x') and write it to \dst.
+.macro	_next_tweak	src, tmp, dst
+	vpshufd		$0x13, \src, \tmp
+	vpaddq		\src, \src, \dst
+	vpsrad		$31, \tmp, \tmp
+	vpand		GF_POLY_XMM, \tmp, \tmp
+	vpxor		\tmp, \dst, \dst
+.endm
+
+// Given the XTS tweak(s) in the vector \src, compute the next vector of
+// tweak(s) (by multiplying by the polynomial 'x^(VL/16)') and write it to \dst.
+//
+// If VL > 16, then there are multiple tweaks, and we use vpclmulqdq to compute
+// all tweaks in the vector in parallel.  If VL=16, we just do the regular
+// computation without vpclmulqdq, as it's the faster method for a single tweak.
+.macro	_next_tweakvec	src, tmp1, tmp2, dst
+.if VL == 16
+	_next_tweak	\src, \tmp1, \dst
+.else
+	vpsrlq		$64 - VL/16, \src, \tmp1
+	vpclmulqdq	$0x01, GF_POLY, \tmp1, \tmp2
+	vpslldq		$8, \tmp1, \tmp1
+	vpsllq		$VL/16, \src, \dst
+	_xor3		\tmp1, \tmp2, \dst
+.endif
+.endm
+
+// Given the first XTS tweak at (TWEAK), compute the first set of tweaks and
+// store them in the vector registers TWEAK0-TWEAK3.  Clobbers V0-V5.
+.macro	_compute_first_set_of_tweaks
+	vmovdqu		(TWEAK), TWEAK0_XMM
+	_vbroadcast128	.Lgf_poly(%rip), GF_POLY
+.if VL == 16
+	// With VL=16, multiplying by x serially is fastest.
+	_next_tweak	TWEAK0, %xmm0, TWEAK1
+	_next_tweak	TWEAK1, %xmm0, TWEAK2
+	_next_tweak	TWEAK2, %xmm0, TWEAK3
+.else
+.if VL == 32
+	// Compute the second block of TWEAK0.
+	_next_tweak	TWEAK0_XMM, %xmm0, %xmm1
+	vinserti128	$1, %xmm1, TWEAK0, TWEAK0
+.elseif VL == 64
+	// Compute the remaining blocks of TWEAK0.
+	_next_tweak	TWEAK0_XMM, %xmm0, %xmm1
+	_next_tweak	%xmm1, %xmm0, %xmm2
+	_next_tweak	%xmm2, %xmm0, %xmm3
+	vinserti32x4	$1, %xmm1, TWEAK0, TWEAK0
+	vinserti32x4	$2, %xmm2, TWEAK0, TWEAK0
+	vinserti32x4	$3, %xmm3, TWEAK0, TWEAK0
+.endif
+	// Compute TWEAK[1-3] from TWEAK0.
+	vpsrlq		$64 - 1*VL/16, TWEAK0, V0
+	vpsrlq		$64 - 2*VL/16, TWEAK0, V2
+	vpsrlq		$64 - 3*VL/16, TWEAK0, V4
+	vpclmulqdq	$0x01, GF_POLY, V0, V1
+	vpclmulqdq	$0x01, GF_POLY, V2, V3
+	vpclmulqdq	$0x01, GF_POLY, V4, V5
+	vpslldq		$8, V0, V0
+	vpslldq		$8, V2, V2
+	vpslldq		$8, V4, V4
+	vpsllq		$1*VL/16, TWEAK0, TWEAK1
+	vpsllq		$2*VL/16, TWEAK0, TWEAK2
+	vpsllq		$3*VL/16, TWEAK0, TWEAK3
+.if USE_AVX10
+	vpternlogd	$0x96, V0, V1, TWEAK1
+	vpternlogd	$0x96, V2, V3, TWEAK2
+	vpternlogd	$0x96, V4, V5, TWEAK3
+.else
+	vpxor		V0, TWEAK1, TWEAK1
+	vpxor		V2, TWEAK2, TWEAK2
+	vpxor		V4, TWEAK3, TWEAK3
+	vpxor		V1, TWEAK1, TWEAK1
+	vpxor		V3, TWEAK2, TWEAK2
+	vpxor		V5, TWEAK3, TWEAK3
+.endif
+.endif
+.endm
+
+// Do one step in computing the next set of tweaks using the method of just
+// multiplying by x repeatedly (the same method _next_tweak uses).
+.macro	_tweak_step_mulx	i
+.if \i == 0
+	.set PREV_TWEAK, TWEAK3
+	.set NEXT_TWEAK, NEXT_TWEAK0
+.elseif \i == 5
+	.set PREV_TWEAK, NEXT_TWEAK0
+	.set NEXT_TWEAK, NEXT_TWEAK1
+.elseif \i == 10
+	.set PREV_TWEAK, NEXT_TWEAK1
+	.set NEXT_TWEAK, NEXT_TWEAK2
+.elseif \i == 15
+	.set PREV_TWEAK, NEXT_TWEAK2
+	.set NEXT_TWEAK, NEXT_TWEAK3
+.endif
+.if \i < 20 && \i % 5 == 0
+	vpshufd		$0x13, PREV_TWEAK, V5
+.elseif \i < 20 && \i % 5 == 1
+	vpaddq		PREV_TWEAK, PREV_TWEAK, NEXT_TWEAK
+.elseif \i < 20 && \i % 5 == 2
+	vpsrad		$31, V5, V5
+.elseif \i < 20 && \i % 5 == 3
+	vpand		GF_POLY, V5, V5
+.elseif \i < 20 && \i % 5 == 4
+	vpxor		V5, NEXT_TWEAK, NEXT_TWEAK
+.elseif \i == 1000
+	vmovdqa		NEXT_TWEAK0, TWEAK0
+	vmovdqa		NEXT_TWEAK1, TWEAK1
+	vmovdqa		NEXT_TWEAK2, TWEAK2
+	vmovdqa		NEXT_TWEAK3, TWEAK3
+.endif
+.endm
+
+// Do one step in computing the next set of tweaks using the VPCLMULQDQ method
+// (the same method _next_tweakvec uses for VL > 16).  This means multiplying
+// each tweak by x^(4*VL/16) independently.  Since 4*VL/16 is a multiple of 8
+// when VL > 16 (which it is here), the needed shift amounts are byte-aligned,
+// which allows the use of vpsrldq and vpslldq to do 128-bit wide shifts.
+.macro	_tweak_step_pclmul	i
+.if \i == 2
+	vpsrldq		$(128 - 4*VL/16) / 8, TWEAK0, NEXT_TWEAK0
+.elseif \i == 4
+	vpsrldq		$(128 - 4*VL/16) / 8, TWEAK1, NEXT_TWEAK1
+.elseif \i == 6
+	vpsrldq		$(128 - 4*VL/16) / 8, TWEAK2, NEXT_TWEAK2
+.elseif \i == 8
+	vpsrldq		$(128 - 4*VL/16) / 8, TWEAK3, NEXT_TWEAK3
+.elseif \i == 10
+	vpclmulqdq	$0x00, GF_POLY, NEXT_TWEAK0, NEXT_TWEAK0
+.elseif \i == 12
+	vpclmulqdq	$0x00, GF_POLY, NEXT_TWEAK1, NEXT_TWEAK1
+.elseif \i == 14
+	vpclmulqdq	$0x00, GF_POLY, NEXT_TWEAK2, NEXT_TWEAK2
+.elseif \i == 16
+	vpclmulqdq	$0x00, GF_POLY, NEXT_TWEAK3, NEXT_TWEAK3
+.elseif \i == 1000
+	vpslldq		$(4*VL/16) / 8, TWEAK0, TWEAK0
+	vpslldq		$(4*VL/16) / 8, TWEAK1, TWEAK1
+	vpslldq		$(4*VL/16) / 8, TWEAK2, TWEAK2
+	vpslldq		$(4*VL/16) / 8, TWEAK3, TWEAK3
+	_vpxor		NEXT_TWEAK0, TWEAK0, TWEAK0
+	_vpxor		NEXT_TWEAK1, TWEAK1, TWEAK1
+	_vpxor		NEXT_TWEAK2, TWEAK2, TWEAK2
+	_vpxor		NEXT_TWEAK3, TWEAK3, TWEAK3
+.endif
+.endm
+
+// _tweak_step does one step of the computation of the next set of tweaks from
+// TWEAK[0-3].  To complete all steps, this must be invoked with \i values 0
+// through at least 19, then 1000 which signals the last step.
+//
+// This is used to interleave the computation of the next set of tweaks with the
+// AES en/decryptions, which increases performance in some cases.
+.macro	_tweak_step	i
+.if VL == 16
+	_tweak_step_mulx	\i
+.else
+	_tweak_step_pclmul	\i
+.endif
+.endm
+
+// Load the round keys: just the first one if !USE_AVX10, otherwise all of them.
+.macro	_load_round_keys
+	_vbroadcast128	0*16(KEY), KEY0
+.if USE_AVX10
+	_vbroadcast128	1*16(KEY), KEY1
+	_vbroadcast128	2*16(KEY), KEY2
+	_vbroadcast128	3*16(KEY), KEY3
+	_vbroadcast128	4*16(KEY), KEY4
+	_vbroadcast128	5*16(KEY), KEY5
+	_vbroadcast128	6*16(KEY), KEY6
+	_vbroadcast128	7*16(KEY), KEY7
+	_vbroadcast128	8*16(KEY), KEY8
+	_vbroadcast128	9*16(KEY), KEY9
+	_vbroadcast128	10*16(KEY), KEY10
+	// Note: if it's AES-128 or AES-192, the last several round keys won't
+	// be used.  We do the loads anyway to save a conditional jump.
+	_vbroadcast128	11*16(KEY), KEY11
+	_vbroadcast128	12*16(KEY), KEY12
+	_vbroadcast128	13*16(KEY), KEY13
+	_vbroadcast128	14*16(KEY), KEY14
+.endif
+.endm
+
+// Do a single round of AES encryption (if \enc==1) or decryption (if \enc==0)
+// on the block(s) in \data using the round key(s) in \key.  The register length
+// determines the number of AES blocks en/decrypted.
+.macro	_vaes	enc, last, key, data
+.if \enc
+.if \last
+	vaesenclast	\key, \data, \data
+.else
+	vaesenc		\key, \data, \data
+.endif
+.else
+.if \last
+	vaesdeclast	\key, \data, \data
+.else
+	vaesdec		\key, \data, \data
+.endif
+.endif
+.endm
+
+// Do a single round of AES en/decryption on the block(s) in \data, using the
+// same key for all block(s).  The round key is loaded from the appropriate
+// register or memory location for round \i.  May clobber V4.
+.macro _vaes_1x		enc, last, i, xmm_suffix, data
+.if USE_AVX10
+	_vaes		\enc, \last, KEY\i\xmm_suffix, \data
+.else
+.ifnb \xmm_suffix
+	_vaes		\enc, \last, \i*16(KEY), \data
+.else
+	_vbroadcast128	\i*16(KEY), V4
+	_vaes		\enc, \last, V4, \data
+.endif
+.endif
+.endm
+
+// Do a single round of AES en/decryption on the blocks in registers V0-V3,
+// using the same key for all blocks.  The round key is loaded from the
+// appropriate register or memory location for round \i.  In addition, does step
+// \i of the computation of the next set of tweaks.  May clobber V4.
+.macro	_vaes_4x	enc, last, i
+.if USE_AVX10
+	_tweak_step	(2*(\i-1))
+	_vaes		\enc, \last, KEY\i, V0
+	_vaes		\enc, \last, KEY\i, V1
+	_tweak_step	(2*(\i-1) + 1)
+	_vaes		\enc, \last, KEY\i, V2
+	_vaes		\enc, \last, KEY\i, V3
+.else
+	_vbroadcast128	\i*16(KEY), V4
+	_tweak_step	(2*(\i-1))
+	_vaes		\enc, \last, V4, V0
+	_vaes		\enc, \last, V4, V1
+	_tweak_step	(2*(\i-1) + 1)
+	_vaes		\enc, \last, V4, V2
+	_vaes		\enc, \last, V4, V3
+.endif
+.endm
+
+// Do tweaked AES en/decryption (i.e., XOR with \tweak, then AES en/decrypt,
+// then XOR with \tweak again) of the block(s) in \data.  To process a single
+// block, use xmm registers and set \xmm_suffix=_XMM.  To process a vector of
+// length VL, use V* registers and leave \xmm_suffix empty.  May clobber V4.
+.macro	_aes_crypt	enc, xmm_suffix, tweak, data
+	_xor3		KEY0\xmm_suffix, \tweak, \data
+	_vaes_1x	\enc, 0, 1, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 2, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 3, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 4, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 5, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 6, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 7, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 8, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 9, \xmm_suffix, \data
+	cmp		$24, KEYLEN
+	jle		.Laes_128_or_192\@
+	_vaes_1x	\enc, 0, 10, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 11, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 12, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 13, \xmm_suffix, \data
+	_vaes_1x	\enc, 1, 14, \xmm_suffix, \data
+	jmp		.Laes_done\@
+.Laes_128_or_192\@:
+	je		.Laes_192\@
+	_vaes_1x	\enc, 1, 10, \xmm_suffix, \data
+	jmp		.Laes_done\@
+.Laes_192\@:
+	_vaes_1x	\enc, 0, 10, \xmm_suffix, \data
+	_vaes_1x	\enc, 0, 11, \xmm_suffix, \data
+	_vaes_1x	\enc, 1, 12, \xmm_suffix, \data
+.Laes_done\@:
+	_vpxor		\tweak, \data, \data
+.endm
+
+.macro	_aes_xts_crypt	enc
+	_define_aliases
+
+	// Load the AES key length: 16 (AES-128), 24 (AES-192), or 32 (AES-256).
+	movl		480(KEY), KEYLEN
+
+	// If decrypting, advance KEY to the decryption round keys.
+.if !\enc
+	add		$240, KEY
+.endif
+
+	// Check whether the data length is a multiple of the AES block length.
+	test		$15, LEN
+	jnz		.Lneed_cts\@
+.Lxts_init\@:
+
+	// Cache as many round keys as possible.
+	_load_round_keys
+
+	// Compute the first set of tweaks TWEAK[0-3].
+	_compute_first_set_of_tweaks
+
+	sub		$4*VL, LEN
+	jl		.Lhandle_remainder\@
+
+.Lmain_loop\@:
+	// This is the main loop, en/decrypting 4*VL bytes per iteration.
+
+	// XOR each source block with its tweak and the first round key.
+.if USE_AVX10
+	vmovdqu8	0*VL(SRC), V0
+	vmovdqu8	1*VL(SRC), V1
+	vmovdqu8	2*VL(SRC), V2
+	vmovdqu8	3*VL(SRC), V3
+	vpternlogd	$0x96, TWEAK0, KEY0, V0
+	vpternlogd	$0x96, TWEAK1, KEY0, V1
+	vpternlogd	$0x96, TWEAK2, KEY0, V2
+	vpternlogd	$0x96, TWEAK3, KEY0, V3
+.else
+	vpxor		0*VL(SRC), KEY0, V0
+	vpxor		1*VL(SRC), KEY0, V1
+	vpxor		2*VL(SRC), KEY0, V2
+	vpxor		3*VL(SRC), KEY0, V3
+	vpxor		TWEAK0, V0, V0
+	vpxor		TWEAK1, V1, V1
+	vpxor		TWEAK2, V2, V2
+	vpxor		TWEAK3, V3, V3
+.endif
+	// Do all the AES rounds on the data blocks, interleaved with
+	// the computation of the next set of tweaks.
+	_vaes_4x	\enc, 0, 1
+	_vaes_4x	\enc, 0, 2
+	_vaes_4x	\enc, 0, 3
+	_vaes_4x	\enc, 0, 4
+	_vaes_4x	\enc, 0, 5
+	_vaes_4x	\enc, 0, 6
+	_vaes_4x	\enc, 0, 7
+	_vaes_4x	\enc, 0, 8
+	_vaes_4x	\enc, 0, 9
+	// Try to optimize for AES-256 by keeping the code for AES-128 and
+	// AES-192 out-of-line.
+	cmp		$24, KEYLEN
+	jle		.Lencrypt_4x_aes_128_or_192\@
+	_vaes_4x	\enc, 0, 10
+	_vaes_4x	\enc, 0, 11
+	_vaes_4x	\enc, 0, 12
+	_vaes_4x	\enc, 0, 13
+	_vaes_4x	\enc, 1, 14
+.Lencrypt_4x_done\@:
+
+	// XOR in the tweaks again.
+	_vpxor		TWEAK0, V0, V0
+	_vpxor		TWEAK1, V1, V1
+	_vpxor		TWEAK2, V2, V2
+	_vpxor		TWEAK3, V3, V3
+
+	// Store the destination blocks.
+	_vmovdqu	V0, 0*VL(DST)
+	_vmovdqu	V1, 1*VL(DST)
+	_vmovdqu	V2, 2*VL(DST)
+	_vmovdqu	V3, 3*VL(DST)
+
+	// Finish computing the next set of tweaks.
+	_tweak_step	1000
+
+	add		$4*VL, SRC
+	add		$4*VL, DST
+	sub		$4*VL, LEN
+	jge		.Lmain_loop\@
+
+	// Check for the uncommon case where the data length isn't a multiple of
+	// 4*VL.  Handle it out-of-line in order to optimize for the common
+	// case.  In the common case, just fall through to the ret.
+	test		$4*VL-1, LEN
+	jnz		.Lhandle_remainder\@
+.Ldone\@:
+	// Store the next tweak back to *TWEAK to support continuation calls.
+	vmovdqu		TWEAK0_XMM, (TWEAK)
+.if VL > 16
+	vzeroupper
+.endif
+	RET
+
+.Lhandle_remainder\@:
+	add		$4*VL, LEN	// Undo the extra sub from earlier.
+
+	// En/decrypt any remaining full blocks, one vector at a time.
+.if VL > 16
+	sub		$VL, LEN
+	jl		.Lvec_at_a_time_done\@
+.Lvec_at_a_time\@:
+	_vmovdqu	(SRC), V0
+	_aes_crypt	\enc, , TWEAK0, V0
+	_vmovdqu	V0, (DST)
+	_next_tweakvec	TWEAK0, V0, V1, TWEAK0
+	add		$VL, SRC
+	add		$VL, DST
+	sub		$VL, LEN
+	jge		.Lvec_at_a_time\@
+.Lvec_at_a_time_done\@:
+	add		$VL-16, LEN	// Undo the extra sub from earlier.
+.else
+	sub		$16, LEN
+.endif
+
+	// En/decrypt any remaining full blocks, one at a time.
+	jl		.Lblock_at_a_time_done\@
+.Lblock_at_a_time\@:
+	vmovdqu		(SRC), %xmm0
+	_aes_crypt	\enc, _XMM, TWEAK0_XMM, %xmm0
+	vmovdqu		%xmm0, (DST)
+	_next_tweak	TWEAK0_XMM, %xmm0, TWEAK0_XMM
+	add		$16, SRC
+	add		$16, DST
+	sub		$16, LEN
+	jge		.Lblock_at_a_time\@
+.Lblock_at_a_time_done\@:
+	add		$16, LEN	// Undo the extra sub from earlier.
+
+.Lfull_blocks_done\@:
+	// Now 0 <= LEN <= 15.  If LEN is nonzero, do ciphertext stealing to
+	// process the last 16 + LEN bytes.  If LEN is zero, we're done.
+	test		LEN, LEN
+	jnz		.Lcts\@
+	jmp		.Ldone\@
+
+	// Out-of-line handling of AES-128 and AES-192
+.Lencrypt_4x_aes_128_or_192\@:
+	jz		.Lencrypt_4x_aes_192\@
+	_vaes_4x	\enc, 1, 10
+	jmp		.Lencrypt_4x_done\@
+.Lencrypt_4x_aes_192\@:
+	_vaes_4x	\enc, 0, 10
+	_vaes_4x	\enc, 0, 11
+	_vaes_4x	\enc, 1, 12
+	jmp		.Lencrypt_4x_done\@
+
+.Lneed_cts\@:
+	// The data length isn't a multiple of the AES block length, so
+	// ciphertext stealing (CTS) will be needed.  Subtract one block from
+	// LEN so that the main loop doesn't process the last full block.  The
+	// CTS step will process it specially along with the partial block.
+	sub		$16, LEN
+	jmp		.Lxts_init\@
+
+.Lcts\@:
+	// Do ciphertext stealing (CTS) to en/decrypt the last full block and
+	// the partial block.  CTS needs two tweaks.  TWEAK0_XMM contains the
+	// next tweak; compute the one after that.  Decryption uses these two
+	// tweaks in reverse order, so also define aliases to handle that.
+	_next_tweak	TWEAK0_XMM, %xmm0, TWEAK1_XMM
+.if \enc
+	.set		CTS_TWEAK0,	TWEAK0_XMM
+	.set		CTS_TWEAK1,	TWEAK1_XMM
+.else
+	.set		CTS_TWEAK0,	TWEAK1_XMM
+	.set		CTS_TWEAK1,	TWEAK0_XMM
+.endif
+
+	// En/decrypt the last full block.
+	vmovdqu		(SRC), %xmm0
+	_aes_crypt	\enc, _XMM, CTS_TWEAK0, %xmm0
+
+.if USE_AVX10
+	// Create a mask that has the first LEN bits set.
+	mov		$-1, %rax
+	bzhi		LEN, %rax, %rax
+	kmovq		%rax, %k1
+
+	// Swap the first LEN bytes of the above result with the partial block.
+	// Note that to support in-place en/decryption, the load from the src
+	// partial block must happen before the store to the dst partial block.
+	vmovdqa		%xmm0, %xmm1
+	vmovdqu8	16(SRC), %xmm0{%k1}
+	vmovdqu8	%xmm1, 16(DST){%k1}
+.else
+	lea		.Lcts_permute_table(%rip), %rax
+
+	// Load the src partial block, left-aligned.  Note that to support
+	// in-place en/decryption, this must happen before the store to the dst
+	// partial block.
+	vmovdqu		(SRC, LEN, 1), %xmm1
+
+	// Shift the first LEN bytes of the en/decryption of the last full block
+	// to the end of a register, then store it to DST+LEN.  This stores the
+	// dst partial block.  It also writes to the second part of the dst last
+	// full block, but that part is overwritten later.
+	vpshufb		(%rax, LEN, 1), %xmm0, %xmm2
+	vmovdqu		%xmm2, (DST, LEN, 1)
+
+	// Make xmm3 contain [16-LEN,16-LEN+1,...,14,15,0x80,0x80,...].
+	sub		LEN, %rax
+	vmovdqu		32(%rax), %xmm3
+
+	// Shift the src partial block to the beginning of its register.
+	vpshufb		%xmm3, %xmm1, %xmm1
+
+	// Do a blend to generate the src partial block followed by the second
+	// part of the en/decryption of the last full block.
+	vpblendvb	%xmm3, %xmm0, %xmm1, %xmm0
+.endif
+	// En/decrypt again and store the last full block.
+	_aes_crypt	\enc, _XMM, CTS_TWEAK1, %xmm0
+	vmovdqu		%xmm0, (DST)
+	jmp		.Ldone\@
+.endm
+
+// void aes_xts_encrypt_iv(const struct crypto_aes_ctx *tweak_key,
+//			   u8 iv[AES_BLOCK_SIZE]);
+SYM_FUNC_START(aes_xts_encrypt_iv)
+	vmovdqu		(%rsi), %xmm0
+	vpxor		0*16(%rdi), %xmm0, %xmm0
+	vaesenc		1*16(%rdi), %xmm0, %xmm0
+	vaesenc		2*16(%rdi), %xmm0, %xmm0
+	vaesenc		3*16(%rdi), %xmm0, %xmm0
+	vaesenc		4*16(%rdi), %xmm0, %xmm0
+	vaesenc		5*16(%rdi), %xmm0, %xmm0
+	vaesenc		6*16(%rdi), %xmm0, %xmm0
+	vaesenc		7*16(%rdi), %xmm0, %xmm0
+	vaesenc		8*16(%rdi), %xmm0, %xmm0
+	vaesenc		9*16(%rdi), %xmm0, %xmm0
+	cmpl		$24, 480(%rdi)
+	jle		.Lencrypt_iv_aes_128_or_192
+	vaesenc		10*16(%rdi), %xmm0, %xmm0
+	vaesenc		11*16(%rdi), %xmm0, %xmm0
+	vaesenc		12*16(%rdi), %xmm0, %xmm0
+	vaesenc		13*16(%rdi), %xmm0, %xmm0
+	vaesenclast	14*16(%rdi), %xmm0, %xmm0
+.Lencrypt_iv_done:
+	vmovdqu		%xmm0, (%rsi)
+	RET
+
+	// Out-of-line handling of AES-128 and AES-192
+.Lencrypt_iv_aes_128_or_192:
+	jz		.Lencrypt_iv_aes_192
+	vaesenclast	10*16(%rdi), %xmm0, %xmm0
+	jmp		.Lencrypt_iv_done
+.Lencrypt_iv_aes_192:
+	vaesenc		10*16(%rdi), %xmm0, %xmm0
+	vaesenc		11*16(%rdi), %xmm0, %xmm0
+	vaesenclast	12*16(%rdi), %xmm0, %xmm0
+	jmp		.Lencrypt_iv_done
+SYM_FUNC_END(aes_xts_encrypt_iv)
+
+// Below are the actual AES-XTS encryption and decryption functions,
+// instantiated from the above macro.  They all have the following prototype:
+//
+// void (*xts_asm_func)(const struct crypto_aes_ctx *key,
+//			const u8 *src, u8 *dst, size_t len,
+//			u8 tweak[AES_BLOCK_SIZE]);
+//
+// |key| is the data key.  |tweak| contains the next tweak; the encryption of
+// the original IV with the tweak key was already done.  This function supports
+// incremental computation, but |len| must always be >= 16 (AES_BLOCK_SIZE), and
+// |len| must be a multiple of 16 except on the last call.  If |len| is a
+// multiple of 16, then this function updates |tweak| to contain the next tweak.
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/6] crypto: x86/aes-xts - wire up AESNI + AVX implementation
  2024-03-29  8:03 [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
  2024-03-29  8:03 ` [PATCH v2 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support Eric Biggers
  2024-03-29  8:03 ` [PATCH v2 2/6] crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs Eric Biggers
@ 2024-03-29  8:03 ` Eric Biggers
  2024-03-29  8:03 ` [PATCH v2 4/6] crypto: x86/aes-xts - wire up VAES + AVX2 implementation Eric Biggers
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Eric Biggers @ 2024-03-29  8:03 UTC (permalink / raw)
  To: linux-crypto, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Chang S . Bae

From: Eric Biggers <ebiggers@google.com>

Add an AES-XTS implementation "xts-aes-aesni-avx" for x86_64 CPUs that
have the AES-NI and AVX extensions but not VAES.  It's similar to the
existing xts-aes-aesni in that uses xmm registers to operate on one AES
block at a time.  It differs from xts-aes-aesni in the following ways:

- It uses the VEX-coded (non-destructive) instructions from AVX.
  This improves performance slightly.
- It incorporates some additional optimizations such as interleaving the
  tweak computation with AES en/decryption, handling single-page
  messages more efficiently, and caching the first round key.
- It supports only 64-bit (x86_64).
- It's generated by an assembly macro that will also be used to generate
  VAES-based implementations.

The performance improvement over xts-aes-aesni varies from small to
large, depending on the CPU and other factors such as the size of the
messages en/decrypted.  For example, the following increases in
AES-256-XTS decryption throughput are seen on the following CPUs:

                          | 4096-byte messages | 512-byte messages |
    ----------------------+--------------------+-------------------+
    Intel Skylake         |        6%          |       31%         |
    Intel Cascade Lake    |        4%          |       26%         |
    AMD Zen 1             |        61%         |       73%         |
    AMD Zen 2             |        36%         |       59%         |

(The above CPUs don't support VAES, so they can't use VAES instead.)

While this isn't as large an improvement as what VAES provides, this
still seems worthwhile.  This implementation is fairly easy to provide
based on the assembly macro that's needed for VAES anyway, and it will
be the best implementation on a large number of CPUs (very roughly, the
CPUs launched by Intel and AMD from 2011 to 2018).

This makes the existing xts-aes-aesni *mostly* obsolete.  For now, leave
it in place to support 32-bit kernels and also CPUs like Intel Westmere
that support AES-NI but not AVX.  (We could potentially remove it anyway
and just rely on the indirect acceleration via ecb-aes-aesni in those
cases, but that change will need to be considered separately.)

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/aes-xts-avx-x86_64.S |   9 ++
 arch/x86/crypto/aesni-intel_glue.c   | 202 ++++++++++++++++++++++++++-
 2 files changed, 209 insertions(+), 2 deletions(-)

diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S
index a5e2783c46ec..32e26f562cf0 100644
--- a/arch/x86/crypto/aes-xts-avx-x86_64.S
+++ b/arch/x86/crypto/aes-xts-avx-x86_64.S
@@ -796,5 +796,14 @@ SYM_FUNC_END(aes_xts_encrypt_iv)
 // |key| is the data key.  |tweak| contains the next tweak; the encryption of
 // the original IV with the tweak key was already done.  This function supports
 // incremental computation, but |len| must always be >= 16 (AES_BLOCK_SIZE), and
 // |len| must be a multiple of 16 except on the last call.  If |len| is a
 // multiple of 16, then this function updates |tweak| to contain the next tweak.
+
+.set	VL, 16
+.set	USE_AVX10, 0
+SYM_TYPED_FUNC_START(aes_xts_encrypt_aesni_avx)
+	_aes_xts_crypt	1
+SYM_FUNC_END(aes_xts_encrypt_aesni_avx)
+SYM_TYPED_FUNC_START(aes_xts_decrypt_aesni_avx)
+	_aes_xts_crypt	0
+SYM_FUNC_END(aes_xts_decrypt_aesni_avx)
diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index b1d90c25975a..10e283721a85 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -1135,11 +1135,200 @@ static struct skcipher_alg aesni_xctr = {
 	.encrypt	= xctr_crypt,
 	.decrypt	= xctr_crypt,
 };
 
 static struct simd_skcipher_alg *aesni_simd_xctr;
-#endif /* CONFIG_X86_64 */
+
+asmlinkage void aes_xts_encrypt_iv(const struct crypto_aes_ctx *tweak_key,
+				   u8 iv[AES_BLOCK_SIZE]);
+
+typedef void (*xts_asm_func)(const struct crypto_aes_ctx *key,
+			     const u8 *src, u8 *dst, size_t len,
+			     u8 tweak[AES_BLOCK_SIZE]);
+
+/* This handles cases where the source and/or destination span pages. */
+static noinline int
+xts_crypt_slowpath(struct skcipher_request *req, xts_asm_func asm_func)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	const struct aesni_xts_ctx *ctx = aes_xts_ctx(tfm);
+	int tail = req->cryptlen % AES_BLOCK_SIZE;
+	struct scatterlist sg_src[2], sg_dst[2];
+	struct skcipher_request subreq;
+	struct skcipher_walk walk;
+	struct scatterlist *src, *dst;
+	int err;
+
+	/*
+	 * If the message length isn't divisible by the AES block size, then
+	 * separate off the last full block and the partial block.  This ensures
+	 * that they are processed in the same call to the assembly function,
+	 * which is required for ciphertext stealing.
+	 */
+	if (tail) {
+		skcipher_request_set_tfm(&subreq, tfm);
+		skcipher_request_set_callback(&subreq,
+					      skcipher_request_flags(req),
+					      NULL, NULL);
+		skcipher_request_set_crypt(&subreq, req->src, req->dst,
+					   req->cryptlen - tail - AES_BLOCK_SIZE,
+					   req->iv);
+		req = &subreq;
+	}
+
+	err = skcipher_walk_virt(&walk, req, false);
+
+	while (walk.nbytes) {
+		unsigned int nbytes = walk.nbytes;
+
+		if (nbytes < walk.total)
+			nbytes = round_down(nbytes, AES_BLOCK_SIZE);
+
+		kernel_fpu_begin();
+		(*asm_func)(&ctx->crypt_ctx, walk.src.virt.addr,
+			    walk.dst.virt.addr, nbytes, req->iv);
+		kernel_fpu_end();
+		err = skcipher_walk_done(&walk, walk.nbytes - nbytes);
+	}
+
+	if (err || !tail)
+		return err;
+
+	/* Do ciphertext stealing with the last full block and partial block. */
+
+	dst = src = scatterwalk_ffwd(sg_src, req->src, req->cryptlen);
+	if (req->dst != req->src)
+		dst = scatterwalk_ffwd(sg_dst, req->dst, req->cryptlen);
+
+	skcipher_request_set_crypt(req, src, dst, AES_BLOCK_SIZE + tail,
+				   req->iv);
+
+	err = skcipher_walk_virt(&walk, req, false);
+	if (err)
+		return err;
+
+	kernel_fpu_begin();
+	(*asm_func)(&ctx->crypt_ctx, walk.src.virt.addr, walk.dst.virt.addr,
+		    walk.nbytes, req->iv);
+	kernel_fpu_end();
+
+	return skcipher_walk_done(&walk, 0);
+}
+
+/* __always_inline to avoid indirect call in fastpath */
+static __always_inline int
+xts_crypt2(struct skcipher_request *req, xts_asm_func asm_func)
+{
+	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
+	const struct aesni_xts_ctx *ctx = aes_xts_ctx(tfm);
+	const unsigned int cryptlen = req->cryptlen;
+	struct scatterlist *src = req->src;
+	struct scatterlist *dst = req->dst;
+
+	if (unlikely(cryptlen < AES_BLOCK_SIZE))
+		return -EINVAL;
+
+	kernel_fpu_begin();
+	aes_xts_encrypt_iv(&ctx->tweak_ctx, req->iv);
+
+	/*
+	 * In practice, virtually all XTS plaintexts and ciphertexts are either
+	 * 512 or 4096 bytes, aligned such that they don't span page boundaries.
+	 * To optimize the performance of these cases, and also any other case
+	 * where no page boundary is spanned, the below fast-path handles
+	 * single-page sources and destinations as efficiently as possible.
+	 */
+	if (likely(src->length >= cryptlen && dst->length >= cryptlen &&
+		   src->offset + cryptlen <= PAGE_SIZE &&
+		   dst->offset + cryptlen <= PAGE_SIZE)) {
+		struct page *src_page = sg_page(src);
+		struct page *dst_page = sg_page(dst);
+		void *src_virt = kmap_local_page(src_page) + src->offset;
+		void *dst_virt = kmap_local_page(dst_page) + dst->offset;
+
+		(*asm_func)(&ctx->crypt_ctx, src_virt, dst_virt, cryptlen,
+			    req->iv);
+		kunmap_local(dst_virt);
+		kunmap_local(src_virt);
+		kernel_fpu_end();
+		return 0;
+	}
+	kernel_fpu_end();
+	return xts_crypt_slowpath(req, asm_func);
+}
+
+#define DEFINE_XTS_ALG(suffix, driver_name, priority)			       \
+									       \
+asmlinkage void aes_xts_encrypt_##suffix(const struct crypto_aes_ctx *key,     \
+					 const u8 *src, u8 *dst, size_t len,   \
+					 u8 tweak[AES_BLOCK_SIZE]);	       \
+asmlinkage void aes_xts_decrypt_##suffix(const struct crypto_aes_ctx *key,     \
+					 const u8 *src, u8 *dst, size_t len,   \
+					 u8 tweak[AES_BLOCK_SIZE]);	       \
+									       \
+static int xts_encrypt_##suffix(struct skcipher_request *req)		       \
+{									       \
+	return xts_crypt2(req, aes_xts_encrypt_##suffix);		       \
+}									       \
+									       \
+static int xts_decrypt_##suffix(struct skcipher_request *req)		       \
+{									       \
+	return xts_crypt2(req, aes_xts_decrypt_##suffix);		       \
+}									       \
+									       \
+static struct skcipher_alg aes_xts_alg_##suffix = {			       \
+	.base = {							       \
+		.cra_name		= "__xts(aes)",			       \
+		.cra_driver_name	= "__" driver_name,		       \
+		.cra_priority		= priority,			       \
+		.cra_flags		= CRYPTO_ALG_INTERNAL,		       \
+		.cra_blocksize		= AES_BLOCK_SIZE,		       \
+		.cra_ctxsize		= XTS_AES_CTX_SIZE,		       \
+		.cra_module		= THIS_MODULE,			       \
+	},								       \
+	.min_keysize	= 2 * AES_MIN_KEY_SIZE,				       \
+	.max_keysize	= 2 * AES_MAX_KEY_SIZE,				       \
+	.ivsize		= AES_BLOCK_SIZE,				       \
+	.walksize	= 2 * AES_BLOCK_SIZE,				       \
+	.setkey		= xts_aesni_setkey,				       \
+	.encrypt	= xts_encrypt_##suffix,				       \
+	.decrypt	= xts_decrypt_##suffix,				       \
+};									       \
+									       \
+static struct simd_skcipher_alg *aes_xts_simdalg_##suffix
+
+DEFINE_XTS_ALG(aesni_avx, "xts-aes-aesni-avx", 500);
+
+static int __init register_xts_algs(void)
+{
+	int err;
+
+	if (!boot_cpu_has(X86_FEATURE_AVX))
+		return 0;
+	err = simd_register_skciphers_compat(&aes_xts_alg_aesni_avx, 1,
+					     &aes_xts_simdalg_aesni_avx);
+	if (err)
+		return err;
+	return 0;
+}
+
+static void unregister_xts_algs(void)
+{
+	if (aes_xts_simdalg_aesni_avx)
+		simd_unregister_skciphers(&aes_xts_alg_aesni_avx, 1,
+					  &aes_xts_simdalg_aesni_avx);
+}
+#else /* CONFIG_X86_64 */
+static int __init register_xts_algs(void)
+{
+	return 0;
+}
+
+static void unregister_xts_algs(void)
+{
+}
+#endif /* !CONFIG_X86_64 */
 
 #ifdef CONFIG_X86_64
 static int generic_gcmaes_set_key(struct crypto_aead *aead, const u8 *key,
 				  unsigned int key_len)
 {
@@ -1274,17 +1463,25 @@ static int __init aesni_init(void)
 						     &aesni_simd_xctr);
 	if (err)
 		goto unregister_aeads;
 #endif /* CONFIG_X86_64 */
 
+	err = register_xts_algs();
+	if (err)
+		goto unregister_xts;
+
 	return 0;
 
+unregister_xts:
+	unregister_xts_algs();
 #ifdef CONFIG_X86_64
+	if (aesni_simd_xctr)
+		simd_unregister_skciphers(&aesni_xctr, 1, &aesni_simd_xctr);
 unregister_aeads:
+#endif /* CONFIG_X86_64 */
 	simd_unregister_aeads(aesni_aeads, ARRAY_SIZE(aesni_aeads),
 				aesni_simd_aeads);
-#endif /* CONFIG_X86_64 */
 
 unregister_skciphers:
 	simd_unregister_skciphers(aesni_skciphers, ARRAY_SIZE(aesni_skciphers),
 				  aesni_simd_skciphers);
 unregister_cipher:
@@ -1301,10 +1498,11 @@ static void __exit aesni_exit(void)
 	crypto_unregister_alg(&aesni_cipher_alg);
 #ifdef CONFIG_X86_64
 	if (boot_cpu_has(X86_FEATURE_AVX))
 		simd_unregister_skciphers(&aesni_xctr, 1, &aesni_simd_xctr);
 #endif /* CONFIG_X86_64 */
+	unregister_xts_algs();
 }
 
 late_initcall(aesni_init);
 module_exit(aesni_exit);
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 4/6] crypto: x86/aes-xts - wire up VAES + AVX2 implementation
  2024-03-29  8:03 [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
                   ` (2 preceding siblings ...)
  2024-03-29  8:03 ` [PATCH v2 3/6] crypto: x86/aes-xts - wire up AESNI + AVX implementation Eric Biggers
@ 2024-03-29  8:03 ` Eric Biggers
  2024-03-29  8:03 ` [PATCH v2 5/6] crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation Eric Biggers
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Eric Biggers @ 2024-03-29  8:03 UTC (permalink / raw)
  To: linux-crypto, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Chang S . Bae

From: Eric Biggers <ebiggers@google.com>

Add an AES-XTS implementation "xts-aes-vaes-avx2" for x86_64 CPUs with
the VAES, VPCLMULQDQ, and AVX2 extensions, but not AVX512 or AVX10.
This implementation uses ymm registers to operate on two AES blocks at a
time.  The assembly code is instantiated using a macro so that most of
the source code is shared with other implementations.

This is the optimal implementation on AMD Zen 3.  It should also be the
optimal implementation on Intel Alder Lake, which similarly supports
VAES but not AVX512.  Comparing to xts-aes-aesni-avx on Zen 3,
xts-aes-vaes-avx2 provides 70% higher AES-256-XTS decryption throughput
with 4096-byte messages, or 23% higher with 512-byte messages.

A large improvement is also seen with CPUs that do support AVX512 (e.g.,
98% higher AES-256-XTS decryption throughput on Ice Lake with 4096-byte
messages), though the following patches add AVX512 optimized
implementations to get a bit more performance on those CPUs.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/aes-xts-avx-x86_64.S | 11 +++++++++++
 arch/x86/crypto/aesni-intel_glue.c   | 20 ++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S
index 32e26f562cf0..43706213dfca 100644
--- a/arch/x86/crypto/aes-xts-avx-x86_64.S
+++ b/arch/x86/crypto/aes-xts-avx-x86_64.S
@@ -805,5 +805,16 @@ SYM_TYPED_FUNC_START(aes_xts_encrypt_aesni_avx)
 	_aes_xts_crypt	1
 SYM_FUNC_END(aes_xts_encrypt_aesni_avx)
 SYM_TYPED_FUNC_START(aes_xts_decrypt_aesni_avx)
 	_aes_xts_crypt	0
 SYM_FUNC_END(aes_xts_decrypt_aesni_avx)
+
+#if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ)
+.set	VL, 32
+.set	USE_AVX10, 0
+SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx2)
+	_aes_xts_crypt	1
+SYM_FUNC_END(aes_xts_encrypt_vaes_avx2)
+SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx2)
+	_aes_xts_crypt	0
+SYM_FUNC_END(aes_xts_decrypt_vaes_avx2)
+#endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */
diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index 10e283721a85..4cc15c7207f3 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -1295,10 +1295,13 @@ static struct skcipher_alg aes_xts_alg_##suffix = {			       \
 };									       \
 									       \
 static struct simd_skcipher_alg *aes_xts_simdalg_##suffix
 
 DEFINE_XTS_ALG(aesni_avx, "xts-aes-aesni-avx", 500);
+#if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ)
+DEFINE_XTS_ALG(vaes_avx2, "xts-aes-vaes-avx2", 600);
+#endif
 
 static int __init register_xts_algs(void)
 {
 	int err;
 
@@ -1306,18 +1309,35 @@ static int __init register_xts_algs(void)
 		return 0;
 	err = simd_register_skciphers_compat(&aes_xts_alg_aesni_avx, 1,
 					     &aes_xts_simdalg_aesni_avx);
 	if (err)
 		return err;
+#if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ)
+	if (!boot_cpu_has(X86_FEATURE_AVX2) ||
+	    !boot_cpu_has(X86_FEATURE_VAES) ||
+	    !boot_cpu_has(X86_FEATURE_VPCLMULQDQ) ||
+	    !boot_cpu_has(X86_FEATURE_PCLMULQDQ) ||
+	    !cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL))
+		return 0;
+	err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx2, 1,
+					     &aes_xts_simdalg_vaes_avx2);
+	if (err)
+		return err;
+#endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */
 	return 0;
 }
 
 static void unregister_xts_algs(void)
 {
 	if (aes_xts_simdalg_aesni_avx)
 		simd_unregister_skciphers(&aes_xts_alg_aesni_avx, 1,
 					  &aes_xts_simdalg_aesni_avx);
+#if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ)
+	if (aes_xts_simdalg_vaes_avx2)
+		simd_unregister_skciphers(&aes_xts_alg_vaes_avx2, 1,
+					  &aes_xts_simdalg_vaes_avx2);
+#endif
 }
 #else /* CONFIG_X86_64 */
 static int __init register_xts_algs(void)
 {
 	return 0;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 5/6] crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation
  2024-03-29  8:03 [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
                   ` (3 preceding siblings ...)
  2024-03-29  8:03 ` [PATCH v2 4/6] crypto: x86/aes-xts - wire up VAES + AVX2 implementation Eric Biggers
@ 2024-03-29  8:03 ` Eric Biggers
  2024-03-29  8:03 ` [PATCH v2 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation Eric Biggers
  2024-03-29  9:03 ` [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs Ard Biesheuvel
  6 siblings, 0 replies; 15+ messages in thread
From: Eric Biggers @ 2024-03-29  8:03 UTC (permalink / raw)
  To: linux-crypto, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Chang S . Bae

From: Eric Biggers <ebiggers@google.com>

Add an AES-XTS implementation "xts-aes-vaes-avx10_256" for x86_64 CPUs
with the VAES, VPCLMULQDQ, and either AVX10/256 or AVX512BW + AVX512VL
extensions.  This implementation avoids using zmm registers, instead
using ymm registers to operate on two AES blocks at a time.  The
assembly code is instantiated using a macro so that most of the source
code is shared with other implementations.

This is the optimal implementation on CPUs that support VAES and AVX512
but where the zmm registers should not be used due to downclocking
effects, for example Intel's Ice Lake.  It should also be the optimal
implementation on future CPUs that support AVX10/256 but not AVX10/512.

The performance is slightly better than that of xts-aes-vaes-avx2, which
uses the same 256-bit vector length, due to factors such as being able
to use ymm16-ymm31 to cache the AES round keys, and being able to use
the vpternlogd instruction to do XORs more efficiently.  For example, on
Ice Lake, the throughput of decrypting 4096-byte messages with
AES-256-XTS is 6.6% higher with xts-aes-vaes-avx10_256 than with
xts-aes-vaes-avx2.  While this is a small improvement, it is
straightforward to provide this implementation (xts-aes-vaes-avx10_256)
as long as we are providing xts-aes-vaes-avx2 and xts-aes-vaes-avx10_512
anyway, due to the way the _aes_xts_crypt macro is structured.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/aes-xts-avx-x86_64.S |  9 +++++++++
 arch/x86/crypto/aesni-intel_glue.c   | 16 ++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S
index 43706213dfca..71be474b22da 100644
--- a/arch/x86/crypto/aes-xts-avx-x86_64.S
+++ b/arch/x86/crypto/aes-xts-avx-x86_64.S
@@ -815,6 +815,15 @@ SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx2)
 	_aes_xts_crypt	1
 SYM_FUNC_END(aes_xts_encrypt_vaes_avx2)
 SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx2)
 	_aes_xts_crypt	0
 SYM_FUNC_END(aes_xts_decrypt_vaes_avx2)
+
+.set	VL, 32
+.set	USE_AVX10, 1
+SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx10_256)
+	_aes_xts_crypt	1
+SYM_FUNC_END(aes_xts_encrypt_vaes_avx10_256)
+SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx10_256)
+	_aes_xts_crypt	0
+SYM_FUNC_END(aes_xts_decrypt_vaes_avx10_256)
 #endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */
diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index 4cc15c7207f3..914cbf5d1f5c 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -1297,10 +1297,11 @@ static struct skcipher_alg aes_xts_alg_##suffix = {			       \
 static struct simd_skcipher_alg *aes_xts_simdalg_##suffix
 
 DEFINE_XTS_ALG(aesni_avx, "xts-aes-aesni-avx", 500);
 #if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ)
 DEFINE_XTS_ALG(vaes_avx2, "xts-aes-vaes-avx2", 600);
+DEFINE_XTS_ALG(vaes_avx10_256, "xts-aes-vaes-avx10_256", 700);
 #endif
 
 static int __init register_xts_algs(void)
 {
 	int err;
@@ -1320,10 +1321,22 @@ static int __init register_xts_algs(void)
 		return 0;
 	err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx2, 1,
 					     &aes_xts_simdalg_vaes_avx2);
 	if (err)
 		return err;
+
+	if (!boot_cpu_has(X86_FEATURE_AVX512BW) ||
+	    !boot_cpu_has(X86_FEATURE_AVX512VL) ||
+	    !boot_cpu_has(X86_FEATURE_BMI2) ||
+	    !cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM |
+			       XFEATURE_MASK_AVX512, NULL))
+		return 0;
+
+	err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx10_256, 1,
+					     &aes_xts_simdalg_vaes_avx10_256);
+	if (err)
+		return err;
 #endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */
 	return 0;
 }
 
 static void unregister_xts_algs(void)
@@ -1333,10 +1346,13 @@ static void unregister_xts_algs(void)
 					  &aes_xts_simdalg_aesni_avx);
 #if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ)
 	if (aes_xts_simdalg_vaes_avx2)
 		simd_unregister_skciphers(&aes_xts_alg_vaes_avx2, 1,
 					  &aes_xts_simdalg_vaes_avx2);
+	if (aes_xts_simdalg_vaes_avx10_256)
+		simd_unregister_skciphers(&aes_xts_alg_vaes_avx10_256, 1,
+					  &aes_xts_simdalg_vaes_avx10_256);
 #endif
 }
 #else /* CONFIG_X86_64 */
 static int __init register_xts_algs(void)
 {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation
  2024-03-29  8:03 [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
                   ` (4 preceding siblings ...)
  2024-03-29  8:03 ` [PATCH v2 5/6] crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation Eric Biggers
@ 2024-03-29  8:03 ` Eric Biggers
  2024-04-04 20:34   ` Dave Hansen
  2024-03-29  9:03 ` [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs Ard Biesheuvel
  6 siblings, 1 reply; 15+ messages in thread
From: Eric Biggers @ 2024-03-29  8:03 UTC (permalink / raw)
  To: linux-crypto, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Chang S . Bae

From: Eric Biggers <ebiggers@google.com>

Add an AES-XTS implementation "xts-aes-vaes-avx10_512" for x86_64 CPUs
with the VAES, VPCLMULQDQ, and either AVX10/512 or AVX512BW + AVX512VL
extensions.  This implementation uses zmm registers to operate on four
AES blocks at a time.  The assembly code is instantiated using a macro
so that most of the source code is shared with other implementations.

To avoid downclocking on older Intel CPU models, an exclusion list is
used to prevent this 512-bit implementation from being used by default
on some CPU models.  They will use xts-aes-vaes-avx10_256 instead.  For
now, this exclusion list is simply coded into aesni-intel_glue.c.  It
may make sense to eventually move it into a more central location.

xts-aes-vaes-avx10_512 is slightly faster than xts-aes-vaes-avx10_256 on
some current CPUs.  E.g., on AMD Zen 4, AES-256-XTS decryption
throughput increases by 13% with 4096-byte inputs, or 14% with 512-byte
inputs.  On Intel Sapphire Rapids, AES-256-XTS decryption throughput
increases by 2% with 4096-byte inputs, or 3% with 512-byte inputs.

Future CPUs may provide stronger 512-bit support, in which case a larger
benefit should be seen.

Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/x86/crypto/aes-xts-avx-x86_64.S |  9 ++++++++
 arch/x86/crypto/aesni-intel_glue.c   | 32 ++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/arch/x86/crypto/aes-xts-avx-x86_64.S b/arch/x86/crypto/aes-xts-avx-x86_64.S
index 71be474b22da..b8005d0205f8 100644
--- a/arch/x86/crypto/aes-xts-avx-x86_64.S
+++ b/arch/x86/crypto/aes-xts-avx-x86_64.S
@@ -824,6 +824,15 @@ SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx10_256)
 	_aes_xts_crypt	1
 SYM_FUNC_END(aes_xts_encrypt_vaes_avx10_256)
 SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx10_256)
 	_aes_xts_crypt	0
 SYM_FUNC_END(aes_xts_decrypt_vaes_avx10_256)
+
+.set	VL, 64
+.set	USE_AVX10, 1
+SYM_TYPED_FUNC_START(aes_xts_encrypt_vaes_avx10_512)
+	_aes_xts_crypt	1
+SYM_FUNC_END(aes_xts_encrypt_vaes_avx10_512)
+SYM_TYPED_FUNC_START(aes_xts_decrypt_vaes_avx10_512)
+	_aes_xts_crypt	0
+SYM_FUNC_END(aes_xts_decrypt_vaes_avx10_512)
 #endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */
diff --git a/arch/x86/crypto/aesni-intel_glue.c b/arch/x86/crypto/aesni-intel_glue.c
index 914cbf5d1f5c..0855ace8659c 100644
--- a/arch/x86/crypto/aesni-intel_glue.c
+++ b/arch/x86/crypto/aesni-intel_glue.c
@@ -1298,12 +1298,33 @@ static struct simd_skcipher_alg *aes_xts_simdalg_##suffix
 
 DEFINE_XTS_ALG(aesni_avx, "xts-aes-aesni-avx", 500);
 #if defined(CONFIG_AS_VAES) && defined(CONFIG_AS_VPCLMULQDQ)
 DEFINE_XTS_ALG(vaes_avx2, "xts-aes-vaes-avx2", 600);
 DEFINE_XTS_ALG(vaes_avx10_256, "xts-aes-vaes-avx10_256", 700);
+DEFINE_XTS_ALG(vaes_avx10_512, "xts-aes-vaes-avx10_512", 800);
 #endif
 
+/*
+ * This is a list of CPU models that are known to suffer from downclocking when
+ * zmm registers (512-bit vectors) are used.  On these CPUs, the AES-XTS
+ * implementation with zmm registers won't be used by default.  An
+ * implementation with ymm registers (256-bit vectors) will be used instead.
+ */
+static const struct x86_cpu_id zmm_exclusion_list[] = {
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_SKYLAKE_X },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_X },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_D },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_L },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_NNPI },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE_L },
+	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE },
+	/* Allow Rocket Lake and later, and Sapphire Rapids and later. */
+	/* Also allow AMD CPUs (starting with Zen 4, the first with AVX-512). */
+	{},
+};
+
 static int __init register_xts_algs(void)
 {
 	int err;
 
 	if (!boot_cpu_has(X86_FEATURE_AVX))
@@ -1333,10 +1354,18 @@ static int __init register_xts_algs(void)
 
 	err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx10_256, 1,
 					     &aes_xts_simdalg_vaes_avx10_256);
 	if (err)
 		return err;
+
+	if (x86_match_cpu(zmm_exclusion_list))
+		aes_xts_alg_vaes_avx10_512.base.cra_priority = 1;
+
+	err = simd_register_skciphers_compat(&aes_xts_alg_vaes_avx10_512, 1,
+					     &aes_xts_simdalg_vaes_avx10_512);
+	if (err)
+		return err;
 #endif /* CONFIG_AS_VAES && CONFIG_AS_VPCLMULQDQ */
 	return 0;
 }
 
 static void unregister_xts_algs(void)
@@ -1349,10 +1378,13 @@ static void unregister_xts_algs(void)
 		simd_unregister_skciphers(&aes_xts_alg_vaes_avx2, 1,
 					  &aes_xts_simdalg_vaes_avx2);
 	if (aes_xts_simdalg_vaes_avx10_256)
 		simd_unregister_skciphers(&aes_xts_alg_vaes_avx10_256, 1,
 					  &aes_xts_simdalg_vaes_avx10_256);
+	if (aes_xts_simdalg_vaes_avx10_512)
+		simd_unregister_skciphers(&aes_xts_alg_vaes_avx10_512, 1,
+					  &aes_xts_simdalg_vaes_avx10_512);
 #endif
 }
 #else /* CONFIG_X86_64 */
 static int __init register_xts_algs(void)
 {
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs
  2024-03-29  8:03 [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
                   ` (5 preceding siblings ...)
  2024-03-29  8:03 ` [PATCH v2 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation Eric Biggers
@ 2024-03-29  9:03 ` Ard Biesheuvel
  2024-03-29  9:31   ` Eric Biggers
  6 siblings, 1 reply; 15+ messages in thread
From: Ard Biesheuvel @ 2024-03-29  9:03 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, x86, linux-kernel, Andy Lutomirski, Chang S . Bae

On Fri, 29 Mar 2024 at 10:06, Eric Biggers <ebiggers@kernel.org> wrote:
>
> This patchset adds new AES-XTS implementations that accelerate disk and
> file encryption on modern x86_64 CPUs.
>
> The largest improvements are seen on CPUs that support the VAES
> extension: Intel Ice Lake (2019) and later, and AMD Zen 3 (2020) and
> later.  However, an implementation using plain AESNI + AVX is also added
> and provides a boost on older CPUs too.
>
> To try to handle the mess that is x86 SIMD, the code for all the new
> AES-XTS implementations is generated from an assembly macro.  This makes
> it so that we e.g. don't have to have entirely different source code
> just for different vector lengths (xmm, ymm, zmm).
>
> To avoid downclocking effects, zmm registers aren't used on certain
> Intel CPU models such as Ice Lake.  These CPU models default to an
> implementation using ymm registers instead.
>
> To make testing easier, all four new AES-XTS implementations are
> registered separately with the crypto API.  They are prioritized
> appropriately so that the best one for the CPU is used by default.
>
> There's no separate kconfig option for the new implementations, as they
> are included in the existing option CONFIG_CRYPTO_AES_NI_INTEL.
>
> This patchset increases the throughput of AES-256-XTS by the following
> amounts on the following CPUs:
>
>                           | 4096-byte messages | 512-byte messages |
>     ----------------------+--------------------+-------------------+
>     Intel Skylake         |        6%          |       31%         |
>     Intel Cascade Lake    |        4%          |       26%         |
>     Intel Ice Lake        |       127%         |      120%         |
>     Intel Sapphire Rapids |       151%         |      112%         |
>     AMD Zen 1             |        61%         |       73%         |
>     AMD Zen 2             |        36%         |       59%         |
>     AMD Zen 3             |       138%         |       99%         |
>     AMD Zen 4             |       155%         |      117%         |
>
> To summarize how the XTS implementations perform in general, here are
> benchmarks of all of them on AMD Zen 4, with 4096-byte messages.  (Of
> course, in practice only the best one for the CPU actually gets used.)
>
>     xts-aes-aesni                  4247 MB/s
>     xts-aes-aesni-avx              5669 MB/s
>     xts-aes-vaes-avx2              9588 MB/s
>     xts-aes-vaes-avx10_256         9631 MB/s
>     xts-aes-vaes-avx10_512         10868 MB/s
>
> ... and on Intel Sapphire Rapids:
>
>     xts-aes-aesni                  4848 MB/s
>     xts-aes-aesni-avx              5287 MB/s
>     xts-aes-vaes-avx2              11685 MB/s
>     xts-aes-vaes-avx10_256         11938 MB/s
>     xts-aes-vaes-avx10_512         12176 MB/s
>
> Notes about benchmarking methods:
>
> - All my benchmarks were done using a custom kernel module that invokes
>   the crypto_skcipher API.  Note that benchmarking the crypto API from
>   userspace using AF_ALG, e.g. as 'cryptsetup benchmark' does, is bad at
>   measuring fast algorithms due to the syscall overhead of AF_ALG.  I
>   don't recommend that method.  Instead, I measured the crypto
>   performance directly, as that's what this patchset focuses on.
>
> - All numbers I give are for decryption.  However, on all the CPUs I
>   tested, encryption performs almost identically to decryption.
>
> Open questions:
>
> - Is the policy that I implemented for preferring ymm registers to zmm
>   registers the right one?  arch/x86/crypto/poly1305_glue.c thinks that
>   only Skylake has the bad downclocking.  My current proposal is a bit
>   more conservative; it also excludes Ice Lake and Tiger Lake.  Those
>   CPUs supposedly still have some downclocking, though not as much.
>
> - Should the policy on the use of zmm registers be in a centralized
>   place?  It probably doesn't make sense to have random different
>   policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).
>
> - Are there any other known issues with using AVX512 in kernel mode?  It
>   seems to work, and technically it's not new because Poly1305 and ARIA
>   already use AVX512, including the mask registers and zmm registers up
>   to 31.  So if there was a major issue, like the new registers not
>   being properly saved and restored, it probably would have already been
>   found.  But AES-XTS support would introduce a wider use of it.
>
> - Should we perhaps not even bother with AVX512 / AVX10 at all for now,
>   given that on current CPUs most of the improvement is achieved by
>   going to VAES + AVX2?  I.e. should we skip the last two patches?  I'm
>   hoping the improvement will be greater on future CPUs, though.
>
> Changed in v2:
>   - Additional optimizations:
>       - Interleaved the tweak computation with AES en/decryption.  This
>         helps significantly on some CPUs, especially those without VAES.
>       - Further optimized for single-page sources and destinations.
>       - Used fewer instructions to update tweaks in VPCLMULQDQ case.
>       - Improved handling of "round 0".
>       - Eliminated a jump instruction from the main loop.
>   - Other
>       - Fixed zmm_exclusion_list[] to be null-terminated.
>       - Added missing #ifdef to unregister_xts_algs().
>       - Added some more comments.
>       - Improved cover letter and some commit messages.
>       - Now that the next tweak is always computed anyways, made it be
>         returned unconditionally.
>       - Moved the IV encryption to a separate function.
>
> Eric Biggers (6):
>   x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support
>   crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs
>   crypto: x86/aes-xts - wire up AESNI + AVX implementation
>   crypto: x86/aes-xts - wire up VAES + AVX2 implementation
>   crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation
>   crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation
>

Retested this v2:

Tested-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>

Hopefully, the AES-KL keylocker implementation can be based on this
template as well. I wouldn't mind retiring the existing xts(aesni)
code entirely, and using the xts() wrapper around ecb-aes-aesni on
32-bit and on non-AVX uarchs with AES-NI.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs
  2024-03-29  9:03 ` [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs Ard Biesheuvel
@ 2024-03-29  9:31   ` Eric Biggers
  2024-04-03  0:44     ` Eric Biggers
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Biggers @ 2024-03-29  9:31 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-crypto, x86, linux-kernel, Andy Lutomirski, Chang S . Bae

On Fri, Mar 29, 2024 at 11:03:07AM +0200, Ard Biesheuvel wrote:
> 
> Retested this v2:
> 
> Tested-by: Ard Biesheuvel <ardb@kernel.org>
> Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
> 
> Hopefully, the AES-KL keylocker implementation can be based on this
> template as well.

As-is, it would be a bit ugly to add keylocker support to my template because my
template always processes 4 registers of AES blocks per iteration of the main
loop (like the existing aes-xts-aesni), whereas the keylocker instructions are
hardcoded to operate on 8 AES blocks at a time in xmm0-xmm7, presumably to
reduce the overhead of unwrapping the key.

I did try an 8-wide version briefly.  There are some older CPUs on which it
helps.  (On newer CPUs, AES latency is lower, and the width increases by moving
to ymm or zmm registers anyway.)  But it didn't seem too attractive to me.  It
causes registers to spill, and it becomes a bit awkward to unroll the AES rounds
when the code size is twice as large, so it may need to be re-rolled.  I should
take a closer look, but I decided to just stay with a 4-wide version for now.

So I *think* AES-KL is best kept separate for now.  I do wonder if the AES-KL
code should adopt the idea of using VEX-coded instructions, though --- surely
it's the case that in practice, any CPU with AES-KL also supports AVX.

> I wouldn't mind retiring the existing xts(aesni)
> code entirely, and using the xts() wrapper around ecb-aes-aesni on
> 32-bit and on non-AVX uarchs with AES-NI.

Yes, it will need to be benchmarked, but that probably makes sense.  If
Wikipedia is to be trusted, on the Intel side only Westmere (from 2010) has
AES-NI but not AVX, and on the AMD side all CPUs with AES-NI have AVX...

- Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs
  2024-03-29  9:31   ` Eric Biggers
@ 2024-04-03  0:44     ` Eric Biggers
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Biggers @ 2024-04-03  0:44 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-crypto, x86, linux-kernel, Andy Lutomirski, Chang S . Bae

On Fri, Mar 29, 2024 at 02:31:30AM -0700, Eric Biggers wrote:
> > I wouldn't mind retiring the existing xts(aesni)
> > code entirely, and using the xts() wrapper around ecb-aes-aesni on
> > 32-bit and on non-AVX uarchs with AES-NI.
> 
> Yes, it will need to be benchmarked, but that probably makes sense.  If
> Wikipedia is to be trusted, on the Intel side only Westmere (from 2010) has
> AES-NI but not AVX, and on the AMD side all CPUs with AES-NI have AVX...

It looks like I missed some low-power CPUs.  Intel's Silvermont (2013), Goldmont
(2016), Goldmont Plus (2017), and Tremont (2020) support AES-NI but not AVX.
Their successor, Gracemont (2021), supports AVX.

I don't have any one of those immediately available to run a test on.  But just
doing a quick benchmark on Zen 1, xts-aes-aesni has 62% higher throughput than
xts(ecb-aes-aesni).  The significant difference seems expected, since there's a
lot of API overhead in the xts template, and it computes all the tweaks twice in
C code.

So I'm thinking we'll need to keep xts-aes-aesni around for now, alongside
xts-aes-aesni-avx.

(And with all the SIMD instructions taking a different number of arguments and
having different names for AVX vs non-AVX, I don't see a clean way to unify them
in assembly.  They could be unified if we used C intrinsics instead of assembly
and compiled a C function with and without the "avx" target.  However,
intrinsics bring their own issues and make it hard to control the generated
code.  I don't really want to rely on intrinsics for this code.)

- Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation
  2024-03-29  8:03 ` [PATCH v2 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation Eric Biggers
@ 2024-04-04 20:34   ` Dave Hansen
  2024-04-04 23:36     ` Eric Biggers
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2024-04-04 20:34 UTC (permalink / raw)
  To: Eric Biggers, linux-crypto, x86
  Cc: linux-kernel, Ard Biesheuvel, Andy Lutomirski, Chang S . Bae

On 3/29/24 01:03, Eric Biggers wrote:
> +static const struct x86_cpu_id zmm_exclusion_list[] = {
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_SKYLAKE_X },
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_X },
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_D },
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE },
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_L },
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_NNPI },
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE_L },
> +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE },
> +	/* Allow Rocket Lake and later, and Sapphire Rapids and later. */
> +	/* Also allow AMD CPUs (starting with Zen 4, the first with AVX-512). */
> +	{},
> +};

A hard-coded model/family exclusion list is not great.

It'll break when running in guests on newer CPUs that fake any of these
models.  Some folks will also surely disagree with the kernel policy
implemented here.

Is there no way to implement this other than a hard-coded kernel policy?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation
  2024-04-04 20:34   ` Dave Hansen
@ 2024-04-04 23:36     ` Eric Biggers
  2024-04-04 23:53       ` Dave Hansen
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Biggers @ 2024-04-04 23:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-crypto, x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Chang S . Bae

On Thu, Apr 04, 2024 at 01:34:04PM -0700, Dave Hansen wrote:
> On 3/29/24 01:03, Eric Biggers wrote:
> > +static const struct x86_cpu_id zmm_exclusion_list[] = {
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_SKYLAKE_X },
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_X },
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_D },
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE },
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_L },
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_ICELAKE_NNPI },
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE_L },
> > +	{ .vendor = X86_VENDOR_INTEL, .family = 6, .model = INTEL_FAM6_TIGERLAKE },
> > +	/* Allow Rocket Lake and later, and Sapphire Rapids and later. */
> > +	/* Also allow AMD CPUs (starting with Zen 4, the first with AVX-512). */
> > +	{},
> > +};
> 
> A hard-coded model/family exclusion list is not great.
> 
> It'll break when running in guests on newer CPUs that fake any of these
> models.  Some folks will also surely disagree with the kernel policy
> implemented here.
> 
> Is there no way to implement this other than a hard-coded kernel policy?

Besides the hardcoded CPU exclusion list, the options are:

1. Never use zmm registers.

2. Ignore the issue and use zmm registers even on these CPU models.  Systemwide
   performance may suffer due to downclocking.

3. Do a runtime test to detect whether using zmm registers causes downclocking.
   This seems impractical.

4. Keep the proposed policy as the default behavior, but allow it to be
   overridden on the kernel command line.  This would be a bit more flexible;
   however, most people don't change defaults anyway.

When you write "Some folks will also surely disagree with the kernel policy
implemented here", are there any specific concerns that you anticipate?  Note
that Intel has acknowledged the zmm downclocking issues on Ice Lake and
suggested that using ymm registers instead would be reasonable:
https://lore.kernel.org/linux-crypto/e8ce1146-3952-6977-1d0e-a22758e58914@intel.com/

If there is really a controversy, my vote is that for now we just go with option
(1), i.e. drop this patch from the series.  We can reconsider the issue when a
CPU is released with better 512-bit support.

- Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation
  2024-04-04 23:36     ` Eric Biggers
@ 2024-04-04 23:53       ` Dave Hansen
  2024-04-05  0:11         ` Eric Biggers
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Hansen @ 2024-04-04 23:53 UTC (permalink / raw)
  To: Eric Biggers
  Cc: linux-crypto, x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Chang S . Bae

On 4/4/24 16:36, Eric Biggers wrote:
> 1. Never use zmm registers.
...
> 4. Keep the proposed policy as the default behavior, but allow it to be
>    overridden on the kernel command line.  This would be a bit more flexible;
>    however, most people don't change defaults anyway.
> 
> When you write "Some folks will also surely disagree with the kernel policy
> implemented here", are there any specific concerns that you anticipate?

Some people care less about the frequency throttling and only care about
max performance _using_ AVX512.

> Note that Intel has acknowledged the zmm downclocking issues on Ice
> Lake and suggested that using ymm registers instead would be
> reasonable:>
https://lore.kernel.org/linux-crypto/e8ce1146-3952-6977-1d0e-a22758e58914@intel.com/
> 
> If there is really a controversy, my vote is that for now we just go with option
> (1), i.e. drop this patch from the series.  We can reconsider the issue when a
> CPU is released with better 512-bit support.

(1) is fine with me.

(4) would also be fine.  But I don't think it absolutely _has_ to be a
boot-time switch.  What prevents you from registering, say,
"xts-aes-vaes-avx10" and then doing:

	if (avx512_is_desired())
		xts-aes-vaes-avx10_512(...);
	else
		xts-aes-vaes-avx10_256(...);

at runtime?

Where avx512_is_desired() can be changed willy-nilly, either with a
command-line parameter or runtime knob.  Sure, the performance might
change versus what was measured, but I don't think that's a deal breaker.

Then if folks want to do fancy benchmarks or model/family checks or
whatever, they can do it in userspace at runtime.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation
  2024-04-04 23:53       ` Dave Hansen
@ 2024-04-05  0:11         ` Eric Biggers
  2024-04-05  7:20           ` Herbert Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Biggers @ 2024-04-05  0:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-crypto, x86, linux-kernel, Ard Biesheuvel, Andy Lutomirski,
	Chang S . Bae

On Thu, Apr 04, 2024 at 04:53:12PM -0700, Dave Hansen wrote:
> On 4/4/24 16:36, Eric Biggers wrote:
> > 1. Never use zmm registers.
> ...
> > 4. Keep the proposed policy as the default behavior, but allow it to be
> >    overridden on the kernel command line.  This would be a bit more flexible;
> >    however, most people don't change defaults anyway.
> > 
> > When you write "Some folks will also surely disagree with the kernel policy
> > implemented here", are there any specific concerns that you anticipate?
> 
> Some people care less about the frequency throttling and only care about
> max performance _using_ AVX512.
> 
> > Note that Intel has acknowledged the zmm downclocking issues on Ice
> > Lake and suggested that using ymm registers instead would be
> > reasonable:>
> https://lore.kernel.org/linux-crypto/e8ce1146-3952-6977-1d0e-a22758e58914@intel.com/
> > 
> > If there is really a controversy, my vote is that for now we just go with option
> > (1), i.e. drop this patch from the series.  We can reconsider the issue when a
> > CPU is released with better 512-bit support.
> 
> (1) is fine with me.
> 
> (4) would also be fine.  But I don't think it absolutely _has_ to be a
> boot-time switch.  What prevents you from registering, say,
> "xts-aes-vaes-avx10" and then doing:
> 
> 	if (avx512_is_desired())
> 		xts-aes-vaes-avx10_512(...);
> 	else
> 		xts-aes-vaes-avx10_256(...);
> 
> at runtime?
> 
> Where avx512_is_desired() can be changed willy-nilly, either with a
> command-line parameter or runtime knob.  Sure, the performance might
> change versus what was measured, but I don't think that's a deal breaker.
> 
> Then if folks want to do fancy benchmarks or model/family checks or
> whatever, they can do it in userspace at runtime.

It's certainly possible for a single crypto algorithm (using "algorithm" in the
crypto API sense of the word) to have multiple alternative code paths, and there
are examples of this in arch/x86/crypto/.  However, I think this is a poor
practice, at least as the crypto API is currently designed, because it makes it
difficult to test the different code paths.  Alternatives are best handled by
registering them as separate algorithms with different cra_priority values.

Also, I forgot one property of my patch, which is that because I made the
zmm_exclusion_list just decrease the priority of xts-aes-vaes-avx10_512 rather
than skipping registering it, the change actually can be undone at runtime by
increasing the priority of xts-aes-vaes-avx10_512 back to its original value.
Userspace can do it using the "crypto user configuration API"
(include/uapi/linux/cryptouser.h), specifically CRYPTO_MSG_UPDATEALG.

Maybe that is enough configurability already?

- Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation
  2024-04-05  0:11         ` Eric Biggers
@ 2024-04-05  7:20           ` Herbert Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Herbert Xu @ 2024-04-05  7:20 UTC (permalink / raw)
  To: Eric Biggers
  Cc: dave.hansen, linux-crypto, x86, linux-kernel, ardb, luto,
	chang.seok.bae

Eric Biggers <ebiggers@kernel.org> wrote:
>
> Also, I forgot one property of my patch, which is that because I made the
> zmm_exclusion_list just decrease the priority of xts-aes-vaes-avx10_512 rather
> than skipping registering it, the change actually can be undone at runtime by
> increasing the priority of xts-aes-vaes-avx10_512 back to its original value.
> Userspace can do it using the "crypto user configuration API"
> (include/uapi/linux/cryptouser.h), specifically CRYPTO_MSG_UPDATEALG.
> 
> Maybe that is enough configurability already?

Yes I think that's more than sufficient.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-04-05  7:20 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-29  8:03 [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs Eric Biggers
2024-03-29  8:03 ` [PATCH v2 1/6] x86: add kconfig symbols for assembler VAES and VPCLMULQDQ support Eric Biggers
2024-03-29  8:03 ` [PATCH v2 2/6] crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs Eric Biggers
2024-03-29  8:03 ` [PATCH v2 3/6] crypto: x86/aes-xts - wire up AESNI + AVX implementation Eric Biggers
2024-03-29  8:03 ` [PATCH v2 4/6] crypto: x86/aes-xts - wire up VAES + AVX2 implementation Eric Biggers
2024-03-29  8:03 ` [PATCH v2 5/6] crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation Eric Biggers
2024-03-29  8:03 ` [PATCH v2 6/6] crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation Eric Biggers
2024-04-04 20:34   ` Dave Hansen
2024-04-04 23:36     ` Eric Biggers
2024-04-04 23:53       ` Dave Hansen
2024-04-05  0:11         ` Eric Biggers
2024-04-05  7:20           ` Herbert Xu
2024-03-29  9:03 ` [PATCH v2 0/6] Faster AES-XTS on modern x86_64 CPUs Ard Biesheuvel
2024-03-29  9:31   ` Eric Biggers
2024-04-03  0:44     ` Eric Biggers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.