RCU Archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH 5.15 000/183] 5.15.134-rc1 review
       [not found] <20231004175203.943277832@linuxfoundation.org>
@ 2023-10-05 17:49 ` Naresh Kamboju
  2023-10-06 16:20   ` Liam R. Howlett
  0 siblings, 1 reply; 14+ messages in thread
From: Naresh Kamboju @ 2023-10-05 17:49 UTC (permalink / raw
  To: Greg Kroah-Hartman
  Cc: stable, patches, linux-kernel, torvalds, akpm, linux, shuah,
	patches, lkft-triage, pavel, jonathanh, f.fainelli,
	sudipm.mukherjee, srw, rwarsow, conor, Chengming Zhou,
	Liam R. Howlett, Joel Fernandes, Peter Zijlstra, Ovidiu Panait,
	Ingo Molnar, Paul E. McKenney, rcu

On Wed, 4 Oct 2023 at 23:33, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
>
> This is the start of the stable review cycle for the 5.15.134 release.
> There are 183 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
>
> Responses should be made by Fri, 06 Oct 2023 17:51:12 +0000.
> Anything received after that time might be too late.
>
> The whole patch series can be found in one patch at:
>         https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.134-rc1.gz
> or in the git tree and branch at:
>         git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> and the diffstat can be found below.
>
> thanks,
>
> greg k-h

Results from Linaro’s test farm.
Regressions on x86.

Following kernel warning noticed on x86 while booting stable-rc 5.15.134-rc1
with selftest merge config built kernel.

Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>

Anyone noticed this kernel warning ?

This is always reproducible while booting x86 with a given config.

x86 boot log:
-----
[    0.000000] Linux version 5.15.134-rc1 (tuxmake@tuxmake)
(x86_64-linux-gnu-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils
for Debian) 2.40) #1 SMP @1696443178
...
[    1.480701] ------------[ cut here ]------------
[    1.481296] WARNING: CPU: 0 PID: 13 at kernel/rcu/tasks.h:958
trc_inspect_reader+0x80/0xb0
[    1.481296] Modules linked in:
[    1.481296] CPU: 0 PID: 13 Comm: rcu_tasks_trace Not tainted 5.15.134-rc1 #1
[    1.481296] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.5 11/26/2020
[    1.481296] RIP: 0010:trc_inspect_reader+0x80/0xb0
[    1.481296] Code: b6 83 45 04 00 00 84 c0 75 48 c6 83 45 04 00 00
01 b8 01 00 00 00 5b 41 5c 5d c3 cc cc cc cc 0f 94 c0 eb b4 f6 43 2c
02 75 02 <0f> 0b 48 83 05 36 f8 ee 02 01 b8 01 00 00 00 48 83 05 21 f8
ee 02
[    1.481296] RSP: 0000:ffffb25e000afd70 EFLAGS: 00010046
[    1.481296] RAX: 0000000000000000 RBX: ffff9b40c080d040 RCX: 0000000000000003
[    1.481296] RDX: ffff9b4427b80000 RSI: 0000000000000000 RDI: ffff9b40c080d040
[    1.481296] RBP: ffffb25e000afd80 R08: e32db91cdfdc3bef R09: 00000000035b89d4
[    1.481296] R10: 000000006a495065 R11: 0000000000000030 R12: ffffffffae692100
[    1.481296] R13: 0000000000000000 R14: ffff9b40c080d9a8 R15: 0000000000000000
[    1.481296] FS:  0000000000000000(0000) GS:ffff9b4427a00000(0000)
knlGS:0000000000000000
[    1.481296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.481296] CR2: ffff9b4297201000 CR3: 00000002d5e26001 CR4: 00000000003706f0
[    1.481296] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    1.481296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    1.481296] Call Trace:
[    1.481296]  <TASK>
[    1.481296]  ? show_regs.cold+0x1a/0x1f
[    1.481296]  ? __warn+0x88/0x120
[    1.481296]  ? trc_inspect_reader+0x80/0xb0
[    1.481296]  ? report_bug+0xa8/0xd0
[    1.481296]  ? handle_bug+0x40/0x70
[    1.481296]  ? exc_invalid_op+0x18/0x70
[    1.481296]  ? asm_exc_invalid_op+0x1b/0x20
[    1.481296]  ? rcu_tasks_kthread+0x250/0x250
[    1.481296]  ? trc_inspect_reader+0x80/0xb0
[    1.481296]  ? rcu_tasks_kthread+0x250/0x250
[    1.481296]  try_invoke_on_locked_down_task+0x109/0x120
[    1.481296]  trc_wait_for_one_reader.part.0+0x48/0x270
[    1.481296]  rcu_tasks_trace_postscan+0x76/0xb0
[    1.481296]  rcu_tasks_wait_gp+0x186/0x380
[    1.481296]  ? _raw_spin_unlock_irqrestore+0x35/0x50
[    1.481296]  rcu_tasks_kthread+0x145/0x250
[    1.481296]  ? do_wait_intr_irq+0xc0/0xc0
[    1.481296]  ? synchronize_rcu_tasks_rude+0x20/0x20
[    1.481296]  kthread+0x146/0x170
[    1.481296]  ? set_kthread_struct+0x50/0x50
[    1.481296]  ret_from_fork+0x1f/0x30
[    1.481296]  </TASK>
[    1.481296] irq event stamp: 132
[    1.481296] hardirqs last  enabled at (131): [<ffffffffaf7936a5>]
_raw_spin_unlock_irqrestore+0x35/0x50
[    1.481296] hardirqs last disabled at (132): [<ffffffffaf79345b>]
_raw_spin_lock_irqsave+0x5b/0x60
[    1.481296] softirqs last  enabled at (54): [<ffffffffae69201c>]
rcu_tasks_kthread+0x16c/0x250
[    1.481296] softirqs last disabled at (50): [<ffffffffae69201c>]
rcu_tasks_kthread+0x16c/0x250
[    1.481296] ---[ end trace 5a00c61d8412a9ac ]---


Links:
----
 - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.15.y/build/v5.15.133-184-g6f28ecf24aef/testrun/20260259/suite/log-parser-boot/test/check-kernel-exception/log
 - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.15.y/build/v5.15.133-184-g6f28ecf24aef/testrun/20260259/suite/log-parser-boot/tests/
 Build: https://storage.tuxsuite.com/public/linaro/lkft/builds/2WJFhcfqqG69pqj6LWuI14kVoP5/

steps to reproduce:
--------
 - https://storage.tuxsuite.com/public/linaro/lkft/builds/2WJFhcfqqG69pqj6LWuI14kVoP5/tuxmake_reproducer.sh

## Build
* kernel: 5.15.134-rc1
* git: https://gitlab.com/Linaro/lkft/mirrors/stable/linux-stable-rc
* git branch: linux-5.15.y
* git commit: 6f28ecf24aef2896f4071dc6268d3fb5f8259c77
* git describe: v5.15.133-184-g6f28ecf24aef
* test details:
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.15.y/build/v5.15.133-184-g6f28ecf24aef

## Test Regressions (compared to v5.15.133)
* x86, log-parser-boot
  - check-kernel-exception
  - check-kernel-warning

* x86, log-parser-test
  - check-kernel-exception
  - check-kernel-warning


## Metric Regressions (compared to v5.15.133)

## Test Fixes (compared to v5.15.133)

## Metric Fixes (compared to v5.15.133)

## Test result summary
total: 90392, pass: 71514, fail: 2557, skip: 16224, xfail: 97

## Build Summary
* arc: 4 total, 4 passed, 0 failed
* arm: 114 total, 114 passed, 0 failed
* arm64: 42 total, 42 passed, 0 failed
* i386: 32 total, 31 passed, 1 failed
* mips: 27 total, 26 passed, 1 failed
* parisc: 4 total, 4 passed, 0 failed
* powerpc: 26 total, 25 passed, 1 failed
* riscv: 11 total, 11 passed, 0 failed
* s390: 12 total, 11 passed, 1 failed
* sh: 13 total, 11 passed, 2 failed
* sparc: 8 total, 8 passed, 0 failed
* x86_64: 38 total, 38 passed, 0 failed

## Test suites summary
* boot
* kselftest-android
* kselftest-arm64
* kselftest-breakpoints
* kselftest-capabilities
* kselftest-cgroup
* kselftest-clone3
* kselftest-core
* kselftest-cpu-hotplug
* kselftest-cpufreq
* kselftest-drivers-dma-buf
* kselftest-efivarfs
* kselftest-exec
* kselftest-filesystems
* kselftest-filesystems-binderfs
* kselftest-filesystems-epoll
* kselftest-firmware
* kselftest-fpu
* kselftest-ftrace
* kselftest-futex
* kselftest-gpio
* kselftest-intel_pstate
* kselftest-ipc
* kselftest-ir
* kselftest-kcmp
* kselftest-kexec
* kselftest-kvm
* kselftest-lib
* kselftest-membarrier
* kselftest-memfd
* kselftest-memory-hotplug
* kselftest-mincore
* kselftest-mount
* kselftest-mqueue
* kselftest-net
* kselftest-net-forwarding
* kselftest-net-mptcp
* kselftest-netfilter
* kselftest-nsfs
* kselftest-openat2
* kselftest-pid_namespace
* kselftest-pidfd
* kselftest-proc
* kselftest-pstore
* kselftest-ptrace
* kselftest-rseq
* kselftest-rtc
* kselftest-seccomp
* kselftest-sigaltstack
* kselftest-size
* kselftest-splice
* kselftest-static_keys
* kselftest-sync
* kselftest-sysctl
* kselftest-tc-testing
* kselftest-timens
* kselftest-tmpfs
* kselftest-tpm2
* kselftest-user
* kselftest-user_events
* kselftest-vDSO
* kselftest-vm
* kselftest-watchdog
* kselftest-x86
* kselftest-zram
* kunit
* kvm-unit-tests
* libgpiod
* log-parser-boot
* log-parser-test
* ltp-cap_bounds
* ltp-commands
* ltp-containers
* ltp-controllers
* ltp-cpuhotplug
* ltp-crypto
* ltp-cve
* ltp-dio
* ltp-fcntl-locktests
* ltp-filecaps
* ltp-fs
* ltp-fs_bind
* ltp-fs_perms_simple
* ltp-fsx
* ltp-hugetlb
* ltp-io
* ltp-ipc
* ltp-math
* ltp-mm
* ltp-nptl
* ltp-pty
* ltp-sched
* ltp-securebits
* ltp-smoke
* ltp-syscalls
* ltp-tracing
* network-basic-tests
* perf
* rcutorture
* v4l2-compliance

--
Linaro LKFT
https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5.15 000/183] 5.15.134-rc1 review
  2023-10-05 17:49 ` [PATCH 5.15 000/183] 5.15.134-rc1 review Naresh Kamboju
@ 2023-10-06 16:20   ` Liam R. Howlett
  2023-10-06 16:47     ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Liam R. Howlett @ 2023-10-06 16:20 UTC (permalink / raw
  To: Naresh Kamboju
  Cc: Greg Kroah-Hartman, stable, patches, linux-kernel, torvalds, akpm,
	linux, shuah, patches, lkft-triage, pavel, jonathanh, f.fainelli,
	sudipm.mukherjee, srw, rwarsow, conor, Chengming Zhou,
	Joel Fernandes, Peter Zijlstra, Ovidiu Panait, Ingo Molnar,
	Paul E. McKenney, rcu

* Naresh Kamboju <naresh.kamboju@linaro.org> [231005 13:49]:
> On Wed, 4 Oct 2023 at 23:33, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > This is the start of the stable review cycle for the 5.15.134 release.
> > There are 183 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> >
> > Responses should be made by Fri, 06 Oct 2023 17:51:12 +0000.
> > Anything received after that time might be too late.
> >
> > The whole patch series can be found in one patch at:
> >         https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.134-rc1.gz
> > or in the git tree and branch at:
> >         git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > and the diffstat can be found below.
> >
> > thanks,
> >
> > greg k-h
> 
> Results from Linaro’s test farm.
> Regressions on x86.
> 
> Following kernel warning noticed on x86 while booting stable-rc 5.15.134-rc1
> with selftest merge config built kernel.
> 
> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> 
> Anyone noticed this kernel warning ?
> 
> This is always reproducible while booting x86 with a given config.

From that config:
#
# RCU Subsystem                                                                                                                                                                                                                                  
#                                                                                                                                                                                                                                                
CONFIG_TREE_RCU=y                                                                                                                                                                                                                                
# CONFIG_RCU_EXPERT is not set                                                                                                                                                                                                                   
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y                                                                                                                                                                                                                        
CONFIG_RCU_NEED_SEGCBLIST=y
# end of RCU Subsystem    

#                                                                                                                                                                                                                                                
# RCU Debugging                                                                                                                                                                                                                                  
#                                                                                                                                                                                                                                                
CONFIG_PROVE_RCU=y                                                                                                                                                                                                                               
# CONFIG_RCU_SCALE_TEST is not set                                                                                                                                                                                                               
# CONFIG_RCU_TORTURE_TEST is not set                                                                                                                                                                                                             
# CONFIG_RCU_REF_SCALE_TEST is not set                                                                                                                                                                                                           
CONFIG_RCU_CPU_STALL_TIMEOUT=21                                                                                                                                                                                                                  
CONFIG_RCU_TRACE=y                                                                                                                                                                                                                               
# CONFIG_RCU_EQS_DEBUG is not set                                                                                                                                                                                                                
# end of RCU Debugging


> 
> x86 boot log:
> -----
> [    0.000000] Linux version 5.15.134-rc1 (tuxmake@tuxmake)
> (x86_64-linux-gnu-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils
> for Debian) 2.40) #1 SMP @1696443178
> ...
> [    1.480701] ------------[ cut here ]------------
> [    1.481296] WARNING: CPU: 0 PID: 13 at kernel/rcu/tasks.h:958
> trc_inspect_reader+0x80/0xb0
> [    1.481296] Modules linked in:
> [    1.481296] CPU: 0 PID: 13 Comm: rcu_tasks_trace Not tainted 5.15.134-rc1 #1
> [    1.481296] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.5 11/26/2020
> [    1.481296] RIP: 0010:trc_inspect_reader+0x80/0xb0

This function has changed a lot, including the dropping of this
WARN_ON_ONCE().  The warning was replaced in 897ba84dc5aa ("rcu-tasks:
Handle idle tasks for recently offlined CPUs") with something that looks
equivalent so I'm not sure why it would not trigger in newer revisions.

Obviously the behaviour I changed was the test for the task being idle.
I am not sure how best to short-circuit that test from happening during
boot as I am not familiar with the RCU code.

It's also worth noting that the bug this fixes wasn't exposed until the
maple tree (added in v6.1) was used for the IRQ descriptors (added in
v6.5).

> [    1.481296] Code: b6 83 45 04 00 00 84 c0 75 48 c6 83 45 04 00 00
> 01 b8 01 00 00 00 5b 41 5c 5d c3 cc cc cc cc 0f 94 c0 eb b4 f6 43 2c
> 02 75 02 <0f> 0b 48 83 05 36 f8 ee 02 01 b8 01 00 00 00 48 83 05 21 f8
> ee 02
> [    1.481296] RSP: 0000:ffffb25e000afd70 EFLAGS: 00010046
> [    1.481296] RAX: 0000000000000000 RBX: ffff9b40c080d040 RCX: 0000000000000003
> [    1.481296] RDX: ffff9b4427b80000 RSI: 0000000000000000 RDI: ffff9b40c080d040
> [    1.481296] RBP: ffffb25e000afd80 R08: e32db91cdfdc3bef R09: 00000000035b89d4
> [    1.481296] R10: 000000006a495065 R11: 0000000000000030 R12: ffffffffae692100
> [    1.481296] R13: 0000000000000000 R14: ffff9b40c080d9a8 R15: 0000000000000000
> [    1.481296] FS:  0000000000000000(0000) GS:ffff9b4427a00000(0000)
> knlGS:0000000000000000
> [    1.481296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.481296] CR2: ffff9b4297201000 CR3: 00000002d5e26001 CR4: 00000000003706f0
> [    1.481296] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [    1.481296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [    1.481296] Call Trace:
> [    1.481296]  <TASK>
> [    1.481296]  ? show_regs.cold+0x1a/0x1f
> [    1.481296]  ? __warn+0x88/0x120
> [    1.481296]  ? trc_inspect_reader+0x80/0xb0
> [    1.481296]  ? report_bug+0xa8/0xd0
> [    1.481296]  ? handle_bug+0x40/0x70
> [    1.481296]  ? exc_invalid_op+0x18/0x70
> [    1.481296]  ? asm_exc_invalid_op+0x1b/0x20
> [    1.481296]  ? rcu_tasks_kthread+0x250/0x250
> [    1.481296]  ? trc_inspect_reader+0x80/0xb0
> [    1.481296]  ? rcu_tasks_kthread+0x250/0x250
> [    1.481296]  try_invoke_on_locked_down_task+0x109/0x120
> [    1.481296]  trc_wait_for_one_reader.part.0+0x48/0x270
> [    1.481296]  rcu_tasks_trace_postscan+0x76/0xb0
> [    1.481296]  rcu_tasks_wait_gp+0x186/0x380
> [    1.481296]  ? _raw_spin_unlock_irqrestore+0x35/0x50
> [    1.481296]  rcu_tasks_kthread+0x145/0x250
> [    1.481296]  ? do_wait_intr_irq+0xc0/0xc0
> [    1.481296]  ? synchronize_rcu_tasks_rude+0x20/0x20
> [    1.481296]  kthread+0x146/0x170
> [    1.481296]  ? set_kthread_struct+0x50/0x50
> [    1.481296]  ret_from_fork+0x1f/0x30
> [    1.481296]  </TASK>
> [    1.481296] irq event stamp: 132
> [    1.481296] hardirqs last  enabled at (131): [<ffffffffaf7936a5>]
> _raw_spin_unlock_irqrestore+0x35/0x50
> [    1.481296] hardirqs last disabled at (132): [<ffffffffaf79345b>]
> _raw_spin_lock_irqsave+0x5b/0x60
> [    1.481296] softirqs last  enabled at (54): [<ffffffffae69201c>]
> rcu_tasks_kthread+0x16c/0x250
> [    1.481296] softirqs last disabled at (50): [<ffffffffae69201c>]
> rcu_tasks_kthread+0x16c/0x250
> [    1.481296] ---[ end trace 5a00c61d8412a9ac ]---
> 
> 
> Links:
> ----
>  - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.15.y/build/v5.15.133-184-g6f28ecf24aef/testrun/20260259/suite/log-parser-boot/test/check-kernel-exception/log
>  - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.15.y/build/v5.15.133-184-g6f28ecf24aef/testrun/20260259/suite/log-parser-boot/tests/
>  Build: https://storage.tuxsuite.com/public/linaro/lkft/builds/2WJFhcfqqG69pqj6LWuI14kVoP5/
> 
> steps to reproduce:
> --------
>  - https://storage.tuxsuite.com/public/linaro/lkft/builds/2WJFhcfqqG69pqj6LWuI14kVoP5/tuxmake_reproducer.sh
> 
> ## Build
> * kernel: 5.15.134-rc1
> * git: https://gitlab.com/Linaro/lkft/mirrors/stable/linux-stable-rc
> * git branch: linux-5.15.y
> * git commit: 6f28ecf24aef2896f4071dc6268d3fb5f8259c77
> * git describe: v5.15.133-184-g6f28ecf24aef
> * test details:
> https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.15.y/build/v5.15.133-184-g6f28ecf24aef
> 
> ## Test Regressions (compared to v5.15.133)
> * x86, log-parser-boot
>   - check-kernel-exception
>   - check-kernel-warning
> 
> * x86, log-parser-test
>   - check-kernel-exception
>   - check-kernel-warning
> 
> 
> ## Metric Regressions (compared to v5.15.133)
> 
> ## Test Fixes (compared to v5.15.133)
> 
> ## Metric Fixes (compared to v5.15.133)
> 
> ## Test result summary
> total: 90392, pass: 71514, fail: 2557, skip: 16224, xfail: 97
> 
> ## Build Summary
> * arc: 4 total, 4 passed, 0 failed
> * arm: 114 total, 114 passed, 0 failed
> * arm64: 42 total, 42 passed, 0 failed
> * i386: 32 total, 31 passed, 1 failed
> * mips: 27 total, 26 passed, 1 failed
> * parisc: 4 total, 4 passed, 0 failed
> * powerpc: 26 total, 25 passed, 1 failed
> * riscv: 11 total, 11 passed, 0 failed
> * s390: 12 total, 11 passed, 1 failed
> * sh: 13 total, 11 passed, 2 failed
> * sparc: 8 total, 8 passed, 0 failed
> * x86_64: 38 total, 38 passed, 0 failed
> 
> ## Test suites summary
> * boot
> * kselftest-android
> * kselftest-arm64
> * kselftest-breakpoints
> * kselftest-capabilities
> * kselftest-cgroup
> * kselftest-clone3
> * kselftest-core
> * kselftest-cpu-hotplug
> * kselftest-cpufreq
> * kselftest-drivers-dma-buf
> * kselftest-efivarfs
> * kselftest-exec
> * kselftest-filesystems
> * kselftest-filesystems-binderfs
> * kselftest-filesystems-epoll
> * kselftest-firmware
> * kselftest-fpu
> * kselftest-ftrace
> * kselftest-futex
> * kselftest-gpio
> * kselftest-intel_pstate
> * kselftest-ipc
> * kselftest-ir
> * kselftest-kcmp
> * kselftest-kexec
> * kselftest-kvm
> * kselftest-lib
> * kselftest-membarrier
> * kselftest-memfd
> * kselftest-memory-hotplug
> * kselftest-mincore
> * kselftest-mount
> * kselftest-mqueue
> * kselftest-net
> * kselftest-net-forwarding
> * kselftest-net-mptcp
> * kselftest-netfilter
> * kselftest-nsfs
> * kselftest-openat2
> * kselftest-pid_namespace
> * kselftest-pidfd
> * kselftest-proc
> * kselftest-pstore
> * kselftest-ptrace
> * kselftest-rseq
> * kselftest-rtc
> * kselftest-seccomp
> * kselftest-sigaltstack
> * kselftest-size
> * kselftest-splice
> * kselftest-static_keys
> * kselftest-sync
> * kselftest-sysctl
> * kselftest-tc-testing
> * kselftest-timens
> * kselftest-tmpfs
> * kselftest-tpm2
> * kselftest-user
> * kselftest-user_events
> * kselftest-vDSO
> * kselftest-vm
> * kselftest-watchdog
> * kselftest-x86
> * kselftest-zram
> * kunit
> * kvm-unit-tests
> * libgpiod
> * log-parser-boot
> * log-parser-test
> * ltp-cap_bounds
> * ltp-commands
> * ltp-containers
> * ltp-controllers
> * ltp-cpuhotplug
> * ltp-crypto
> * ltp-cve
> * ltp-dio
> * ltp-fcntl-locktests
> * ltp-filecaps
> * ltp-fs
> * ltp-fs_bind
> * ltp-fs_perms_simple
> * ltp-fsx
> * ltp-hugetlb
> * ltp-io
> * ltp-ipc
> * ltp-math
> * ltp-mm
> * ltp-nptl
> * ltp-pty
> * ltp-sched
> * ltp-securebits
> * ltp-smoke
> * ltp-syscalls
> * ltp-tracing
> * network-basic-tests
> * perf
> * rcutorture
> * v4l2-compliance
> 
> --
> Linaro LKFT
> https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5.15 000/183] 5.15.134-rc1 review
  2023-10-06 16:20   ` Liam R. Howlett
@ 2023-10-06 16:47     ` Paul E. McKenney
  2023-10-06 17:57       ` Liam R. Howlett
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2023-10-06 16:47 UTC (permalink / raw
  To: Liam R. Howlett
  Cc: Naresh Kamboju, Greg Kroah-Hartman, stable, patches, linux-kernel,
	torvalds, akpm, linux, shuah, patches, lkft-triage, pavel,
	jonathanh, f.fainelli, sudipm.mukherjee, srw, rwarsow, conor,
	Chengming Zhou, Joel Fernandes, Peter Zijlstra, Ovidiu Panait,
	Ingo Molnar, rcu

On Fri, Oct 06, 2023 at 12:20:38PM -0400, Liam R. Howlett wrote:
> * Naresh Kamboju <naresh.kamboju@linaro.org> [231005 13:49]:
> > On Wed, 4 Oct 2023 at 23:33, Greg Kroah-Hartman
> > <gregkh@linuxfoundation.org> wrote:
> > >
> > > This is the start of the stable review cycle for the 5.15.134 release.
> > > There are 183 patches in this series, all will be posted as a response
> > > to this one.  If anyone has any issues with these being applied, please
> > > let me know.
> > >
> > > Responses should be made by Fri, 06 Oct 2023 17:51:12 +0000.
> > > Anything received after that time might be too late.
> > >
> > > The whole patch series can be found in one patch at:
> > >         https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.134-rc1.gz
> > > or in the git tree and branch at:
> > >         git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > > and the diffstat can be found below.
> > >
> > > thanks,
> > >
> > > greg k-h
> > 
> > Results from Linaro’s test farm.
> > Regressions on x86.
> > 
> > Following kernel warning noticed on x86 while booting stable-rc 5.15.134-rc1
> > with selftest merge config built kernel.
> > 
> > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > 
> > Anyone noticed this kernel warning ?
> > 
> > This is always reproducible while booting x86 with a given config.
> 
> >From that config:
> #
> # RCU Subsystem
> #
> CONFIG_TREE_RCU=y
> # CONFIG_RCU_EXPERT is not set
> CONFIG_SRCU=y
> CONFIG_TREE_SRCU=y
> CONFIG_TASKS_RCU_GENERIC=y
> CONFIG_TASKS_RUDE_RCU=y
> CONFIG_TASKS_TRACE_RCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_NEED_SEGCBLIST=y
> # end of RCU Subsystem    
> 
> #
> # RCU Debugging
> #
> CONFIG_PROVE_RCU=y
> # CONFIG_RCU_SCALE_TEST is not set
> # CONFIG_RCU_TORTURE_TEST is not set
> # CONFIG_RCU_REF_SCALE_TEST is not set
> CONFIG_RCU_CPU_STALL_TIMEOUT=21
> CONFIG_RCU_TRACE=y
> # CONFIG_RCU_EQS_DEBUG is not set
> # end of RCU Debugging
> 
> 
> > 
> > x86 boot log:
> > -----
> > [    0.000000] Linux version 5.15.134-rc1 (tuxmake@tuxmake)
> > (x86_64-linux-gnu-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils
> > for Debian) 2.40) #1 SMP @1696443178
> > ...
> > [    1.480701] ------------[ cut here ]------------
> > [    1.481296] WARNING: CPU: 0 PID: 13 at kernel/rcu/tasks.h:958
> > trc_inspect_reader+0x80/0xb0
> > [    1.481296] Modules linked in:
> > [    1.481296] CPU: 0 PID: 13 Comm: rcu_tasks_trace Not tainted 5.15.134-rc1 #1
> > [    1.481296] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > 2.5 11/26/2020
> > [    1.481296] RIP: 0010:trc_inspect_reader+0x80/0xb0
> 
> This function has changed a lot, including the dropping of this
> WARN_ON_ONCE().  The warning was replaced in 897ba84dc5aa ("rcu-tasks:
> Handle idle tasks for recently offlined CPUs") with something that looks
> equivalent so I'm not sure why it would not trigger in newer revisions.
> 
> Obviously the behaviour I changed was the test for the task being idle.
> I am not sure how best to short-circuit that test from happening during
> boot as I am not familiar with the RCU code.

The usual test for RCU's notion of early boot being completed is
(rcu_scheduler_active != RCU_SCHEDULER_INIT).

Except that "ofl" should always be false that early in boot, at least
in mainline.

> It's also worth noting that the bug this fixes wasn't exposed until the
> maple tree (added in v6.1) was used for the IRQ descriptors (added in
> v6.5).

Lots of latent bugs, to be sure, even with rcutorture.  :-/

							Thanx, Paul

> > [    1.481296] Code: b6 83 45 04 00 00 84 c0 75 48 c6 83 45 04 00 00
> > 01 b8 01 00 00 00 5b 41 5c 5d c3 cc cc cc cc 0f 94 c0 eb b4 f6 43 2c
> > 02 75 02 <0f> 0b 48 83 05 36 f8 ee 02 01 b8 01 00 00 00 48 83 05 21 f8
> > ee 02
> > [    1.481296] RSP: 0000:ffffb25e000afd70 EFLAGS: 00010046
> > [    1.481296] RAX: 0000000000000000 RBX: ffff9b40c080d040 RCX: 0000000000000003
> > [    1.481296] RDX: ffff9b4427b80000 RSI: 0000000000000000 RDI: ffff9b40c080d040
> > [    1.481296] RBP: ffffb25e000afd80 R08: e32db91cdfdc3bef R09: 00000000035b89d4
> > [    1.481296] R10: 000000006a495065 R11: 0000000000000030 R12: ffffffffae692100
> > [    1.481296] R13: 0000000000000000 R14: ffff9b40c080d9a8 R15: 0000000000000000
> > [    1.481296] FS:  0000000000000000(0000) GS:ffff9b4427a00000(0000)
> > knlGS:0000000000000000
> > [    1.481296] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [    1.481296] CR2: ffff9b4297201000 CR3: 00000002d5e26001 CR4: 00000000003706f0
> > [    1.481296] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [    1.481296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [    1.481296] Call Trace:
> > [    1.481296]  <TASK>
> > [    1.481296]  ? show_regs.cold+0x1a/0x1f
> > [    1.481296]  ? __warn+0x88/0x120
> > [    1.481296]  ? trc_inspect_reader+0x80/0xb0
> > [    1.481296]  ? report_bug+0xa8/0xd0
> > [    1.481296]  ? handle_bug+0x40/0x70
> > [    1.481296]  ? exc_invalid_op+0x18/0x70
> > [    1.481296]  ? asm_exc_invalid_op+0x1b/0x20
> > [    1.481296]  ? rcu_tasks_kthread+0x250/0x250
> > [    1.481296]  ? trc_inspect_reader+0x80/0xb0
> > [    1.481296]  ? rcu_tasks_kthread+0x250/0x250
> > [    1.481296]  try_invoke_on_locked_down_task+0x109/0x120
> > [    1.481296]  trc_wait_for_one_reader.part.0+0x48/0x270
> > [    1.481296]  rcu_tasks_trace_postscan+0x76/0xb0
> > [    1.481296]  rcu_tasks_wait_gp+0x186/0x380
> > [    1.481296]  ? _raw_spin_unlock_irqrestore+0x35/0x50
> > [    1.481296]  rcu_tasks_kthread+0x145/0x250
> > [    1.481296]  ? do_wait_intr_irq+0xc0/0xc0
> > [    1.481296]  ? synchronize_rcu_tasks_rude+0x20/0x20
> > [    1.481296]  kthread+0x146/0x170
> > [    1.481296]  ? set_kthread_struct+0x50/0x50
> > [    1.481296]  ret_from_fork+0x1f/0x30
> > [    1.481296]  </TASK>
> > [    1.481296] irq event stamp: 132
> > [    1.481296] hardirqs last  enabled at (131): [<ffffffffaf7936a5>]
> > _raw_spin_unlock_irqrestore+0x35/0x50
> > [    1.481296] hardirqs last disabled at (132): [<ffffffffaf79345b>]
> > _raw_spin_lock_irqsave+0x5b/0x60
> > [    1.481296] softirqs last  enabled at (54): [<ffffffffae69201c>]
> > rcu_tasks_kthread+0x16c/0x250
> > [    1.481296] softirqs last disabled at (50): [<ffffffffae69201c>]
> > rcu_tasks_kthread+0x16c/0x250
> > [    1.481296] ---[ end trace 5a00c61d8412a9ac ]---
> > 
> > 
> > Links:
> > ----
> >  - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.15.y/build/v5.15.133-184-g6f28ecf24aef/testrun/20260259/suite/log-parser-boot/test/check-kernel-exception/log
> >  - https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.15.y/build/v5.15.133-184-g6f28ecf24aef/testrun/20260259/suite/log-parser-boot/tests/
> >  Build: https://storage.tuxsuite.com/public/linaro/lkft/builds/2WJFhcfqqG69pqj6LWuI14kVoP5/
> > 
> > steps to reproduce:
> > --------
> >  - https://storage.tuxsuite.com/public/linaro/lkft/builds/2WJFhcfqqG69pqj6LWuI14kVoP5/tuxmake_reproducer.sh
> > 
> > ## Build
> > * kernel: 5.15.134-rc1
> > * git: https://gitlab.com/Linaro/lkft/mirrors/stable/linux-stable-rc
> > * git branch: linux-5.15.y
> > * git commit: 6f28ecf24aef2896f4071dc6268d3fb5f8259c77
> > * git describe: v5.15.133-184-g6f28ecf24aef
> > * test details:
> > https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.15.y/build/v5.15.133-184-g6f28ecf24aef
> > 
> > ## Test Regressions (compared to v5.15.133)
> > * x86, log-parser-boot
> >   - check-kernel-exception
> >   - check-kernel-warning
> > 
> > * x86, log-parser-test
> >   - check-kernel-exception
> >   - check-kernel-warning
> > 
> > 
> > ## Metric Regressions (compared to v5.15.133)
> > 
> > ## Test Fixes (compared to v5.15.133)
> > 
> > ## Metric Fixes (compared to v5.15.133)
> > 
> > ## Test result summary
> > total: 90392, pass: 71514, fail: 2557, skip: 16224, xfail: 97
> > 
> > ## Build Summary
> > * arc: 4 total, 4 passed, 0 failed
> > * arm: 114 total, 114 passed, 0 failed
> > * arm64: 42 total, 42 passed, 0 failed
> > * i386: 32 total, 31 passed, 1 failed
> > * mips: 27 total, 26 passed, 1 failed
> > * parisc: 4 total, 4 passed, 0 failed
> > * powerpc: 26 total, 25 passed, 1 failed
> > * riscv: 11 total, 11 passed, 0 failed
> > * s390: 12 total, 11 passed, 1 failed
> > * sh: 13 total, 11 passed, 2 failed
> > * sparc: 8 total, 8 passed, 0 failed
> > * x86_64: 38 total, 38 passed, 0 failed
> > 
> > ## Test suites summary
> > * boot
> > * kselftest-android
> > * kselftest-arm64
> > * kselftest-breakpoints
> > * kselftest-capabilities
> > * kselftest-cgroup
> > * kselftest-clone3
> > * kselftest-core
> > * kselftest-cpu-hotplug
> > * kselftest-cpufreq
> > * kselftest-drivers-dma-buf
> > * kselftest-efivarfs
> > * kselftest-exec
> > * kselftest-filesystems
> > * kselftest-filesystems-binderfs
> > * kselftest-filesystems-epoll
> > * kselftest-firmware
> > * kselftest-fpu
> > * kselftest-ftrace
> > * kselftest-futex
> > * kselftest-gpio
> > * kselftest-intel_pstate
> > * kselftest-ipc
> > * kselftest-ir
> > * kselftest-kcmp
> > * kselftest-kexec
> > * kselftest-kvm
> > * kselftest-lib
> > * kselftest-membarrier
> > * kselftest-memfd
> > * kselftest-memory-hotplug
> > * kselftest-mincore
> > * kselftest-mount
> > * kselftest-mqueue
> > * kselftest-net
> > * kselftest-net-forwarding
> > * kselftest-net-mptcp
> > * kselftest-netfilter
> > * kselftest-nsfs
> > * kselftest-openat2
> > * kselftest-pid_namespace
> > * kselftest-pidfd
> > * kselftest-proc
> > * kselftest-pstore
> > * kselftest-ptrace
> > * kselftest-rseq
> > * kselftest-rtc
> > * kselftest-seccomp
> > * kselftest-sigaltstack
> > * kselftest-size
> > * kselftest-splice
> > * kselftest-static_keys
> > * kselftest-sync
> > * kselftest-sysctl
> > * kselftest-tc-testing
> > * kselftest-timens
> > * kselftest-tmpfs
> > * kselftest-tpm2
> > * kselftest-user
> > * kselftest-user_events
> > * kselftest-vDSO
> > * kselftest-vm
> > * kselftest-watchdog
> > * kselftest-x86
> > * kselftest-zram
> > * kunit
> > * kvm-unit-tests
> > * libgpiod
> > * log-parser-boot
> > * log-parser-test
> > * ltp-cap_bounds
> > * ltp-commands
> > * ltp-containers
> > * ltp-controllers
> > * ltp-cpuhotplug
> > * ltp-crypto
> > * ltp-cve
> > * ltp-dio
> > * ltp-fcntl-locktests
> > * ltp-filecaps
> > * ltp-fs
> > * ltp-fs_bind
> > * ltp-fs_perms_simple
> > * ltp-fsx
> > * ltp-hugetlb
> > * ltp-io
> > * ltp-ipc
> > * ltp-math
> > * ltp-mm
> > * ltp-nptl
> > * ltp-pty
> > * ltp-sched
> > * ltp-securebits
> > * ltp-smoke
> > * ltp-syscalls
> > * ltp-tracing
> > * network-basic-tests
> > * perf
> > * rcutorture
> > * v4l2-compliance
> > 
> > --
> > Linaro LKFT
> > https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5.15 000/183] 5.15.134-rc1 review
  2023-10-06 16:47     ` Paul E. McKenney
@ 2023-10-06 17:57       ` Liam R. Howlett
  2023-10-06 18:20         ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Liam R. Howlett @ 2023-10-06 17:57 UTC (permalink / raw
  To: Paul E. McKenney
  Cc: Naresh Kamboju, Greg Kroah-Hartman, stable, patches, linux-kernel,
	torvalds, akpm, linux, shuah, patches, lkft-triage, pavel,
	jonathanh, f.fainelli, sudipm.mukherjee, srw, rwarsow, conor,
	Chengming Zhou, Joel Fernandes, Peter Zijlstra, Ovidiu Panait,
	Ingo Molnar, rcu

* Paul E. McKenney <paulmck@kernel.org> [231006 12:47]:
> On Fri, Oct 06, 2023 at 12:20:38PM -0400, Liam R. Howlett wrote:
> > * Naresh Kamboju <naresh.kamboju@linaro.org> [231005 13:49]:
> > > On Wed, 4 Oct 2023 at 23:33, Greg Kroah-Hartman
> > > <gregkh@linuxfoundation.org> wrote:
> > > >
> > > > This is the start of the stable review cycle for the 5.15.134 release.
> > > > There are 183 patches in this series, all will be posted as a response
> > > > to this one.  If anyone has any issues with these being applied, please
> > > > let me know.
> > > >
> > > > Responses should be made by Fri, 06 Oct 2023 17:51:12 +0000.
> > > > Anything received after that time might be too late.
> > > >
> > > > The whole patch series can be found in one patch at:
> > > >         https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.134-rc1.gz
> > > > or in the git tree and branch at:
> > > >         git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > > > and the diffstat can be found below.
> > > >
> > > > thanks,
> > > >
> > > > greg k-h
> > > 
> > > Results from Linaro’s test farm.
> > > Regressions on x86.
> > > 
> > > Following kernel warning noticed on x86 while booting stable-rc 5.15.134-rc1
> > > with selftest merge config built kernel.
> > > 
> > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > 
> > > Anyone noticed this kernel warning ?
> > > 
> > > This is always reproducible while booting x86 with a given config.
> > 
> > >From that config:
> > #
> > # RCU Subsystem
> > #
> > CONFIG_TREE_RCU=y
> > # CONFIG_RCU_EXPERT is not set
> > CONFIG_SRCU=y
> > CONFIG_TREE_SRCU=y
> > CONFIG_TASKS_RCU_GENERIC=y
> > CONFIG_TASKS_RUDE_RCU=y
> > CONFIG_TASKS_TRACE_RCU=y
> > CONFIG_RCU_STALL_COMMON=y
> > CONFIG_RCU_NEED_SEGCBLIST=y
> > # end of RCU Subsystem    
> > 
> > #
> > # RCU Debugging
> > #
> > CONFIG_PROVE_RCU=y
> > # CONFIG_RCU_SCALE_TEST is not set
> > # CONFIG_RCU_TORTURE_TEST is not set
> > # CONFIG_RCU_REF_SCALE_TEST is not set
> > CONFIG_RCU_CPU_STALL_TIMEOUT=21
> > CONFIG_RCU_TRACE=y
> > # CONFIG_RCU_EQS_DEBUG is not set
> > # end of RCU Debugging
> > 
> > 
> > > 
> > > x86 boot log:
> > > -----
> > > [    0.000000] Linux version 5.15.134-rc1 (tuxmake@tuxmake)
> > > (x86_64-linux-gnu-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils
> > > for Debian) 2.40) #1 SMP @1696443178
> > > ...
> > > [    1.480701] ------------[ cut here ]------------
> > > [    1.481296] WARNING: CPU: 0 PID: 13 at kernel/rcu/tasks.h:958
> > > trc_inspect_reader+0x80/0xb0
> > > [    1.481296] Modules linked in:
> > > [    1.481296] CPU: 0 PID: 13 Comm: rcu_tasks_trace Not tainted 5.15.134-rc1 #1
> > > [    1.481296] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > 2.5 11/26/2020
> > > [    1.481296] RIP: 0010:trc_inspect_reader+0x80/0xb0
> > 
> > This function has changed a lot, including the dropping of this
> > WARN_ON_ONCE().  The warning was replaced in 897ba84dc5aa ("rcu-tasks:
> > Handle idle tasks for recently offlined CPUs") with something that looks
> > equivalent so I'm not sure why it would not trigger in newer revisions.
> > 
> > Obviously the behaviour I changed was the test for the task being idle.
> > I am not sure how best to short-circuit that test from happening during
> > boot as I am not familiar with the RCU code.
> 
> The usual test for RCU's notion of early boot being completed is
> (rcu_scheduler_active != RCU_SCHEDULER_INIT).
> 
> Except that "ofl" should always be false that early in boot, at least
> in mainline.

Is this still true in the final version of the patch where we set the
boot task as !idle until just before the early boot is finished?  I
wouldn't think of this as 'early in boot' anymore as much as the entire
kernel setup.  Maybe we need to shorten the time we stay in !idle mode
for earlier kernels?

How frequent is this function called?  We could check something for
early boot... or track down where the cpu is put online and restore idle
before that happens?

> 
> > It's also worth noting that the bug this fixes wasn't exposed until the
> > maple tree (added in v6.1) was used for the IRQ descriptors (added in
> > v6.5).
> 
> Lots of latent bugs, to be sure, even with rcutorture.  :-/

The Right Thing is to fix the bug all the way back to the introduction,
but what fallout makes the backport less desirable than living with the
unexposed bug?


Thanks,
Liam


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5.15 000/183] 5.15.134-rc1 review
  2023-10-06 17:57       ` Liam R. Howlett
@ 2023-10-06 18:20         ` Paul E. McKenney
  2023-10-08  1:22           ` Joel Fernandes
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2023-10-06 18:20 UTC (permalink / raw
  To: Liam R. Howlett
  Cc: Naresh Kamboju, Greg Kroah-Hartman, stable, patches, linux-kernel,
	torvalds, akpm, linux, shuah, patches, lkft-triage, pavel,
	jonathanh, f.fainelli, sudipm.mukherjee, srw, rwarsow, conor,
	Chengming Zhou, Joel Fernandes, Peter Zijlstra, Ovidiu Panait,
	Ingo Molnar, rcu

On Fri, Oct 06, 2023 at 01:57:14PM -0400, Liam R. Howlett wrote:
> * Paul E. McKenney <paulmck@kernel.org> [231006 12:47]:
> > On Fri, Oct 06, 2023 at 12:20:38PM -0400, Liam R. Howlett wrote:
> > > * Naresh Kamboju <naresh.kamboju@linaro.org> [231005 13:49]:
> > > > On Wed, 4 Oct 2023 at 23:33, Greg Kroah-Hartman
> > > > <gregkh@linuxfoundation.org> wrote:
> > > > >
> > > > > This is the start of the stable review cycle for the 5.15.134 release.
> > > > > There are 183 patches in this series, all will be posted as a response
> > > > > to this one.  If anyone has any issues with these being applied, please
> > > > > let me know.
> > > > >
> > > > > Responses should be made by Fri, 06 Oct 2023 17:51:12 +0000.
> > > > > Anything received after that time might be too late.
> > > > >
> > > > > The whole patch series can be found in one patch at:
> > > > >         https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.134-rc1.gz
> > > > > or in the git tree and branch at:
> > > > >         git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > > > > and the diffstat can be found below.
> > > > >
> > > > > thanks,
> > > > >
> > > > > greg k-h
> > > > 
> > > > Results from Linaro’s test farm.
> > > > Regressions on x86.
> > > > 
> > > > Following kernel warning noticed on x86 while booting stable-rc 5.15.134-rc1
> > > > with selftest merge config built kernel.
> > > > 
> > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > 
> > > > Anyone noticed this kernel warning ?
> > > > 
> > > > This is always reproducible while booting x86 with a given config.
> > > 
> > > >From that config:
> > > #
> > > # RCU Subsystem
> > > #
> > > CONFIG_TREE_RCU=y
> > > # CONFIG_RCU_EXPERT is not set
> > > CONFIG_SRCU=y
> > > CONFIG_TREE_SRCU=y
> > > CONFIG_TASKS_RCU_GENERIC=y
> > > CONFIG_TASKS_RUDE_RCU=y
> > > CONFIG_TASKS_TRACE_RCU=y
> > > CONFIG_RCU_STALL_COMMON=y
> > > CONFIG_RCU_NEED_SEGCBLIST=y
> > > # end of RCU Subsystem    
> > > 
> > > #
> > > # RCU Debugging
> > > #
> > > CONFIG_PROVE_RCU=y
> > > # CONFIG_RCU_SCALE_TEST is not set
> > > # CONFIG_RCU_TORTURE_TEST is not set
> > > # CONFIG_RCU_REF_SCALE_TEST is not set
> > > CONFIG_RCU_CPU_STALL_TIMEOUT=21
> > > CONFIG_RCU_TRACE=y
> > > # CONFIG_RCU_EQS_DEBUG is not set
> > > # end of RCU Debugging
> > > 
> > > 
> > > > 
> > > > x86 boot log:
> > > > -----
> > > > [    0.000000] Linux version 5.15.134-rc1 (tuxmake@tuxmake)
> > > > (x86_64-linux-gnu-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils
> > > > for Debian) 2.40) #1 SMP @1696443178
> > > > ...
> > > > [    1.480701] ------------[ cut here ]------------
> > > > [    1.481296] WARNING: CPU: 0 PID: 13 at kernel/rcu/tasks.h:958
> > > > trc_inspect_reader+0x80/0xb0
> > > > [    1.481296] Modules linked in:
> > > > [    1.481296] CPU: 0 PID: 13 Comm: rcu_tasks_trace Not tainted 5.15.134-rc1 #1
> > > > [    1.481296] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > 2.5 11/26/2020
> > > > [    1.481296] RIP: 0010:trc_inspect_reader+0x80/0xb0
> > > 
> > > This function has changed a lot, including the dropping of this
> > > WARN_ON_ONCE().  The warning was replaced in 897ba84dc5aa ("rcu-tasks:
> > > Handle idle tasks for recently offlined CPUs") with something that looks
> > > equivalent so I'm not sure why it would not trigger in newer revisions.
> > > 
> > > Obviously the behaviour I changed was the test for the task being idle.
> > > I am not sure how best to short-circuit that test from happening during
> > > boot as I am not familiar with the RCU code.
> > 
> > The usual test for RCU's notion of early boot being completed is
> > (rcu_scheduler_active != RCU_SCHEDULER_INIT).
> > 
> > Except that "ofl" should always be false that early in boot, at least
> > in mainline.
> 
> Is this still true in the final version of the patch where we set the
> boot task as !idle until just before the early boot is finished?  I
> wouldn't think of this as 'early in boot' anymore as much as the entire
> kernel setup.  Maybe we need to shorten the time we stay in !idle mode
> for earlier kernels?

In mainline, the ofl variable is defined as cpu_is_offline(cpu), and
during boot, the boot CPU is guaranteed to be online.  (As opposed to
the boot CPU's idle-task state.)

> How frequent is this function called?  We could check something for
> early boot... or track down where the cpu is put online and restore idle
> before that happens?

Once per RCU Tasks Trace grace period per reader seen to be blocking
that grace period.  Its performance is as issue, but not to anywhere
near the same extent as (say) rcu_read_lock_trace().

> > > It's also worth noting that the bug this fixes wasn't exposed until the
> > > maple tree (added in v6.1) was used for the IRQ descriptors (added in
> > > v6.5).
> > 
> > Lots of latent bugs, to be sure, even with rcutorture.  :-/
> 
> The Right Thing is to fix the bug all the way back to the introduction,
> but what fallout makes the backport less desirable than living with the
> unexposed bug?

You are quite right that it is possible for the risk of a backport to
exceed the risk of the original bug.

I defer to Joel (CCed) on how best to resolve this in -stable.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5.15 000/183] 5.15.134-rc1 review
  2023-10-06 18:20         ` Paul E. McKenney
@ 2023-10-08  1:22           ` Joel Fernandes
  2023-10-09  1:20             ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2023-10-08  1:22 UTC (permalink / raw
  To: paulmck
  Cc: Liam R. Howlett, Naresh Kamboju, Greg Kroah-Hartman, stable,
	patches, linux-kernel, torvalds, akpm, linux, shuah, patches,
	lkft-triage, pavel, jonathanh, f.fainelli, sudipm.mukherjee, srw,
	rwarsow, conor, Chengming Zhou, Peter Zijlstra, Ovidiu Panait,
	Ingo Molnar, rcu

On Fri, Oct 6, 2023 at 2:20 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Fri, Oct 06, 2023 at 01:57:14PM -0400, Liam R. Howlett wrote:
> > * Paul E. McKenney <paulmck@kernel.org> [231006 12:47]:
> > > On Fri, Oct 06, 2023 at 12:20:38PM -0400, Liam R. Howlett wrote:
> > > > * Naresh Kamboju <naresh.kamboju@linaro.org> [231005 13:49]:
> > > > > On Wed, 4 Oct 2023 at 23:33, Greg Kroah-Hartman
> > > > > <gregkh@linuxfoundation.org> wrote:
> > > > > >
> > > > > > This is the start of the stable review cycle for the 5.15.134 release.
> > > > > > There are 183 patches in this series, all will be posted as a response
> > > > > > to this one.  If anyone has any issues with these being applied, please
> > > > > > let me know.
> > > > > >
> > > > > > Responses should be made by Fri, 06 Oct 2023 17:51:12 +0000.
> > > > > > Anything received after that time might be too late.
> > > > > >
> > > > > > The whole patch series can be found in one patch at:
> > > > > >         https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.134-rc1.gz
> > > > > > or in the git tree and branch at:
> > > > > >         git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > > > > > and the diffstat can be found below.
> > > > > >
> > > > > > thanks,
> > > > > >
> > > > > > greg k-h
> > > > >
> > > > > Results from Linaro’s test farm.
> > > > > Regressions on x86.
> > > > >
> > > > > Following kernel warning noticed on x86 while booting stable-rc 5.15.134-rc1
> > > > > with selftest merge config built kernel.
> > > > >
> > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > >
> > > > > Anyone noticed this kernel warning ?
> > > > >
> > > > > This is always reproducible while booting x86 with a given config.
> > > >
> > > > >From that config:
> > > > #
> > > > # RCU Subsystem
> > > > #
> > > > CONFIG_TREE_RCU=y
> > > > # CONFIG_RCU_EXPERT is not set
> > > > CONFIG_SRCU=y
> > > > CONFIG_TREE_SRCU=y
> > > > CONFIG_TASKS_RCU_GENERIC=y
> > > > CONFIG_TASKS_RUDE_RCU=y
> > > > CONFIG_TASKS_TRACE_RCU=y
> > > > CONFIG_RCU_STALL_COMMON=y
> > > > CONFIG_RCU_NEED_SEGCBLIST=y
> > > > # end of RCU Subsystem
> > > >
> > > > #
> > > > # RCU Debugging
> > > > #
> > > > CONFIG_PROVE_RCU=y
> > > > # CONFIG_RCU_SCALE_TEST is not set
> > > > # CONFIG_RCU_TORTURE_TEST is not set
> > > > # CONFIG_RCU_REF_SCALE_TEST is not set
> > > > CONFIG_RCU_CPU_STALL_TIMEOUT=21
> > > > CONFIG_RCU_TRACE=y
> > > > # CONFIG_RCU_EQS_DEBUG is not set
> > > > # end of RCU Debugging
> > > >
> > > >
> > > > >
> > > > > x86 boot log:
> > > > > -----
> > > > > [    0.000000] Linux version 5.15.134-rc1 (tuxmake@tuxmake)
> > > > > (x86_64-linux-gnu-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils
> > > > > for Debian) 2.40) #1 SMP @1696443178
> > > > > ...
> > > > > [    1.480701] ------------[ cut here ]------------
> > > > > [    1.481296] WARNING: CPU: 0 PID: 13 at kernel/rcu/tasks.h:958
> > > > > trc_inspect_reader+0x80/0xb0
> > > > > [    1.481296] Modules linked in:
> > > > > [    1.481296] CPU: 0 PID: 13 Comm: rcu_tasks_trace Not tainted 5.15.134-rc1 #1
> > > > > [    1.481296] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > > 2.5 11/26/2020
> > > > > [    1.481296] RIP: 0010:trc_inspect_reader+0x80/0xb0
> > > >
> > > > This function has changed a lot, including the dropping of this
> > > > WARN_ON_ONCE().  The warning was replaced in 897ba84dc5aa ("rcu-tasks:
> > > > Handle idle tasks for recently offlined CPUs") with something that looks
> > > > equivalent so I'm not sure why it would not trigger in newer revisions.
> > > >
> > > > Obviously the behaviour I changed was the test for the task being idle.
> > > > I am not sure how best to short-circuit that test from happening during
> > > > boot as I am not familiar with the RCU code.
> > >
> > > The usual test for RCU's notion of early boot being completed is
> > > (rcu_scheduler_active != RCU_SCHEDULER_INIT).
> > >
> > > Except that "ofl" should always be false that early in boot, at least
> > > in mainline.
> >
> > Is this still true in the final version of the patch where we set the
> > boot task as !idle until just before the early boot is finished?  I
> > wouldn't think of this as 'early in boot' anymore as much as the entire
> > kernel setup.  Maybe we need to shorten the time we stay in !idle mode
> > for earlier kernels?
>
> In mainline, the ofl variable is defined as cpu_is_offline(cpu), and
> during boot, the boot CPU is guaranteed to be online.  (As opposed to
> the boot CPU's idle-task state.)
>
> > How frequent is this function called?  We could check something for
> > early boot... or track down where the cpu is put online and restore idle
> > before that happens?
>
> Once per RCU Tasks Trace grace period per reader seen to be blocking
> that grace period.  Its performance is as issue, but not to anywhere
> near the same extent as (say) rcu_read_lock_trace().
>
> > > > It's also worth noting that the bug this fixes wasn't exposed until the
> > > > maple tree (added in v6.1) was used for the IRQ descriptors (added in
> > > > v6.5).
> > >
> > > Lots of latent bugs, to be sure, even with rcutorture.  :-/
> >
> > The Right Thing is to fix the bug all the way back to the introduction,
> > but what fallout makes the backport less desirable than living with the
> > unexposed bug?
>
> You are quite right that it is possible for the risk of a backport to
> exceed the risk of the original bug.
>
> I defer to Joel (CCed) on how best to resolve this in -stable.

Maybe I am missing something but this issue should also be happening
in mainline right?

Even though mainline has 897ba84dc5aa ("rcu-tasks: Handle idle tasks
for recently offlined CPUs") , the warning should still be happening
due to Liam's "kernel/sched: Modify initial boot task idle setup"
because the warning is just rearranged a bit but essentially the same.

IMHO, the right thing to do then is to drop Liam's patch from 5.15 and
fix it in mainline (using the ideas described in this thread), then
backport both that new fix and Liam's patch to 5.15.

Or is there a reason this warning does not show up on the mainline?

My impression is that dropping Liam's patch for the stable release and
revisiting it later is a better approach since tiny RCU is used way
less in the wild than tree/tasks RCU. Thoughts?

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5.15 000/183] 5.15.134-rc1 review
  2023-10-08  1:22           ` Joel Fernandes
@ 2023-10-09  1:20             ` Paul E. McKenney
  2023-10-11  1:34               ` Paul E. McKenney
  2023-10-11  2:44               ` Joel Fernandes
  0 siblings, 2 replies; 14+ messages in thread
From: Paul E. McKenney @ 2023-10-09  1:20 UTC (permalink / raw
  To: Joel Fernandes
  Cc: Liam R. Howlett, Naresh Kamboju, Greg Kroah-Hartman, stable,
	patches, linux-kernel, torvalds, akpm, linux, shuah, patches,
	lkft-triage, pavel, jonathanh, f.fainelli, sudipm.mukherjee, srw,
	rwarsow, conor, Chengming Zhou, Peter Zijlstra, Ovidiu Panait,
	Ingo Molnar, rcu

On Sat, Oct 07, 2023 at 09:22:55PM -0400, Joel Fernandes wrote:
> On Fri, Oct 6, 2023 at 2:20 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Fri, Oct 06, 2023 at 01:57:14PM -0400, Liam R. Howlett wrote:
> > > * Paul E. McKenney <paulmck@kernel.org> [231006 12:47]:
> > > > On Fri, Oct 06, 2023 at 12:20:38PM -0400, Liam R. Howlett wrote:
> > > > > * Naresh Kamboju <naresh.kamboju@linaro.org> [231005 13:49]:
> > > > > > On Wed, 4 Oct 2023 at 23:33, Greg Kroah-Hartman
> > > > > > <gregkh@linuxfoundation.org> wrote:
> > > > > > >
> > > > > > > This is the start of the stable review cycle for the 5.15.134 release.
> > > > > > > There are 183 patches in this series, all will be posted as a response
> > > > > > > to this one.  If anyone has any issues with these being applied, please
> > > > > > > let me know.
> > > > > > >
> > > > > > > Responses should be made by Fri, 06 Oct 2023 17:51:12 +0000.
> > > > > > > Anything received after that time might be too late.
> > > > > > >
> > > > > > > The whole patch series can be found in one patch at:
> > > > > > >         https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.134-rc1.gz
> > > > > > > or in the git tree and branch at:
> > > > > > >         git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > > > > > > and the diffstat can be found below.
> > > > > > >
> > > > > > > thanks,
> > > > > > >
> > > > > > > greg k-h
> > > > > >
> > > > > > Results from Linaro’s test farm.
> > > > > > Regressions on x86.
> > > > > >
> > > > > > Following kernel warning noticed on x86 while booting stable-rc 5.15.134-rc1
> > > > > > with selftest merge config built kernel.
> > > > > >
> > > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > > >
> > > > > > Anyone noticed this kernel warning ?
> > > > > >
> > > > > > This is always reproducible while booting x86 with a given config.
> > > > >
> > > > > >From that config:
> > > > > #
> > > > > # RCU Subsystem
> > > > > #
> > > > > CONFIG_TREE_RCU=y
> > > > > # CONFIG_RCU_EXPERT is not set
> > > > > CONFIG_SRCU=y
> > > > > CONFIG_TREE_SRCU=y
> > > > > CONFIG_TASKS_RCU_GENERIC=y
> > > > > CONFIG_TASKS_RUDE_RCU=y
> > > > > CONFIG_TASKS_TRACE_RCU=y
> > > > > CONFIG_RCU_STALL_COMMON=y
> > > > > CONFIG_RCU_NEED_SEGCBLIST=y
> > > > > # end of RCU Subsystem
> > > > >
> > > > > #
> > > > > # RCU Debugging
> > > > > #
> > > > > CONFIG_PROVE_RCU=y
> > > > > # CONFIG_RCU_SCALE_TEST is not set
> > > > > # CONFIG_RCU_TORTURE_TEST is not set
> > > > > # CONFIG_RCU_REF_SCALE_TEST is not set
> > > > > CONFIG_RCU_CPU_STALL_TIMEOUT=21
> > > > > CONFIG_RCU_TRACE=y
> > > > > # CONFIG_RCU_EQS_DEBUG is not set
> > > > > # end of RCU Debugging
> > > > >
> > > > >
> > > > > >
> > > > > > x86 boot log:
> > > > > > -----
> > > > > > [    0.000000] Linux version 5.15.134-rc1 (tuxmake@tuxmake)
> > > > > > (x86_64-linux-gnu-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils
> > > > > > for Debian) 2.40) #1 SMP @1696443178
> > > > > > ...
> > > > > > [    1.480701] ------------[ cut here ]------------
> > > > > > [    1.481296] WARNING: CPU: 0 PID: 13 at kernel/rcu/tasks.h:958
> > > > > > trc_inspect_reader+0x80/0xb0
> > > > > > [    1.481296] Modules linked in:
> > > > > > [    1.481296] CPU: 0 PID: 13 Comm: rcu_tasks_trace Not tainted 5.15.134-rc1 #1
> > > > > > [    1.481296] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > > > 2.5 11/26/2020
> > > > > > [    1.481296] RIP: 0010:trc_inspect_reader+0x80/0xb0
> > > > >
> > > > > This function has changed a lot, including the dropping of this
> > > > > WARN_ON_ONCE().  The warning was replaced in 897ba84dc5aa ("rcu-tasks:
> > > > > Handle idle tasks for recently offlined CPUs") with something that looks
> > > > > equivalent so I'm not sure why it would not trigger in newer revisions.
> > > > >
> > > > > Obviously the behaviour I changed was the test for the task being idle.
> > > > > I am not sure how best to short-circuit that test from happening during
> > > > > boot as I am not familiar with the RCU code.
> > > >
> > > > The usual test for RCU's notion of early boot being completed is
> > > > (rcu_scheduler_active != RCU_SCHEDULER_INIT).
> > > >
> > > > Except that "ofl" should always be false that early in boot, at least
> > > > in mainline.
> > >
> > > Is this still true in the final version of the patch where we set the
> > > boot task as !idle until just before the early boot is finished?  I
> > > wouldn't think of this as 'early in boot' anymore as much as the entire
> > > kernel setup.  Maybe we need to shorten the time we stay in !idle mode
> > > for earlier kernels?
> >
> > In mainline, the ofl variable is defined as cpu_is_offline(cpu), and
> > during boot, the boot CPU is guaranteed to be online.  (As opposed to
> > the boot CPU's idle-task state.)
> >
> > > How frequent is this function called?  We could check something for
> > > early boot... or track down where the cpu is put online and restore idle
> > > before that happens?
> >
> > Once per RCU Tasks Trace grace period per reader seen to be blocking
> > that grace period.  Its performance is as issue, but not to anywhere
> > near the same extent as (say) rcu_read_lock_trace().
> >
> > > > > It's also worth noting that the bug this fixes wasn't exposed until the
> > > > > maple tree (added in v6.1) was used for the IRQ descriptors (added in
> > > > > v6.5).
> > > >
> > > > Lots of latent bugs, to be sure, even with rcutorture.  :-/
> > >
> > > The Right Thing is to fix the bug all the way back to the introduction,
> > > but what fallout makes the backport less desirable than living with the
> > > unexposed bug?
> >
> > You are quite right that it is possible for the risk of a backport to
> > exceed the risk of the original bug.
> >
> > I defer to Joel (CCed) on how best to resolve this in -stable.
> 
> Maybe I am missing something but this issue should also be happening
> in mainline right?
> 
> Even though mainline has 897ba84dc5aa ("rcu-tasks: Handle idle tasks
> for recently offlined CPUs") , the warning should still be happening
> due to Liam's "kernel/sched: Modify initial boot task idle setup"
> because the warning is just rearranged a bit but essentially the same.
> 
> IMHO, the right thing to do then is to drop Liam's patch from 5.15 and
> fix it in mainline (using the ideas described in this thread), then
> backport both that new fix and Liam's patch to 5.15.
> 
> Or is there a reason this warning does not show up on the mainline?
> 
> My impression is that dropping Liam's patch for the stable release and
> revisiting it later is a better approach since tiny RCU is used way
> less in the wild than tree/tasks RCU. Thoughts?

I think that this one is strange enough that we need to write down the
situation in detail, make sure we have all the corner cases covered in
both mainline and -stable, and decide what to do from there.

Yes, I know, this email thread contains much of this information, but
a little organizing of it would be good.

Would you like to put that together, or should I?  If me, I will get
a draft out by the end of this coming Tuesday, Pacific Time.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5.15 000/183] 5.15.134-rc1 review
  2023-10-09  1:20             ` Paul E. McKenney
@ 2023-10-11  1:34               ` Paul E. McKenney
  2023-10-11  5:05                 ` Joel Fernandes
  2023-10-11 13:47                 ` Frederic Weisbecker
  2023-10-11  2:44               ` Joel Fernandes
  1 sibling, 2 replies; 14+ messages in thread
From: Paul E. McKenney @ 2023-10-11  1:34 UTC (permalink / raw
  To: Joel Fernandes
  Cc: Liam R. Howlett, Naresh Kamboju, Greg Kroah-Hartman, stable,
	patches, linux-kernel, torvalds, akpm, linux, shuah, patches,
	lkft-triage, pavel, jonathanh, f.fainelli, sudipm.mukherjee, srw,
	rwarsow, conor, Chengming Zhou, Peter Zijlstra, Ovidiu Panait,
	Ingo Molnar, rcu

On Sun, Oct 08, 2023 at 06:20:53PM -0700, Paul E. McKenney wrote:
> On Sat, Oct 07, 2023 at 09:22:55PM -0400, Joel Fernandes wrote:
> > On Fri, Oct 6, 2023 at 2:20 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Fri, Oct 06, 2023 at 01:57:14PM -0400, Liam R. Howlett wrote:
> > > > * Paul E. McKenney <paulmck@kernel.org> [231006 12:47]:
> > > > > On Fri, Oct 06, 2023 at 12:20:38PM -0400, Liam R. Howlett wrote:
> > > > > > * Naresh Kamboju <naresh.kamboju@linaro.org> [231005 13:49]:
> > > > > > > On Wed, 4 Oct 2023 at 23:33, Greg Kroah-Hartman
> > > > > > > <gregkh@linuxfoundation.org> wrote:
> > > > > > > >
> > > > > > > > This is the start of the stable review cycle for the 5.15.134 release.
> > > > > > > > There are 183 patches in this series, all will be posted as a response
> > > > > > > > to this one.  If anyone has any issues with these being applied, please
> > > > > > > > let me know.
> > > > > > > >
> > > > > > > > Responses should be made by Fri, 06 Oct 2023 17:51:12 +0000.
> > > > > > > > Anything received after that time might be too late.
> > > > > > > >
> > > > > > > > The whole patch series can be found in one patch at:
> > > > > > > >         https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.134-rc1.gz
> > > > > > > > or in the git tree and branch at:
> > > > > > > >         git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > > > > > > > and the diffstat can be found below.
> > > > > > > >
> > > > > > > > thanks,
> > > > > > > >
> > > > > > > > greg k-h
> > > > > > >
> > > > > > > Results from Linaro’s test farm.
> > > > > > > Regressions on x86.
> > > > > > >
> > > > > > > Following kernel warning noticed on x86 while booting stable-rc 5.15.134-rc1
> > > > > > > with selftest merge config built kernel.
> > > > > > >
> > > > > > > Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>
> > > > > > >
> > > > > > > Anyone noticed this kernel warning ?
> > > > > > >
> > > > > > > This is always reproducible while booting x86 with a given config.
> > > > > >
> > > > > > >From that config:
> > > > > > #
> > > > > > # RCU Subsystem
> > > > > > #
> > > > > > CONFIG_TREE_RCU=y
> > > > > > # CONFIG_RCU_EXPERT is not set
> > > > > > CONFIG_SRCU=y
> > > > > > CONFIG_TREE_SRCU=y
> > > > > > CONFIG_TASKS_RCU_GENERIC=y
> > > > > > CONFIG_TASKS_RUDE_RCU=y
> > > > > > CONFIG_TASKS_TRACE_RCU=y
> > > > > > CONFIG_RCU_STALL_COMMON=y
> > > > > > CONFIG_RCU_NEED_SEGCBLIST=y
> > > > > > # end of RCU Subsystem
> > > > > >
> > > > > > #
> > > > > > # RCU Debugging
> > > > > > #
> > > > > > CONFIG_PROVE_RCU=y
> > > > > > # CONFIG_RCU_SCALE_TEST is not set
> > > > > > # CONFIG_RCU_TORTURE_TEST is not set
> > > > > > # CONFIG_RCU_REF_SCALE_TEST is not set
> > > > > > CONFIG_RCU_CPU_STALL_TIMEOUT=21
> > > > > > CONFIG_RCU_TRACE=y
> > > > > > # CONFIG_RCU_EQS_DEBUG is not set
> > > > > > # end of RCU Debugging
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > x86 boot log:
> > > > > > > -----
> > > > > > > [    0.000000] Linux version 5.15.134-rc1 (tuxmake@tuxmake)
> > > > > > > (x86_64-linux-gnu-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils
> > > > > > > for Debian) 2.40) #1 SMP @1696443178
> > > > > > > ...
> > > > > > > [    1.480701] ------------[ cut here ]------------
> > > > > > > [    1.481296] WARNING: CPU: 0 PID: 13 at kernel/rcu/tasks.h:958
> > > > > > > trc_inspect_reader+0x80/0xb0
> > > > > > > [    1.481296] Modules linked in:
> > > > > > > [    1.481296] CPU: 0 PID: 13 Comm: rcu_tasks_trace Not tainted 5.15.134-rc1 #1
> > > > > > > [    1.481296] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > > > > 2.5 11/26/2020
> > > > > > > [    1.481296] RIP: 0010:trc_inspect_reader+0x80/0xb0
> > > > > >
> > > > > > This function has changed a lot, including the dropping of this
> > > > > > WARN_ON_ONCE().  The warning was replaced in 897ba84dc5aa ("rcu-tasks:
> > > > > > Handle idle tasks for recently offlined CPUs") with something that looks
> > > > > > equivalent so I'm not sure why it would not trigger in newer revisions.
> > > > > >
> > > > > > Obviously the behaviour I changed was the test for the task being idle.
> > > > > > I am not sure how best to short-circuit that test from happening during
> > > > > > boot as I am not familiar with the RCU code.
> > > > >
> > > > > The usual test for RCU's notion of early boot being completed is
> > > > > (rcu_scheduler_active != RCU_SCHEDULER_INIT).
> > > > >
> > > > > Except that "ofl" should always be false that early in boot, at least
> > > > > in mainline.
> > > >
> > > > Is this still true in the final version of the patch where we set the
> > > > boot task as !idle until just before the early boot is finished?  I
> > > > wouldn't think of this as 'early in boot' anymore as much as the entire
> > > > kernel setup.  Maybe we need to shorten the time we stay in !idle mode
> > > > for earlier kernels?
> > >
> > > In mainline, the ofl variable is defined as cpu_is_offline(cpu), and
> > > during boot, the boot CPU is guaranteed to be online.  (As opposed to
> > > the boot CPU's idle-task state.)
> > >
> > > > How frequent is this function called?  We could check something for
> > > > early boot... or track down where the cpu is put online and restore idle
> > > > before that happens?
> > >
> > > Once per RCU Tasks Trace grace period per reader seen to be blocking
> > > that grace period.  Its performance is as issue, but not to anywhere
> > > near the same extent as (say) rcu_read_lock_trace().
> > >
> > > > > > It's also worth noting that the bug this fixes wasn't exposed until the
> > > > > > maple tree (added in v6.1) was used for the IRQ descriptors (added in
> > > > > > v6.5).
> > > > >
> > > > > Lots of latent bugs, to be sure, even with rcutorture.  :-/
> > > >
> > > > The Right Thing is to fix the bug all the way back to the introduction,
> > > > but what fallout makes the backport less desirable than living with the
> > > > unexposed bug?
> > >
> > > You are quite right that it is possible for the risk of a backport to
> > > exceed the risk of the original bug.
> > >
> > > I defer to Joel (CCed) on how best to resolve this in -stable.
> > 
> > Maybe I am missing something but this issue should also be happening
> > in mainline right?
> > 
> > Even though mainline has 897ba84dc5aa ("rcu-tasks: Handle idle tasks
> > for recently offlined CPUs") , the warning should still be happening
> > due to Liam's "kernel/sched: Modify initial boot task idle setup"
> > because the warning is just rearranged a bit but essentially the same.
> > 
> > IMHO, the right thing to do then is to drop Liam's patch from 5.15 and
> > fix it in mainline (using the ideas described in this thread), then
> > backport both that new fix and Liam's patch to 5.15.
> > 
> > Or is there a reason this warning does not show up on the mainline?

There is not a whole lot of commonality between the v5.15.134 version of
RCU Tasks Trace and that of mainline.  In theory, in mainline, CPU hotplug
is supposed to be disabled across all calls to trc_inspect_reader(),
which means that there would not be any CPU coming or going.

But there could potentially be some time between when a CPU was
marked as online and its idle task was marked PF_IDLE.  And in
fact x86 start_secondary() invokes set_cpu_online() before it calls
cpu_startup_entry(), and it is the latter than sets PF_IDLE.

The same is true of alpha, arc, arm, arm64, csky, ia64, loongarch, mips,
openrisc, parisc, powerpc, riscv, s390, sh, sparc32, sparc64, x86 xen,
and xtensa, which is everybody.

One reason why my testing did not reproduce this is because I was running
against v6.6-rc1, and cff9b2332ab7 ("kernel/sched: Modify initial boot
task idle setup") went into v6.6-rc3.  An initial run merging in current
mainline also failed to reproduce this, but I am running overnight.
If that doesn't reproduce, I will try inserting delays between the
set_cpu_online() and the cpu_startup_entry().

If this problem is real, fixes include:

o	Revert Liam's patch and make Tiny RCU's call_rcu() deal with
	the problem.  This is overhead and non-tinyness, but to Joel's
	point, it might be best.

o	Go back to something more like Liam's original patch, which
	cleared PF_IDLE only for the boot CPU.

o	Set PF_IDLE before calling set_cpu_online().  This would work,
	but it would also be rather ugly, reaching into each and every
	architecture.

o	Move the call to set_cpu_online() into cpu_startup_entry().
	This would require some serious inspection to prove that it is
	safe, assuming that it is in fact safe.

o	Drop the WARN_ON_ONCE() from trc_inspect_reader().  Not all
	that excited by losing this diagnostic, but then again it
	has been awhile since it has caught anything.

o	Make the WARN_ON_ONCE() condition in trc_inspect_reader() instead
	to a "return false" to retry later.  Ditto, also not liking the
	possibility of indefinite deferral with no warning.

There are likely other approaches.

> > My impression is that dropping Liam's patch for the stable release and
> > revisiting it later is a better approach since tiny RCU is used way
> > less in the wild than tree/tasks RCU. Thoughts?
> 
> I think that this one is strange enough that we need to write down the
> situation in detail, make sure we have all the corner cases covered in
> both mainline and -stable, and decide what to do from there.
> 
> Yes, I know, this email thread contains much of this information, but
> a little organizing of it would be good.
> 
> Would you like to put that together, or should I?  If me, I will get
> a draft out by the end of this coming Tuesday, Pacific Time.

And I guess that this is that draft.

It is quite possible that Tasks RCU also has issues with momentary
online non-idleness of non-boot-CPU idle tasks, but checking that is a
task for another time.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5.15 000/183] 5.15.134-rc1 review
  2023-10-09  1:20             ` Paul E. McKenney
  2023-10-11  1:34               ` Paul E. McKenney
@ 2023-10-11  2:44               ` Joel Fernandes
  2023-10-11  3:11                 ` Paul E. McKenney
  1 sibling, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2023-10-11  2:44 UTC (permalink / raw
  To: paulmck
  Cc: Liam R. Howlett, Naresh Kamboju, Greg Kroah-Hartman, stable,
	patches, linux-kernel, torvalds, akpm, linux, shuah, patches,
	lkft-triage, pavel, jonathanh, f.fainelli, sudipm.mukherjee, srw,
	rwarsow, conor, Chengming Zhou, Peter Zijlstra, Ovidiu Panait,
	Ingo Molnar, rcu

On Sun, Oct 8, 2023 at 9:20 PM Paul E. McKenney <paulmck@kernel.org> wrote:
[...]
> > > > How frequent is this function called?  We could check something for
> > > > early boot... or track down where the cpu is put online and restore idle
> > > > before that happens?
> > >
> > > Once per RCU Tasks Trace grace period per reader seen to be blocking
> > > that grace period.  Its performance is as issue, but not to anywhere
> > > near the same extent as (say) rcu_read_lock_trace().
> > >
> > > > > > It's also worth noting that the bug this fixes wasn't exposed until the
> > > > > > maple tree (added in v6.1) was used for the IRQ descriptors (added in
> > > > > > v6.5).
> > > > >
> > > > > Lots of latent bugs, to be sure, even with rcutorture.  :-/
> > > >
> > > > The Right Thing is to fix the bug all the way back to the introduction,
> > > > but what fallout makes the backport less desirable than living with the
> > > > unexposed bug?
> > >
> > > You are quite right that it is possible for the risk of a backport to
> > > exceed the risk of the original bug.
> > >
> > > I defer to Joel (CCed) on how best to resolve this in -stable.
> >
> > Maybe I am missing something but this issue should also be happening
> > in mainline right?
> >
> > Even though mainline has 897ba84dc5aa ("rcu-tasks: Handle idle tasks
> > for recently offlined CPUs") , the warning should still be happening
> > due to Liam's "kernel/sched: Modify initial boot task idle setup"
> > because the warning is just rearranged a bit but essentially the same.
> >
> > IMHO, the right thing to do then is to drop Liam's patch from 5.15 and
> > fix it in mainline (using the ideas described in this thread), then
> > backport both that new fix and Liam's patch to 5.15.
> >
> > Or is there a reason this warning does not show up on the mainline?
> >
> > My impression is that dropping Liam's patch for the stable release and
> > revisiting it later is a better approach since tiny RCU is used way
> > less in the wild than tree/tasks RCU. Thoughts?
>
> I think that this one is strange enough that we need to write down the
> situation in detail, make sure we have all the corner cases covered in
> both mainline and -stable, and decide what to do from there.
>
> Yes, I know, this email thread contains much of this information, but
> a little organizing of it would be good.
>
> Would you like to put that together, or should I?  If me, I will get
> a draft out by the end of this coming Tuesday, Pacific Time.

I apologize, I haven't been able to do any real work as I was OOO for
the most part due to dental issues. I am about 25% back now. I will
review your other email writeup and thanks for putting it together!

 - Joel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5.15 000/183] 5.15.134-rc1 review
  2023-10-11  2:44               ` Joel Fernandes
@ 2023-10-11  3:11                 ` Paul E. McKenney
  0 siblings, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2023-10-11  3:11 UTC (permalink / raw
  To: Joel Fernandes
  Cc: Liam R. Howlett, Naresh Kamboju, Greg Kroah-Hartman, stable,
	patches, linux-kernel, torvalds, akpm, linux, shuah, patches,
	lkft-triage, pavel, jonathanh, f.fainelli, sudipm.mukherjee, srw,
	rwarsow, conor, Chengming Zhou, Peter Zijlstra, Ovidiu Panait,
	Ingo Molnar, rcu

On Tue, Oct 10, 2023 at 10:44:16PM -0400, Joel Fernandes wrote:
> On Sun, Oct 8, 2023 at 9:20 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> [...]
> > > > > How frequent is this function called?  We could check something for
> > > > > early boot... or track down where the cpu is put online and restore idle
> > > > > before that happens?
> > > >
> > > > Once per RCU Tasks Trace grace period per reader seen to be blocking
> > > > that grace period.  Its performance is as issue, but not to anywhere
> > > > near the same extent as (say) rcu_read_lock_trace().
> > > >
> > > > > > > It's also worth noting that the bug this fixes wasn't exposed until the
> > > > > > > maple tree (added in v6.1) was used for the IRQ descriptors (added in
> > > > > > > v6.5).
> > > > > >
> > > > > > Lots of latent bugs, to be sure, even with rcutorture.  :-/
> > > > >
> > > > > The Right Thing is to fix the bug all the way back to the introduction,
> > > > > but what fallout makes the backport less desirable than living with the
> > > > > unexposed bug?
> > > >
> > > > You are quite right that it is possible for the risk of a backport to
> > > > exceed the risk of the original bug.
> > > >
> > > > I defer to Joel (CCed) on how best to resolve this in -stable.
> > >
> > > Maybe I am missing something but this issue should also be happening
> > > in mainline right?
> > >
> > > Even though mainline has 897ba84dc5aa ("rcu-tasks: Handle idle tasks
> > > for recently offlined CPUs") , the warning should still be happening
> > > due to Liam's "kernel/sched: Modify initial boot task idle setup"
> > > because the warning is just rearranged a bit but essentially the same.
> > >
> > > IMHO, the right thing to do then is to drop Liam's patch from 5.15 and
> > > fix it in mainline (using the ideas described in this thread), then
> > > backport both that new fix and Liam's patch to 5.15.
> > >
> > > Or is there a reason this warning does not show up on the mainline?
> > >
> > > My impression is that dropping Liam's patch for the stable release and
> > > revisiting it later is a better approach since tiny RCU is used way
> > > less in the wild than tree/tasks RCU. Thoughts?
> >
> > I think that this one is strange enough that we need to write down the
> > situation in detail, make sure we have all the corner cases covered in
> > both mainline and -stable, and decide what to do from there.
> >
> > Yes, I know, this email thread contains much of this information, but
> > a little organizing of it would be good.
> >
> > Would you like to put that together, or should I?  If me, I will get
> > a draft out by the end of this coming Tuesday, Pacific Time.
> 
> I apologize, I haven't been able to do any real work as I was OOO for
> the most part due to dental issues. I am about 25% back now. I will
> review your other email writeup and thanks for putting it together!

No need to apologize!  If anything, it is I who should apologize for
not digging deeply into this to begin with.  As always, there were
distraction.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5.15 000/183] 5.15.134-rc1 review
  2023-10-11  1:34               ` Paul E. McKenney
@ 2023-10-11  5:05                 ` Joel Fernandes
  2023-10-11 10:25                   ` Paul E. McKenney
  2023-10-11 13:47                 ` Frederic Weisbecker
  1 sibling, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2023-10-11  5:05 UTC (permalink / raw
  To: Paul E. McKenney
  Cc: Liam R. Howlett, Naresh Kamboju, Greg Kroah-Hartman, stable,
	patches, linux-kernel, torvalds, akpm, linux, shuah, patches,
	lkft-triage, pavel, jonathanh, f.fainelli, sudipm.mukherjee, srw,
	rwarsow, conor, Chengming Zhou, Peter Zijlstra, Ovidiu Panait,
	Ingo Molnar, rcu

On Tue, Oct 10, 2023 at 06:34:35PM -0700, Paul E. McKenney wrote:
[...]
> > > > > > > It's also worth noting that the bug this fixes wasn't exposed until the
> > > > > > > maple tree (added in v6.1) was used for the IRQ descriptors (added in
> > > > > > > v6.5).
> > > > > >
> > > > > > Lots of latent bugs, to be sure, even with rcutorture.  :-/
> > > > >
> > > > > The Right Thing is to fix the bug all the way back to the introduction,
> > > > > but what fallout makes the backport less desirable than living with the
> > > > > unexposed bug?
> > > >
> > > > You are quite right that it is possible for the risk of a backport to
> > > > exceed the risk of the original bug.
> > > >
> > > > I defer to Joel (CCed) on how best to resolve this in -stable.
> > > 
> > > Maybe I am missing something but this issue should also be happening
> > > in mainline right?
> > > 
> > > Even though mainline has 897ba84dc5aa ("rcu-tasks: Handle idle tasks
> > > for recently offlined CPUs") , the warning should still be happening
> > > due to Liam's "kernel/sched: Modify initial boot task idle setup"
> > > because the warning is just rearranged a bit but essentially the same.
> > > 
> > > IMHO, the right thing to do then is to drop Liam's patch from 5.15 and
> > > fix it in mainline (using the ideas described in this thread), then
> > > backport both that new fix and Liam's patch to 5.15.
> > > 
> > > Or is there a reason this warning does not show up on the mainline?
> 
> There is not a whole lot of commonality between the v5.15.134 version of
> RCU Tasks Trace and that of mainline.  In theory, in mainline, CPU hotplug
> is supposed to be disabled across all calls to trc_inspect_reader(),
> which means that there would not be any CPU coming or going.
> 
> But there could potentially be some time between when a CPU was
> marked as online and its idle task was marked PF_IDLE.  And in
> fact x86 start_secondary() invokes set_cpu_online() before it calls
> cpu_startup_entry(), and it is the latter than sets PF_IDLE.
> 
> The same is true of alpha, arc, arm, arm64, csky, ia64, loongarch, mips,
> openrisc, parisc, powerpc, riscv, s390, sh, sparc32, sparc64, x86 xen,
> and xtensa, which is everybody.
> 
> One reason why my testing did not reproduce this is because I was running
> against v6.6-rc1, and cff9b2332ab7 ("kernel/sched: Modify initial boot
> task idle setup") went into v6.6-rc3.  An initial run merging in current
> mainline also failed to reproduce this, but I am running overnight.
> If that doesn't reproduce, I will try inserting delays between the
> set_cpu_online() and the cpu_startup_entry().

I thought the warning happens before set_cpu_online() is even called, because
under such situation, ofl == true and the task is not set to PF_IDLE yet:

                  WARN_ON_ONCE(ofl && task_curr(t) && !is_idle_task(t));

> If this problem is real, fixes include:
> 
> o	Revert Liam's patch and make Tiny RCU's call_rcu() deal with
> 	the problem.  This is overhead and non-tinyness, but to Joel's
> 	point, it might be best.
> 
> o	Go back to something more like Liam's original patch, which
> 	cleared PF_IDLE only for the boot CPU.
> 
> o	Set PF_IDLE before calling set_cpu_online().  This would work,
> 	but it would also be rather ugly, reaching into each and every
> 	architecture.
> 
> o	Move the call to set_cpu_online() into cpu_startup_entry().
> 	This would require some serious inspection to prove that it is
> 	safe, assuming that it is in fact safe.
> 
> o	Drop the WARN_ON_ONCE() from trc_inspect_reader().  Not all
> 	that excited by losing this diagnostic, but then again it
> 	has been awhile since it has caught anything.
> 
> o	Make the WARN_ON_ONCE() condition in trc_inspect_reader() instead
> 	to a "return false" to retry later.  Ditto, also not liking the
> 	possibility of indefinite deferral with no warning.

Just for completeness, 

 o      Since it just a warning, checking for task_struct::pid == 0 instead of is_idle_task()?
        Though PF_IDLE is also set in play_idle_precise().

 o	Change warning to:
                  WARN_ON_ONCE(ofl && task_curr(t) && (!is_idle_task(t) && t->pid != 0));

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5.15 000/183] 5.15.134-rc1 review
  2023-10-11  5:05                 ` Joel Fernandes
@ 2023-10-11 10:25                   ` Paul E. McKenney
  0 siblings, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2023-10-11 10:25 UTC (permalink / raw
  To: Joel Fernandes
  Cc: Liam R. Howlett, Naresh Kamboju, Greg Kroah-Hartman, stable,
	patches, linux-kernel, torvalds, akpm, linux, shuah, patches,
	lkft-triage, pavel, jonathanh, f.fainelli, sudipm.mukherjee, srw,
	rwarsow, conor, Chengming Zhou, Peter Zijlstra, Ovidiu Panait,
	Ingo Molnar, rcu

On Wed, Oct 11, 2023 at 05:05:04AM +0000, Joel Fernandes wrote:
> On Tue, Oct 10, 2023 at 06:34:35PM -0700, Paul E. McKenney wrote:
> [...]
> > > > > > > > It's also worth noting that the bug this fixes wasn't exposed until the
> > > > > > > > maple tree (added in v6.1) was used for the IRQ descriptors (added in
> > > > > > > > v6.5).
> > > > > > >
> > > > > > > Lots of latent bugs, to be sure, even with rcutorture.  :-/
> > > > > >
> > > > > > The Right Thing is to fix the bug all the way back to the introduction,
> > > > > > but what fallout makes the backport less desirable than living with the
> > > > > > unexposed bug?
> > > > >
> > > > > You are quite right that it is possible for the risk of a backport to
> > > > > exceed the risk of the original bug.
> > > > >
> > > > > I defer to Joel (CCed) on how best to resolve this in -stable.
> > > > 
> > > > Maybe I am missing something but this issue should also be happening
> > > > in mainline right?
> > > > 
> > > > Even though mainline has 897ba84dc5aa ("rcu-tasks: Handle idle tasks
> > > > for recently offlined CPUs") , the warning should still be happening
> > > > due to Liam's "kernel/sched: Modify initial boot task idle setup"
> > > > because the warning is just rearranged a bit but essentially the same.
> > > > 
> > > > IMHO, the right thing to do then is to drop Liam's patch from 5.15 and
> > > > fix it in mainline (using the ideas described in this thread), then
> > > > backport both that new fix and Liam's patch to 5.15.
> > > > 
> > > > Or is there a reason this warning does not show up on the mainline?
> > 
> > There is not a whole lot of commonality between the v5.15.134 version of
> > RCU Tasks Trace and that of mainline.  In theory, in mainline, CPU hotplug
> > is supposed to be disabled across all calls to trc_inspect_reader(),
> > which means that there would not be any CPU coming or going.
> > 
> > But there could potentially be some time between when a CPU was
> > marked as online and its idle task was marked PF_IDLE.  And in
> > fact x86 start_secondary() invokes set_cpu_online() before it calls
> > cpu_startup_entry(), and it is the latter than sets PF_IDLE.
> > 
> > The same is true of alpha, arc, arm, arm64, csky, ia64, loongarch, mips,
> > openrisc, parisc, powerpc, riscv, s390, sh, sparc32, sparc64, x86 xen,
> > and xtensa, which is everybody.
> > 
> > One reason why my testing did not reproduce this is because I was running
> > against v6.6-rc1, and cff9b2332ab7 ("kernel/sched: Modify initial boot
> > task idle setup") went into v6.6-rc3.  An initial run merging in current
> > mainline also failed to reproduce this, but I am running overnight.
> > If that doesn't reproduce, I will try inserting delays between the
> > set_cpu_online() and the cpu_startup_entry().
> 
> I thought the warning happens before set_cpu_online() is even called, because
> under such situation, ofl == true and the task is not set to PF_IDLE yet:
> 
>                   WARN_ON_ONCE(ofl && task_curr(t) && !is_idle_task(t));

That case is supposed to be excluded by the cpus_read_lock() calls.
Yes, key phrase "supposed to be".  ;-)

> > If this problem is real, fixes include:
> > 
> > o	Revert Liam's patch and make Tiny RCU's call_rcu() deal with
> > 	the problem.  This is overhead and non-tinyness, but to Joel's
> > 	point, it might be best.
> > 
> > o	Go back to something more like Liam's original patch, which
> > 	cleared PF_IDLE only for the boot CPU.
> > 
> > o	Set PF_IDLE before calling set_cpu_online().  This would work,
> > 	but it would also be rather ugly, reaching into each and every
> > 	architecture.
> > 
> > o	Move the call to set_cpu_online() into cpu_startup_entry().
> > 	This would require some serious inspection to prove that it is
> > 	safe, assuming that it is in fact safe.
> > 
> > o	Drop the WARN_ON_ONCE() from trc_inspect_reader().  Not all
> > 	that excited by losing this diagnostic, but then again it
> > 	has been awhile since it has caught anything.
> > 
> > o	Make the WARN_ON_ONCE() condition in trc_inspect_reader() instead
> > 	to a "return false" to retry later.  Ditto, also not liking the
> > 	possibility of indefinite deferral with no warning.
> 
> Just for completeness, 
> 
>  o      Since it just a warning, checking for task_struct::pid == 0 instead of is_idle_task()?
>         Though PF_IDLE is also set in play_idle_precise().
> 
>  o	Change warning to:
>                   WARN_ON_ONCE(ofl && task_curr(t) && (!is_idle_task(t) && t->pid != 0));

This change does look promising, thank you!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5.15 000/183] 5.15.134-rc1 review
  2023-10-11  1:34               ` Paul E. McKenney
  2023-10-11  5:05                 ` Joel Fernandes
@ 2023-10-11 13:47                 ` Frederic Weisbecker
  2023-10-11 16:31                   ` Paul E. McKenney
  1 sibling, 1 reply; 14+ messages in thread
From: Frederic Weisbecker @ 2023-10-11 13:47 UTC (permalink / raw
  To: Paul E. McKenney
  Cc: Joel Fernandes, Liam R. Howlett, Naresh Kamboju,
	Greg Kroah-Hartman, stable, patches, linux-kernel, torvalds, akpm,
	linux, shuah, patches, lkft-triage, pavel, jonathanh, f.fainelli,
	sudipm.mukherjee, srw, rwarsow, conor, Chengming Zhou,
	Peter Zijlstra, Ovidiu Panait, Ingo Molnar, rcu

Le Tue, Oct 10, 2023 at 06:34:35PM -0700, Paul E. McKenney a écrit :
> If this problem is real, fixes include:
> 
> o	Revert Liam's patch and make Tiny RCU's call_rcu() deal with
> 	the problem.  This is overhead and non-tinyness, but to Joel's
> 	point, it might be best.

But what is calling call_rcu() or start_poll_synchronize_rcu() so
early that the CPU is not even online? (that's before boot_cpu_init() !)

Deferring PF_IDLE setting might pave the way for more issues like this one,
present or future. Though is_idle_task() returning true when the task is not
in the idle loop but is playing the init/0 role is debatable.

An alternative for tiny RCU is to force waking up ksoftirqd when call_rcu()
is in the idle task. Since rcu_qs() during the context switch raises a softirq
anyway. It's more overhead for start_poll_synchronize_rcu() though but do we
expect much RCU polling in idle?

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a92bce40b04b..6ab15233e2be 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -604,6 +604,7 @@ extern void __raise_softirq_irqoff(unsigned int nr);
 
 extern void raise_softirq_irqoff(unsigned int nr);
 extern void raise_softirq(unsigned int nr);
+extern void raise_ksoftirqd_irqsoff(unsigned int nr);
 
 DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
 
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index 42f7589e51e0..872dab8b8b53 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -189,12 +189,12 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 	local_irq_save(flags);
 	*rcu_ctrlblk.curtail = head;
 	rcu_ctrlblk.curtail = &head->next;
-	local_irq_restore(flags);
 
 	if (unlikely(is_idle_task(current))) {
 		/* force scheduling for rcu_qs() */
-		resched_cpu(0);
+		raise_ksoftirqd_irqsoff(RCU_SOFTIRQ);
 	}
+	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(call_rcu);
 
@@ -225,10 +225,13 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
 unsigned long start_poll_synchronize_rcu(void)
 {
 	unsigned long gp_seq = get_state_synchronize_rcu();
+	unsigned long flags;
 
 	if (unlikely(is_idle_task(current))) {
+		local_irq_save(flags);
 		/* force scheduling for rcu_qs() */
-		resched_cpu(0);
+		raise_ksoftirqd_irqsoff(RCU_SOFTIRQ);
+		local_irq_restore(flags);
 	}
 	return gp_seq;
 }
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 210cf5f8d92c..ef105cbdc705 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -695,6 +695,14 @@ void __raise_softirq_irqoff(unsigned int nr)
 	or_softirq_pending(1UL << nr);
 }
 
+#ifdef CONFIG_RCU_TINY
+void raise_ksoftirqd(unsigned int nr)
+{
+	__raise_softirq_irqoff(nr);
+	wakeup_softirqd();
+}
+#endif
+
 void open_softirq(int nr, void (*action)(struct softirq_action *))
 {
 	softirq_vec[nr].action = action;





^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 5.15 000/183] 5.15.134-rc1 review
  2023-10-11 13:47                 ` Frederic Weisbecker
@ 2023-10-11 16:31                   ` Paul E. McKenney
  0 siblings, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2023-10-11 16:31 UTC (permalink / raw
  To: Frederic Weisbecker
  Cc: Joel Fernandes, Liam R. Howlett, Naresh Kamboju,
	Greg Kroah-Hartman, stable, patches, linux-kernel, torvalds, akpm,
	linux, shuah, patches, lkft-triage, pavel, jonathanh, f.fainelli,
	sudipm.mukherjee, srw, rwarsow, conor, Chengming Zhou,
	Peter Zijlstra, Ovidiu Panait, Ingo Molnar, rcu

On Wed, Oct 11, 2023 at 03:47:23PM +0200, Frederic Weisbecker wrote:
> Le Tue, Oct 10, 2023 at 06:34:35PM -0700, Paul E. McKenney a écrit :
> > If this problem is real, fixes include:
> > 
> > o	Revert Liam's patch and make Tiny RCU's call_rcu() deal with
> > 	the problem.  This is overhead and non-tinyness, but to Joel's
> > 	point, it might be best.
> 
> But what is calling call_rcu() or start_poll_synchronize_rcu() so
> early that the CPU is not even online? (that's before boot_cpu_init() !)
> 
> Deferring PF_IDLE setting might pave the way for more issues like this one,
> present or future. Though is_idle_task() returning true when the task is not
> in the idle loop but is playing the init/0 role is debatable.
> 
> An alternative for tiny RCU is to force waking up ksoftirqd when call_rcu()
> is in the idle task. Since rcu_qs() during the context switch raises a softirq
> anyway. It's more overhead for start_poll_synchronize_rcu() though but do we
> expect much RCU polling in idle?

Nice!!!

This does solve the original problem with little or no additional overhead
(perhaps even with decreased overhead), and avoids the other RCU Tasks
issues.

						Thanx, Paul

> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index a92bce40b04b..6ab15233e2be 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -604,6 +604,7 @@ extern void __raise_softirq_irqoff(unsigned int nr);
>  
>  extern void raise_softirq_irqoff(unsigned int nr);
>  extern void raise_softirq(unsigned int nr);
> +extern void raise_ksoftirqd_irqsoff(unsigned int nr);
>  
>  DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
>  
> diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> index 42f7589e51e0..872dab8b8b53 100644
> --- a/kernel/rcu/tiny.c
> +++ b/kernel/rcu/tiny.c
> @@ -189,12 +189,12 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>  	local_irq_save(flags);
>  	*rcu_ctrlblk.curtail = head;
>  	rcu_ctrlblk.curtail = &head->next;
> -	local_irq_restore(flags);
>  
>  	if (unlikely(is_idle_task(current))) {
>  		/* force scheduling for rcu_qs() */
> -		resched_cpu(0);
> +		raise_ksoftirqd_irqsoff(RCU_SOFTIRQ);
>  	}
> +	local_irq_restore(flags);
>  }
>  EXPORT_SYMBOL_GPL(call_rcu);
>  
> @@ -225,10 +225,13 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
>  unsigned long start_poll_synchronize_rcu(void)
>  {
>  	unsigned long gp_seq = get_state_synchronize_rcu();
> +	unsigned long flags;
>  
>  	if (unlikely(is_idle_task(current))) {
> +		local_irq_save(flags);
>  		/* force scheduling for rcu_qs() */
> -		resched_cpu(0);
> +		raise_ksoftirqd_irqsoff(RCU_SOFTIRQ);
> +		local_irq_restore(flags);
>  	}
>  	return gp_seq;
>  }
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 210cf5f8d92c..ef105cbdc705 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -695,6 +695,14 @@ void __raise_softirq_irqoff(unsigned int nr)
>  	or_softirq_pending(1UL << nr);
>  }
>  
> +#ifdef CONFIG_RCU_TINY
> +void raise_ksoftirqd(unsigned int nr)
> +{
> +	__raise_softirq_irqoff(nr);
> +	wakeup_softirqd();
> +}
> +#endif
> +
>  void open_softirq(int nr, void (*action)(struct softirq_action *))
>  {
>  	softirq_vec[nr].action = action;
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-10-11 16:31 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20231004175203.943277832@linuxfoundation.org>
2023-10-05 17:49 ` [PATCH 5.15 000/183] 5.15.134-rc1 review Naresh Kamboju
2023-10-06 16:20   ` Liam R. Howlett
2023-10-06 16:47     ` Paul E. McKenney
2023-10-06 17:57       ` Liam R. Howlett
2023-10-06 18:20         ` Paul E. McKenney
2023-10-08  1:22           ` Joel Fernandes
2023-10-09  1:20             ` Paul E. McKenney
2023-10-11  1:34               ` Paul E. McKenney
2023-10-11  5:05                 ` Joel Fernandes
2023-10-11 10:25                   ` Paul E. McKenney
2023-10-11 13:47                 ` Frederic Weisbecker
2023-10-11 16:31                   ` Paul E. McKenney
2023-10-11  2:44               ` Joel Fernandes
2023-10-11  3:11                 ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).