All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Salvatore Bonaccorso <carnil@debian.org>
To: Nick Hastings <nicholaschastings@gmail.com>, 1036530@bugs.debian.org
Cc: Mario Limonciello <mario.limonciello@amd.com>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Len Brown <lenb@kernel.org>,
	linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org,
	regressions@lists.linux.dev
Subject: Re: Bug#1036530: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
Date: Tue, 30 May 2023 13:22:40 +0200	[thread overview]
Message-ID: <ZHXcgPC+7u04RuGD@eldamar.lan> (raw)
In-Reply-To: <ZHWfMBeAONerAJmd@xps>

Hi Nick,

Thanks to you both for triaging the issue!

On Tue, May 30, 2023 at 04:01:04PM +0900, Nick Hastings wrote:
> Hi,
> 
> * Mario Limonciello <mario.limonciello@amd.com> [230530 13:00]:
> > On 5/29/23 18:01, Nick Hastings wrote:
> > > Hi,
> > > 
> > > * Nick Hastings <nicholaschastings@gmail.com> [230529 12:51]:
> > > > * Mario Limonciello <mario.limonciello@amd.com> [230529 10:14]:
> > > > > On 5/28/23 19:56, Nick Hastings wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > * Mario Limonciello <mario.limonciello@amd.com> [230528 21:44]:
> > > > > > > On 5/28/23 01:49, Salvatore Bonaccorso wrote:
> > > > > > > > Hi Mario
> > > > > > > > 
> > > > > > > > Nick Hastings reported in Debian in https://bugs.debian.org/1036530
> > > > > > > > lockups from his system after updating from a 6.0 based version to
> > > > > > > > 6.1.y. >
> > > > > > > > #regzbot ^introduced 24867516f06d
> > > > > > > > 
> > > > > > > > he bisected the issue and tracked it down to:
> > > > > > > > 
> > > > > > > > On Sun, May 28, 2023 at 10:14:51AM +0900, Nick Hastings wrote:
> > > > > > > > > Control: tags -1 - moreinfo
> > > > > > > > > 
> > > > > > > > > Hi,
> > > > > > > > > 
> > > > > > > > > I repeated the git bisect, and the bad commit seems to be:
> > > > > > > > > 
> > > > > > > > > (git)-[v6.1-rc1~206^2~4^5~3|bisect] % git bisect bad
> > > > > > > > > 24867516f06dabedef3be7eea0ef0846b91538bc is the first bad commit
> > > > > > > > > commit 24867516f06dabedef3be7eea0ef0846b91538bc
> > > > > > > > > Author: Mario Limonciello <mario.limonciello@amd.com>
> > > > > > > > > Date:   Tue Aug 23 13:51:31 2022 -0500
> > > > > > > > > 
> > > > > > > > >        ACPI: OSI: Remove Linux-Dell-Video _OSI string
> > > > > > > > >        This string was introduced because drivers for NVIDIA hardware
> > > > > > > > >        had bugs supporting RTD3 in the past.
> > > > > > > > >        Before proprietary NVIDIA driver started to support RTD3, Ubuntu had
> > > > > > > > >        had a mechanism for switching PRIME on and off, though it had required
> > > > > > > > >        to logout/login to make the library switch happen.
> > > > > > > > >        When the PRIME had been off, the mechanism had unloaded the NVIDIA
> > > > > > > > >        driver and put the device into D3cold, but the GPU had never come back
> > > > > > > > >        to D0 again which is why ODMs used the _OSI to expose an old _DSM
> > > > > > > > >        method to switch the power on/off.
> > > > > > > > >        That has been fixed by commit 5775b843a619 ("PCI: Restore config space
> > > > > > > > >        on runtime resume despite being unbound"). so vendors shouldn't be
> > > > > > > > >        using this string to modify ASL any more.
> > > > > > > > >        Reviewed-by: Lyude Paul <lyude@redhat.com>
> > > > > > > > >        Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
> > > > > > > > >        Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > > > > > 
> > > > > > > > >     drivers/acpi/osi.c | 9 ---------
> > > > > > > > >     1 file changed, 9 deletions(-)
> > > > > > > > > 
> > > > > > > > > This machine is a Dell with an nvidia chip so it looks like this really
> > > > > > > > > could be the commit that that is causing the problems. The description
> > > > > > > > > of the commit also seems (to my untrained eye) to be consistent with the
> > > > > > > > > error reported on the console when the lockup occurs:
> > > > > > > > > 
> > > > > > > > > [   58.729863] ACPI Error: Aborting method \_SB.PCI0.PGON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > > > > [   58.729904] ACPI Error: Aborting method \_SB.PCI0.PEG0.PG00._ON due to previous error (AE_AML_LOOP_TIMEOUT) (20220331/psparse-529)
> > > > > > > > > [   60.083261] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
> > > > > > > > > 
> > > > > > > > > Hopefully this is enough information for experts to resolve this.
> > > > > > > > 
> > > > > > > > Does this ring some bell for you? Do you need any further information
> > > > > > > > from Nick?
> > > > > > > > 
> > > > > > > > Regards,
> > > > > > > > Salvatore
> > > > > > > 
> > > > > > 
> > > > > > > Have Nick try using "pcie_port_pm=off" and see if it helps the issue.
> > > > > > 
> > > > > > I booted into a 6.1 kernel with this option. It has been running without
> > > > > > problems for 1.5 hours. Usually I would expect the lockup to have
> > > > > > occurred by now.
> > > > 
> > > > I let this run for 3 hours without issue.
> > > > 
> > > > > > > Does this happen in the latest 6.4 RC as well?
> > > > > > 
> > > > > > I have compiled that kernel and will boot into it after running this one
> > > > > > with the pcie_port_pm=off for another hour or so.
> > > > 
> > > > I'm now running 6.4.0-rc4 without seeing the problem after 1 hour.
> > > 
> > > I did eventually see a lockup of this kernel. On the console I saw:
> > > 
> > > [  151.035036] vfio-pci 0000:01:00.0 Unable to change power state from D3cold to D0, device inaccessible
> > > 
> > > I did not see the other two lines that were present in earlier lock ups >
> > > > I did however see two unrelated problems that I include here for
> > > > completeness:
> > > > 1. iwlwifi module did not automatically load
> > > > 2. Xwayland used huge amount of CPU even though was not running any X
> > > > programs. Recompiling my wayland compositor without XWayland support
> > > > "fixed" this.
> > > > 
> > > > > > > I think we need to see a full dmesg and acpidump to better
> > > > > > > characterize it.
> > > > > > 
> > > > > > Please find attached. Let me know if there is anything else I can provide.
> > > > > > 
> > > > > > Regards,
> > > > > > 
> > > > > > Nick.
> > > > > 
> > > > > I don't see nouveau loading, are you explicitly preventing it from
> > > > > loading?
> > > > 
> > > > Yes nouveau is blacklisted.
> > > > 
> > > > > Can I see the journal from a boot when it reproduced?
> > > > 
> > > > Hmm not sure which n for "journalctl -b n" maps to which kernel (is that
> > > > what you are requesting?). The commit hash doesn't not seem to be
> > > > listed. I may have to boot into a bad kernel again.
> > > 
> > > Please find attached the output from a "journalctl --system -bN" for a
> > > kernel that has this issue.
> > > 
> > > Regards,
> > > 
> > > Nick.
> > 
> > In this log I see nouveau loaded, but I also don't see the failure
> > occurring.
> 
> I never saw anything in the logs from a lockup either. I had assumed it
> was no longer able to write to disk. The failure did occur on that
> occasion.

Can you try if you would get more out of it using netconsole?

https://www.kernel.org/doc/html/latest/networking/netconsole.html

Regards,
Salvatore

  reply	other threads:[~2023-05-30 11:23 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <168471337231.1913606.15905047692536779158.reportbug@xps>
     [not found] ` <ZG3mbc3zdR4KcUW/@eldamar.lan>
     [not found]   ` <ZG6cY8xjfob4Bvcs@xps>
     [not found]     ` <ZG98fQ+MD4O0nGGE@eldamar.lan>
     [not found]       ` <ZG/8cxxTJ9ZzrVPQ@xps>
     [not found]         ` <ZHCYRmD7YeIWoy2W@eldamar.lan>
     [not found]           ` <ZHKrC4/G6ZyvRReI@xps>
2023-05-28  6:49             ` Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system) Salvatore Bonaccorso
2023-05-28 12:44               ` Mario Limonciello
2023-05-29  0:56                 ` Nick Hastings
2023-05-29  1:14                   ` Mario Limonciello
2023-05-29  3:51                     ` Nick Hastings
2023-05-29 23:01                       ` Nick Hastings
2023-05-30  4:00                         ` Mario Limonciello
2023-05-30  7:01                           ` Nick Hastings
2023-05-30 11:22                             ` Salvatore Bonaccorso [this message]
2023-05-31 23:40                             ` Nick Hastings
2023-06-01 16:18                               ` Limonciello, Mario
2023-06-01 16:33                                 ` Karol Herbst
2023-06-01 16:54                                   ` Limonciello, Mario
2023-06-01 17:18                                     ` Karol Herbst
2023-06-01 17:21                                       ` Limonciello, Mario
2023-06-01 18:10                                         ` Karol Herbst
2023-06-02  0:01                                           ` Nick Hastings
2023-06-02  0:57                                             ` Limonciello, Mario
2023-06-26 12:09                                               ` Linux regression tracking (Thorsten Leemhuis)
2023-06-26 12:36                                                 ` Bug#1036530: Info received (Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)) Debian Bug Tracking System
2023-06-26 22:34                                                 ` Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system) Nick Hastings
2023-06-30 13:02                                                   ` Thorsten Leemhuis
2023-06-30 13:09                                                     ` Karol Herbst
2023-06-30 21:38                                                     ` Nick Hastings
2023-06-30 21:40                                                       ` Limonciello, Mario
2023-06-30 22:12                                                         ` Nick Hastings
2023-06-01 23:55                                 ` Nick Hastings
2023-07-07 21:54                                 ` Lyude Paul

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZHXcgPC+7u04RuGD@eldamar.lan \
    --to=carnil@debian.org \
    --cc=1036530@bugs.debian.org \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mario.limonciello@amd.com \
    --cc=nicholaschastings@gmail.com \
    --cc=rafael@kernel.org \
    --cc=regressions@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.