Re: [PATCH] parisc: Try to fix random segmentation faults in package builds

Linux-parisc archive mirror
 help / color / mirror / Atom feed

From: <Vidra.Jonas@seznam.cz>
To: <linux-parisc@vger.kernel.org>
Cc: "John David Anglin" <dave@parisc-linux.org>,
	"Helge Deller" <deller@gmx.de>
Subject: Re: [PATCH] parisc: Try to fix random segmentation faults in package builds
Date: Sun, 12 May 2024 08:57:48 +0200 (CEST)	[thread overview]
Message-ID: <E1b.NdM}.1zk9vH6PTNN.1cG6Xi@seznam.cz> (raw)
In-Reply-To: <91563ff7-349b-4815-bcfe-99f8f34b0b16@bell.net>

---------- Original e-mail ----------
From: John David Anglin
To: linux-parisc@vger.kernel.org
CC: John David Anglin, Helge Deller
Date: 8. 5. 2024 17:23:27
Subject: Re: [PATCH] parisc: Try to fix random segmentation faults in package builds

> In my opinion, the 6.1.x branch is the most stable branch on parisc.  6.6.x and later
> branches have folio changes and haven't had very much testing in build environments.
> I did run 6.8.7 and 6.8.8 on rp3440 for some time but I have gone back to a slightly
> modified 6.1.90.

OK, thanks, I'll roll back as well.


>> My machine is affected heavily by the segfaults – with some kernel
>> configurations, I get several per hour when compiling Gentoo packages
> That's more than normal although number seems to depend on package.
> At this rate, you wouldn't be able to build gcc.

Well, yeah. :-) The crashes are rarer when using a kernel with many
debugging options turned on, which suggests that it's some kind of a
race condition. Unfortunately, that also means it doesn't manifest when
the program is run under strace or gdb. I build large packages with -j1,
as the crashes are rarer with a smaller load.

The worst offender is the `moc` program used in builds of Qt packages,
it crashes a lot.


>> on all four cores. This patch doesn't fix them, though. On the patched
> Okay.  There are likely multiple problems.  The problem I was trying to address is null
> objects in the hash tables used by ld and as.  The symptom is usually a null pointer
> dereference after pointer has been loaded from null object.  These occur in multiple
> places in libbfd during hash table traversal.  Typically, a couple would occur in a gcc
> testsuite run.  _objalloc_alloc uses malloc.  One can see the faults on the console and
> in the gcc testsuite log.
>
> How these null objects are generated is not known.  It must be a kernel issue because
> they don't occur with qemu.  I think the frequency of these faults is reduced with the
> patch.  I suspect the objects are zeroed after they are initialized.  In some cases, ld can
> successfully link by ignoring null objects.
>
> The next time I see a fault caused by a null object, I think it would be useful to see if
> we have a full null page.  This might indicate a swap problem.

I did see a full zeroed page at least once, but it's hard to debug.
Also, I'm not sure whether core dumps are reliable in this case – since
this is a kernel bug, the view of memory stored in a core dump might be
different from what the program saw at the time of the crash.


>> kernel, it happened after ~8h of uptime during installation of the
>> perl-core/Test-Simple package. I got no error output from the running
>> program, but an HPMC was logged to the serial console:
>>
>> [30007.186309] mm/pgtable-generic.c:54: bad pmd 539b0030.
>> <Cpu3> 78000c6203e00000 a0e008c01100b009 CC_PAT_ENCODED_FIELD_WARNING
>> <Cpu0> e800009800e00000 0000000041093be4 CC_ERR_CHECK_HPMC
>> <Cpu1> e800009801e00000 00000000404ce130 CC_ERR_CHECK_HPMC
>> <Cpu3> 76000c6803e00000 0000000000000520 CC_PAT_DATA_FIELD_WARNING
>> <Cpu0> 37000f7300e00000 84000[30007.188321] Backtrace:
>> [30007.188321] [<00000000404eef9c>] pte_offset_map_nolock+0xe8/0x150
>> [30007.188321] [<00000000404d6784>] __handle_mm_fault+0x138/0x17e8
>> [30007.188321] [<00000000404d8004>] handle_mm_fault+0x1d0/0x3b0
>> [30007.188321] [<00000000401e4c98>] do_page_fault+0x1e4/0x8a0
>> [30007.188321] [<00000000401e95c0>] handle_interruption+0x330/0xe60
>> [30007.188321] [<0000000040295b44>] schedule_tail+0x78/0xe8
>> [30007.188321] [<00000000401e0f6c>] finish_child_return+0x0/0x58
>>
>> A longer excerpt of the logs is attached. The error happened at boot
>> time 30007, the preceding unaligned accesses seem to be unrelated.
> I doubt this HPMC is related to the patch.  In the above, the pmd table appears to have
> become corrupted.

I see all kinds of corruption in both kernel space and user space, and I
assumed they all share the same underlying mechanism, but you're right
that there might be multiple unrelated causes.


>> I don't think it's a hardware error, as HP-UX 11i v1 works flawlessly on
>> the same machine. The errors seem to be more frequent with a heavy IO
>> load, so it might be system-bus or PCI-bus-related. Using X11 causes
>> lockups rather quickly, but that could be caused by unrelated errors in
>> the graphics subsystem and/or the Radeon drivers.
> I am not using X11 on my c8000.  I have frame buffer support on. Radeon acceleration
> is broken on parisc.

Yeah, accel doesn't work, but unaccelerated graphics works fine. Except
for the crashes, that is.


>> 2. The segfault is sometimes preceded by an unaligned access, which I
>> believe is also caused by a corrupted machine state rather than by a
>> coding error in the program – sometimes a bunch of unaligned accesses
>> show up in the logs just prior to a segfault / lockup, even from
>> unrelated programs such as random bash processes. Sometimes the machine
>> keeps working afterwards (although I typically reboot it immediately
>> to limit the consequences of potential kernel data structure damage),
>> sometimes it HPMCs or LPMCs. This is difficult to explain by just a wild
>> zeroed page appearance. But this typically happens when running X11, so
>> again, it might be caused by another bug, such as the GPU randomly
>> writing to memory via misconfigured DMA.
> There was a bug in the unaligned handler for double word instructions (ldd) that was
> recently fixed.  ldd/std are not used in userspace, so this problem didn't affect it.

Yes, but this fixes the case when a program has a coding bug, performs
an unaligned access and the kernel has to emulate the load. What I'm
seeing is that sometimes, several programs which usually run just fine
with no unaligned accesses all perform an unaligned access at once,
which seems very weird. I sometimes (but not always) see this on X11
startup.


> We have observed that the faults appear SMP and memory size related.  A rp4440 with
> 6 CPUs and 4 GB RAM faulted a lot.  It's mostly a PA8800/PA8900 issue.
>
> It's months since I had a HPMC or LPMC on rp3440 and c8000.  Stalls still happen but they
> are rare.

I have 16 GiB of memory and 4 × PA8900 @ 1GHz. But I've seen a lot of
them even with 2 GiB.

     prev parent reply	other threads:[~2024-05-12  6:58 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-05 16:58 [PATCH] parisc: Try to fix random segmentation faults in package builds John David Anglin
2024-05-08  8:54 ` Vidra.Jonas
2024-05-08 15:23   ` John David Anglin
2024-05-08 19:18     ` matoro
2024-05-08 20:52       ` John David Anglin
2024-05-08 23:51         ` matoro
2024-05-09  1:21           ` John David Anglin
2024-05-09 17:10         ` John David Anglin
2024-05-29 15:54           ` matoro
2024-05-29 16:33             ` John David Anglin
2024-05-30  5:00               ` matoro
2024-05-12  6:57     ` Vidra.Jonas [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='E1b.NdM}.1zk9vH6PTNN.1cG6Xi@seznam.cz' \
    --to=vidra.jonas@seznam.cz \
    --cc=dave@parisc-linux.org \
    --cc=deller@gmx.de \
    --cc=linux-parisc@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).