All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Alex Constantino <dreaming.about.electric.sheep@gmail.com>
To: regressions@leemhuis.info
Cc: 1054514@bugs.debian.org, airlied@redhat.com, carnil@debian.org,
	daniel@ffwll.ch, dri-devel@lists.freedesktop.org,
	kraxel@redhat.com, linux-kernel@vger.kernel.org,
	regressions@lists.linux.dev, spice-devel@lists.freedesktop.org,
	timo.lindfors@iki.fi, tzimmermann@suse.de,
	virtualization@lists.linux-foundation.org,
	Alex Constantino <dreaming.about.electric.sheep@gmail.com>
Subject: [PATCH 0/1] drm/qxl: fixes qxl_fence_wait
Date: Fri,  8 Mar 2024 01:08:50 +0000	[thread overview]
Message-ID: <20240308010851.17104-1-dreaming.about.electric.sheep@gmail.com> (raw)
In-Reply-To: <fb0fda6a-3750-4e1b-893f-97a3e402b9af@leemhuis.info>

Hi,
As initially reported by Timo in the QXL driver will crash given enough
workload:
https://lore.kernel.org/regressions/fb0fda6a-3750-4e1b-893f-97a3e402b9af@leemhuis.info/
I initially came across this problem when migrating Debian VMs from Bullseye
to Bookworm. This bug will somewhat randomly but consistently happen, even
just by using neovim with plugins or playing a video. This exception would
then cascade and make Xorg crash too.

The error log from dmesg would have `[TTM] Buffer eviction failed` followed
by either a `failed to allocate VRAM BO` or `failed to allocate GEM object`.
And the error log from Xorg would have `qxl(0): error doing QXL_ALLOC`
followed by a backtrace and segmentation fault.

I can confirm the problem still exists in latest kernel versions:
https://gitlab.freedesktop.org/drm/kernel @ c6d6a82d8a9f
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git @ 1870cdc0e8de

When I was investigating this issue I ended up creating a script which
triggers the issue in just a couple of minutes when executed under uxterm.
YMMV according to your system, for example when using urxvt crashes were
not as consistent, likely due to it being more efficient and having less
video memory allocations.
For me this is the fastest way to trigger the bug. Here follows:
```
#!/bin/bash
print_gradient_with_awk() {
    local arg="$1"
    if [[ -n $arg ]]; then
        arg=" ($arg)"
    fi
    awk -v arg="$arg" 'BEGIN{
        s="/\\/\\/\\/\\/\\"; s=s s s s s s s s;
        for (colnum = 0; colnum<77; colnum++) {
            r = 255-(colnum*255/76);
            g = (colnum*510/76);
            b = (colnum*255/76);
            if (g>255) g = 510-g;
            printf "\033[48;2;%d;%d;%dm", r,g,b;
            printf "\033[38;2;%d;%d;%dm", 255-r,255-g,255-b;
            printf "%s\033[0m", substr(s,colnum+1,1);
        }
        printf "%s\n", arg;
    }'
}
for i in {1..10000}; do
    print_gradient_with_awk $i
done
```

Timo initially reported:
commit 5f6c871fe919 ("drm/qxl: properly free qxl releases") as working fine
commit 5a838e5d5825 ("drm/qxl: simplify qxl_fence_wait") introducing the bug

The bug occurs whenever a timeout is reached in wait_event_timeout.
To fix this issue I updated the code to include a busy wait logic, which
was how the last working version operated. That fixes this bug while still
keeping the code simple (which I suspect was the motivation for the
5a838e5d5825 commit in the first place), as opposed to just reverting to
the last working version at 5f6c871fe919
The choice for the use of HZ as a scaling factor for the loop was that it
is also used by ttm_bo_wait_ctx which is one of the indirect callers of
qxl_fence_wait, with the other being ttm_bo_delayed_delete

To confirm the problem no longer manifests I have:
- executed my own test case pasted above
- executed Timo's test case pasted below
- played a video stream in mplayer for 3h (no audio stream because
  apparently pulseaudio and/or alsa have memory leaks that make the
  system run out of memory)

For quick reference here is Timo's script:
```
#!/bin/bash
chvt 3
for j in $(seq 80); do
    echo "$(date) starting round $j"
    if [ "$(journalctl --boot | grep "failed to allocate VRAM BO")" != "" ]; then
        echo "bug was reproduced after $j tries"
        exit 1
    fi
    for i in $(seq 100); do
        dmesg > /dev/tty3
    done
done
echo "bug could not be reproduced"
exit 0
```

From what I could find online it seems that users that have been affected
by this problem just tend to move from QXL to VirtIO, that is why this bug
has been hidding for over 3 years now.
This issue was initially reported by Timo 4 months ago but the discussion
seems to have stalled.
It would be great if this could be addressed and avoid it falling through
the cracks.

Thank you for your time.


---

Alex Constantino (1):
  drm/qxl: fixes qxl_fence_wait

 drivers/gpu/drm/qxl/qxl_release.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)


base-commit: 1870cdc0e8dee32e3c221704a2977898ba4c10e8
-- 
2.39.2


  parent reply	other threads:[~2024-03-08  1:12 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <alpine.DEB.2.20.2310242308150.28457@mail.home>
2023-10-24 21:09 ` Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently Salvatore Bonaccorso
2023-10-24 21:09   ` Salvatore Bonaccorso
2023-10-24 21:09   ` Salvatore Bonaccorso
2023-10-24 21:39   ` Timo Lindfors
2023-10-24 21:39     ` Timo Lindfors
2023-12-06  9:56     ` Linux regression tracking (Thorsten Leemhuis)
2023-12-06  9:56       ` Linux regression tracking (Thorsten Leemhuis)
2023-12-06 10:45       ` Bug#1054514: Info received (Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently) Debian Bug Tracking System
2024-03-08  1:08       ` Alex Constantino [this message]
2024-03-08  1:08         ` [PATCH 1/1] drm/qxl: fixes qxl_fence_wait Alex Constantino
2024-03-08  8:58           ` Thorsten Leemhuis
2024-03-20 15:25           ` Linux regression tracking (Thorsten Leemhuis)
2024-03-20 15:27             ` Bug#1054514: Info received ([PATCH 1/1] drm/qxl: fixes qxl_fence_wait) Debian Bug Tracking System
2024-03-27 13:27             ` [PATCH 1/1] drm/qxl: fixes qxl_fence_wait Maxime Ripard
2024-04-04 18:14               ` [PATCH v2 0/1] Revert "drm/qxl: simplify qxl_fence_wait" Alex Constantino
2024-04-04 18:14                 ` [PATCH v2 1/1] " Alex Constantino
2024-04-05  4:37                   ` Greg KH
2024-04-05 13:13                   ` (subset) " Maxime Ripard
2023-10-24 23:55   ` Bug#1054514: linux-image-6.1.0-13-amd64: Debian VM with qxl graphics freezes frequently Bagas Sanjaya
2023-10-24 23:55     ` Bagas Sanjaya

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240308010851.17104-1-dreaming.about.electric.sheep@gmail.com \
    --to=dreaming.about.electric.sheep@gmail.com \
    --cc=1054514@bugs.debian.org \
    --cc=airlied@redhat.com \
    --cc=carnil@debian.org \
    --cc=daniel@ffwll.ch \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=kraxel@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=regressions@leemhuis.info \
    --cc=regressions@lists.linux.dev \
    --cc=spice-devel@lists.freedesktop.org \
    --cc=timo.lindfors@iki.fi \
    --cc=tzimmermann@suse.de \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.