From: Wren Turkal <wt@penguintechs.org>
To: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: Lk Sii <lk_sii@163.com>,
linux-bluetooth@vger.kernel.org,
Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>,
Zijun Hu <quic_zijuhu@quicinc.com>,
Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Subject: Re: path to landing patch to fix warm boot issue for qca6390
Date: Tue, 14 May 2024 22:08:34 -0700 [thread overview]
Message-ID: <452ec12b-b6bc-483e-a766-e5a1a09b8b3a@penguintechs.org> (raw)
In-Reply-To: <8ad37b23-df40-4d3e-ac7d-6dbaf92249e7@penguintechs.org>
Luiz,
On 5/14/24 3:17 PM, Wren Turkal wrote:
> Hey Luiz,
>
> On 5/13/24 1:46 PM, Luiz Augusto von Dentz wrote:
>> Hi Wren,
>>
>> On Mon, May 13, 2024 at 4:13 PM Wren Turkal <wt@penguintechs.org> wrote:
>>>
>>> On 5/10/24 11:25 PM, Lk Sii wrote:
>>>> On 2024/5/11 07:33, Wren Turkal wrote:
>>>>> On 5/10/24 2:25 PM, Luiz Augusto von Dentz wrote:
>>>>>> Hi Wren,
>>>>>>
>>>>>> On Fri, May 10, 2024 at 4:54 PM Wren Turkal <wt@penguintechs.org>
>>>>>> wrote:
>>>>>>>
>>>>>>> On 5/10/24 12:48 PM, Luiz Augusto von Dentz wrote:
>>>>>>>> Hi Wren,
>>>>>>>>
>>>>>>>> On Fri, May 10, 2024 at 3:14 PM Wren Turkal <wt@penguintechs.org>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On 5/6/24 12:49 PM, Luiz Augusto von Dentz wrote:
>>>>>>>>>> Hi Wren,
>>>>>>>>>>
>>>>>>>>>> On Mon, May 6, 2024 at 3:24 PM Wren Turkal <wt@penguintechs.org>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Krzysztof,
>>>>>>>>>>>
>>>>>>>>>>> I am reaching out to you as you had the most important
>>>>>>>>>>> objections
>>>>>>>>>>> to the
>>>>>>>>>>> change to fix qca6390 for the warm boot/module reload bug
>>>>>>>>>>> that I am
>>>>>>>>>>> experiencing.
>>>>>>>>>>>
>>>>>>>>>>> For context, the problem is that the hci_uart module will send
>>>>>>>>>>> specific
>>>>>>>>>>> vendor specfic commands during shutdown of the hardware under
>>>>>>>>>>> most
>>>>>>>>>>> situations. These VSCs put the bluetooth device into a
>>>>>>>>>>> non-working state
>>>>>>>>>>> on my Dell XPS 13 9310 with qca6390 bluetooth hardware.
>>>>>>>>>>>
>>>>>>>>>>> Zijun's proposed fix is to not send these commands when it's not
>>>>>>>>>>> appropriate for the hardware. The vendor commands should be
>>>>>>>>>>> avoided when
>>>>>>>>>>> the hardware does not have persistent configuration or when the
>>>>>>>>>>> device
>>>>>>>>>>> is in setup state (indicating that is has never been setup and
>>>>>>>>>>> should
>>>>>>>>>>> not be sent the VSCs on the shutdown path). This is what Zijun's
>>>>>>>>>>> patch
>>>>>>>>>>> implements.
>>>>>>>>>>>
>>>>>>>>>>> In addition, Zijun's change removes the influence of both
>>>>>>>>>>> the QCA_BT_OFF qca flag and and HCI_RUNNING hdev flag. Zijun
>>>>>>>>>>> asserts
>>>>>>>>>>> that those flags should not influence the sending of the VSCs
>>>>>>>>>>> in the
>>>>>>>>>>> shutdown path. If I understand KK's objections properly, this is
>>>>>>>>>>> where
>>>>>>>>>>> his objection is stemming from. KK, is this correct?
>>>>>>>>>>>
>>>>>>>>>>> Zijun's proposed fix can be seen here:
>>>>>>>>>>> https://patchwork.kernel.org/project/bluetooth/patch/1713932807-19619-3-git-send-email-quic_zijuhu@quicinc.com/
>>>>>>>>>>>
>>>>>>>>>>> I'm wondering if we can resolve this impasse by splitting the
>>>>>>>>>>> change
>>>>>>>>>>> into two changes, as follows:
>>>>>>>>>>>
>>>>>>>>>>> 1. Change that removes the influence of the QCA_BT_OFF and
>>>>>>>>>>> HCI_RUNNING
>>>>>>>>>>> flags in the shutdown path.
>>>>>>>>>>> 2. Add the quirk from Zijun's patch that fixes my hardward
>>>>>>>>>>> configuration.
>>>>>>>>>>>
>>>>>>>>>>> I'm hoping that better clearer descriptions for #1 can help
>>>>>>>>>>> get that
>>>>>>>>>>> landed since the logic current appears to be at odds with how
>>>>>>>>>>> the
>>>>>>>>>>> hardware works.
>>>>>>>>>>>
>>>>>>>>>>> Also, I am happy to split the patches into the two patches, or
>>>>>>>>>>> (maybe
>>>>>>>>>>> more ideally) just modify the commit message to better
>>>>>>>>>>> indicate the
>>>>>>>>>>> reason the change. I just need guidance from maintainers so that
>>>>>>>>>>> whatever work I do leads to something acceptable for y'all.
>>>>>>>>>>>
>>>>>>>>>>> So, please help me get this done. I am just a user with broken
>>>>>>>>>>> hardware
>>>>>>>>>>> and a fondness for Linux. I would love to help do what's needed
>>>>>>>>>>> to get
>>>>>>>>>>> this fix landed.
>>>>>>>>>>
>>>>>>>>>> Ive also objected to that change, in fact the whole shutdown
>>>>>>>>>> sequence
>>>>>>>>>> is sort of bogus in my opinion and the driver shall really
>>>>>>>>>> have some
>>>>>>>>>> means to find out what mode it is in when it reboots,
>>>>>>>>>> regardless if
>>>>>>>>>> cold or warm boot, since otherwise we are in trouble if the
>>>>>>>>>> user is
>>>>>>>>>> booting from another OS that doesn't do the expected shutdown
>>>>>>>>>> sequence.
>>>>>>>>>
>>>>>>>>> This criticism makes a ton of sense. I'm sorry I missed it before.
>>>>>>>>> There
>>>>>>>>> were a lot of threads moving in parallel. However, I am
>>>>>>>>> curious. Given
>>>>>>>>> that the patch improves the situation for users (like me). Is
>>>>>>>>> there
>>>>>>>>> any
>>>>>>>>> way we can separate the redesign of the shutdown sequence and
>>>>>>>>> the UX
>>>>>>>>> improvement that comes with this patch?
>>>>>>>>>
>>>>>>>>> Here's my concern. I am happy to do the work to redesign this.
>>>>>>>>> However,
>>>>>>>>> I don't think I have the information needed to do this since I
>>>>>>>>> don't
>>>>>>>>> have access to the technical docs for the qca6390. I am worried
>>>>>>>>> that not
>>>>>>>>> accepting some form of this patch is letting perfect be the enemy
>>>>>>>>> of the
>>>>>>>>> good. And I am not sure how I personally can help with that. If
>>>>>>>>> you
>>>>>>>>> think it's possible for me to do this without the docs for the
>>>>>>>>> hardware,
>>>>>>>>> I am willing to give it a shot if I can get some guidance.
>>>>>>>>> Honestly, I
>>>>>>>>> wish I had the skill to be confident about a change like this, but
>>>>>>>>> I don't.
>>>>>>>>>
>>>>>>>>> Any ideas on how to move forward would be greatly appreciated.
>>>>>>>>>
>>>>>>>>> And just to be perfectly clear, I have tested this patch on my
>>>>>>>>> laptop.
>>>>>>>>> It greatly enhances my ability to use my hardware since I can
>>>>>>>>> reboot the
>>>>>>>>> machine without having to make sure to power cycle the laptop.
>>>>>>>>> This is
>>>>>>>>> not a theoretical improvement.
>>>>>>>>
>>>>>>>> I would really love some explanation why can't the driver know
>>>>>>>> under
>>>>>>>> what mode the controller is when it gets probed, because to me we
>>>>>>>> cannot accept a driver that only works under certain condition
>>>>>>>> after
>>>>>>>> the boot and in case it is really impossible, can't even power
>>>>>>>> cycle
>>>>>>>> it to get it back to cold boot stage???
>>>>>>>
>>>>>>> This is a great technical criticism of the driver, and I think you
>>>>>>> deserve that explanation.
>>>>>>>
>>>>>>> However, with the driver already in the kernel, shouldn't the
>>>>>>> bias be
>>>>>>> toward mitigating the extremely bad UX and not hold users hostage
>>>>>>> for
>>>>>>> the bad design which has already been approved and landed in the
>>>>>>> kernel?
>>>>>>>
>>>>>>>> Also the criticism here should be directed to the vendor, how long
>>>>>>>> have we been discussing problems in the QCA driver? And the only
>>>>>>>> thing
>>>>>>>> I see coming our way are work-arounds of the problems, the
>>>>>>>> address not
>>>>>>>> being unique coming from the firmware itself and when provided
>>>>>>>> via DT
>>>>>>>> the address is in the wrong byteorder and now that the driver must
>>>>>>>> communicate the firmware on shutdown in order to get it working
>>>>>>>> properly on the next boot.
>>>>>>>
>>>>>>> I agree that Qualcomm should get flack for this, however, the UX
>>>>>>> problem
>>>>>>> can be mitigated with a logic fix that doesn't make the
>>>>>>> init/shutdown
>>>>>>> design problem any worse than it currently seems to be. I mean,
>>>>>>> wouldn't
>>>>>>> this logic have to exist somewhere even if it weren't the
>>>>>>> shutdown path?
>>>>>>>
>>>>>>> If you are trying to use this as leverage to get Qualcomm to do a
>>>>>>> bigger
>>>>>>> thing (redesign the init/shutdown logic), I do think that tactic
>>>>>>> needlessly puts users in the crossfire. I can totally understand why
>>>>>>> you'd do it. I am just suffering the crossfire in the meantime,
>>>>>>> and it
>>>>>>> doesn't feel great.
>>>>>>
>>>>>> So you prefer to risk getting a kernel crash due to UAF over
>>>>>> Bluetooth
>>>>>> not working? Really? Because I haven't seen any configuration that
>>>>>> those changes you tested don't reintroduce the UAF, which is why I
>>>>>> haven't applied that change to begin with, so no I'm not holding back
>>>>>> to pressure Qualcomm to redesign the shutdown logic, it just these
>>>>>> things got entangled because I just realized the shutdown thingy is
>>>>>> really out of place, imo, but I'd be fine if there is a temporary fix
>>>>>> until someone finally decide to spend some time to really fix the
>>>>>> shutdown logic.
>>>>>
>>>>> Luiz, I'm sorry. I do not want a crash instead. I didn't understand
>>>>> that
>>>>> the solution I proposed (i.e. adding Zijun's logic without removing
>>>>> KK's
>>>>> logic) would introduce a new crash opportunity. I previously
>>>>> thought you
>>>>> were saying one of the following things:
>>>>> 1. The crash opportunity already existed due the init/shudown
>>>>> sequences.
>>>>> 2. The crash opportunity already existed due the init/shudown
>>>>> sequences
>>>>> when removing KK's logic.
>>>>>
>>>>> If it was #1, I was hoping that adding the logic would make the
>>>>> risk no
>>>>> worse.
>>>>>
>>>>> If it was #2, I was hoping that my suggestion of adding Zijun's logic
>>>>> without removing KK's logic might represent an acceptable middleground
>>>>> for a temporary fix that would "correct" the logic without
>>>>> introducing a
>>>>> new crash opportunity.
>>>>>
>>>>> I feel like I may not be clear about what I mean by adding Zijun's
>>>>> logic
>>>>> and not removing KK's logic. Maybe something like this diff:
>>>>>
>>>>> diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
>>>>> index 2f7ae38d85eb..fcac44ae7898 100644
>>>>> --- a/drivers/bluetooth/hci_qca.c
>>>>> +++ b/drivers/bluetooth/hci_qca.c
>>>>> @@ -2456,6 +2456,10 @@ static void qca_serdev_shutdown(struct
>>>>> device *dev)
>>>>> !test_bit(HCI_RUNNING, &hdev->flags))
>>>>> return;
>>>>>
>>>>> + if (test_bit(HCI_QUIRK_NON_PERSISTENT_SETUP,
>>>>> &hdev->quirks) ||
>>>>> + hci_dev_test_flag(hdev, HCI_SETUP))
>>>>> + return;
>>>>> +
>>>>> serdev_device_write_flush(serdev);
>>>>> ret = serdev_device_write_buf(serdev, ibs_wake_cmd,
>>>>>
>>>>> sizeof(ibs_wake_cmd));
>>>>>
>>>>> I think this diff is mangled due to using Thunderbird, but I hope this
>>>>> helps convey what I was asking about.
>>>>>
>>>>> If I am understanding you correctly now, you are saying that simply
>>>>> introducing Zijun's logic (without removing KK's logic) will
>>>>> introduce a
>>>>> new crash opportunity. Is that correct?
>>>>>
>>>>
>>>> as Zijun declared. i believe Zijun's change will solve both this
>>>> reported regression issue and the use-after-free(crash) issue.
>>>
>>> I did see Zijun's claim of that. However, I think that both KK and Luiz
>>> are not convinced by the explanation. Also, if that explanation does
>>> convince KK and Luiz, I think that the explanation needs to be added to
>>> the commit message.
>>>
>>> I'm hoping that Luiz will at least respond to the middleground I
>>> proposed as a workaround.
>>
>> I recall suggesting using HCI_UART_PROTO_READY instead since that
>> tells when serdev_device_close has been run:
>>
>> diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
>> index 0c9c9ee56592..bbbc86d4932a 100644
>> --- a/drivers/bluetooth/hci_qca.c
>> +++ b/drivers/bluetooth/hci_qca.c
>> @@ -2455,8 +2455,8 @@ static void qca_serdev_shutdown(struct device *dev)
>> const u8 edl_reset_soc_cmd[] = { 0x01, 0x00, 0xFC, 0x01, 0x05 };
>>
>> if (qcadev->btsoc_type == QCA_QCA6390) {
>> - if (test_bit(QCA_BT_OFF, &qca->flags) ||
>> - !test_bit(HCI_RUNNING, &hdev->flags))
>> + /* Check if serdev_device_close() has already been
>> called. */
>> + if (!test_bit(HCI_UART_PROTO_READY, &hu->flags))
>> return;
>>
>> serdev_device_write_flush(serdev);
>
> I am testing this right now. I will let you know how it goes.
I applied your patch and tested a number of scenarios. Here are the
results. Apologies for any bad line wraps. I guess I haven't managed to
tame Thunderbird yet:
warm boot:
unpatched kernel to patched kernel : doesn't work
patched kernel to patched kernel : works
patched kernel tests:
starting from cold boot:
works
disable/re-enable bt by plasma UI:
works
restart bluetooth service:
works
unload/re-load hci_uart and bluetooth service:
doesn't work after reload; can't enable BT
I think this worked with Zijun's commit, but I'm not 100% sure.
disable bluetooth service, warm boot, start bluetooth service:
fails to connect to devices
What follows are logs for the non-working cases that only involve the
patched kernel:
logs for unload/re-load hci_uart and bluetooth service test when
reloading the module:
May 14 21:42:34 braindead.localdomain kernel: Bluetooth: HCI UART driver
ver 2.3
May 14 21:42:34 braindead.localdomain kernel: Bluetooth: HCI UART
protocol H4 registered
May 14 21:42:34 braindead.localdomain kernel: Bluetooth: HCI UART
protocol QCA registered
May 14 21:42:34 braindead.localdomain kernel: Bluetooth: hci0: setting
up ROME/QCA6390
May 14 21:42:34 braindead.localdomain kernel: Bluetooth: hci0: QCA
Product ID :0x00000010
May 14 21:42:34 braindead.localdomain kernel: Bluetooth: hci0: QCA SOC
Version :0x400a0200
May 14 21:42:34 braindead.localdomain kernel: Bluetooth: hci0: QCA ROM
Version :0x00000200
May 14 21:42:34 braindead.localdomain kernel: Bluetooth: hci0: QCA Patch
Version:0x00003ac0
May 14 21:42:34 braindead.localdomain kernel: Bluetooth: hci0: QCA
controller version 0x02000200
May 14 21:42:34 braindead.localdomain kernel: Bluetooth: hci0: QCA
Downloading qca/htbtfw20.tlv
May 14 21:42:34 braindead.localdomain kernel: Bluetooth: hci0: QCA
Failed to send TLV segment (-110)
May 14 21:42:34 braindead.localdomain kernel: Bluetooth: hci0: QCA
Failed to download patch (-110)
May 14 21:42:34 braindead.localdomain kernel: Bluetooth: hci0: Retry BT
power ON:0
May 14 21:42:37 braindead.localdomain kernel: Bluetooth: hci0: command
0xfc00 tx timeout
May 14 21:42:37 braindead.localdomain kernel: Bluetooth: hci0: Reading
QCA version information failed (-110)
May 14 21:42:37 braindead.localdomain kernel: Bluetooth: hci0: Retry BT
power ON:1
May 14 21:42:39 braindead.localdomain kernel: Bluetooth: hci0: command
0xfc00 tx timeout
May 14 21:42:39 braindead.localdomain kernel: Bluetooth: hci0: Reading
QCA version information failed (-110)
May 14 21:42:39 braindead.localdomain kernel: Bluetooth: hci0: Retry BT
power ON:2
May 14 21:42:41 braindead.localdomain kernel: Bluetooth: hci0: command
0xfc00 tx timeout
May 14 21:42:41 braindead.localdomain kernel: Bluetooth: hci0: Reading
QCA version information failed (-110)
logs for disable bluetooth service, warm boot, start bluetooth service
manually test starting at the point of starting bluetooth service:
May 14 21:45:53 braindead.localdomain kernel: Bluetooth: hci0:
hci_devcd_init Return:-95
May 14 21:46:59 braindead.localdomain kernel: Bluetooth: BNEP (Ethernet
Emulation) ver 1.3
May 14 21:46:59 braindead.localdomain kernel: Bluetooth: BNEP filters:
protocol multicast
May 14 21:46:59 braindead.localdomain kernel: Bluetooth: BNEP socket
layer initialized
May 14 21:46:59 braindead.localdomain kernel: Bluetooth: MGMT ver 1.22
May 14 21:46:59 braindead.localdomain kernel: Bluetooth: RFCOMM TTY
layer initialized
May 14 21:46:59 braindead.localdomain kernel: Bluetooth: RFCOMM socket
layer initialized
May 14 21:46:59 braindead.localdomain kernel: Bluetooth: RFCOMM ver 1.11
May 14 21:47:01 braindead.localdomain kernel: Bluetooth: hci0: Opcode
0x0c03 failed: -110
May 14 21:47:01 braindead.localdomain kernel: Bluetooth: hci0:
mem_dump_status: 2
May 14 21:47:03 braindead.localdomain kernel: Bluetooth: hci0: Opcode
0x0c03 failed: -110
May 14 21:47:13 braindead.localdomain kernel: Bluetooth: hci0: Opcode
0x0c03 failed: -110
So, I noticed that your patch doesn't include the quirk conditional from
Zijun's commit. I am assuming that that was intentional. However, I
wanted to point it out in case that was somehow material to the results.
Please let me know what else I can do to help. And again, I really
appreciate your effort on this.
wt
--
You're more amazing than you think!
next prev parent reply other threads:[~2024-05-15 5:08 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-06 19:21 path to landing patch to fix warm boot issue for qca6390 Wren Turkal
2024-05-06 19:49 ` Luiz Augusto von Dentz
2024-05-10 19:13 ` Wren Turkal
2024-05-10 19:48 ` Luiz Augusto von Dentz
2024-05-10 20:54 ` Wren Turkal
2024-05-10 21:25 ` Luiz Augusto von Dentz
2024-05-10 22:57 ` Paul Menzel
2024-05-10 23:20 ` Luiz Augusto von Dentz
2024-05-10 23:33 ` Wren Turkal
2024-05-11 6:25 ` Lk Sii
2024-05-13 20:13 ` Wren Turkal
2024-05-13 20:46 ` Luiz Augusto von Dentz
2024-05-14 22:17 ` Wren Turkal
2024-05-15 5:08 ` Wren Turkal [this message]
2024-05-07 6:27 ` Krzysztof Kozlowski
2024-05-07 14:34 ` Lk Sii
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=452ec12b-b6bc-483e-a766-e5a1a09b8b3a@penguintechs.org \
--to=wt@penguintechs.org \
--cc=bartosz.golaszewski@linaro.org \
--cc=krzysztof.kozlowski@linaro.org \
--cc=linux-bluetooth@vger.kernel.org \
--cc=lk_sii@163.com \
--cc=luiz.dentz@gmail.com \
--cc=quic_zijuhu@quicinc.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).