Linux-RDMA Archive mirror
 help / color / mirror / Atom feed
From: Przemek Kitszel <przemyslaw.kitszel@intel.com>
To: Greg KH <gregkh@linuxfoundation.org>, Shay Drory <shayd@nvidia.com>
Cc: <netdev@vger.kernel.org>, <pabeni@redhat.com>,
	<davem@davemloft.net>, <kuba@kernel.org>, <edumazet@google.com>,
	<david.m.ertman@intel.com>, <rafael@kernel.org>,
	<ira.weiny@intel.com>, <linux-rdma@vger.kernel.org>,
	<leon@kernel.org>, <tariqt@nvidia.com>,
	Parav Pandit <parav@nvidia.com>
Subject: Re: [PATCH net-next v4 1/2] driver core: auxiliary bus: show auxiliary device IRQs
Date: Fri, 10 May 2024 14:54:49 +0200	[thread overview]
Message-ID: <22533dbb-3be9-4ff2-9b59-b3d6a650f7b3@intel.com> (raw)
In-Reply-To: <2024051056-encrypt-divided-30d2@gregkh>

On 5/10/24 10:15, Greg KH wrote:
> On Thu, May 09, 2024 at 12:14:10PM +0300, Shay Drory wrote:
>> PCI subfunctions (SF) are anchored on the auxiliary bus.
> 
> "Some PCI subfunctions can be on the auxiliary bus"
> 
> Or maybe "Sometimes the auxiliary bus interface is used for PCI
> subfunctions."
> 
> Either way, the text here as-is is not correct as that is not how the
> auxbus code is always used, sorry.
> 
>> PCI physical
>> and virtual functions are anchored on the PCI bus;  the irq information
> 
> Odd use of ';'?  And an extra ' '?
> 
>> of each such function is visible to users via sysfs directory "msi_irqs"
>> containing file for each irq entry. However, for PCI SFs such information
>> is unavailable. Due to this users have no visibility on IRQs used by the
>> SFs.
> 
> Not even in /proc/irq/ ?
> 
>> Secondly, an SF is a multi function device supporting rdma, netdevice
> 
> Not "is", it should be "can be"  Not all the world is your crazy
> hardware :)
> 
>> and more. Without irq information at the bus level, the user is unable
>> to view or use the affinity of the SF IRQs.
> 
> How would affinity be relevent here?  You are just allowing them to be
> viewed, not set.
> 
>> Hence to match to the equivalent PCI PFs and VFs, add "irqs" directory,
>> for supporting auxiliary devices, containing file for each irq entry.
>>
>> Additionally, the PCI SFs sometimes share the IRQs with peer SFs. This
>> information is also not available to the users. To overcome this
>> limitation, each irq sysfs entry shows if irq is exclusive or shared.
>>
>> For example:
>> $ ls /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/
>> 50  51  52  53  54  55  56  57  58
>> $ cat /sys/bus/auxiliary/devices/mlx5_core.sf.1/irqs/52
>> exclusive
>>
>> Reviewed-by: Parav Pandit <parav@nvidia.com>
>> Signed-off-by: Shay Drory <shayd@nvidia.com>
>>
>> ---
>> v3->4:
>> - remove global mutex (Przemek)

thanks, and sorry for not catching back in time on v3 disussion

>> v2->v3:
>> - fix function declaration in case SYSFS isn't defined (Parav)
>> - convert auxdev->groups array with auxiliary_irqs_groups (Przemek)
>> v1->v2:
>> - move #ifdefs from drivers/base/auxiliary.c to
>>    include/linux/auxiliary_bus.h (Greg)
>> - use EXPORT_SYMBOL_GPL instead of EXPORT_SYMBOL (Greg)
>> - Fix kzalloc(ref) to kzalloc(*ref) (Simon)
>> - Add return description in auxiliary_device_sysfs_irq_add() kdoc (Simon)
>> - Fix auxiliary_irq_mode_show doc (kernel test boot)
>> ---
>>   Documentation/ABI/testing/sysfs-bus-auxiliary |  14 ++
>>   drivers/base/auxiliary.c                      | 178 +++++++++++++++++-
>>   include/linux/auxiliary_bus.h                 |  24 ++-
>>   3 files changed, 213 insertions(+), 3 deletions(-)
>>   create mode 100644 Documentation/ABI/testing/sysfs-bus-auxiliary
>>
>> diff --git a/Documentation/ABI/testing/sysfs-bus-auxiliary b/Documentation/ABI/testing/sysfs-bus-auxiliary
>> new file mode 100644
>> index 000000000000..3b8299d49d9e
>> --- /dev/null
>> +++ b/Documentation/ABI/testing/sysfs-bus-auxiliary
>> @@ -0,0 +1,14 @@
>> +What:		/sys/bus/auxiliary/devices/.../irqs/
>> +Date:		April, 2024
>> +Contact:	Shay Drory <shayd@nvidia.com>
>> +Description:
>> +		The /sys/devices/.../irqs directory contains a variable set of
>> +		files, with each file is named as irq number similar to PCI PF
>> +		or VF's irq number located in msi_irqs directory.
> 
> So this can be msi irqs?  Or not msi irqs?  How do we know?
> 
> 
>> +
>> +What:		/sys/bus/auxiliary/devices/.../irqs/<N>
>> +Date:		April, 2024
>> +Contact:	Shay Drory <shayd@nvidia.com>
>> +Description:
>> +		auxiliary devices can share IRQs. This attribute indicates if
>> +		the irq is shared with other SFs or exclusively used by the SF.
>> diff --git a/drivers/base/auxiliary.c b/drivers/base/auxiliary.c
>> index d3a2c40c2f12..def02f5f1220 100644
>> --- a/drivers/base/auxiliary.c
>> +++ b/drivers/base/auxiliary.c
>> @@ -158,6 +158,176 @@
>>    *	};
>>    */
>>   
>> +#ifdef CONFIG_SYSFS
>> +/* Xarray of irqs to determine if irq is exclusive or shared. */
>> +static DEFINE_XARRAY(irqs);
>> +
>> +struct auxiliary_irq_info {
>> +	struct device_attribute sysfs_attr;
>> +	int irq;
>> +};
>> +
>> +static struct attribute *auxiliary_irq_attrs[] = {
>> +	NULL
>> +};
>> +
>> +static const struct attribute_group auxiliary_irqs_group = {
>> +	.name = "irqs",
>> +	.attrs = auxiliary_irq_attrs,
>> +};
>> +
>> +static const struct attribute_group *auxiliary_irqs_groups[2] = {
> 
> Why list the array size?
> 
>> +	&auxiliary_irqs_group,
>> +	NULL
>> +};
>> +
>> +/* Auxiliary devices can share IRQs. Expose to user whether the provided IRQ is
>> + * shared or exclusive.
>> + */
>> +static ssize_t auxiliary_irq_mode_show(struct device *dev,
>> +				       struct device_attribute *attr, char *buf)
>> +{
>> +	struct auxiliary_irq_info *info =
>> +		container_of(attr, struct auxiliary_irq_info, sysfs_attr);
>> +
>> +	if (refcount_read(xa_load(&irqs, info->irq)) > 1)
> 
> refcount combined with xa?  That feels wrong, why is refcount used for
> this at all?

Not long ago I commented on similar usage for ice driver,
~"since you are locking anyway this could be a plain counter",
and author replied
~"additional semantics (like saturation) of refcount make me feel warm
and fuzzy" (sorry if misquoting too much).
That convinced me back then, so I kept quiet about that here.

The "use least powerful option" rule of thumb is perhaps more important.

@Greg, WDYT?

> 
>> +		return sysfs_emit(buf, "%s\n", "shared");
>> +	else
>> +		return sysfs_emit(buf, "%s\n", "exclusive");
>> +}
>> +
>> +static void auxiliary_irq_destroy(int irq)
>> +{
>> +	refcount_t *ref;
>> +
>> +	xa_lock(&irqs);
>> +	ref = xa_load(&irqs, irq);
>> +	if (refcount_dec_and_test(ref)) {
>> +		__xa_erase(&irqs, irq);
>> +		kfree(ref);
>> +	}
>> +	xa_unlock(&irqs);
>> +}
>> +
>> +static int auxiliary_irq_create(int irq)
>> +{
>> +	refcount_t *new_ref = kzalloc(sizeof(*new_ref), GFP_KERNEL);
>> +	refcount_t *ref;
>> +	int ret = 0;
>> +
>> +	if (!new_ref)
>> +		return -ENOMEM;
>> +
>> +	xa_lock(&irqs);
>> +	ref = xa_load(&irqs, irq);
>> +	if (ref) {
>> +		kfree(new_ref);
>> +		refcount_inc(ref);
> 
> Why do you need to use refcounts for these?  What does that help out
> with?
> 
>> +		goto out;
>> +	}
>> +
>> +	refcount_set(new_ref, 1);
>> +	ref = __xa_cmpxchg(&irqs, irq, NULL, new_ref, GFP_KERNEL);
>> +	if (ref) {
>> +		kfree(new_ref);
>> +		if (xa_is_err(ref)) {
>> +			ret = xa_err(ref);
>> +			goto out;
>> +		}
>> +
>> +		/* Another thread beat us to creating the enrtry. */
>> +		refcount_inc(ref);
> 
> How can that happen?  Why not just use a normal simple lock for all of
> this so you don't have to mess with refcounts at all?  This is not
> performance-relevent code at all, but yet with a refcount you cause
> almost the same issues that a normal lock would have, plus the increased
> complexity of all of the surrounding code (like this, and the crazy
> __xa_cmpxchg() call)
> 
> Make this simple please.

I find current API of xarray not ideal for this use case, and would like
to fix it, but let me write a proper RFC to don't derail (or slow down)
this series.

> 
> 
>> +	}
>> +
>> +out:
>> +	xa_unlock(&irqs);
>> +	return ret;
>> +}
>> +
>> +/**
>> + * auxiliary_device_sysfs_irq_add - add a sysfs entry for the given IRQ
>> + * @auxdev: auxiliary bus device to add the sysfs entry.
>> + * @irq: The associated Linux interrupt number.
>> + *
>> + * This function should be called after auxiliary device have successfully
>> + * received the irq.
>> + *
>> + * Return: zero on success or an error code on failure.
>> + */
>> +int auxiliary_device_sysfs_irq_add(struct auxiliary_device *auxdev, int irq)
>> +{
>> +	struct device *dev = &auxdev->dev;
>> +	struct auxiliary_irq_info *info;
>> +	int ret;
>> +
>> +	ret = auxiliary_irq_create(irq);
>> +	if (ret)
>> +		return ret;
>> +
>> +	info = kzalloc(sizeof(*info), GFP_KERNEL);
>> +	if (!info) {
>> +		ret = -ENOMEM;
>> +		goto info_err;
>> +	}
>> +
>> +	sysfs_attr_init(&info->sysfs_attr.attr);
>> +	info->sysfs_attr.attr.name = kasprintf(GFP_KERNEL, "%d", irq);
>> +	if (!info->sysfs_attr.attr.name) {
>> +		ret = -ENOMEM;
>> +		goto name_err;
>> +	}
>> +	info->irq = irq;
>> +	info->sysfs_attr.attr.mode = 0444;
>> +	info->sysfs_attr.show = auxiliary_irq_mode_show;
>> +
>> +	ret = xa_insert(&auxdev->irqs, irq, info, GFP_KERNEL);
>> +	if (ret)
>> +		goto auxdev_xa_err;
>> +
>> +	ret = sysfs_add_file_to_group(&dev->kobj, &info->sysfs_attr.attr,
>> +				      auxiliary_irqs_group.name);
> 
> Adding dynamic sysfs attributes like this means that you normally just
> raced with userspace and lost.  How are you ensuring that you did not
> just do that?
> 
>> +/**
>> + * auxiliary_device_sysfs_irq_remove - remove a sysfs entry for the given IRQ
>> + * @auxdev: auxiliary bus device to add the sysfs entry.
>> + * @irq: the IRQ to remove.
>> + *
>> + * This function should be called to remove an IRQ sysfs entry.
>> + */
>> +void auxiliary_device_sysfs_irq_remove(struct auxiliary_device *auxdev, int irq)
>> +{
>> +	struct auxiliary_irq_info *info = xa_load(&auxdev->irqs, irq);
>> +	struct device *dev = &auxdev->dev;
>> +
>> +	if (WARN_ON(!info))
> 
> How can this ever happen?  If not, don't check for it please.  If it can
> happen, properly handle it and move on, don't reboot the box.
> 
> thanks,
> 
> greg k-h
> 


  reply	other threads:[~2024-05-10 12:54 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-09  9:14 [PATCH net-next v4 0/2] Introduce auxiliary bus IRQs sysfs Shay Drory
2024-05-09  9:14 ` [PATCH net-next v4 1/2] driver core: auxiliary bus: show auxiliary device IRQs Shay Drory
2024-05-10  8:15   ` Greg KH
2024-05-10 12:54     ` Przemek Kitszel [this message]
2024-05-10 13:07       ` Greg KH
2024-05-10 14:01         ` Przemek Kitszel
2024-05-11  7:44           ` Greg KH
2024-05-12  7:30             ` Shay Drori
2024-05-12 15:32       ` Jason Gunthorpe
2024-05-13  8:33         ` Przemek Kitszel
2024-05-13 23:06           ` Jason Gunthorpe
2024-05-12  7:27     ` Shay Drori
2024-05-09  9:14 ` [PATCH net-next v4 2/2] net/mlx5: Expose SFs IRQs Shay Drory

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=22533dbb-3be9-4ff2-9b59-b3d6a650f7b3@intel.com \
    --to=przemyslaw.kitszel@intel.com \
    --cc=davem@davemloft.net \
    --cc=david.m.ertman@intel.com \
    --cc=edumazet@google.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=ira.weiny@intel.com \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=parav@nvidia.com \
    --cc=rafael@kernel.org \
    --cc=shayd@nvidia.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).