[PATCH v2 0/3] Introduce per NUMA node memory error statistics

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] Introduce per NUMA node memory error statistics
@ 2023-01-20  3:46 Jiaqi Yan
  2023-01-20  3:46 ` [PATCH v2 1/3] mm: memory-failure: Add memory failure stats to sysfs Jiaqi Yan
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Jiaqi Yan @ 2023-01-20  3:46 UTC (permalink / raw
  To: tony.luck, naoya.horiguchi
  Cc: jiaqiyan, duenwen, rientjes, linux-mm, shy828301, akpm,
	wangkefeng.wang

Background
==========
In the RFC for Kernel Support of Memory Error Detection [1], one advantage
of software-based scanning over hardware patrol scrubber is the ability
to make statistics visible to system administrators. The statistics
include 2 categories:
* Memory error statistics, for example, how many memory error are
  encountered, how many of them are recovered by the kernel. Note these
  memory errors are non-fatal to kernel: during the machine check
  exception (MCE) handling kernel already classified MCE's severity to
  be unnecessary to panic (but either action required or optional).
* Scanner statistics, for example how many times the scanner have fully
  scanned a NUMA node, how many errors are first detected by the scanner.

The memory error statistics are useful to userspace and actually not
specific to scanner detected memory errors, and are the focus of this
patchset.

Motivation
==========
Memory error stats are important to userspace but insufficient in kernel
today. Datacenter administrators can better monitor a machine's memory
health with the visible stats. For example, while memory errors are
inevitable on servers with 10+ TB memory, starting server maintenance
when there are only 1~2 recovered memory errors could be overreacting;
in cloud production environment maintenance usually means live migrate
all the workload running on the server and this usually causes nontrivial
disruption to the customer. Providing insight into the scope of memory
errors on a system helps to determine the appropriate follow-up action.
In addition, the kernel's existing memory error stats need to be
standardized so that userspace can reliably count on their usefulness.

Today kernel provides following memory error info to userspace, but they
are not sufficient or have disadvantages:
* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
  not per NUMA node stats though
* ras:memory_failure_event: only available after explicitly enabled
* /dev/mcelog provides many useful info about the MCEs, but doesn't
  capture how memory_failure recovered memory MCEs
* kernel logs: userspace needs to process log text

Exposing memory error stats is also a good start for the in-kernel memory
error detector. Today the data source of memory error stats are either
direct memory error consumption, or hardware patrol scrubber detection
(either signaled as UCNA or SRAO). Once in-kernel memory scanner is
implemented, it will be the main source as it is usually configured to
scan memory DIMMs constantly and faster than hardware patrol scrubber.

How Implemented
===============
As Naoya pointed out [2], exposing memory error statistics to userspace
is useful independent of software or hardware scanner. Therefore we
implement the memory error statistics independent of the in-kernel memory
error detector. It exposes the following per NUMA node memory error
counters:

  /sys/devices/system/node/node${X}/memory_failure/total
  /sys/devices/system/node/node${X}/memory_failure/recovered
  /sys/devices/system/node/node${X}/memory_failure/ignored
  /sys/devices/system/node/node${X}/memory_failure/failed
  /sys/devices/system/node/node${X}/memory_failure/delayed

These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively. This approach can be
easier to extend for future use cases than /proc/meminfo, trace event,
and log. The following math holds for the statistics:
* total = recovered + ignored + failed + delayed
These memory error stats are reset during machine boot.

The 1st commit introduces these sysfs entries. The 2nd commit populates
memory error stats every time memory_failure attempts memory error
recovery. The 3rd commit adds documentations for introduced stats.

[1] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#mc22959244f5388891c523882e61163c6e4d703af
[2] https://lore.kernel.org/linux-mm/7E670362-C29E-4626-B546-26530D54F937@gmail.com/T/#m52d8d7a333d8536bd7ce74253298858b1c0c0ac6

Changelog

v2 changes:
- Incorporate feedbacks from Andrew Morton <akpm@linux-foundation.org>
  and Horiguchi Naoya <naoya.horiguchi@nec.com>.
- Correciton in cover letter: both UCNA and SRAO are handled by
  memory_failure().
- Rename `pages_poisoned` to `total`.
- Remove the "pages_" prefix from counter names.
- Correction in cover letter and commit message:
  `total` * PAGE_SIZE * #nodes is not exactly equals to
  /proc/meminfo/HardwareCorrupted due to cases not accounted.

Jiaqi Yan (3):
  mm: memory-failure: Add memory failure stats to sysfs
  mm: memory-failure: Bump memory failure stats to pglist_data
  mm: memory-failure: Document memory failure stats

 Documentation/ABI/stable/sysfs-devices-node | 39 +++++++++++
 drivers/base/node.c                         |  3 +
 include/linux/mm.h                          |  5 ++
 include/linux/mmzone.h                      | 28 ++++++++
 mm/memory-failure.c                         | 71 +++++++++++++++++++++
 5 files changed, 146 insertions(+)

-- 
2.39.0.246.g2a6d74b583-goog

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/3] mm: memory-failure: Add memory failure stats to sysfs
  2023-01-20  3:46 [PATCH v2 0/3] Introduce per NUMA node memory error statistics Jiaqi Yan
@ 2023-01-20  3:46 ` Jiaqi Yan
  2023-01-23  2:42   ` HORIGUCHI NAOYA(堀口　直也)
  2023-02-02  6:54   ` Kefeng Wang
  2023-01-20  3:46 ` [PATCH v2 2/3] mm: memory-failure: Bump memory failure stats to pglist_data Jiaqi Yan
  2023-01-20  3:46 ` [PATCH v2 3/3] mm: memory-failure: Document memory failure stats Jiaqi Yan
  2 siblings, 2 replies; 11+ messages in thread
From: Jiaqi Yan @ 2023-01-20  3:46 UTC (permalink / raw
  To: tony.luck, naoya.horiguchi
  Cc: jiaqiyan, duenwen, rientjes, linux-mm, shy828301, akpm,
	wangkefeng.wang

Today kernel provides following memory error info to userspace, but each
has its own disadvantage
* HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
  not per NUMA node stats though
* ras:memory_failure_event: only available after explicitly enabled
* /dev/mcelog provides many useful info about the MCEs, but
  doesn't capture how memory_failure recovered memory MCEs
* kernel logs: userspace needs to process log text

Exposes per NUMA node memory error stats as sysfs entries:

  /sys/devices/system/node/node${X}/memory_failure/total
  /sys/devices/system/node/node${X}/memory_failure/recovered
  /sys/devices/system/node/node${X}/memory_failure/ignored
  /sys/devices/system/node/node${X}/memory_failure/failed
  /sys/devices/system/node/node${X}/memory_failure/delayed

These counters describe how many raw pages are poisoned and after the
attempted recoveries by the kernel, their resolutions: how many are
recovered, ignored, failed, or delayed respectively. The following
math holds for the statistics:
* total = recovered + ignored + failed + delayed

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 drivers/base/node.c    |  3 +++
 include/linux/mm.h     |  5 +++++
 include/linux/mmzone.h | 28 ++++++++++++++++++++++++++++
 mm/memory-failure.c    | 35 +++++++++++++++++++++++++++++++++++
 4 files changed, 71 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index faf3597a96da..b46db17124f3 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -586,6 +586,9 @@ static const struct attribute_group *node_dev_groups[] = {
 	&node_dev_group,
 #ifdef CONFIG_HAVE_ARCH_NODE_DEV_GROUP
 	&arch_node_dev_group,
+#endif
+#ifdef CONFIG_MEMORY_FAILURE
+	&memory_failure_attr_group,
 #endif
 	NULL
 };
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f196e4d66d..888576884eb9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3521,6 +3521,11 @@ enum mf_action_page_type {
 	MF_MSG_UNKNOWN,
 };
 
+/*
+ * Sysfs entries for memory failure handling statistics.
+ */
+extern const struct attribute_group memory_failure_attr_group;
+
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
 extern void clear_huge_page(struct page *page,
 			    unsigned long addr_hint,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cd28a100d9e4..2c537b31fa7b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1110,6 +1110,31 @@ struct deferred_split {
 };
 #endif
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Per NUMA node memory failure handling statistics.
+ */
+struct memory_failure_stats {
+	/*
+	 * Number of raw pages poisoned.
+	 * Cases not accounted: memory outside kernel control, offline page,
+	 * arch-specific memory_failure (SGX), hwpoison_filter() filtered
+	 * error events, and unpoison actions from hwpoison_unpoison.
+	 */
+	unsigned long total;
+	/*
+	 * Recovery results of poisoned raw pages handled by memory_failure,
+	 * in sync with mf_result.
+	 * total = ignored + failed + delayed + recovered.
+	 * total * PAGE_SIZE * #nodes = /proc/meminfo/HardwareCorrupted.
+	 */
+	unsigned long ignored;
+	unsigned long failed;
+	unsigned long delayed;
+	unsigned long recovered;
+};
+#endif
+
 /*
  * On NUMA machines, each NUMA node would have a pg_data_t to describe
  * it's memory layout. On UMA machines there is a single pglist_data which
@@ -1253,6 +1278,9 @@ typedef struct pglist_data {
 #ifdef CONFIG_NUMA
 	struct memory_tier __rcu *memtier;
 #endif
+#ifdef CONFIG_MEMORY_FAILURE
+	struct memory_failure_stats mf_stats;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c77a9e37e27e..c628f1db3a4d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -87,6 +87,41 @@ inline void num_poisoned_pages_sub(unsigned long pfn, long i)
 		memblk_nr_poison_sub(pfn, i);
 }
 
+/**
+ * MF_ATTR_RO - Create sysfs entry for each memory failure statistics.
+ * @_name: name of the file in the per NUMA sysfs directory.
+ */
+#define MF_ATTR_RO(_name)					\
+static ssize_t _name##_show(struct device *dev,			\
+			    struct device_attribute *attr,	\
+			    char *buf)				\
+{								\
+	struct memory_failure_stats *mf_stats =			\
+		&NODE_DATA(dev->id)->mf_stats;			\
+	return sprintf(buf, "%lu\n", mf_stats->_name);		\
+}								\
+static DEVICE_ATTR_RO(_name)
+
+MF_ATTR_RO(total);
+MF_ATTR_RO(ignored);
+MF_ATTR_RO(failed);
+MF_ATTR_RO(delayed);
+MF_ATTR_RO(recovered);
+
+static struct attribute *memory_failure_attr[] = {
+	&dev_attr_total.attr,
+	&dev_attr_ignored.attr,
+	&dev_attr_failed.attr,
+	&dev_attr_delayed.attr,
+	&dev_attr_recovered.attr,
+	NULL,
+};
+
+const struct attribute_group memory_failure_attr_group = {
+	.name = "memory_failure",
+	.attrs = memory_failure_attr,
+};
+
 /*
  * Return values:
  *   1:   the page is dissolved (if needed) and taken off from buddy,
-- 
2.39.0.246.g2a6d74b583-goog



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/3] mm: memory-failure: Bump memory failure stats to pglist_data
  2023-01-20  3:46 [PATCH v2 0/3] Introduce per NUMA node memory error statistics Jiaqi Yan
  2023-01-20  3:46 ` [PATCH v2 1/3] mm: memory-failure: Add memory failure stats to sysfs Jiaqi Yan
@ 2023-01-20  3:46 ` Jiaqi Yan
  2023-01-23  2:42   ` HORIGUCHI NAOYA(堀口　直也)
  2023-02-02  6:56   ` Kefeng Wang
  2023-01-20  3:46 ` [PATCH v2 3/3] mm: memory-failure: Document memory failure stats Jiaqi Yan
  2 siblings, 2 replies; 11+ messages in thread
From: Jiaqi Yan @ 2023-01-20  3:46 UTC (permalink / raw
  To: tony.luck, naoya.horiguchi
  Cc: jiaqiyan, duenwen, rientjes, linux-mm, shy828301, akpm,
	wangkefeng.wang

Right before memory_failure finishes its handling, accumulate poisoned
page's resolution counters to pglist_data's memory_failure_stats, so as
to update the corresponding sysfs entries.

Tested:
1) Start an application to allocate memory buffer chunks
2) Convert random memory buffer addresses to physical addresses
3) Inject memory errors using EINJ at chosen physical addresses
4) Access poisoned memory buffer and recover from SIGBUS
5) Check counter values under
   /sys/devices/system/node/node*/memory_failure/*

Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 mm/memory-failure.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c628f1db3a4d..f4990839ea66 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1227,6 +1227,39 @@ static struct page_state error_states[] = {
 #undef slab
 #undef reserved
 
+static void update_per_node_mf_stats(unsigned long pfn,
+				     enum mf_result result)
+{
+	int nid = MAX_NUMNODES;
+	struct memory_failure_stats *mf_stats = NULL;
+
+	nid = pfn_to_nid(pfn);
+	if (unlikely(nid < 0 || nid >= MAX_NUMNODES)) {
+		WARN_ONCE(1, "Memory failure: pfn=%#lx, invalid nid=%d", pfn, nid);
+		return;
+	}
+
+	mf_stats = &NODE_DATA(nid)->mf_stats;
+	switch (result) {
+	case MF_IGNORED:
+		++mf_stats->ignored;
+		break;
+	case MF_FAILED:
+		++mf_stats->failed;
+		break;
+	case MF_DELAYED:
+		++mf_stats->delayed;
+		break;
+	case MF_RECOVERED:
+		++mf_stats->recovered;
+		break;
+	default:
+		WARN_ONCE(1, "Memory failure: mf_result=%d is not properly handled", result);
+		break;
+	}
+	++mf_stats->total;
+}
+
 /*
  * "Dirty/Clean" indication is not 100% accurate due to the possibility of
  * setting PG_dirty outside page lock. See also comment above set_page_dirty().
@@ -1237,6 +1270,9 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
 	trace_memory_failure_event(pfn, type, result);
 
 	num_poisoned_pages_inc(pfn);
+
+	update_per_node_mf_stats(pfn, result);
+
 	pr_err("%#lx: recovery action for %s: %s\n",
 		pfn, action_page_types[type], action_name[result]);
 
-- 
2.39.0.246.g2a6d74b583-goog



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/3] mm: memory-failure: Document memory failure stats
  2023-01-20  3:46 [PATCH v2 0/3] Introduce per NUMA node memory error statistics Jiaqi Yan
  2023-01-20  3:46 ` [PATCH v2 1/3] mm: memory-failure: Add memory failure stats to sysfs Jiaqi Yan
  2023-01-20  3:46 ` [PATCH v2 2/3] mm: memory-failure: Bump memory failure stats to pglist_data Jiaqi Yan
@ 2023-01-20  3:46 ` Jiaqi Yan
  2023-01-23  2:43   ` HORIGUCHI NAOYA(堀口　直也)
  2 siblings, 1 reply; 11+ messages in thread
From: Jiaqi Yan @ 2023-01-20  3:46 UTC (permalink / raw
  To: tony.luck, naoya.horiguchi
  Cc: jiaqiyan, duenwen, rientjes, linux-mm, shy828301, akpm,
	wangkefeng.wang

Add documentation for memory_failure's per NUMA node sysfs entries

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 Documentation/ABI/stable/sysfs-devices-node | 39 +++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 8db67aa472f1..402af4b2b905 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -182,3 +182,42 @@ Date:		November 2021
 Contact:	Jarkko Sakkinen <jarkko@kernel.org>
 Description:
 		The total amount of SGX physical memory in bytes.
+
+What:		/sys/devices/system/node/nodeX/memory_failure/total
+Date:		January 2023
+Contact:	Jiaqi Yan <jiaqiyan@google.com>
+Description:
+		The total number of raw poisoned pages (pages containing
+		corrupted data due to memory errors) on a NUMA node.
+
+What:		/sys/devices/system/node/nodeX/memory_failure/ignored
+Date:		January 2023
+Contact:	Jiaqi Yan <jiaqiyan@google.com>
+Description:
+		Of the raw poisoned pages on a NUMA node, how many pages are
+		ignored by memory error recovery attempt, usually because
+		support for this type of pages is unavailable, and kernel
+		gives up the recovery.
+
+What:		/sys/devices/system/node/nodeX/memory_failure/failed
+Date:		January 2023
+Contact:	Jiaqi Yan <jiaqiyan@google.com>
+Description:
+		Of the raw poisoned pages on a NUMA node, how many pages are
+		failed by memory error recovery attempt. This usually means
+		a key recovery operation failed.
+
+What:		/sys/devices/system/node/nodeX/memory_failure/delayed
+Date:		January 2023
+Contact:	Jiaqi Yan <jiaqiyan@google.com>
+Description:
+		Of the raw poisoned pages on a NUMA node, how many pages are
+		delayed by memory error recovery attempt. Delayed poisoned
+		pages usually will be retried by kernel.
+
+What:		/sys/devices/system/node/nodeX/memory_failure/recovered
+Date:		January 2023
+Contact:	Jiaqi Yan <jiaqiyan@google.com>
+Description:
+		Of the raw poisoned pages on a NUMA node, how many pages are
+		recovered by memory error recovery attempt.
-- 
2.39.0.246.g2a6d74b583-goog



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] mm: memory-failure: Add memory failure stats to sysfs
  2023-01-20  3:46 ` [PATCH v2 1/3] mm: memory-failure: Add memory failure stats to sysfs Jiaqi Yan
@ 2023-01-23  2:42   ` HORIGUCHI NAOYA(堀口　直也)
  2023-02-02  6:54   ` Kefeng Wang
  1 sibling, 0 replies; 11+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2023-01-23  2:42 UTC (permalink / raw
  To: Jiaqi Yan
  Cc: tony.luck@intel.com, duenwen@google.com, rientjes@google.com,
	linux-mm@kvack.org, shy828301@gmail.com,
	akpm@linux-foundation.org, wangkefeng.wang@huawei.com

On Fri, Jan 20, 2023 at 03:46:20AM +0000, Jiaqi Yan wrote:
> Today kernel provides following memory error info to userspace, but each
> has its own disadvantage
> * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
>   not per NUMA node stats though
> * ras:memory_failure_event: only available after explicitly enabled
> * /dev/mcelog provides many useful info about the MCEs, but
>   doesn't capture how memory_failure recovered memory MCEs
> * kernel logs: userspace needs to process log text
> 
> Exposes per NUMA node memory error stats as sysfs entries:
> 
>   /sys/devices/system/node/node${X}/memory_failure/total
>   /sys/devices/system/node/node${X}/memory_failure/recovered
>   /sys/devices/system/node/node${X}/memory_failure/ignored
>   /sys/devices/system/node/node${X}/memory_failure/failed
>   /sys/devices/system/node/node${X}/memory_failure/delayed
> 
> These counters describe how many raw pages are poisoned and after the
> attempted recoveries by the kernel, their resolutions: how many are
> recovered, ignored, failed, or delayed respectively. The following
> math holds for the statistics:
> * total = recovered + ignored + failed + delayed
> 
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>

Looks good to me, thank you.

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/3] mm: memory-failure: Bump memory failure stats to pglist_data
  2023-01-20  3:46 ` [PATCH v2 2/3] mm: memory-failure: Bump memory failure stats to pglist_data Jiaqi Yan
@ 2023-01-23  2:42   ` HORIGUCHI NAOYA(堀口　直也)
  2023-02-02  6:56   ` Kefeng Wang
  1 sibling, 0 replies; 11+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2023-01-23  2:42 UTC (permalink / raw
  To: Jiaqi Yan
  Cc: tony.luck@intel.com, duenwen@google.com, rientjes@google.com,
	linux-mm@kvack.org, shy828301@gmail.com,
	akpm@linux-foundation.org, wangkefeng.wang@huawei.com

On Fri, Jan 20, 2023 at 03:46:21AM +0000, Jiaqi Yan wrote:
> Right before memory_failure finishes its handling, accumulate poisoned
> page's resolution counters to pglist_data's memory_failure_stats, so as
> to update the corresponding sysfs entries.
> 
> Tested:
> 1) Start an application to allocate memory buffer chunks
> 2) Convert random memory buffer addresses to physical addresses
> 3) Inject memory errors using EINJ at chosen physical addresses
> 4) Access poisoned memory buffer and recover from SIGBUS
> 5) Check counter values under
>    /sys/devices/system/node/node*/memory_failure/*
> 
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 3/3] mm: memory-failure: Document memory failure stats
  2023-01-20  3:46 ` [PATCH v2 3/3] mm: memory-failure: Document memory failure stats Jiaqi Yan
@ 2023-01-23  2:43   ` HORIGUCHI NAOYA(堀口　直也)
  0 siblings, 0 replies; 11+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2023-01-23  2:43 UTC (permalink / raw
  To: Jiaqi Yan
  Cc: tony.luck@intel.com, duenwen@google.com, rientjes@google.com,
	linux-mm@kvack.org, shy828301@gmail.com,
	akpm@linux-foundation.org, wangkefeng.wang@huawei.com

On Fri, Jan 20, 2023 at 03:46:22AM +0000, Jiaqi Yan wrote:
> Add documentation for memory_failure's per NUMA node sysfs entries
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] mm: memory-failure: Add memory failure stats to sysfs
  2023-01-20  3:46 ` [PATCH v2 1/3] mm: memory-failure: Add memory failure stats to sysfs Jiaqi Yan
  2023-01-23  2:42   ` HORIGUCHI NAOYA(堀口　直也)
@ 2023-02-02  6:54   ` Kefeng Wang
  2023-02-04 23:21     ` Jiaqi Yan
  1 sibling, 1 reply; 11+ messages in thread
From: Kefeng Wang @ 2023-02-02  6:54 UTC (permalink / raw
  To: Jiaqi Yan, tony.luck, naoya.horiguchi
  Cc: duenwen, rientjes, linux-mm, shy828301, akpm



On 2023/1/20 11:46, Jiaqi Yan wrote:
> Today kernel provides following memory error info to userspace, but each
> has its own disadvantage
> * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
>    not per NUMA node stats though
> * ras:memory_failure_event: only available after explicitly enabled
> * /dev/mcelog provides many useful info about the MCEs, but
>    doesn't capture how memory_failure recovered memory MCEs
> * kernel logs: userspace needs to process log text
> 
> Exposes per NUMA node memory error stats as sysfs entries:
> 
>    /sys/devices/system/node/node${X}/memory_failure/total
>    /sys/devices/system/node/node${X}/memory_failure/recovered
>    /sys/devices/system/node/node${X}/memory_failure/ignored
>    /sys/devices/system/node/node${X}/memory_failure/failed
>    /sys/devices/system/node/node${X}/memory_failure/delayed
> 
> These counters describe how many raw pages are poisoned and after the
> attempted recoveries by the kernel, their resolutions: how many are
> recovered, ignored, failed, or delayed respectively. The following
> math holds for the statistics:
> * total = recovered + ignored + failed + delayed
> 
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>   drivers/base/node.c    |  3 +++
>   include/linux/mm.h     |  5 +++++
>   include/linux/mmzone.h | 28 ++++++++++++++++++++++++++++
>   mm/memory-failure.c    | 35 +++++++++++++++++++++++++++++++++++
>   4 files changed, 71 insertions(+)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index faf3597a96da..b46db17124f3 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -586,6 +586,9 @@ static const struct attribute_group *node_dev_groups[] = {
>   	&node_dev_group,
>   #ifdef CONFIG_HAVE_ARCH_NODE_DEV_GROUP
>   	&arch_node_dev_group,
> +#endif
> +#ifdef CONFIG_MEMORY_FAILURE
> +	&memory_failure_attr_group,
>   #endif
>   	NULL
>   };
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f3f196e4d66d..888576884eb9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3521,6 +3521,11 @@ enum mf_action_page_type {
>   	MF_MSG_UNKNOWN,
>   };
>   
> +/*
> + * Sysfs entries for memory failure handling statistics.
> + */
> +extern const struct attribute_group memory_failure_attr_group;
> +

This should move under CONFIG_MEMORY_FAILURE

>   #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>   extern void clear_huge_page(struct page *page,
>   			    unsigned long addr_hint,
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index cd28a100d9e4..2c537b31fa7b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1110,6 +1110,31 @@ struct deferred_split {
>   };
>   #endif
>   
> +#ifdef CONFIG_MEMORY_FAILURE
> +/*
> + * Per NUMA node memory failure handling statistics.
> + */
> +struct memory_failure_stats {
> +	/*
> +	 * Number of raw pages poisoned.
> +	 * Cases not accounted: memory outside kernel control, offline page,
> +	 * arch-specific memory_failure (SGX), hwpoison_filter() filtered
> +	 * error events, and unpoison actions from hwpoison_unpoison.
> +	 */
> +	unsigned long total;
> +	/*
> +	 * Recovery results of poisoned raw pages handled by memory_failure,
> +	 * in sync with mf_result.
> +	 * total = ignored + failed + delayed + recovered.
> +	 * total * PAGE_SIZE * #nodes = /proc/meminfo/HardwareCorrupted.
> +	 */
> +	unsigned long ignored;
> +	unsigned long failed;
> +	unsigned long delayed;
> +	unsigned long recovered;
> +};
> +#endif
> +
>   /*
>    * On NUMA machines, each NUMA node would have a pg_data_t to describe
>    * it's memory layout. On UMA machines there is a single pglist_data which
> @@ -1253,6 +1278,9 @@ typedef struct pglist_data {
>   #ifdef CONFIG_NUMA
>   	struct memory_tier __rcu *memtier;
>   #endif
> +#ifdef CONFIG_MEMORY_FAILURE
> +	struct memory_failure_stats mf_stats;
> +#endif
>   } pg_data_t;
>   
>   #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index c77a9e37e27e..c628f1db3a4d 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -87,6 +87,41 @@ inline void num_poisoned_pages_sub(unsigned long pfn, long i)
>   		memblk_nr_poison_sub(pfn, i);
>   }
>   
> +/**
> + * MF_ATTR_RO - Create sysfs entry for each memory failure statistics.
> + * @_name: name of the file in the per NUMA sysfs directory.
> + */
> +#define MF_ATTR_RO(_name)					\
> +static ssize_t _name##_show(struct device *dev,			\
> +			    struct device_attribute *attr,	\
> +			    char *buf)				\
> +{								\
> +	struct memory_failure_stats *mf_stats =			\
> +		&NODE_DATA(dev->id)->mf_stats;			\
> +	return sprintf(buf, "%lu\n", mf_stats->_name);		\
> +}								\
> +static DEVICE_ATTR_RO(_name)
> +
> +MF_ATTR_RO(total);
> +MF_ATTR_RO(ignored);
> +MF_ATTR_RO(failed);
> +MF_ATTR_RO(delayed);
> +MF_ATTR_RO(recovered);
> +
> +static struct attribute *memory_failure_attr[] = {
> +	&dev_attr_total.attr,
> +	&dev_attr_ignored.attr,
> +	&dev_attr_failed.attr,
> +	&dev_attr_delayed.attr,
> +	&dev_attr_recovered.attr,
> +	NULL,
> +};
> +
> +const struct attribute_group memory_failure_attr_group = {
> +	.name = "memory_failure",
> +	.attrs = memory_failure_attr,
> +};
> +
>   /*
>    * Return values:
>    *   1:   the page is dissolved (if needed) and taken off from buddy,


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/3] mm: memory-failure: Bump memory failure stats to pglist_data
  2023-01-20  3:46 ` [PATCH v2 2/3] mm: memory-failure: Bump memory failure stats to pglist_data Jiaqi Yan
  2023-01-23  2:42   ` HORIGUCHI NAOYA(堀口　直也)
@ 2023-02-02  6:56   ` Kefeng Wang
  2023-02-04 23:17     ` Jiaqi Yan
  1 sibling, 1 reply; 11+ messages in thread
From: Kefeng Wang @ 2023-02-02  6:56 UTC (permalink / raw
  To: Jiaqi Yan, tony.luck, naoya.horiguchi
  Cc: duenwen, rientjes, linux-mm, shy828301, akpm



On 2023/1/20 11:46, Jiaqi Yan wrote:
> Right before memory_failure finishes its handling, accumulate poisoned
> page's resolution counters to pglist_data's memory_failure_stats, so as
> to update the corresponding sysfs entries.
> 
> Tested:
> 1) Start an application to allocate memory buffer chunks
> 2) Convert random memory buffer addresses to physical addresses
> 3) Inject memory errors using EINJ at chosen physical addresses
> 4) Access poisoned memory buffer and recover from SIGBUS
> 5) Check counter values under
>     /sys/devices/system/node/node*/memory_failure/*
> 
> Acked-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>   mm/memory-failure.c | 36 ++++++++++++++++++++++++++++++++++++
>   1 file changed, 36 insertions(+)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index c628f1db3a4d..f4990839ea66 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1227,6 +1227,39 @@ static struct page_state error_states[] = {
>   #undef slab
>   #undef reserved
>   
> +static void update_per_node_mf_stats(unsigned long pfn,
> +				     enum mf_result result)
> +{
> +	int nid = MAX_NUMNODES;
> +	struct memory_failure_stats *mf_stats = NULL;
> +
> +	nid = pfn_to_nid(pfn);
> +	if (unlikely(nid < 0 || nid >= MAX_NUMNODES)) {
> +		WARN_ONCE(1, "Memory failure: pfn=%#lx, invalid nid=%d", pfn, nid);
> +		return;
> +	}
> +
...
> +	default:
> +		WARN_ONCE(1, "Memory failure: mf_result=%d is not properly handled", result);
> +		break;
> +	}

We already define pr_fmt, the "Memory failure:" prefix should be dropped.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/3] mm: memory-failure: Bump memory failure stats to pglist_data
  2023-02-02  6:56   ` Kefeng Wang
@ 2023-02-04 23:17     ` Jiaqi Yan
  0 siblings, 0 replies; 11+ messages in thread
From: Jiaqi Yan @ 2023-02-04 23:17 UTC (permalink / raw
  To: Kefeng Wang
  Cc: tony.luck, naoya.horiguchi, duenwen, rientjes, linux-mm,
	shy828301, akpm

On Wed, Feb 1, 2023 at 10:56 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2023/1/20 11:46, Jiaqi Yan wrote:
> > Right before memory_failure finishes its handling, accumulate poisoned
> > page's resolution counters to pglist_data's memory_failure_stats, so as
> > to update the corresponding sysfs entries.
> >
> > Tested:
> > 1) Start an application to allocate memory buffer chunks
> > 2) Convert random memory buffer addresses to physical addresses
> > 3) Inject memory errors using EINJ at chosen physical addresses
> > 4) Access poisoned memory buffer and recover from SIGBUS
> > 5) Check counter values under
> >     /sys/devices/system/node/node*/memory_failure/*
> >
> > Acked-by: David Rientjes <rientjes@google.com>
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > ---
> >   mm/memory-failure.c | 36 ++++++++++++++++++++++++++++++++++++
> >   1 file changed, 36 insertions(+)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index c628f1db3a4d..f4990839ea66 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1227,6 +1227,39 @@ static struct page_state error_states[] = {
> >   #undef slab
> >   #undef reserved
> >
> > +static void update_per_node_mf_stats(unsigned long pfn,
> > +                                  enum mf_result result)
> > +{
> > +     int nid = MAX_NUMNODES;
> > +     struct memory_failure_stats *mf_stats = NULL;
> > +
> > +     nid = pfn_to_nid(pfn);
> > +     if (unlikely(nid < 0 || nid >= MAX_NUMNODES)) {
> > +             WARN_ONCE(1, "Memory failure: pfn=%#lx, invalid nid=%d", pfn, nid);
> > +             return;
> > +     }
> > +
> ...
> > +     default:
> > +             WARN_ONCE(1, "Memory failure: mf_result=%d is not properly handled", result);
> > +             break;
> > +     }
>
> We already define pr_fmt, the "Memory failure:" prefix should be dropped.

"Should be dropped" because it will print duplicated prefixes? Does
WARN_ONCE also automatically include pr_fmt? I don't think that's the
case when I read __warn_printk.

This is what I saw from dmesg when I add a `WARN_ONCE(1, "Memory
failure: pfn=%#lx\n", pfn);`
at the beginning of `update_per_node_mf_stats` (above `nid=pfn_to_nid(pfn)`):

[  523.942688] ------------[ cut here ]------------
[  523.972026] Memory failure: pfn=0x309f8f3
[  523.972038] WARNING: CPU: 4 PID: 21119 at mm/memory-failure.c:1236
action_result+0xec/0x150
[  523.972044] Modules linked in: einj vfat fat i2c_mux_pca954x
i2c_mux spidev cdc_acm xhci_pci xhci_hcd sha3_generic gq(O)
[  523.972054] CPU: 4 PID: 21119 Comm: usemem Tainted: G S M       O
    6.2.0-smp-DEV #1
[  523.972059] RIP: 0010:action_result+0xec/0x150

No duplicated "Memory failure:".

But I realize I should probably add "\n" within WARN_ONCE.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] mm: memory-failure: Add memory failure stats to sysfs
  2023-02-02  6:54   ` Kefeng Wang
@ 2023-02-04 23:21     ` Jiaqi Yan
  0 siblings, 0 replies; 11+ messages in thread
From: Jiaqi Yan @ 2023-02-04 23:21 UTC (permalink / raw
  To: Kefeng Wang
  Cc: tony.luck, naoya.horiguchi, duenwen, rientjes, linux-mm,
	shy828301, akpm

On Wed, Feb 1, 2023 at 10:54 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2023/1/20 11:46, Jiaqi Yan wrote:
> > Today kernel provides following memory error info to userspace, but each
> > has its own disadvantage
> > * HardwareCorrupted in /proc/meminfo: number of bytes poisoned in total,
> >    not per NUMA node stats though
> > * ras:memory_failure_event: only available after explicitly enabled
> > * /dev/mcelog provides many useful info about the MCEs, but
> >    doesn't capture how memory_failure recovered memory MCEs
> > * kernel logs: userspace needs to process log text
> >
> > Exposes per NUMA node memory error stats as sysfs entries:
> >
> >    /sys/devices/system/node/node${X}/memory_failure/total
> >    /sys/devices/system/node/node${X}/memory_failure/recovered
> >    /sys/devices/system/node/node${X}/memory_failure/ignored
> >    /sys/devices/system/node/node${X}/memory_failure/failed
> >    /sys/devices/system/node/node${X}/memory_failure/delayed
> >
> > These counters describe how many raw pages are poisoned and after the
> > attempted recoveries by the kernel, their resolutions: how many are
> > recovered, ignored, failed, or delayed respectively. The following
> > math holds for the statistics:
> > * total = recovered + ignored + failed + delayed
> >
> > Acked-by: David Rientjes <rientjes@google.com>
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > ---
> >   drivers/base/node.c    |  3 +++
> >   include/linux/mm.h     |  5 +++++
> >   include/linux/mmzone.h | 28 ++++++++++++++++++++++++++++
> >   mm/memory-failure.c    | 35 +++++++++++++++++++++++++++++++++++
> >   4 files changed, 71 insertions(+)
> >
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index faf3597a96da..b46db17124f3 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -586,6 +586,9 @@ static const struct attribute_group *node_dev_groups[] = {
> >       &node_dev_group,
> >   #ifdef CONFIG_HAVE_ARCH_NODE_DEV_GROUP
> >       &arch_node_dev_group,
> > +#endif
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +     &memory_failure_attr_group,
> >   #endif
> >       NULL
> >   };
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index f3f196e4d66d..888576884eb9 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3521,6 +3521,11 @@ enum mf_action_page_type {
> >       MF_MSG_UNKNOWN,
> >   };
> >
> > +/*
> > + * Sysfs entries for memory failure handling statistics.
> > + */
> > +extern const struct attribute_group memory_failure_attr_group;
> > +
>
> This should move under CONFIG_MEMORY_FAILURE

Thanks! I will move it around in the new version.

>
> >   #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
> >   extern void clear_huge_page(struct page *page,
> >                           unsigned long addr_hint,
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index cd28a100d9e4..2c537b31fa7b 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1110,6 +1110,31 @@ struct deferred_split {
> >   };
> >   #endif
> >
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +/*
> > + * Per NUMA node memory failure handling statistics.
> > + */
> > +struct memory_failure_stats {
> > +     /*
> > +      * Number of raw pages poisoned.
> > +      * Cases not accounted: memory outside kernel control, offline page,
> > +      * arch-specific memory_failure (SGX), hwpoison_filter() filtered
> > +      * error events, and unpoison actions from hwpoison_unpoison.
> > +      */
> > +     unsigned long total;
> > +     /*
> > +      * Recovery results of poisoned raw pages handled by memory_failure,
> > +      * in sync with mf_result.
> > +      * total = ignored + failed + delayed + recovered.
> > +      * total * PAGE_SIZE * #nodes = /proc/meminfo/HardwareCorrupted.
> > +      */
> > +     unsigned long ignored;
> > +     unsigned long failed;
> > +     unsigned long delayed;
> > +     unsigned long recovered;
> > +};
> > +#endif
> > +
> >   /*
> >    * On NUMA machines, each NUMA node would have a pg_data_t to describe
> >    * it's memory layout. On UMA machines there is a single pglist_data which
> > @@ -1253,6 +1278,9 @@ typedef struct pglist_data {
> >   #ifdef CONFIG_NUMA
> >       struct memory_tier __rcu *memtier;
> >   #endif
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +     struct memory_failure_stats mf_stats;
> > +#endif
> >   } pg_data_t;
> >
> >   #define node_present_pages(nid)     (NODE_DATA(nid)->node_present_pages)
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index c77a9e37e27e..c628f1db3a4d 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -87,6 +87,41 @@ inline void num_poisoned_pages_sub(unsigned long pfn, long i)
> >               memblk_nr_poison_sub(pfn, i);
> >   }
> >
> > +/**
> > + * MF_ATTR_RO - Create sysfs entry for each memory failure statistics.
> > + * @_name: name of the file in the per NUMA sysfs directory.
> > + */
> > +#define MF_ATTR_RO(_name)                                    \
> > +static ssize_t _name##_show(struct device *dev,                      \
> > +                         struct device_attribute *attr,      \
> > +                         char *buf)                          \
> > +{                                                            \
> > +     struct memory_failure_stats *mf_stats =                 \
> > +             &NODE_DATA(dev->id)->mf_stats;                  \
> > +     return sprintf(buf, "%lu\n", mf_stats->_name);          \
> > +}                                                            \
> > +static DEVICE_ATTR_RO(_name)
> > +
> > +MF_ATTR_RO(total);
> > +MF_ATTR_RO(ignored);
> > +MF_ATTR_RO(failed);
> > +MF_ATTR_RO(delayed);
> > +MF_ATTR_RO(recovered);
> > +
> > +static struct attribute *memory_failure_attr[] = {
> > +     &dev_attr_total.attr,
> > +     &dev_attr_ignored.attr,
> > +     &dev_attr_failed.attr,
> > +     &dev_attr_delayed.attr,
> > +     &dev_attr_recovered.attr,
> > +     NULL,
> > +};
> > +
> > +const struct attribute_group memory_failure_attr_group = {
> > +     .name = "memory_failure",
> > +     .attrs = memory_failure_attr,
> > +};
> > +
> >   /*
> >    * Return values:
> >    *   1:   the page is dissolved (if needed) and taken off from buddy,


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-02-04 23:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-20  3:46 [PATCH v2 0/3] Introduce per NUMA node memory error statistics Jiaqi Yan
2023-01-20  3:46 ` [PATCH v2 1/3] mm: memory-failure: Add memory failure stats to sysfs Jiaqi Yan
2023-01-23  2:42   ` HORIGUCHI NAOYA(堀口　直也)
2023-02-02  6:54   ` Kefeng Wang
2023-02-04 23:21     ` Jiaqi Yan
2023-01-20  3:46 ` [PATCH v2 2/3] mm: memory-failure: Bump memory failure stats to pglist_data Jiaqi Yan
2023-01-23  2:42   ` HORIGUCHI NAOYA(堀口　直也)
2023-02-02  6:56   ` Kefeng Wang
2023-02-04 23:17     ` Jiaqi Yan
2023-01-20  3:46 ` [PATCH v2 3/3] mm: memory-failure: Document memory failure stats Jiaqi Yan
2023-01-23  2:43   ` HORIGUCHI NAOYA(堀口　直也)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.