Linux-Block Archive mirror
 help / color / mirror / Atom feed
* [RFC blktests v1 0/1] Test case for 'nvme: short-circuit connection retries'
@ 2023-06-21 15:58 Daniel Wagner
  2023-06-21 15:58 ` [RFC blktests v1 1/1] nvme/050: test DNR handling on reconnect Daniel Wagner
  0 siblings, 1 reply; 2+ messages in thread
From: Daniel Wagner @ 2023-06-21 15:58 UTC (permalink / raw
  To: linux-nvme
  Cc: linux-kernel, linux-block, Chaitanya Kulkarni,
	Shin'ichiro Kawasaki, Sagi Grimberg, Hannes Reinecke,
	James Smart, Martin George, Daniel Wagner

We had a longer discussion on how to interpret the DNR bit on reconnect attempts
in [1]. The conclusion was (if I got this right) is we should not try to reconnect
when the error response had the DNR bit set using the same parameters.

The FC transport already implemented this behavior with

  f25f8ef70ce2 ("nvme-fc: short-circuit reconnect retries")

Hannes also provided patches for TCP and RDMA [2]. With these patches this test
will pass.

The nvme/050 implements this test case by (ab)using the queue count mechanism to
trigger a reconnect. Before the reconnect is triggered the tests set the
allowed_any_host attribute to 0 and forces the reconnect to fail.

[1] https://lore.kernel.org/linux-nvme/20220927143157.3659-1-dwagner@suse.de/
[2] https://lore.kernel.org/linux-nvme/20220715063356.134124-1-hare@suse.de/


This patch is based on top of
  blktests: https://lore.kernel.org/linux-nvme/20230620132703.20648-1-dwagner@suse.de/
  linux: https://lore.kernel.org/linux-nvme/20230620133711.22840-1-dwagner@suse.de/


fc:

nvme/050 (test DNR is handled on connect attempt with invalid arguments) [passed]
    runtime  8.845s  ...  3.756s

tcp:

nvme/050 (test DNR is handled on connect attempt with invalid arguments) [failed]
    runtime  3.756s  ...  8.836s
    --- tests/nvme/050.out      2023-06-21 11:47:47.767788898 +0200
    +++ /home/wagi/work/blktests/results/nodev/nvme/050.out.bad 2023-06-21 15:19:08.368414289 +0200
    @@ -1,2 +1,3 @@
     Running nvme/050
    +controller "nvme2" not deleted within 5 seconds
     Test complete

fc:

 run blktests nvme/050 at 2023-06-21 15:11:31
 loop0: detected capacity change from 0 to 32768
 nvmet: adding nsid 1 to subsystem blktests-subsystem-1
 nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002  rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
 (NULL device *): {0:0} Association created
 [7088] nvmet: ctrl 1 start keep-alive timer for 5 secs
 nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:77b49aba-06b4-431a-9af8-75e318740f1a.
 [6743] nvmet: adding queue 1 to ctrl 1.
 [6312] nvmet: adding queue 2 to ctrl 1.
 [7088] nvmet: adding queue 3 to ctrl 1.
 [6927] nvmet: adding queue 4 to ctrl 1.
 nvme nvme2: NVME-FC{0}: new ctrl: NQN "blktests-subsystem-1"
 nvme nvme2: NVME-FC{0}: io failed due to lldd error 6
 nvme nvme2: NVME-FC{0}: transport association event: transport detected io error
 nvme nvme2: NVME-FC{0}: resetting controller
 [7088] nvmet: ctrl 1 stop keep-alive
 (NULL device *): {0:0} Association deleted
 nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002  rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
 (NULL device *): {0:0} Association freed
 (NULL device *): {0:0} Association created
 (NULL device *): Disconnect LS failed: No Association
 nvmet: connect by host nqn.2014-08.org.nvmexpress:uuid:77b49aba-06b4-431a-9af8-75e318740f1a for subsystem blktests-subsystem-1 not allowed
 nvme_fabrics: nvmf_log_connect_error: DNR 1
 nvme nvme2: Connect for subsystem blktests-subsystem-1 is not allowed, hostnqn: nqn.2014-08.org.nvmexpress:uuid:77b49aba-06b4-431a-9af8-75e318740f1a
 nvme nvme2: NVME-FC{0}: reset: Reconnect attempt failed (16772)
 nvme nvme2: NVME-FC{0}: reconnect failure
 nvme nvme2: Removing ctrl: NQN "blktests-subsystem-1"
 (NULL device *): {0:0} Association deleted
 (NULL device *): {0:0} Association freed
 (NULL device *): Disconnect LS failed: No Association

tcp:

 run blktests nvme/050 at 2023-06-21 15:11:36
 loop0: detected capacity change from 0 to 32768
 nvmet: adding nsid 1 to subsystem blktests-subsystem-1
 nvmet_tcp: enabling port 0 (127.0.0.1:4420)
 [62] nvmet: ctrl 1 start keep-alive timer for 5 secs
 nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:77b49aba-06b4-431a-9af8-75e318740f1a.
 nvme nvme2: creating 4 I/O queues.
 nvme nvme2: mapped 4/0/0 default/read/poll queues.
 [62] nvmet: adding queue 1 to ctrl 1.
 [214] nvmet: adding queue 2 to ctrl 1.
 [215] nvmet: adding queue 3 to ctrl 1.
 [177] nvmet: adding queue 4 to ctrl 1.
 nvme nvme2: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420
 nvme nvme2: starting error recovery
 nvme nvme2: Reconnecting in 1 seconds...
 [6743] nvmet: ctrl 1 stop keep-alive
 nvmet: connect by host nqn.2014-08.org.nvmexpress:uuid:77b49aba-06b4-431a-9af8-75e318740f1a for subsystem blktests-subsystem-1 not allowed
 nvme_fabrics: nvmf_log_connect_error: DNR 1
 nvme nvme2: Connect for subsystem blktests-subsystem-1 is not allowed, hostnqn: nqn.2014-08.org.nvmexpress:uuid:77b49aba-06b4-431a-9af8-75e318740f1a
 nvme nvme2: failed to connect queue: 0 ret=16772
 nvme nvme2: Failed reconnect attempt 1
 nvme nvme2: Reconnecting in 1 seconds...
 nvmet: connect by host nqn.2014-08.org.nvmexpress:uuid:77b49aba-06b4-431a-9af8-75e318740f1a for subsystem blktests-subsystem-1 not allowed
 nvme_fabrics: nvmf_log_connect_error: DNR 1
 nvme nvme2: Connect for subsystem blktests-subsystem-1 is not allowed, hostnqn: nqn.2014-08.org.nvmexpress:uuid:77b49aba-06b4-431a-9af8-75e318740f1a
 nvme nvme2: failed to connect queue: 0 ret=16772
 nvme nvme2: Failed reconnect attempt 2
 nvme nvme2: Reconnecting in 1 seconds...
 nvmet: connect by host nqn.2014-08.org.nvmexpress:uuid:77b49aba-06b4-431a-9af8-75e318740f1a for subsystem blktests-subsystem-1 not allowed
 nvme_fabrics: nvmf_log_connect_error: DNR 1
 nvme nvme2: Connect for subsystem blktests-subsystem-1 is not allowed, hostnqn: nqn.2014-08.org.nvmexpress:uuid:77b49aba-06b4-431a-9af8-75e318740f1a
 nvme nvme2: failed to connect queue: 0 ret=16772
 nvme nvme2: Failed reconnect attempt 3
 nvme nvme2: Reconnecting in 1 seconds...
 nvmet: connect by host nqn.2014-08.org.nvmexpress:uuid:77b49aba-06b4-431a-9af8-75e318740f1a for subsystem blktests-subsystem-1 not allowed
 nvme_fabrics: nvmf_log_connect_error: DNR 1
 nvme nvme2: Connect for subsystem blktests-subsystem-1 is not allowed, hostnqn: nqn.2014-08.org.nvmexpress:uuid:77b49aba-06b4-431a-9af8-75e318740f1a
 nvme nvme2: failed to connect queue: 0 ret=16772
 nvme nvme2: Failed reconnect attempt 4
 nvme nvme2: Reconnecting in 1 seconds...
 nvmet: connect by host nqn.2014-08.org.nvmexpress:uuid:77b49aba-06b4-431a-9af8-75e318740f1a for subsystem blktests-subsystem-1 not allowed
 nvme_fabrics: nvmf_log_connect_error: DNR 1
 nvme nvme2: Connect for subsystem blktests-subsystem-1 is not allowed, hostnqn: nqn.2014-08.org.nvmexpress:uuid:77b49aba-06b4-431a-9af8-75e318740f1a
 nvme nvme2: failed to connect queue: 0 ret=16772
 nvme nvme2: Failed reconnect attempt 5
 nvme nvme2: Reconnecting in 1 seconds...
 nvme nvme2: Removing ctrl: NQN "blktests-subsystem-1"
 nvme nvme2: Property Set error: 880, offset 0x14

Daniel Wagner (1):
  nvme/050: test DNR handling on reconnect

 tests/nvme/050     | 126 +++++++++++++++++++++++++++++++++++++++++++++
 tests/nvme/050.out |   2 +
 2 files changed, 128 insertions(+)
 create mode 100644 tests/nvme/050
 create mode 100644 tests/nvme/050.out

-- 
2.41.0


^ permalink raw reply	[flat|nested] 2+ messages in thread

* [RFC blktests v1 1/1] nvme/050: test DNR handling on reconnect
  2023-06-21 15:58 [RFC blktests v1 0/1] Test case for 'nvme: short-circuit connection retries' Daniel Wagner
@ 2023-06-21 15:58 ` Daniel Wagner
  0 siblings, 0 replies; 2+ messages in thread
From: Daniel Wagner @ 2023-06-21 15:58 UTC (permalink / raw
  To: linux-nvme
  Cc: linux-kernel, linux-block, Chaitanya Kulkarni,
	Shin'ichiro Kawasaki, Sagi Grimberg, Hannes Reinecke,
	James Smart, Martin George, Daniel Wagner

When the host gets disconnected and tries to reconnect,
it should honor the DNR bit and do not retry to connect
with the same parameters.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
 tests/nvme/050     | 126 +++++++++++++++++++++++++++++++++++++++++++++
 tests/nvme/050.out |   2 +
 2 files changed, 128 insertions(+)
 create mode 100644 tests/nvme/050
 create mode 100644 tests/nvme/050.out

diff --git a/tests/nvme/050 b/tests/nvme/050
new file mode 100644
index 000000000000..d33eb24e2f13
--- /dev/null
+++ b/tests/nvme/050
@@ -0,0 +1,126 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-3.0+
+# Copyright (C) 2023 SUSE LLC
+#
+# Test DNR is handled on connnect attempt with invalid arguments.
+
+. tests/nvme/rc
+
+DESCRIPTION="test DNR is handled on connect attempt with invalid arguments"
+
+requires() {
+	_nvme_requires
+	_require_nvme_trtype tcp rdma fc
+	_require_min_cpus 2
+}
+
+nvmf_wait_for_state() {
+	local def_state_timeout=5
+	local subsys_name="$1"
+	local state="$2"
+	local timeout="${3:-$def_state_timeout}"
+	local nvmedev
+	local state_file
+	local start_time
+	local end_time
+
+	nvmedev=$(_find_nvme_dev "${subsys_name}")
+	state_file="/sys/class/nvme-fabrics/ctl/${nvmedev}/state"
+
+	start_time=$(date +%s)
+	while ! grep -q "${state}" "${state_file}"; do
+		sleep 1
+		end_time=$(date +%s)
+		if (( end_time - start_time > timeout )); then
+			echo "expected state \"${state}\" not " \
+				"reached within ${timeout} seconds"
+			return 1
+		fi
+	done
+
+	return 0
+}
+
+nvmf_wait_for_ctrl_delete() {
+	local def_state_timeout=5
+	local nvmedev="$1"
+	local timeout="${2:-$def_state_timeout}"
+	local ctrl="/sys/class/nvme-fabrics/ctl/${nvmedev}/state"
+	local start_time
+	local end_time
+
+	start_time=$(date +%s)
+	while [ -f "${ctrl}" ]; do
+		sleep 1
+		end_time=$(date +%s)
+		if (( end_time - start_time > timeout )); then
+			echo "controller \"${nvmedev}\" not deleted" \
+				"within ${timeout} seconds"
+			return 1
+		fi
+	done
+
+	return 0
+}
+
+set_nvmet_attr_qid_max() {
+	local nvmet_subsystem="$1"
+	local qid_max="$2"
+	local cfs_path="${NVMET_CFS}/subsystems/${nvmet_subsystem}"
+
+	echo "${qid_max}" > "${cfs_path}/attr_qid_max"
+}
+
+test() {
+	echo "Running ${TEST_NAME}"
+
+	_setup_nvmet
+
+	local port
+	local loop_dev
+	local file_path="$TMPDIR/img"
+	local subsys_name="blktests-subsystem-1"
+	local hostid="77b49aba-06b4-431a-9af8-75e318740f1a"
+	local hostnqn="nqn.2014-08.org.nvmexpress:uuid:${hostid}"
+	local cfs_path="${NVMET_CFS}/subsystems/${subsys_name}"
+	local nvmedev
+
+	truncate -s "${nvme_img_size}" "${file_path}"
+
+	loop_dev="$(losetup -f --show "${file_path}")"
+
+	_create_nvmet_subsystem "${subsys_name}" "${loop_dev}" \
+		"91fdba0d-f87b-4c25-b80f-db7be1418b9e"
+	port="$(_create_nvmet_port "${nvme_trtype}")"
+	_add_nvmet_subsys_to_port "${port}" "${subsys_name}"
+
+	_nvme_connect_subsys "${nvme_trtype}" "${subsys_name}" \
+		--hostnqn "${hostnqn}" \
+		--reconnect-delay 1 \
+		--ctrl-loss-tmo 10
+
+	nvmf_wait_for_state "${subsys_name}" "live"
+	nvmedev=$(_find_nvme_dev "${subsys_name}")
+
+	# Only allow connects from ${def_hostnqn}
+	echo 0 > "${cfs_path}/attr_allow_any_host"
+
+	# Force a reconnect
+	set_nvmet_attr_qid_max "${subsys_name}" 1
+
+	# The reconnect fails with the DNR bit set
+	# Thus the host should remove the controller
+	nvmf_wait_for_ctrl_delete "${nvmedev}"
+
+	_nvme_disconnect_subsys "${subsys_name}" >> "$FULL" 2>&1
+
+	_remove_nvmet_subsystem_from_port "${port}" "${subsys_name}"
+	_remove_nvmet_subsystem "${subsys_name}"
+	_remove_nvmet_port "${port}"
+
+	losetup -d "${loop_dev}"
+
+	rm "${file_path}"
+
+	echo "Test complete"
+}
diff --git a/tests/nvme/050.out b/tests/nvme/050.out
new file mode 100644
index 000000000000..b78b05f78424
--- /dev/null
+++ b/tests/nvme/050.out
@@ -0,0 +1,2 @@
+Running nvme/050
+Test complete
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-06-21 15:58 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-21 15:58 [RFC blktests v1 0/1] Test case for 'nvme: short-circuit connection retries' Daniel Wagner
2023-06-21 15:58 ` [RFC blktests v1 1/1] nvme/050: test DNR handling on reconnect Daniel Wagner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).