[RFC 0/2] fuse: introduce fuse server recovery mechanism

Linux-Fsdevel Archive mirror
 help / color / mirror / Atom feed

* [RFC 0/2] fuse: introduce fuse server recovery mechanism
@ 2024-05-24  6:40 Jingbo Xu
  2024-05-24  6:40 ` [RFC 1/2] fuse: introduce recovery mechanism for fuse server Jingbo Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Jingbo Xu @ 2024-05-24  6:40 UTC (permalink / raw
  To: miklos, linux-fsdevel; +Cc: linux-kernel, winters.zc

Background
==========
The fd of '/dev/fuse' serves as a message transmission channel between
FUSE filesystem (kernel space) and fuse server (user space). Once the
fd gets closed (intentionally or unintentionally), the FUSE filesystem
gets aborted, and any attempt of filesystem access gets -ECONNABORTED
error until the FUSE filesystem finally umounted.

It is one of the requisites in production environment to provide
uninterruptible filesystem service.  The most straightforward way, and
maybe the most widely used way, is that make another dedicated user
daemon (similar to systemd fdstore) keep the device fd open.  When the
fuse daemon recovers from a crash, it can retrieve the device fd from the
fdstore daemon through socket takeover (Unix domain socket) method [1]
or pidfd_getfd() syscall [2].  In this way, as long as the fdstore
daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse
daemon crashes, though the filesystem service may hang there for a while
when the fuse daemon gets restarted and has not been completely
recovered yet.

This picture indeed works and has been deployed in our internal
production environment until the following issues are encountered:

1. The fdstore daemon may be killed by mistake, in which case the FUSE
filesystem gets aborted and irrecoverable.

2. In scenarios of containerized deployment, the fuse daemon is deployed
in a container POD, and a dedicated fdstore daemon needs to be deployed
for each fuse daemon.  The fdstore daemon could consume a amount of
resources (e.g. memory footprint), which is not conducive to the dense
container deployment.

3. Each fuse daemon implementation needs to implement its own fdstore
daemon.  If we implement the fuse recovery mechanism on the kernel side,
all fuse daemon implementations could reuse this mechanism.

What we do
==========

Basic Recovery Mechanism
------------------------
We introduce a recovery mechanism for fuse server on the kernel side.

To do this:
1. Introduce a new "tag=" mount option, with which users could identify
a fuse connection with a unique name.
2. Introduce a new FUSE_DEV_IOC_ATTACH ioctl, with which the fuse server
could reconnect to the fuse connection corresponding to the given tag.
3. Introduce a new FUSE_HAS_RECOVERY init flag.  The fuse server should
advertise this feature if it supports server recovery.

With the above recovery mechanism, the whole time sequence is like:
- At the initial mount, the fuse filesystem is mounted with "tag="
  option
- The fuse server advertises FUSE_HAS_RECOVERY flag when replying
  FUSE_INIT
- When the fuse server crashes and the (/dev/fuse) device fd is closed,
  the fuse connection won't be aborted.
- The requests submitted after the server crash will keep staying in
  the iqueue; the processes submitting the requests will hang there
- The fuse server gets restarted and recovers the previous state before
  crash (including the negotiation results of the last FUSE_INIT)
- The fuse server opens /dev/fuse and gets a new device fd, and then
  runs FUSE_DEV_IOC_ATTACH ioctl on the new device fd to retrieve the
  fuse connection with the tag previously used to mount the fuse
  filesystem
- The fuse server issues a FUSE_NOTIFY_RESEND notification to request
  the kernel to resend those inflight requests that have been sent to
  the fuse server before the server crash but not been replied yet
- The fuse server starts to process requests normally (those queued in
  iqueue and those resent by FUSE_NOTIFY_RESEND)

In summary, the requests submitted after the server crash will stay in
the iqueue and get serviced once the fuse server recovers from the crash
and retrieve the previous fuse connection.  As for the inflight requests
that have been sent to the fuse server before the server crash but not
been replied yet, the fuse server could request the kernel to resend
those inflight requests through FUSE_NOTIFY_RESEND notification type.

Security Enhancement
---------------------
Besides, we offer a uid-based security enhancement for the fuse server
recovery mechanism.  Otherwise any malicious attacker could kill the
fuse server and take the filesystem service over with the recovery
mechanism.

To implement this, we introduce a new "rescue_uid=" mount option
specifying the expected uid of the legal process running the fuse
server.  Then only the process with the matching uid is permissible to
retrieve the fuse connection with the server recovery mechanism.

Limitation
==========
1. The current mechanism won't resend a new FUSE_INIT request to fuse
server and start a new negotiation when the fuse server attempts to
re-attach to the fuse connection through FUSE_DEV_IOC_ATTACH ioctl.
Thus the fuse server needs to recover the previous state before crash
(including the negotiation results of the last FUSE_INIT) by itself.

PS. Thus I had to do hacking tricks on libfuse passthrough_ll daemon
when testing the recovery feature.

2. With the current recovery mechanism, the fuse filesystem won't get
aborted when the fuse server crashes.  A following umount will get hung
there.  The call stack shows the hang task is waiting for FUSE_GETATTR
on the mntpoint:

[<0>] request_wait_answer+0xe1/0x200
[<0>] fuse_simple_request+0x18e/0x2a0
[<0>] fuse_do_getattr+0xc9/0x180
[<0>] vfs_statx+0x92/0x170
[<0>] vfs_fstatat+0x7c/0xb0
[<0>] __do_sys_newstat+0x1d/0x40
[<0>] do_syscall_64+0x60/0x170
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

It's not fixed yet in this RFC version.

3. I don't know if a kernel based recovery mechanism is welcome on the
community side.  Any comment is welcome.  Thanks!

[1] https://copyconstruct.medium.com/file-descriptor-transfer-over-unix-domain-sockets-dcbbf5b3b6ec
[2] https://copyconstruct.medium.com/seamless-file-descriptor-transfer-between-processes-with-pidfd-and-pidfd-getfd-816afcd19ed4

Jingbo Xu (2):
  fuse: introduce recovery mechanism for fuse server
  fuse: uid-based security enhancement for the recovery mechanism

 fs/fuse/dev.c             | 55 ++++++++++++++++++++++++++++++++++++++-
 fs/fuse/fuse_i.h          | 15 +++++++++++
 fs/fuse/inode.c           | 46 +++++++++++++++++++++++++++++++-
 include/uapi/linux/fuse.h |  7 +++++
 4 files changed, 121 insertions(+), 2 deletions(-)

-- 
2.19.1.6.gb485710b

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC 1/2] fuse: introduce recovery mechanism for fuse server
  2024-05-24  6:40 [RFC 0/2] fuse: introduce fuse server recovery mechanism Jingbo Xu
@ 2024-05-24  6:40 ` Jingbo Xu
  2024-05-24  6:40 ` [RFC 2/2] fuse: uid-based security enhancement for the recovery mechanism Jingbo Xu
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Jingbo Xu @ 2024-05-24  6:40 UTC (permalink / raw
  To: miklos, linux-fsdevel; +Cc: linux-kernel, winters.zc

Introduce failover mechanism for fuse server, with which the fuse
connection could keep alive while the fuse server crashes.  The fuse
server could re-attach to the fuse connection after crash and recover
the filesystem service.

The requests submitted after the server crash will stay in the iqueue
and get serviced once the fuse server recovers from the crash and
retrieve the previous fuse connection.  As for the inflight requests
that have been sent to the fuse server before the server crash and not
been replied yet, the fuse server could request the kernel to resend
those inflight requests through FUSE_NOTIFY_RESEND notification type.

To implement the above mechanism:

1. Introduce a new "tag=" mount option, with which useres could identify
a fuse connection with a unique name.
2. Introduce a new FUSE_DEV_IOC_ATTACH ioctl, with which the fuse server
could reconnect to the fuse connection corresponding to the given tag.
3. Introduce a new FUSE_HAS_RECOVERY init flag.  The fuse server should
advertise this feature if it supports server recovery.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/fuse/dev.c             | 43 ++++++++++++++++++++++++++++++++++++++-
 fs/fuse/fuse_i.h          |  7 +++++++
 fs/fuse/inode.c           | 35 ++++++++++++++++++++++++++++++-
 include/uapi/linux/fuse.h |  7 +++++++
 4 files changed, 90 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 3ec8bb5e68ff..7599138baac0 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2271,7 +2271,7 @@ int fuse_dev_release(struct inode *inode, struct file *file)
 		end_requests(&to_end);
 
 		/* Are we the last open device? */
-		if (atomic_dec_and_test(&fc->dev_count)) {
+		if (atomic_dec_and_test(&fc->dev_count) && !fc->recovery) {
 			WARN_ON(fc->iq.fasync != NULL);
 			fuse_abort_conn(fc);
 		}
@@ -2376,6 +2376,44 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
 	return fuse_backing_close(fud->fc, backing_id);
 }
 
+static inline bool fuse_device_attach_match(struct fuse_conn *fc,
+					    const char *tag)
+{
+	if (!fc->recovery)
+		return false;
+
+	return !strncmp(fc->tag, tag, FUSE_TAG_NAME_MAX);
+}
+
+static int fuse_device_attach(struct file *file, const char *tag)
+{
+	struct fuse_conn *fc;
+
+	list_for_each_entry(fc, &fuse_conn_list, entry) {
+		if (!fuse_device_attach_match(fc, tag))
+			continue;
+		return fuse_device_clone(fc, file);
+	}
+	return -ENOTTY;
+}
+
+static long fuse_dev_ioctl_attach(struct file *file, __u32 __user *argp)
+{
+	struct fuse_ioctl_attach attach;
+	int res;
+
+	if (copy_from_user(&attach, argp, sizeof(attach)))
+		return -EFAULT;
+
+	if (attach.tag[0] == '\0')
+		return -EINVAL;
+
+	mutex_lock(&fuse_mutex);
+	res = fuse_device_attach(file, attach.tag);
+	mutex_unlock(&fuse_mutex);
+	return res;
+}
+
 static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
 			   unsigned long arg)
 {
@@ -2391,6 +2429,9 @@ static long fuse_dev_ioctl(struct file *file, unsigned int cmd,
 	case FUSE_DEV_IOC_BACKING_CLOSE:
 		return fuse_dev_ioctl_backing_close(file, argp);
 
+	case FUSE_DEV_IOC_ATTACH:
+		return fuse_dev_ioctl_attach(file, argp);
+
 	default:
 		return -ENOTTY;
 	}
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f23919610313..e9832186f84f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -575,6 +575,7 @@ struct fuse_fs_context {
 	unsigned int max_read;
 	unsigned int blksize;
 	const char *subtype;
+	const char *tag;
 
 	/* DAX device, may be NULL */
 	struct dax_device *dax_dev;
@@ -860,6 +861,9 @@ struct fuse_conn {
 	/** Passthrough support for read/write IO */
 	unsigned int passthrough:1;
 
+	/** Support for fuse server recovery */
+	unsigned int recovery:1;
+
 	/** Maximum stack depth for passthrough backing files */
 	int max_stack_depth;
 
@@ -917,6 +921,9 @@ struct fuse_conn {
 	/** IDR for backing files ids */
 	struct idr backing_files_map;
 #endif
+
+	/* Tag of the connection used by fuse server recovery */
+	const char *tag;
 };
 
 /*
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 99e44ea7d875..1ab245d6ade3 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -733,6 +733,7 @@ enum {
 	OPT_ALLOW_OTHER,
 	OPT_MAX_READ,
 	OPT_BLKSIZE,
+	OPT_TAG,
 	OPT_ERR
 };
 
@@ -747,6 +748,7 @@ static const struct fs_parameter_spec fuse_fs_parameters[] = {
 	fsparam_u32	("max_read",		OPT_MAX_READ),
 	fsparam_u32	("blksize",		OPT_BLKSIZE),
 	fsparam_string	("subtype",		OPT_SUBTYPE),
+	fsparam_string	("tag",			OPT_TAG),
 	{}
 };
 
@@ -830,6 +832,15 @@ static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param)
 		ctx->blksize = result.uint_32;
 		break;
 
+	case OPT_TAG:
+		if (ctx->tag)
+			return invalfc(fsc, "Multiple tags specified");
+		if (strlen(param->string) > FUSE_TAG_NAME_MAX)
+			return invalfc(fsc, "Tag name too long");
+		ctx->tag = param->string;
+		param->string = NULL;
+		return 0;
+
 	default:
 		return -EINVAL;
 	}
@@ -843,6 +854,7 @@ static void fuse_free_fsc(struct fs_context *fsc)
 
 	if (ctx) {
 		kfree(ctx->subtype);
+		kfree(ctx->tag);
 		kfree(ctx);
 	}
 }
@@ -969,6 +981,7 @@ void fuse_conn_put(struct fuse_conn *fc)
 		}
 		if (IS_ENABLED(CONFIG_FUSE_PASSTHROUGH))
 			fuse_backing_files_free(fc);
+		kfree(fc->tag);
 		call_rcu(&fc->rcu, delayed_release);
 	}
 }
@@ -1331,6 +1344,8 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 			}
 			if (flags & FUSE_NO_EXPORT_SUPPORT)
 				fm->sb->s_export_op = &fuse_export_fid_operations;
+			if (flags & FUSE_HAS_RECOVERY)
+				fc->recovery = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
 			fc->no_lock = 1;
@@ -1378,7 +1393,7 @@ void fuse_send_init(struct fuse_mount *fm)
 		FUSE_HANDLE_KILLPRIV_V2 | FUSE_SETXATTR_EXT | FUSE_INIT_EXT |
 		FUSE_SECURITY_CTX | FUSE_CREATE_SUPP_GROUP |
 		FUSE_HAS_EXPIRE_ONLY | FUSE_DIRECT_IO_ALLOW_MMAP |
-		FUSE_NO_EXPORT_SUPPORT | FUSE_HAS_RESEND;
+		FUSE_NO_EXPORT_SUPPORT | FUSE_HAS_RESEND | FUSE_HAS_RECOVERY;
 #ifdef CONFIG_FUSE_DAX
 	if (fm->fc->dax)
 		flags |= FUSE_MAP_ALIGNMENT;
@@ -1520,6 +1535,17 @@ void fuse_dev_free(struct fuse_dev *fud)
 }
 EXPORT_SYMBOL_GPL(fuse_dev_free);
 
+static bool fuse_find_conn_tag(const char *tag)
+{
+	struct fuse_conn *fc;
+
+	list_for_each_entry(fc, &fuse_conn_list, entry) {
+		if (!strcmp(fc->tag, tag))
+			return true;
+	}
+	return false;
+}
+
 static void fuse_fill_attr_from_inode(struct fuse_attr *attr,
 				      const struct fuse_inode *fi)
 {
@@ -1727,6 +1753,8 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 	fc->destroy = ctx->destroy;
 	fc->no_control = ctx->no_control;
 	fc->no_force_umount = ctx->no_force_umount;
+	fc->tag = ctx->tag;
+	ctx->tag = NULL;
 
 	err = -ENOMEM;
 	root = fuse_get_root_inode(sb, ctx->rootmode);
@@ -1742,6 +1770,11 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 	if (ctx->fudptr && *ctx->fudptr)
 		goto err_unlock;
 
+	if (fc->tag && fuse_find_conn_tag(fc->tag)) {
+		pr_err("tag %s already exist\n", fc->tag);
+		goto err_unlock;
+	}
+
 	err = fuse_ctl_add_conn(fc);
 	if (err)
 		goto err_unlock;
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index d08b99d60f6f..054d6789b2fc 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -463,6 +463,7 @@ struct fuse_file_lock {
 #define FUSE_PASSTHROUGH	(1ULL << 37)
 #define FUSE_NO_EXPORT_SUPPORT	(1ULL << 38)
 #define FUSE_HAS_RESEND		(1ULL << 39)
+#define FUSE_HAS_RECOVERY	(1ULL << 40)
 
 /* Obsolete alias for FUSE_DIRECT_IO_ALLOW_MMAP */
 #define FUSE_DIRECT_IO_RELAX	FUSE_DIRECT_IO_ALLOW_MMAP
@@ -1079,12 +1080,18 @@ struct fuse_backing_map {
 	uint64_t	padding;
 };
 
+struct fuse_ioctl_attach {
+#define FUSE_TAG_NAME_MAX		128
+	char	tag[FUSE_TAG_NAME_MAX];
+};
+
 /* Device ioctls: */
 #define FUSE_DEV_IOC_MAGIC		229
 #define FUSE_DEV_IOC_CLONE		_IOR(FUSE_DEV_IOC_MAGIC, 0, uint32_t)
 #define FUSE_DEV_IOC_BACKING_OPEN	_IOW(FUSE_DEV_IOC_MAGIC, 1, \
 					     struct fuse_backing_map)
 #define FUSE_DEV_IOC_BACKING_CLOSE	_IOW(FUSE_DEV_IOC_MAGIC, 2, uint32_t)
+#define FUSE_DEV_IOC_ATTACH		_IOW(FUSE_DEV_IOC_MAGIC, 3, struct fuse_ioctl_attach)
 
 struct fuse_lseek_in {
 	uint64_t	fh;
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC 2/2] fuse: uid-based security enhancement for the recovery mechanism
  2024-05-24  6:40 [RFC 0/2] fuse: introduce fuse server recovery mechanism Jingbo Xu
  2024-05-24  6:40 ` [RFC 1/2] fuse: introduce recovery mechanism for fuse server Jingbo Xu
@ 2024-05-24  6:40 ` Jingbo Xu
  2024-05-27 15:16 ` [RFC 0/2] fuse: introduce fuse server " Miklos Szeredi
  2024-05-28  8:38 ` Christian Brauner
  3 siblings, 0 replies; 15+ messages in thread
From: Jingbo Xu @ 2024-05-24  6:40 UTC (permalink / raw
  To: miklos, linux-fsdevel; +Cc: linux-kernel, winters.zc

Offer a uid-based security enhancement for the fuse server recovery
mechanism.  Otherwise any malicious attacker could kill the fuse server
and take the filesystem service over with the recovery mechanism.

Introduce a new "rescue_uid=" mount option specifying the expected uid
of the legal process running the fuse server.  Then only the process
with the matching uid is permissible to retrieve the fuse connection
with the server recovery mechanism.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
 fs/fuse/dev.c    | 12 ++++++++++++
 fs/fuse/fuse_i.h |  8 ++++++++
 fs/fuse/inode.c  | 13 ++++++++++++-
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 7599138baac0..9db35a2bbd85 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2376,12 +2376,24 @@ static long fuse_dev_ioctl_backing_close(struct file *file, __u32 __user *argp)
 	return fuse_backing_close(fud->fc, backing_id);
 }
 
+static inline bool fuse_device_attach_permissible(struct fuse_conn *fc)
+{
+	const struct cred *cred = current_cred();
+
+	return (uid_eq(cred->euid, fc->rescue_uid) &&
+		uid_eq(cred->suid, fc->rescue_uid) &&
+		uid_eq(cred->uid,  fc->rescue_uid));
+}
+
 static inline bool fuse_device_attach_match(struct fuse_conn *fc,
 					    const char *tag)
 {
 	if (!fc->recovery)
 		return false;
 
+	if (!fuse_device_attach_permissible(fc))
+		return false;
+
 	return !strncmp(fc->tag, tag, FUSE_TAG_NAME_MAX);
 }
 
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index e9832186f84f..c43026d7229c 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -560,6 +560,7 @@ struct fuse_fs_context {
 	unsigned int rootmode;
 	kuid_t user_id;
 	kgid_t group_id;
+	kuid_t rescue_uid;
 	bool is_bdev:1;
 	bool fd_present:1;
 	bool rootmode_present:1;
@@ -571,6 +572,7 @@ struct fuse_fs_context {
 	bool no_control:1;
 	bool no_force_umount:1;
 	bool legacy_opts_show:1;
+	bool rescue_uid_present:1;
 	enum fuse_dax_mode dax_mode;
 	unsigned int max_read;
 	unsigned int blksize;
@@ -616,6 +618,9 @@ struct fuse_conn {
 	/** The group id for this mount */
 	kgid_t group_id;
 
+	/* The expected user id of the fuse server */
+	kuid_t rescue_uid;
+
 	/** The pid namespace for this mount */
 	struct pid_namespace *pid_ns;
 
@@ -864,6 +869,9 @@ struct fuse_conn {
 	/** Support for fuse server recovery */
 	unsigned int recovery:1;
 
+	/** Is rescue_uid specified? */
+	unsigned int rescue_uid_present:1;
+
 	/** Maximum stack depth for passthrough backing files */
 	int max_stack_depth;
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 1ab245d6ade3..3b00482293b6 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -734,6 +734,7 @@ enum {
 	OPT_MAX_READ,
 	OPT_BLKSIZE,
 	OPT_TAG,
+	OPT_RESCUE_UID,
 	OPT_ERR
 };
 
@@ -749,6 +750,7 @@ static const struct fs_parameter_spec fuse_fs_parameters[] = {
 	fsparam_u32	("blksize",		OPT_BLKSIZE),
 	fsparam_string	("subtype",		OPT_SUBTYPE),
 	fsparam_string	("tag",			OPT_TAG),
+	fsparam_u32	("rescue_uid",		OPT_RESCUE_UID),
 	{}
 };
 
@@ -841,6 +843,13 @@ static int fuse_parse_param(struct fs_context *fsc, struct fs_parameter *param)
 		param->string = NULL;
 		return 0;
 
+	case OPT_RESCUE_UID:
+		ctx->rescue_uid = make_kuid(fsc->user_ns, result.uint_32);
+		if (!uid_valid(ctx->rescue_uid))
+			return invalfc(fsc, "Invalid rescue_uid");
+		ctx->rescue_uid_present = true;
+		break;
+
 	default:
 		return -EINVAL;
 	}
@@ -1344,7 +1353,7 @@ static void process_init_reply(struct fuse_mount *fm, struct fuse_args *args,
 			}
 			if (flags & FUSE_NO_EXPORT_SUPPORT)
 				fm->sb->s_export_op = &fuse_export_fid_operations;
-			if (flags & FUSE_HAS_RECOVERY)
+			if (flags & FUSE_HAS_RECOVERY && fc->rescue_uid_present)
 				fc->recovery = 1;
 		} else {
 			ra_pages = fc->max_read / PAGE_SIZE;
@@ -1753,6 +1762,8 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 	fc->destroy = ctx->destroy;
 	fc->no_control = ctx->no_control;
 	fc->no_force_umount = ctx->no_force_umount;
+	fc->rescue_uid = ctx->rescue_uid;
+	fc->rescue_uid_present = ctx->rescue_uid_present;
 	fc->tag = ctx->tag;
 	ctx->tag = NULL;
 
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism
  2024-05-24  6:40 [RFC 0/2] fuse: introduce fuse server recovery mechanism Jingbo Xu
  2024-05-24  6:40 ` [RFC 1/2] fuse: introduce recovery mechanism for fuse server Jingbo Xu
  2024-05-24  6:40 ` [RFC 2/2] fuse: uid-based security enhancement for the recovery mechanism Jingbo Xu
@ 2024-05-27 15:16 ` Miklos Szeredi
  2024-05-28  2:45   ` Jingbo Xu
  2024-05-28  8:38 ` Christian Brauner
  3 siblings, 1 reply; 15+ messages in thread
From: Miklos Szeredi @ 2024-05-27 15:16 UTC (permalink / raw
  To: Jingbo Xu; +Cc: linux-fsdevel, linux-kernel, winters.zc

On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:

> 3. I don't know if a kernel based recovery mechanism is welcome on the
> community side.  Any comment is welcome.  Thanks!

I'd prefer something external to fuse.

Maybe a kernel based fdstore (lifetime connected to that of the
container) would a useful service more generally?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism
  2024-05-27 15:16 ` [RFC 0/2] fuse: introduce fuse server " Miklos Szeredi
@ 2024-05-28  2:45   ` Jingbo Xu
  2024-05-28  3:08     ` Jingbo Xu
  2024-05-28  7:45     ` Miklos Szeredi
  0 siblings, 2 replies; 15+ messages in thread
From: Jingbo Xu @ 2024-05-28  2:45 UTC (permalink / raw
  To: Miklos Szeredi; +Cc: linux-fsdevel, linux-kernel, winters.zc



On 5/27/24 11:16 PM, Miklos Szeredi wrote:
> On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> 
>> 3. I don't know if a kernel based recovery mechanism is welcome on the
>> community side.  Any comment is welcome.  Thanks!
> 
> I'd prefer something external to fuse.

Okay, understood.

> 
> Maybe a kernel based fdstore (lifetime connected to that of the
> container) would a useful service more generally?

Yeah I indeed had considered this, but I'm afraid VFS guys would be
concerned about why we do this on kernel side rather than in user space.

I'm not sure what the VFS guys think about this and if the kernel side
shall care about this.

Many thanks!


-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism
  2024-05-28  2:45   ` Jingbo Xu
@ 2024-05-28  3:08     ` Jingbo Xu
  2024-05-28  4:02       ` Gao Xiang
  2024-05-28  7:46       ` Miklos Szeredi
  2024-05-28  7:45     ` Miklos Szeredi
  1 sibling, 2 replies; 15+ messages in thread
From: Jingbo Xu @ 2024-05-28  3:08 UTC (permalink / raw
  To: Miklos Szeredi; +Cc: linux-fsdevel, linux-kernel, winters.zc



On 5/28/24 10:45 AM, Jingbo Xu wrote:
> 
> 
> On 5/27/24 11:16 PM, Miklos Szeredi wrote:
>> On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>>> 3. I don't know if a kernel based recovery mechanism is welcome on the
>>> community side.  Any comment is welcome.  Thanks!
>>
>> I'd prefer something external to fuse.
> 
> Okay, understood.
> 
>>
>> Maybe a kernel based fdstore (lifetime connected to that of the
>> container) would a useful service more generally?
> 
> Yeah I indeed had considered this, but I'm afraid VFS guys would be
> concerned about why we do this on kernel side rather than in user space.
> 
> I'm not sure what the VFS guys think about this and if the kernel side
> shall care about this.
> 

There was an RFC for kernel-side fdstore [1], though it's also
implemented upon FUSE.

[1]
https://lore.kernel.org/all/CA+a=Yy5rnqLqH2iR-ZY6AUkNJy48mroVV3Exmhmt-pfTi82kXA@mail.gmail.com/T/



-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism
  2024-05-28  3:08     ` Jingbo Xu
@ 2024-05-28  4:02       ` Gao Xiang
  2024-05-28  8:43         ` Christian Brauner
  2024-05-28  7:46       ` Miklos Szeredi
  1 sibling, 1 reply; 15+ messages in thread
From: Gao Xiang @ 2024-05-28  4:02 UTC (permalink / raw
  To: Jingbo Xu, Miklos Szeredi, Christian Brauner
  Cc: linux-fsdevel, linux-kernel, winters.zc



On 2024/5/28 11:08, Jingbo Xu wrote:
> 
> 
> On 5/28/24 10:45 AM, Jingbo Xu wrote:
>>
>>
>> On 5/27/24 11:16 PM, Miklos Szeredi wrote:
>>> On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>
>>>> 3. I don't know if a kernel based recovery mechanism is welcome on the
>>>> community side.  Any comment is welcome.  Thanks!
>>>
>>> I'd prefer something external to fuse.
>>
>> Okay, understood.
>>
>>>
>>> Maybe a kernel based fdstore (lifetime connected to that of the
>>> container) would a useful service more generally?
>>
>> Yeah I indeed had considered this, but I'm afraid VFS guys would be
>> concerned about why we do this on kernel side rather than in user space.

Just from my own perspective, even if it's in FUSE, the concern is
almost the same.

I wonder if on-demand cachefiles can keep fds too in the future
(thus e.g. daemonless feature could even be implemented entirely
with kernel fdstore) but it still has the same concern or it's
a source of duplication.

Thanks,
Gao Xiang

>>
>> I'm not sure what the VFS guys think about this and if the kernel side
>> shall care about this.
>>
> 
> There was an RFC for kernel-side fdstore [1], though it's also
> implemented upon FUSE.
> 
> [1]
> https://lore.kernel.org/all/CA+a=Yy5rnqLqH2iR-ZY6AUkNJy48mroVV3Exmhmt-pfTi82kXA@mail.gmail.com/T/
> 
> 
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism
  2024-05-28  4:02       ` Gao Xiang
@ 2024-05-28  8:43         ` Christian Brauner
  2024-05-28  9:13           ` Gao Xiang
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2024-05-28  8:43 UTC (permalink / raw
  To: Gao Xiang
  Cc: Jingbo Xu, Miklos Szeredi, linux-fsdevel, linux-kernel,
	winters.zc

On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote:
> 
> 
> On 2024/5/28 11:08, Jingbo Xu wrote:
> > 
> > 
> > On 5/28/24 10:45 AM, Jingbo Xu wrote:
> > > 
> > > 
> > > On 5/27/24 11:16 PM, Miklos Szeredi wrote:
> > > > On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > > > 
> > > > > 3. I don't know if a kernel based recovery mechanism is welcome on the
> > > > > community side.  Any comment is welcome.  Thanks!
> > > > 
> > > > I'd prefer something external to fuse.
> > > 
> > > Okay, understood.
> > > 
> > > > 
> > > > Maybe a kernel based fdstore (lifetime connected to that of the
> > > > container) would a useful service more generally?
> > > 
> > > Yeah I indeed had considered this, but I'm afraid VFS guys would be
> > > concerned about why we do this on kernel side rather than in user space.
> 
> Just from my own perspective, even if it's in FUSE, the concern is
> almost the same.
> 
> I wonder if on-demand cachefiles can keep fds too in the future
> (thus e.g. daemonless feature could even be implemented entirely
> with kernel fdstore) but it still has the same concern or it's
> a source of duplication.
> 
> Thanks,
> Gao Xiang
> 
> > > 
> > > I'm not sure what the VFS guys think about this and if the kernel side
> > > shall care about this.

Fwiw, I'm not convinced and I think that's a big can of worms security
wise and semantics wise. I have discussed whether a kernel-side fdstore
would be something that systemd would use if available multiple times
and they wouldn't use it because it provides them with no benefits over
having it in userspace.

Especially since it implements a lot of special semantics and policy
that we really don't want in the kernel. I think that's just not
something we should do. We should give userspace all the means to
implement fdstores in userspace but not hold fds ourselves.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism
  2024-05-28  8:43         ` Christian Brauner
@ 2024-05-28  9:13           ` Gao Xiang
  2024-05-28  9:32             ` Christian Brauner
  0 siblings, 1 reply; 15+ messages in thread
From: Gao Xiang @ 2024-05-28  9:13 UTC (permalink / raw
  To: Christian Brauner
  Cc: Jingbo Xu, Miklos Szeredi, linux-fsdevel, linux-kernel,
	winters.zc

Hi Christian,

On 2024/5/28 16:43, Christian Brauner wrote:
> On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote:
>>
>>
>> On 2024/5/28 11:08, Jingbo Xu wrote:
>>>
>>>
>>> On 5/28/24 10:45 AM, Jingbo Xu wrote:
>>>>
>>>>
>>>> On 5/27/24 11:16 PM, Miklos Szeredi wrote:
>>>>> On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>
>>>>>> 3. I don't know if a kernel based recovery mechanism is welcome on the
>>>>>> community side.  Any comment is welcome.  Thanks!
>>>>>
>>>>> I'd prefer something external to fuse.
>>>>
>>>> Okay, understood.
>>>>
>>>>>
>>>>> Maybe a kernel based fdstore (lifetime connected to that of the
>>>>> container) would a useful service more generally?
>>>>
>>>> Yeah I indeed had considered this, but I'm afraid VFS guys would be
>>>> concerned about why we do this on kernel side rather than in user space.
>>
>> Just from my own perspective, even if it's in FUSE, the concern is
>> almost the same.
>>
>> I wonder if on-demand cachefiles can keep fds too in the future
>> (thus e.g. daemonless feature could even be implemented entirely
>> with kernel fdstore) but it still has the same concern or it's
>> a source of duplication.
>>
>> Thanks,
>> Gao Xiang
>>
>>>>
>>>> I'm not sure what the VFS guys think about this and if the kernel side
>>>> shall care about this.
> 
> Fwiw, I'm not convinced and I think that's a big can of worms security
> wise and semantics wise. I have discussed whether a kernel-side fdstore
> would be something that systemd would use if available multiple times
> and they wouldn't use it because it provides them with no benefits over
> having it in userspace.

As far as I know, currently there are approximately two ways to do
failover mechanisms in kernel.

The first model much like a fuse-like model: in this mode, we should
keep and pass fd to maintain the active state.  And currently,
userspace should be responsible for the permission/security issues
when doing something like passing fds.

The second model is like one device-one instance model, for example
ublk (If I understand correctly): each active instance (/dev/ublkbX)
has their own unique control device (/dev/ublkcX).  Users could
assign/change DAC/MAC for each control device.  And failover
recovery just needs to reopen the control device with proper
permission and do recovery.

So just my own thought, kernel-side fdstore pseudo filesystem may
provide a DAC/MAC mechanism for the first model.  That is a much
cleaner way than doing some similar thing independently in each
subsystem which may need DAC/MAC-like mechanism.  But that is
just my own thought.

Thanks,
Gao Xiang

> 
> Especially since it implements a lot of special semantics and policy
> that we really don't want in the kernel. I think that's just not
> something we should do. We should give userspace all the means to
> implement fdstores in userspace but not hold fds ourselves.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism
  2024-05-28  9:13           ` Gao Xiang
@ 2024-05-28  9:32             ` Christian Brauner
  2024-05-28  9:58               ` Gao Xiang
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2024-05-28  9:32 UTC (permalink / raw
  To: Gao Xiang
  Cc: Jingbo Xu, Miklos Szeredi, linux-fsdevel, linux-kernel,
	winters.zc

On Tue, May 28, 2024 at 05:13:04PM +0800, Gao Xiang wrote:
> Hi Christian,
> 
> On 2024/5/28 16:43, Christian Brauner wrote:
> > On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote:
> > > 
> > > 
> > > On 2024/5/28 11:08, Jingbo Xu wrote:
> > > > 
> > > > 
> > > > On 5/28/24 10:45 AM, Jingbo Xu wrote:
> > > > > 
> > > > > 
> > > > > On 5/27/24 11:16 PM, Miklos Szeredi wrote:
> > > > > > On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> > > > > > 
> > > > > > > 3. I don't know if a kernel based recovery mechanism is welcome on the
> > > > > > > community side.  Any comment is welcome.  Thanks!
> > > > > > 
> > > > > > I'd prefer something external to fuse.
> > > > > 
> > > > > Okay, understood.
> > > > > 
> > > > > > 
> > > > > > Maybe a kernel based fdstore (lifetime connected to that of the
> > > > > > container) would a useful service more generally?
> > > > > 
> > > > > Yeah I indeed had considered this, but I'm afraid VFS guys would be
> > > > > concerned about why we do this on kernel side rather than in user space.
> > > 
> > > Just from my own perspective, even if it's in FUSE, the concern is
> > > almost the same.
> > > 
> > > I wonder if on-demand cachefiles can keep fds too in the future
> > > (thus e.g. daemonless feature could even be implemented entirely
> > > with kernel fdstore) but it still has the same concern or it's
> > > a source of duplication.
> > > 
> > > Thanks,
> > > Gao Xiang
> > > 
> > > > > 
> > > > > I'm not sure what the VFS guys think about this and if the kernel side
> > > > > shall care about this.
> > 
> > Fwiw, I'm not convinced and I think that's a big can of worms security
> > wise and semantics wise. I have discussed whether a kernel-side fdstore
> > would be something that systemd would use if available multiple times
> > and they wouldn't use it because it provides them with no benefits over
> > having it in userspace.
> 
> As far as I know, currently there are approximately two ways to do
> failover mechanisms in kernel.
> 
> The first model much like a fuse-like model: in this mode, we should
> keep and pass fd to maintain the active state.  And currently,
> userspace should be responsible for the permission/security issues
> when doing something like passing fds.
> 
> The second model is like one device-one instance model, for example
> ublk (If I understand correctly): each active instance (/dev/ublkbX)
> has their own unique control device (/dev/ublkcX).  Users could
> assign/change DAC/MAC for each control device.  And failover
> recovery just needs to reopen the control device with proper
> permission and do recovery.
> 
> So just my own thought, kernel-side fdstore pseudo filesystem may
> provide a DAC/MAC mechanism for the first model.  That is a much
> cleaner way than doing some similar thing independently in each
> subsystem which may need DAC/MAC-like mechanism.  But that is
> just my own thought.

The failover mechanism for /dev/ublkcX could easily be implemented using
the fdstore. The fact that they rolled their own thing is orthogonal to
this imho. Implementing retrieval policies like this in the kernel is
slowly advancing into /proc/$pid/fd/ levels of complexity. That's all
better handled with appropriate policies in userspace. And cachefilesd
can similarly just stash their fds in the fdstore.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism
  2024-05-28  9:32             ` Christian Brauner
@ 2024-05-28  9:58               ` Gao Xiang
  0 siblings, 0 replies; 15+ messages in thread
From: Gao Xiang @ 2024-05-28  9:58 UTC (permalink / raw
  To: Christian Brauner
  Cc: Jingbo Xu, Miklos Szeredi, linux-fsdevel, linux-kernel,
	winters.zc



On 2024/5/28 17:32, Christian Brauner wrote:
> On Tue, May 28, 2024 at 05:13:04PM +0800, Gao Xiang wrote:
>> Hi Christian,
>>
>> On 2024/5/28 16:43, Christian Brauner wrote:
>>> On Tue, May 28, 2024 at 12:02:46PM +0800, Gao Xiang wrote:
>>>>
>>>>
>>>> On 2024/5/28 11:08, Jingbo Xu wrote:
>>>>>
>>>>>
>>>>> On 5/28/24 10:45 AM, Jingbo Xu wrote:
>>>>>>
>>>>>>
>>>>>> On 5/27/24 11:16 PM, Miklos Szeredi wrote:
>>>>>>> On Fri, 24 May 2024 at 08:40, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>>
>>>>>>>> 3. I don't know if a kernel based recovery mechanism is welcome on the
>>>>>>>> community side.  Any comment is welcome.  Thanks!
>>>>>>>
>>>>>>> I'd prefer something external to fuse.
>>>>>>
>>>>>> Okay, understood.
>>>>>>
>>>>>>>
>>>>>>> Maybe a kernel based fdstore (lifetime connected to that of the
>>>>>>> container) would a useful service more generally?
>>>>>>
>>>>>> Yeah I indeed had considered this, but I'm afraid VFS guys would be
>>>>>> concerned about why we do this on kernel side rather than in user space.
>>>>
>>>> Just from my own perspective, even if it's in FUSE, the concern is
>>>> almost the same.
>>>>
>>>> I wonder if on-demand cachefiles can keep fds too in the future
>>>> (thus e.g. daemonless feature could even be implemented entirely
>>>> with kernel fdstore) but it still has the same concern or it's
>>>> a source of duplication.
>>>>
>>>> Thanks,
>>>> Gao Xiang
>>>>
>>>>>>
>>>>>> I'm not sure what the VFS guys think about this and if the kernel side
>>>>>> shall care about this.
>>>
>>> Fwiw, I'm not convinced and I think that's a big can of worms security
>>> wise and semantics wise. I have discussed whether a kernel-side fdstore
>>> would be something that systemd would use if available multiple times
>>> and they wouldn't use it because it provides them with no benefits over
>>> having it in userspace.
>>
>> As far as I know, currently there are approximately two ways to do
>> failover mechanisms in kernel.
>>
>> The first model much like a fuse-like model: in this mode, we should
>> keep and pass fd to maintain the active state.  And currently,
>> userspace should be responsible for the permission/security issues
>> when doing something like passing fds.
>>
>> The second model is like one device-one instance model, for example
>> ublk (If I understand correctly): each active instance (/dev/ublkbX)
>> has their own unique control device (/dev/ublkcX).  Users could
>> assign/change DAC/MAC for each control device.  And failover
>> recovery just needs to reopen the control device with proper
>> permission and do recovery.
>>
>> So just my own thought, kernel-side fdstore pseudo filesystem may
>> provide a DAC/MAC mechanism for the first model.  That is a much
>> cleaner way than doing some similar thing independently in each
>> subsystem which may need DAC/MAC-like mechanism.  But that is
>> just my own thought.
> 
> The failover mechanism for /dev/ublkcX could easily be implemented using
> the fdstore. The fact that they rolled their own thing is orthogonal to
> this imho. Implementing retrieval policies like this in the kernel is
> slowly advancing into /proc/$pid/fd/ levels of complexity. That's all
> better handled with appropriate policies in userspace. And cachefilesd
> can similarly just stash their fds in the fdstore.

Ok, got it.  I just would like to know what kernel fdstore
currently sounds like (since Miklos mentioned it so I wonder
if it's feasible since it can benefit to non-fuse cases).
I think userspace fdstore works for me (unless some other
interesting use cases for evaluation later).

Jingbo has an internal requirement for fuse, that is a pure
fuse stuff, and that is out of my scope though.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism
  2024-05-28  3:08     ` Jingbo Xu
  2024-05-28  4:02       ` Gao Xiang
@ 2024-05-28  7:46       ` Miklos Szeredi
  1 sibling, 0 replies; 15+ messages in thread
From: Miklos Szeredi @ 2024-05-28  7:46 UTC (permalink / raw
  To: Jingbo Xu; +Cc: linux-fsdevel, linux-kernel, winters.zc

On Tue, 28 May 2024 at 05:08, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> There was an RFC for kernel-side fdstore [1], though it's also
> implemented upon FUSE.

I strongly believe that this needs to be disassociated from fuse.

It could be a pseudo filesystem, though.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism
  2024-05-28  2:45   ` Jingbo Xu
  2024-05-28  3:08     ` Jingbo Xu
@ 2024-05-28  7:45     ` Miklos Szeredi
  1 sibling, 0 replies; 15+ messages in thread
From: Miklos Szeredi @ 2024-05-28  7:45 UTC (permalink / raw
  To: Jingbo Xu; +Cc: linux-fsdevel, linux-kernel, winters.zc

On Tue, 28 May 2024 at 04:45, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:

> Yeah I indeed had considered this, but I'm afraid VFS guys would be
> concerned about why we do this on kernel side rather than in user space.
>
> I'm not sure what the VFS guys think about this and if the kernel side
> shall care about this.

Yes, that is indeed something that needs to be discussed.

I often find, that when discussing something like this a lot of good
ideas can come from different directions, so it can help move things
forward.

Try something really simple first, and post a patch.  Don't overthink
the first version.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism
  2024-05-24  6:40 [RFC 0/2] fuse: introduce fuse server recovery mechanism Jingbo Xu
                   ` (2 preceding siblings ...)
  2024-05-27 15:16 ` [RFC 0/2] fuse: introduce fuse server " Miklos Szeredi
@ 2024-05-28  8:38 ` Christian Brauner
  2024-05-28  9:45   ` Jingbo Xu
  3 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2024-05-28  8:38 UTC (permalink / raw
  To: Jingbo Xu; +Cc: miklos, linux-fsdevel, linux-kernel, winters.zc

On Fri, May 24, 2024 at 02:40:28PM +0800, Jingbo Xu wrote:
> Background
> ==========
> The fd of '/dev/fuse' serves as a message transmission channel between
> FUSE filesystem (kernel space) and fuse server (user space). Once the
> fd gets closed (intentionally or unintentionally), the FUSE filesystem
> gets aborted, and any attempt of filesystem access gets -ECONNABORTED
> error until the FUSE filesystem finally umounted.
> 
> It is one of the requisites in production environment to provide
> uninterruptible filesystem service.  The most straightforward way, and
> maybe the most widely used way, is that make another dedicated user
> daemon (similar to systemd fdstore) keep the device fd open.  When the
> fuse daemon recovers from a crash, it can retrieve the device fd from the
> fdstore daemon through socket takeover (Unix domain socket) method [1]
> or pidfd_getfd() syscall [2].  In this way, as long as the fdstore
> daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse
> daemon crashes, though the filesystem service may hang there for a while
> when the fuse daemon gets restarted and has not been completely
> recovered yet.
> 
> This picture indeed works and has been deployed in our internal
> production environment until the following issues are encountered:
> 
> 1. The fdstore daemon may be killed by mistake, in which case the FUSE
> filesystem gets aborted and irrecoverable.

That's only a problem if you use the fdstore of the per-user instance.
The main fdstore is part of PID 1 and you can't kill that. So really,
systemd needs to hand the fds from the per-user instance to the main
fdstore.

> 2. In scenarios of containerized deployment, the fuse daemon is deployed
> in a container POD, and a dedicated fdstore daemon needs to be deployed
> for each fuse daemon.  The fdstore daemon could consume a amount of
> resources (e.g. memory footprint), which is not conducive to the dense
> container deployment.
> 
> 3. Each fuse daemon implementation needs to implement its own fdstore
> daemon.  If we implement the fuse recovery mechanism on the kernel side,
> all fuse daemon implementations could reuse this mechanism.

You can just the global fdstore. That is a design limitation not an
inherent limitation.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC 0/2] fuse: introduce fuse server recovery mechanism
  2024-05-28  8:38 ` Christian Brauner
@ 2024-05-28  9:45   ` Jingbo Xu
  0 siblings, 0 replies; 15+ messages in thread
From: Jingbo Xu @ 2024-05-28  9:45 UTC (permalink / raw
  To: Christian Brauner; +Cc: miklos, linux-fsdevel, linux-kernel, winters.zc

Hi, Christian,

Thanks for the review.


On 5/28/24 4:38 PM, Christian Brauner wrote:
> On Fri, May 24, 2024 at 02:40:28PM +0800, Jingbo Xu wrote:
>> Background
>> ==========
>> The fd of '/dev/fuse' serves as a message transmission channel between
>> FUSE filesystem (kernel space) and fuse server (user space). Once the
>> fd gets closed (intentionally or unintentionally), the FUSE filesystem
>> gets aborted, and any attempt of filesystem access gets -ECONNABORTED
>> error until the FUSE filesystem finally umounted.
>>
>> It is one of the requisites in production environment to provide
>> uninterruptible filesystem service.  The most straightforward way, and
>> maybe the most widely used way, is that make another dedicated user
>> daemon (similar to systemd fdstore) keep the device fd open.  When the
>> fuse daemon recovers from a crash, it can retrieve the device fd from the
>> fdstore daemon through socket takeover (Unix domain socket) method [1]
>> or pidfd_getfd() syscall [2].  In this way, as long as the fdstore
>> daemon doesn't exit, the FUSE filesystem won't get aborted once the fuse
>> daemon crashes, though the filesystem service may hang there for a while
>> when the fuse daemon gets restarted and has not been completely
>> recovered yet.
>>
>> This picture indeed works and has been deployed in our internal
>> production environment until the following issues are encountered:
>>
>> 1. The fdstore daemon may be killed by mistake, in which case the FUSE
>> filesystem gets aborted and irrecoverable.
> 
> That's only a problem if you use the fdstore of the per-user instance.
> The main fdstore is part of PID 1 and you can't kill that. So really,
> systemd needs to hand the fds from the per-user instance to the main
> fdstore.

Systemd indeed has implemented its own fdstore mechanism in the user space.

Nowadays more and more fuse daemons are running inside containers, but a
container generally has no systemd inside it.
> 
>> 2. In scenarios of containerized deployment, the fuse daemon is deployed
>> in a container POD, and a dedicated fdstore daemon needs to be deployed
>> for each fuse daemon.  The fdstore daemon could consume a amount of
>> resources (e.g. memory footprint), which is not conducive to the dense
>> container deployment.
>>
>> 3. Each fuse daemon implementation needs to implement its own fdstore
>> daemon.  If we implement the fuse recovery mechanism on the kernel side,
>> all fuse daemon implementations could reuse this mechanism.
> 
> You can just the global fdstore. That is a design limitation not an
> inherent limitation.

What I initially mean is that each fuse daemon implementation (e.g.
s3fs, ossfs, and other vendors) needs to make its own but similar
mechanism for daemon failover.  There has not been a common component
for fdstore in container scenarios just like systemd fdstore.


I'd admit that it's controversial to implement a kernel-side fdstore.
Thus I only implement a failover mechanism for fuse server in this RFC
patch.  But I also understand Miklos's concern as what we really need to
support daemon failover is just something like fdstore to keep the
device fd alive.


-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-05-28  9:58 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-24  6:40 [RFC 0/2] fuse: introduce fuse server recovery mechanism Jingbo Xu
2024-05-24  6:40 ` [RFC 1/2] fuse: introduce recovery mechanism for fuse server Jingbo Xu
2024-05-24  6:40 ` [RFC 2/2] fuse: uid-based security enhancement for the recovery mechanism Jingbo Xu
2024-05-27 15:16 ` [RFC 0/2] fuse: introduce fuse server " Miklos Szeredi
2024-05-28  2:45   ` Jingbo Xu
2024-05-28  3:08     ` Jingbo Xu
2024-05-28  4:02       ` Gao Xiang
2024-05-28  8:43         ` Christian Brauner
2024-05-28  9:13           ` Gao Xiang
2024-05-28  9:32             ` Christian Brauner
2024-05-28  9:58               ` Gao Xiang
2024-05-28  7:46       ` Miklos Szeredi
2024-05-28  7:45     ` Miklos Szeredi
2024-05-28  8:38 ` Christian Brauner
2024-05-28  9:45   ` Jingbo Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).