Re: Modifying and fixing(?) the per-inode snap handling in ceph

CEPH-Devel archive mirror
 help / color / mirror / Atom feed

From: Xiubo Li <xiubli@redhat.com>
To: David Howells <dhowells@redhat.com>
Cc: Ilya Dryomov <idryomov@gmail.com>,
	Greg Farnum <gfarnum@redhat.com>,
	Venky Shankar <vshankar@redhat.com>,
	Jeff Layton <jlayton@kernel.org>,
	ceph-devel@vger.kernel.org
Subject: Re: Modifying and fixing(?) the per-inode snap handling in ceph
Date: Wed, 17 Jan 2024 10:28:38 +0800	[thread overview]
Message-ID: <e871e640-bf26-4d12-bd98-e80be58641fc@redhat.com> (raw)
In-Reply-To: <2578305.1705401721@warthog.procyon.org.uk>


On 1/16/24 18:42, David Howells wrote:
> Xiubo Li <xiubli@redhat.com> wrote:
>
>> Please note the 'ceph_cap_snap' struct is orient to the MDS daemons in
>> cephfs and for metadata only.  While the 'ceph_snap_context' struct is
>> orient to the ceph Rados for data only instead.
> That doesn't seem to be how it is implemented, though.  Looking at the
> writeback code, the ordered list of snaps to be written back *is* the
> ceph_cap_snap list.  The writeback code goes through the list in order,
> writing out the data associated with the snapc that is the ->context for each
> capsnap.

Did you see the code in 
https://github.com/ceph/ceph-client/blob/for-linus/fs/ceph/addr.c#L1065-L1076 
? And later when writing the page out it will use the 'snapc' which must 
equal to the page's static snapc.

While when the 'ci->ceph_cap_snap' list is empty the 
'get_oldest_context()' will return the global realm's snapc instead. The 
global realm will always exist from beginning.


>   When a new snapshot occurs, the snap notification service code
> allocates and queues a new capsnap on each inode.  Is there any reason that
> cannot be done lazily?

As you know cephfs snapshot is already in lazy mode and very fast. IMO 
it need to finalize the size, mtime, etc for a cap_snap asap.

> Also, it appears there may be an additional bug here: If memory allocation
> fails when allocating a new capsnap in queue_realm_cap_snaps(), it abandons
> the process (not unreasonably), but leaves a bunch of inodes with their old
> capsnaps.  If you can do this lazily, this can be better handled in the write
> routines where user interaction can be bounced.

Yeah, in corner case this really could fail and we need to improve IMO. 
But if we defer doing this always is not a good idea because during this 
time the size could be changed a lot.

>
>> As I mentioned above once a page/folio set the 'ceph_snap_context' struct it
>> shouldn't change any more.
> Until the page has been cleaned, I presume.

Correct.

>
>> Then could you explain how to pass the snapshot set to ceph Rados when
>> flushing each dirty page/folio ?
> The netfs_group pointer passed into netfs_write_group() is stored in the
> netfs_io_request struct and is thus accessible by the filesystem.  I've
> attached the three functions involved below.  Note that this is a work in
> progress and will most certainly require further changes.  It's also untested
> as I haven't got it to a stage that I can test this part of the code yet.
>
>> I don't understand how to bind the 'ceph_cap_snap' struct to dirty pages
>> here.
> Point folio->private at it and take a ref on it, much as it is currently doing
> for snapc.  The capsnap->context points to snapc and currently folio->private
> points to the snapc and then in writeback, and currently we take the context
> snapc from the capsnap and iterate over all the folios on that inode looking
> for any with a private pointer that matches the snapc.  We could store the
> capsnap pointer there instead and iterate looking for that.
>
> -~-
>
> I've pushed my patches out to:
>
> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=ceph-iter
>
> if you want a look.  Note that the top 7 patches are the ones in flux at the
> moment while I try and write the netfs helpers for the ceph filesystem.  Don't
> expect any of those to compile.
>
> Actually, I'd appreciate it if you could have a look at the patches up to
> "rbd: Use ceph_databuf_enc_start/stop()".  Those are almost completely done,
> though with one exception.  There's one bit in RBD that still needs dealing
> with (marked with a #warn and an "[INCOMPLETE]" marked in the patch subject) -
> but RBD does otherwise work.
>
> The aim is to remove all the different data types, except for two: one where
> an iov_iter is provided (such as an ITER_XARRAY referring directly to the
> pagecache) and one where we have a "ceph_databuf" with a bvec[].  Ideally,
> there would only be one, but I'll see if I can manage that.

Okay, I will try to read these patches this week. Let me see how do they 
work.

Thanks David

- Xiubo


> David
> ---
> /*
>   * Handle the termination of a write to the server.
>   */
> static void ceph_write_terminated(struct netfs_io_subrequest *subreq, int err)
> {
> 	struct ceph_io_subrequest *csub =
> 		container_of(subreq, struct ceph_io_subrequest, sreq);
> 	struct ceph_io_request *creq = csub->creq;
> 	struct ceph_osd_request *req = csub->req;
> 	struct inode *inode = creq->rreq.inode;
> 	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> 	struct ceph_client *cl = ceph_inode_to_client(inode);
> 	struct ceph_inode_info *ci = ceph_inode(inode);
>
> 	ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
> 				  req->r_end_latency, subreq->len, err);
>
> 	if (err != 0) {
> 		doutc(cl, "sync_write osd write returned %d\n", err);
> 		/* Version changed! Must re-do the rmw cycle */
> 		if ((creq->rmw_assert_version && (err == -ERANGE || err == -EOVERFLOW)) ||
> 		    (!creq->rmw_assert_version && err == -EEXIST)) {
> 			/* We should only ever see this on a rmw */
> 			WARN_ON_ONCE(!test_bit(NETFS_RREQ_RMW, &ci->netfs.flags));
>
> 			/* The version should never go backward */
> 			WARN_ON_ONCE(err == -EOVERFLOW);
>
> 			/* FIXME: limit number of times we loop? */
> 			set_bit(NETFS_RREQ_REPEAT_RMW, &ci->netfs.flags);
> 		}
> 		ceph_set_error_write(ci);
> 	} else {
> 		ceph_clear_error_write(ci);
> 	}
>
> 	csub->req = NULL;
> 	ceph_osdc_put_request(req);
> 	netfs_subreq_terminated(subreq, len, err);
> }
>
> static void ceph_upload_to_server(struct netfs_io_subrequest *subreq)
> {
> 	struct ceph_io_subrequest *csub =
> 		container_of(subreq, struct ceph_io_subrequest, sreq);
> 	struct ceph_osd_request *req = csub->req;
> 	struct ceph_fs_client *fsc = ceph_inode_to_fs_client(subreq->rreq->inode);
> 	struct ceph_osd_client *osdc = &fsc->client->osdc;
>
> 	trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
> 	ceph_osdc_start_request(osdc, req);
> }
>
> /*
>   * Set up write requests for a writeback slice.  We need to add a write request
>   * for each write we want to make.
>   */
> void ceph_create_write_requests(struct netfs_io_request *wreq, loff_t start, size_t remain)
> {
> 	struct netfs_io_subrequest *subreq;
> 	struct ceph_io_subrequest *csub;
> 	struct ceph_snap_context *snapc =
> 		container_of(wreq->group, struct ceph_snap_context, group);
> 	struct inode *inode = wreq->inode;
> 	struct ceph_inode_info *ci = ceph_inode(inode);
> 	struct ceph_fs_client *fsc =
> 		ceph_inode_to_fs_client(inode);
> 	struct ceph_client *cl = ceph_inode_to_client(inode);
> 	struct ceph_osd_client *osdc = &fsc->client->osdc;
> 	struct ceph_osd_request *req;
> 	struct timespec64 mtime = current_time(inode);
> 	bool rmw = test_bit(NETFS_RREQ_RMW, &ci->netfs.flags);
>
> 	doutc(cl, "create_write_reqs on inode %p %lld~%zu snapc %p seq %lld\n",
> 	      inode, start, remain, snapc, snapc->seq);
>
> 	do {
> 		unsigned long long llpart;
> 		size_t part = remain;
> 		u64 assert_ver = 0;
> 		u64 objnum, objoff;
>
> 		subreq = netfs_create_write_request(wreq, NETFS_UPLOAD_TO_SERVER,
> 						    start, remain, NULL);
> 		if (!subreq) {
> 			wreq->error = -ENOMEM;
> 			break;
> 		}
>
> 		csub = container_of(subreq, struct ceph_io_subrequest, sreq);
>
> 		/* clamp the length to the end of first object */
> 		ceph_calc_file_object_mapping(&ci->i_layout, start, remain,
> 					      &objnum, &objoff, &part);
>
> 		llpart = part;
> 		doutc(cl, "create_write_reqs ino %llx %lld~%zu part %lld~%llu -- %srmw\n",
> 		      ci->i_vino.ino, start, remain, start, llpart, rmw ? "" : "no ");
>
> 		req = ceph_osdc_new_request(osdc, &ci->i_layout,
> 					    ci->i_vino, start, &llpart,
> 					    rmw ? 1 : 0, rmw ? 2 : 1,
> 					    CEPH_OSD_OP_WRITE,
> 					    CEPH_OSD_FLAG_WRITE,
> 					    snapc, ci->i_truncate_seq,
> 					    ci->i_truncate_size, false);
> 		if (IS_ERR(req)) {
> 			wreq->error = PTR_ERR(req);
> 			break;
> 		}
>
> 		part = llpart;
> 		doutc(cl, "write op %lld~%zu\n", start, part);
> 		osd_req_op_extent_osd_iter(req, 0, &subreq->io_iter);
> 		req->r_inode = inode;
> 		req->r_mtime = mtime;
> 		req->subreq = subreq;
> 		csub->req = req;
>
> 		/*
> 		 * If we're doing an RMW cycle, set up an assertion that the
> 		 * remote data hasn't changed.  If we don't have a version
> 		 * number, then the object doesn't exist yet.  Use an exclusive
> 		 * create instead of a version assertion in that case.
> 		 */
> 		if (rmw) {
> 			if (assert_ver) {
> 				osd_req_op_init(req, 0, CEPH_OSD_OP_ASSERT_VER, 0);
> 				req->r_ops[0].assert_ver.ver = assert_ver;
> 			} else {
> 				osd_req_op_init(req, 0, CEPH_OSD_OP_CREATE,
> 						CEPH_OSD_OP_FLAG_EXCL);
> 			}
> 		}
>
> 		ceph_upload_to_server(req);
>
> 		start += part;
> 		remain -= part;
> 	} while (remain > 0 && !wreq->error);
>
> 	dout("create_write_reqs returning %d\n", wreq->error);
> }
>

     prev parent reply	other threads:[~2024-01-17  2:28 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <f535be3c-3d02-4a13-8aaf-5c634ffa218b@redhat.com>
2024-01-15 14:07 ` Modifying and fixing(?) the per-inode snap handling in ceph David Howells
2024-01-16 10:42   ` David Howells
2024-01-17  2:28     ` Xiubo Li [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e871e640-bf26-4d12-bd98-e80be58641fc@redhat.com \
    --to=xiubli@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=dhowells@redhat.com \
    --cc=gfarnum@redhat.com \
    --cc=idryomov@gmail.com \
    --cc=jlayton@kernel.org \
    --cc=vshankar@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).