From: Xiubo Li <xiubli@redhat.com>
To: David Howells <dhowells@redhat.com>
Cc: Ilya Dryomov <idryomov@gmail.com>,
Greg Farnum <gfarnum@redhat.com>,
Venky Shankar <vshankar@redhat.com>,
Jeff Layton <jlayton@kernel.org>,
ceph-devel@vger.kernel.org
Subject: Re: Modifying and fixing(?) the per-inode snap handling in ceph
Date: Wed, 17 Jan 2024 10:28:38 +0800 [thread overview]
Message-ID: <e871e640-bf26-4d12-bd98-e80be58641fc@redhat.com> (raw)
In-Reply-To: <2578305.1705401721@warthog.procyon.org.uk>
On 1/16/24 18:42, David Howells wrote:
> Xiubo Li <xiubli@redhat.com> wrote:
>
>> Please note the 'ceph_cap_snap' struct is orient to the MDS daemons in
>> cephfs and for metadata only. While the 'ceph_snap_context' struct is
>> orient to the ceph Rados for data only instead.
> That doesn't seem to be how it is implemented, though. Looking at the
> writeback code, the ordered list of snaps to be written back *is* the
> ceph_cap_snap list. The writeback code goes through the list in order,
> writing out the data associated with the snapc that is the ->context for each
> capsnap.
Did you see the code in
https://github.com/ceph/ceph-client/blob/for-linus/fs/ceph/addr.c#L1065-L1076
? And later when writing the page out it will use the 'snapc' which must
equal to the page's static snapc.
While when the 'ci->ceph_cap_snap' list is empty the
'get_oldest_context()' will return the global realm's snapc instead. The
global realm will always exist from beginning.
> When a new snapshot occurs, the snap notification service code
> allocates and queues a new capsnap on each inode. Is there any reason that
> cannot be done lazily?
As you know cephfs snapshot is already in lazy mode and very fast. IMO
it need to finalize the size, mtime, etc for a cap_snap asap.
> Also, it appears there may be an additional bug here: If memory allocation
> fails when allocating a new capsnap in queue_realm_cap_snaps(), it abandons
> the process (not unreasonably), but leaves a bunch of inodes with their old
> capsnaps. If you can do this lazily, this can be better handled in the write
> routines where user interaction can be bounced.
Yeah, in corner case this really could fail and we need to improve IMO.
But if we defer doing this always is not a good idea because during this
time the size could be changed a lot.
>
>> As I mentioned above once a page/folio set the 'ceph_snap_context' struct it
>> shouldn't change any more.
> Until the page has been cleaned, I presume.
Correct.
>
>> Then could you explain how to pass the snapshot set to ceph Rados when
>> flushing each dirty page/folio ?
> The netfs_group pointer passed into netfs_write_group() is stored in the
> netfs_io_request struct and is thus accessible by the filesystem. I've
> attached the three functions involved below. Note that this is a work in
> progress and will most certainly require further changes. It's also untested
> as I haven't got it to a stage that I can test this part of the code yet.
>
>> I don't understand how to bind the 'ceph_cap_snap' struct to dirty pages
>> here.
> Point folio->private at it and take a ref on it, much as it is currently doing
> for snapc. The capsnap->context points to snapc and currently folio->private
> points to the snapc and then in writeback, and currently we take the context
> snapc from the capsnap and iterate over all the folios on that inode looking
> for any with a private pointer that matches the snapc. We could store the
> capsnap pointer there instead and iterate looking for that.
>
> -~-
>
> I've pushed my patches out to:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=ceph-iter
>
> if you want a look. Note that the top 7 patches are the ones in flux at the
> moment while I try and write the netfs helpers for the ceph filesystem. Don't
> expect any of those to compile.
>
> Actually, I'd appreciate it if you could have a look at the patches up to
> "rbd: Use ceph_databuf_enc_start/stop()". Those are almost completely done,
> though with one exception. There's one bit in RBD that still needs dealing
> with (marked with a #warn and an "[INCOMPLETE]" marked in the patch subject) -
> but RBD does otherwise work.
>
> The aim is to remove all the different data types, except for two: one where
> an iov_iter is provided (such as an ITER_XARRAY referring directly to the
> pagecache) and one where we have a "ceph_databuf" with a bvec[]. Ideally,
> there would only be one, but I'll see if I can manage that.
Okay, I will try to read these patches this week. Let me see how do they
work.
Thanks David
- Xiubo
> David
> ---
> /*
> * Handle the termination of a write to the server.
> */
> static void ceph_write_terminated(struct netfs_io_subrequest *subreq, int err)
> {
> struct ceph_io_subrequest *csub =
> container_of(subreq, struct ceph_io_subrequest, sreq);
> struct ceph_io_request *creq = csub->creq;
> struct ceph_osd_request *req = csub->req;
> struct inode *inode = creq->rreq.inode;
> struct ceph_fs_client *fsc = ceph_inode_to_fs_client(inode);
> struct ceph_client *cl = ceph_inode_to_client(inode);
> struct ceph_inode_info *ci = ceph_inode(inode);
>
> ceph_update_write_metrics(&fsc->mdsc->metric, req->r_start_latency,
> req->r_end_latency, subreq->len, err);
>
> if (err != 0) {
> doutc(cl, "sync_write osd write returned %d\n", err);
> /* Version changed! Must re-do the rmw cycle */
> if ((creq->rmw_assert_version && (err == -ERANGE || err == -EOVERFLOW)) ||
> (!creq->rmw_assert_version && err == -EEXIST)) {
> /* We should only ever see this on a rmw */
> WARN_ON_ONCE(!test_bit(NETFS_RREQ_RMW, &ci->netfs.flags));
>
> /* The version should never go backward */
> WARN_ON_ONCE(err == -EOVERFLOW);
>
> /* FIXME: limit number of times we loop? */
> set_bit(NETFS_RREQ_REPEAT_RMW, &ci->netfs.flags);
> }
> ceph_set_error_write(ci);
> } else {
> ceph_clear_error_write(ci);
> }
>
> csub->req = NULL;
> ceph_osdc_put_request(req);
> netfs_subreq_terminated(subreq, len, err);
> }
>
> static void ceph_upload_to_server(struct netfs_io_subrequest *subreq)
> {
> struct ceph_io_subrequest *csub =
> container_of(subreq, struct ceph_io_subrequest, sreq);
> struct ceph_osd_request *req = csub->req;
> struct ceph_fs_client *fsc = ceph_inode_to_fs_client(subreq->rreq->inode);
> struct ceph_osd_client *osdc = &fsc->client->osdc;
>
> trace_netfs_sreq(subreq, netfs_sreq_trace_submit);
> ceph_osdc_start_request(osdc, req);
> }
>
> /*
> * Set up write requests for a writeback slice. We need to add a write request
> * for each write we want to make.
> */
> void ceph_create_write_requests(struct netfs_io_request *wreq, loff_t start, size_t remain)
> {
> struct netfs_io_subrequest *subreq;
> struct ceph_io_subrequest *csub;
> struct ceph_snap_context *snapc =
> container_of(wreq->group, struct ceph_snap_context, group);
> struct inode *inode = wreq->inode;
> struct ceph_inode_info *ci = ceph_inode(inode);
> struct ceph_fs_client *fsc =
> ceph_inode_to_fs_client(inode);
> struct ceph_client *cl = ceph_inode_to_client(inode);
> struct ceph_osd_client *osdc = &fsc->client->osdc;
> struct ceph_osd_request *req;
> struct timespec64 mtime = current_time(inode);
> bool rmw = test_bit(NETFS_RREQ_RMW, &ci->netfs.flags);
>
> doutc(cl, "create_write_reqs on inode %p %lld~%zu snapc %p seq %lld\n",
> inode, start, remain, snapc, snapc->seq);
>
> do {
> unsigned long long llpart;
> size_t part = remain;
> u64 assert_ver = 0;
> u64 objnum, objoff;
>
> subreq = netfs_create_write_request(wreq, NETFS_UPLOAD_TO_SERVER,
> start, remain, NULL);
> if (!subreq) {
> wreq->error = -ENOMEM;
> break;
> }
>
> csub = container_of(subreq, struct ceph_io_subrequest, sreq);
>
> /* clamp the length to the end of first object */
> ceph_calc_file_object_mapping(&ci->i_layout, start, remain,
> &objnum, &objoff, &part);
>
> llpart = part;
> doutc(cl, "create_write_reqs ino %llx %lld~%zu part %lld~%llu -- %srmw\n",
> ci->i_vino.ino, start, remain, start, llpart, rmw ? "" : "no ");
>
> req = ceph_osdc_new_request(osdc, &ci->i_layout,
> ci->i_vino, start, &llpart,
> rmw ? 1 : 0, rmw ? 2 : 1,
> CEPH_OSD_OP_WRITE,
> CEPH_OSD_FLAG_WRITE,
> snapc, ci->i_truncate_seq,
> ci->i_truncate_size, false);
> if (IS_ERR(req)) {
> wreq->error = PTR_ERR(req);
> break;
> }
>
> part = llpart;
> doutc(cl, "write op %lld~%zu\n", start, part);
> osd_req_op_extent_osd_iter(req, 0, &subreq->io_iter);
> req->r_inode = inode;
> req->r_mtime = mtime;
> req->subreq = subreq;
> csub->req = req;
>
> /*
> * If we're doing an RMW cycle, set up an assertion that the
> * remote data hasn't changed. If we don't have a version
> * number, then the object doesn't exist yet. Use an exclusive
> * create instead of a version assertion in that case.
> */
> if (rmw) {
> if (assert_ver) {
> osd_req_op_init(req, 0, CEPH_OSD_OP_ASSERT_VER, 0);
> req->r_ops[0].assert_ver.ver = assert_ver;
> } else {
> osd_req_op_init(req, 0, CEPH_OSD_OP_CREATE,
> CEPH_OSD_OP_FLAG_EXCL);
> }
> }
>
> ceph_upload_to_server(req);
>
> start += part;
> remain -= part;
> } while (remain > 0 && !wreq->error);
>
> dout("create_write_reqs returning %d\n", wreq->error);
> }
>
prev parent reply other threads:[~2024-01-17 2:28 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <f535be3c-3d02-4a13-8aaf-5c634ffa218b@redhat.com>
2024-01-15 14:07 ` Modifying and fixing(?) the per-inode snap handling in ceph David Howells
2024-01-16 10:42 ` David Howells
2024-01-17 2:28 ` Xiubo Li [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e871e640-bf26-4d12-bd98-e80be58641fc@redhat.com \
--to=xiubli@redhat.com \
--cc=ceph-devel@vger.kernel.org \
--cc=dhowells@redhat.com \
--cc=gfarnum@redhat.com \
--cc=idryomov@gmail.com \
--cc=jlayton@kernel.org \
--cc=vshankar@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).