From: Jeff Garzik <jeff@garzik.org>
To: Project Hail <hail-devel@vger.kernel.org>
Subject: Ranged GET for chunkd, tabled
Date: Fri, 09 Jul 2010 18:21:02 -0400 [thread overview]
Message-ID: <4C37A0CE.8080805@garzik.org> (raw)
Reading -less- than the entire file is a required attribute of the S3
API, where the Range HTTP header is specified to the GET method,
supplying the byte range for the request. This corrects an otherwise
obvious limitation in the protocol: if you desire only a 4k chunk of a
2GB file, you should not be forced to download all of the 2GB file.
Partial-GET is also a must-have feature for my other two hacking
projects, itd and nfs4d. When executing a SCSI READ, itd will not want
to download a huge amount of data, just to handle a 4-LBA request.
Similarly with nfs4d, executing a READ of an NFS file should not require
nfs4d to download more data than required from chunkd.
For tabled, the implementation requires a bit of modification to the
event-driven GET code path, but nothing overly burdensome. It largely
relies on chunkd, though, to provide the ability to retrieve only a
portion of the specified object.
For chunkd, the implementation of partial-GET is also relatively
straightforward, but it introduces a few minor protocol issues.
Presently, we checksum the entire object at PUT time, and return that
checksum at GET time, so that the client may verify the [strong]
checksum to ensure no data corruption occurred.
A partial-GET implies the checksum is useless, and must be recomputed
just for the object subset being requested. Unfortunately, this also
implies a key optimization, checksum offload (which goes straight from
kernel pages to NIC TCP output via DMA, all in hardware) becomes impossible.
On an unencrypted GET, chunkd executes sendfile(2), thereby eliminating
several memory copies that would otherwise be made by the app and by the
kernel. sendfile(2) automatically reads data from an fd, and writes
that data to another fd, all without ever exposing that data directly to
the app. As such, partial-GET with checksumming would require replacing
sendfile(out_fd, in_fd, &offset, bytes);
with
while (buffer not completely written to out_fd)
read(in_fd, buf, count)
SHA1_hash(buf)
write(out_fd, buf, count)
The protocol issue is related. If we are to deliver the checksum in the
-header-, that implies that entire partial-GET object data must be read
and checksummed prior to creating the message header. Then, the message
header and object data is sent. Incredibly inefficient. The
time-honored solution is putting the checksum at the end of the data
stream, thereby allowing the checksum to be generating during data
transmission.
Another issue this raises is checksum verification. Ideally we want to
have pre-stored checksum, so that the local node can verify at data
transmission time that what it reads off disk matches what it wrote $N
days ago. Simply creating a checksum of what you write(2) to a TCP
connection does not protect against disk corruption.
One solution is to update the chunkd disk format (again), and introduce
checksums for each fixed-block, ie. one checksum for each 64k in a file.
This would enable chunkd to verify, prior to sending data on a
partial-GET, that the data pulled off disk is not corrupted.
Just some food for thought :)
Jeff
reply other threads:[~2010-07-09 22:21 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4C37A0CE.8080805@garzik.org \
--to=jeff@garzik.org \
--cc=hail-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).