From: Patrick Steinhardt <ps@pks.im>
To: git@vger.kernel.org
Subject: Re: [PATCH v2 4/7] reftable/block: refactor binary search over restart points
Date: Wed, 3 Apr 2024 08:01:46 +0200 [thread overview]
Message-ID: <Zgzwyjwo780Ry1XL@tanuki> (raw)
In-Reply-To: <ympg3oam4pty53qnxhtnla2xac73gdqsoclwsuzcmjd4dn4xre@j64veypnecrj>
[-- Attachment #1: Type: text/plain, Size: 4817 bytes --]
On Tue, Apr 02, 2024 at 12:46:30PM -0500, Justin Tobler wrote:
> On 24/04/02 07:15PM, Patrick Steinhardt wrote:
> > On Tue, Apr 02, 2024 at 11:42:29AM -0500, Justin Tobler wrote:
> > > On 24/03/25 11:10AM, Patrick Steinhardt wrote:
> > > > When seeking a record in our block reader we perform a binary search
> > > > over the block's restart points so that we don't have to do a linear
> > > > scan over the whole block. The logic to do so is quite intricate though,
> > > > which makes it hard to understand.
> > > >
> > > > Improve documentation and rename some of the functions and variables so
> > > > that the code becomes easier to understand overall. This refactoring
> > > > should not result in any change in behaviour.
> > > >
> > > > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > > > ---
> > > ...
> > > > - i = binsearch(br->restart_count, &restart_key_less, &args);
> > > > + /*
> > > > + * Perform a binary search over the block's restart points, which
> > > > + * avoids doing a linear scan over the whole block. Like this, we
> > > > + * identify the section of the block that should contain our key.
> > > > + *
> > > > + * Note that we explicitly search for the first restart point _greater_
> > > > + * than the sought-after record, not _greater or equal_ to it. In case
> > > > + * the sought-after record is located directly at the restart point we
> > > > + * would otherwise start doing the linear search at the preceding
> > > > + * restart point. While that works alright, we would end up scanning
> > > > + * too many record.
> > > > + */
> > > > + i = binsearch(br->restart_count, &restart_needle_less, &args);
> > > > +
> > > > + /*
> > > > + * Now there are multiple cases:
> > > > + *
> > > > + * - `i == 0`: The wanted record must be contained before the first
> > > > + * restart point. We will thus start searching for the record in
> > > > + * that section after accounting for the header offset.
> > >
> > > To clarify my understanding, does a restart_offset not exist at the
> > > beginning of a block? I was originally under the impression that the
> > > first reset point would be at the beginning of the block (or just after
> > > the header). If this was the case, the first restart point being greater
> > > would indicate that the wanted record is not contained within the block.
> >
> > It wouldn't make much sense to include it as a restart point. A restart
> > point is a point where the prefix compression will be reset such that
> > the record at that point can be read without reading preceding records.
> > This is always implicitly true for the first record in a block as it is
> > never prefix-compressed. Consequently, writing a restart point for the
> > first record would be a waste of disk space.
>
> Thanks Patrick! Good to know :)
>
> From Documentation/technical/reftable.txt:
>
> > As the first ref block shares the first file block with the file
> > header, all restart_offset in the first block are relative to the
> > start of the file (position 0), and include the file header. This
> > forces the first restart_offset to be 28.
>
> I'm assumming this is out-of-date.
That paragraph actually made me re-check my own assumptions. Turns out
my claim is wrong. The function responsible for registering restarts is
`block_writer_register_restart()`, which gets a parameter `is_restart`
that is determined by `reftable_encode_key()`. And that function in turn
will set `restart = 1` whenever `prefix_len == 0`. And given that the
first record always has `prefix_len == 0`, it will thus have a restart
point, as well.
I'll update the comment like this:
diff --git a/reftable/block.c b/reftable/block.c
index 8bb4e43cec..298e8c56b9 100644
--- a/reftable/block.c
+++ b/reftable/block.c
@@ -417,9 +417,12 @@ int block_reader_seek(struct block_reader *br, struct block_iter *it,
/*
* Now there are multiple cases:
*
- * - `i == 0`: The wanted record must be contained before the first
- * restart point. We will thus start searching for the record in
- * that section after accounting for the header offset.
+ * - `i == 0`: The wanted record is smaller than the record found at
+ * the first restart point. As the first restart point is the first
+ * record in the block, our wanted record cannot be located in this
+ * block at all. We still need to position the iterator so that the
+ * next call to `block_iter_next()` will yield an end-of-iterator
+ * signal.
*
* - `i == restart_count`: The wanted record was not found at any of
* the restart points. As there is no restart point at the end of
Thanks for making me challenge my own assumptions!
Patrick
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
next prev parent reply other threads:[~2024-04-03 6:01 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-22 12:22 [PATCH 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
2024-03-22 18:46 ` Justin Tobler
2024-03-25 10:07 ` Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
2024-03-22 18:55 ` Justin Tobler
2024-03-25 10:07 ` Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
2024-03-22 12:22 ` [PATCH 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
2024-03-25 10:10 ` [PATCH v2 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
2024-03-25 10:10 ` [PATCH v2 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
2024-03-25 10:10 ` [PATCH v2 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
2024-03-25 10:10 ` [PATCH v2 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
2024-04-02 16:27 ` Justin Tobler
2024-04-02 17:15 ` Patrick Steinhardt
2024-03-25 10:10 ` [PATCH v2 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
2024-04-02 16:42 ` Justin Tobler
2024-04-02 17:15 ` Patrick Steinhardt
2024-04-02 17:46 ` Justin Tobler
2024-04-03 6:01 ` Patrick Steinhardt [this message]
2024-03-25 10:10 ` [PATCH v2 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
2024-03-25 10:10 ` [PATCH v2 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
2024-03-25 10:11 ` [PATCH v2 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
2024-04-02 16:47 ` Justin Tobler
2024-04-02 17:15 ` Patrick Steinhardt
2024-04-02 17:24 ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Patrick Steinhardt
2024-04-02 17:24 ` [PATCH v3 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
2024-04-02 17:24 ` [PATCH v3 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
2024-04-02 17:24 ` [PATCH v3 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
2024-04-02 17:24 ` [PATCH v3 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
2024-04-02 17:24 ` [PATCH v3 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
2024-04-02 17:25 ` [PATCH v3 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
2024-04-02 17:25 ` [PATCH v3 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
2024-04-02 17:49 ` [PATCH v3 0/7] reftable: improvements for the `binsearch()` mechanism Justin Tobler
2024-04-03 6:03 ` [PATCH v4 " Patrick Steinhardt
2024-04-03 6:03 ` [PATCH v4 1/7] reftable/basics: fix return type of `binsearch()` to be `size_t` Patrick Steinhardt
2024-04-03 6:04 ` [PATCH v4 2/7] reftable/basics: improve `binsearch()` test Patrick Steinhardt
2024-04-03 6:04 ` [PATCH v4 3/7] reftable/refname: refactor binary search over refnames Patrick Steinhardt
2024-04-03 6:04 ` [PATCH v4 4/7] reftable/block: refactor binary search over restart points Patrick Steinhardt
2024-04-03 6:04 ` [PATCH v4 5/7] reftable/block: fix error handling when searching " Patrick Steinhardt
2024-04-03 6:04 ` [PATCH v4 6/7] reftable/record: extract function to decode key lengths Patrick Steinhardt
2024-04-03 6:04 ` [PATCH v4 7/7] reftable/block: avoid decoding keys when searching restart points Patrick Steinhardt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zgzwyjwo780Ry1XL@tanuki \
--to=ps@pks.im \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).